Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak

Main category: eess.AS

TL;DR: Novel post-training method adapts Audio Large Language Model for multidimensional speech quality assessment using calibration and reinforcement learning with dimension-specific rewards

Details

Motivation: Current speech quality assessment relies on Mean Opinion Scores (MOS) which lack explainability; need to analyze underlying perceptual dimensions and provide detailed artifact detection

Method: Two-stage approach: 1) calibration stage aligns model to predict predefined perceptual dimensions, 2) reinforcement learning stage uses Group Relative Policy Optimization (GRPO) with dimension-specific rewards to enhance accuracy of descriptions and temporal localization

Result: Achieves state-of-the-art 0.71 mean PCC score on QualiSpeech benchmark, 13% improvement in MOS prediction, and substantial advances in pinpointing and classifying audio artifacts temporally

Conclusion: The method successfully enables explainable multidimensional speech quality assessment with improved accuracy and temporal artifact localization through tailored Audio LLM adaptation

Abstract: Explainable speech quality assessment requires moving beyond Mean Opinion Scores (MOS) to analyze underlying perceptual dimensions. To address this, we introduce a novel post-training method that tailors the foundational Audio Large Language Model for multidimensional reasoning, detection and classification of audio artifacts. First, a calibration stage aligns the model to predict predefined perceptual dimensions. Second, a reinforcement learning stage leverages Group Relative Policy Optimization (GRPO) with dimension-specific rewards to heavily enhance accuracy of descriptions and temporal localization of quality issues. With this approach we reach state-of-the-art results of 0.71 mean PCC score on the multidimensional QualiSpeech benchmark and 13% improvement in MOS prediction driven by RL-based reasoning. Furthermore, our fine-grained GRPO rewards substantially advance the model’s ability to pinpoint and classify audio artifacts in time.

Relevance: 9/10

[2] Speech Codec Probing from Semantic and Phonetic Perspectives

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth Narayanan

Main category: eess.AS

TL;DR: Analysis shows current speech tokenizers capture phonetic rather than semantic information, revealing a mismatch with text semantics that affects multimodal LLM performance.

Details

Motivation: Speech tokenizers are crucial for connecting speech to LLMs in multimodal systems, but there's evidence that what's called "semantic" in speech representations doesn't align with text-derived semantics, which can degrade multimodal LLM performance.

Method: Systematically analyze information encoded by widely used speech tokenizers through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics like CKA to disentangle semantic and phonetic content.

Result: Current tokenizers primarily capture phonetic rather than lexical-semantic structure, revealing a fundamental mismatch between speech and text representations.

Conclusion: The findings provide practical implications for designing next-generation speech tokenization methods that better align with text semantics for improved multimodal LLM performance.

Abstract: Speech tokenizers are essential for connecting speech to large language models (LLMs) in multimodal systems. These tokenizers are expected to preserve both semantic and acoustic information for downstream understanding and generation. However, emerging evidence suggests that what is termed “semantic” in speech representations does not align with text-derived semantics: a mismatch that can degrade multimodal LLM performance. In this paper, we systematically analyze the information encoded by several widely used speech tokenizers, disentangling their semantic and phonetic content through word-level probing tasks, layerwise representation analysis, and cross-modal alignment metrics such as CKA. Our results show that current tokenizers primarily capture phonetic rather than lexical-semantic structure, and we derive practical implications for the design of next-generation speech tokenization methods.

Relevance: 9/10

[3] ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes

Main category: cs.SD

TL;DR: ID-LoRA is a novel method for joint audio-video personalization that generates both a subject’s appearance and voice in a single model using text prompts, reference images, and short audio clips.

Details

Motivation: Existing methods treat video and audio separately, preventing synchronization of sounds with on-screen actions and limiting control over speaking style and acoustic environment through text prompts.

Method: Adapts LTX-2 joint audio-video diffusion backbone using parameter-efficient In-Context LoRA with two key innovations: negative temporal positions to distinguish reference/generation tokens, and identity guidance to preserve speaker characteristics during denoising.

Result: Human preference studies show ID-LoRA preferred over Kling 2.6 Pro by 73% for voice similarity and 65% for speaking style, with 24% improvement in speaker similarity on cross-environment settings.

Conclusion: ID-LoRA enables joint audio-video personalization in a single generative pass, achieving strong results with minimal training data while providing physically grounded sound synthesis.

Abstract: Existing video personalization methods preserve visual likeness but treat video and audio separately. Without access to the visual scene, audio models cannot synchronize sounds with on-screen actions; and because classical voice-cloning models condition only on a reference recording, a text prompt cannot redirect speaking style or acoustic environment. We propose ID-LoRA (Identity-Driven In-Context LoRA), which jointly generates a subject’s appearance and voice in a single model, letting a text prompt, a reference image, and a short audio clip govern both modalities together. ID-LoRA adapts the LTX-2 joint audio-video diffusion backbone via parameter-efficient In-Context LoRA and, to our knowledge, is the first method to personalize visual appearance and voice in a single generative pass. Two challenges arise. Reference and generation tokens share the same positional-encoding space, making them hard to distinguish; we address this with negative temporal positions, placing reference tokens in a disjoint RoPE region while preserving their internal temporal structure. Speaker characteristics also tend to be diluted during denoising; we introduce identity guidance, a classifier-free guidance variant that amplifies speaker-specific features by contrasting predictions with and without the reference signal. In human preference studies, ID-LoRA is preferred over Kling 2.6 Pro by 73% of annotators for voice similarity and 65% for speaking style. On cross-environment settings, speaker similarity improves by 24% over Kling, with the gap widening as conditions diverge. A preliminary user study further suggests that joint generation provides a useful inductive bias for physically grounded sound synthesis. ID-LoRA achieves these results with only ~3K training pairs on a single GPU. Code, models, and data will be released.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 117]
cs.CV [Total: 171]
cs.AI [Total: 78]
cs.SD [Total: 17]
cs.LG [Total: 164]
cs.MA [Total: 2]
cs.MM [Total: 2]
eess.AS [Total: 9]
eess.IV [Total: 6]

cs.CL

[1] GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

Ghazal Kalhor, Yadollah Yaghoobzadeh

Main category: cs.CL

TL;DR: A benchmark (GhazalBench) for evaluating LLMs on Persian poetry understanding, testing both meaning comprehension and exact verse recall under various cues, showing models struggle with form recall in Persian but perform better on English sonnets.

Details

Motivation: Persian poetry is culturally significant in Iran, requiring models to handle both meaning and exact form. Current LLMs need evaluation on culturally entrenched texts with usage-grounded conditions.

Method: Created GhazalBench benchmark assessing two abilities: 1) producing faithful prose paraphrases of couplets, and 2) accessing canonical verses under varying semantic and formal cues. Evaluated several proprietary and open-weight multilingual LLMs.

Result: Models capture poetic meaning but struggle with exact verse recall in completion tasks. Recognition-based tasks reduce this gap. English sonnet evaluation shows higher recall, suggesting limitations are due to training exposure differences rather than architecture.

Conclusion: Need evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. Training exposure differences affect performance on culturally specific forms.

Abstract: Persian poetry plays an active role in Iranian cultural practice, where verses by canonical poets such as Hafez are frequently quoted, paraphrased, or completed from partial cues. Supporting such interactions requires language models to engage not only with poetic meaning but also with culturally entrenched surface form. We introduce GhazalBench, a benchmark for evaluating how large language models (LLMs) interact with Persian ghazals under usage-grounded conditions. GhazalBench assesses two complementary abilities: producing faithful prose paraphrases of couplets and accessing canonical verses under varying semantic and formal cues. Across several proprietary and open-weight multilingual LLMs, we observe a consistent dissociation: models generally capture poetic meaning but struggle with exact verse recall in completion-based settings, while recognition-based tasks substantially reduce this gap. A parallel evaluation on English sonnets shows markedly higher recall performance, suggesting that these limitations are tied to differences in training exposure rather than inherent architectural constraints. Our findings highlight the need for evaluation frameworks that jointly assess meaning, form, and cue-dependent access to culturally significant texts. GhazalBench is available at https://github.com/kalhorghazal/GhazalBench.

[2] Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

Tairan Fu, Javier Conde, Pedro Reviriego, Javier Coronado-Blázquez, Nina Melero, Elena Merino-Gómez

Main category: cs.CL

TL;DR: LLMs can summarize books using either internal knowledge from training or full text input; full text generally yields more detailed summaries, but internal knowledge sometimes outperforms, questioning models’ long-text summarization capabilities.

Details

Motivation: To investigate how LLM-generated book summaries compare when using only internal knowledge vs. full text input, and whether prior knowledge influences summaries even when given the full book.

Method: Experimental evaluation comparing summaries of well-known books produced by state-of-the-art LLMs using two approaches: (i) only internal knowledge from training, and (ii) full text of the book as input.

Result: Full text generally provides more detailed summaries, but some books have better scores for internal knowledge summaries, suggesting internal knowledge can sometimes outperform full-text summarization.

Conclusion: The findings question LLMs’ capabilities for long-text summarization, as information learned during training can outperform summarization of full text in some cases, highlighting limitations in current models’ text processing abilities.

Abstract: Summarization is a core task in Natural Language Processing (NLP). Recent advances in Large Language Models (LLMs) and the introduction of large context windows reaching millions of tokens make it possible to process entire books in a single prompt. At the same time, for well-known books, LLMs can generate summaries based only on internal knowledge acquired during training. This raises several important questions: How do summaries generated from internal memory compare to those derived from the full text? Does prior knowledge influence summaries even when the model is given the book as input? In this work, we conduct an experimental evaluation of book summarization with state-of-the-art LLMs. We compare summaries of well-known books produced using (i) only the internal knowledge of the model and (ii) the full text of the book. The results show that having the full text provides more detailed summaries in general, but some books have better scores for the internal knowledge summaries. This puts into question the capabilities of models to perform summarization of long texts, as information learned during training can outperform summarization of the full text in some cases.

[3] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

Omar Elshehy, Omer Nacar, Abdelbasset Djamai, Muhammed Ragab, Khloud Al Jallad, Mona Abdelazim

Main category: cs.CL

TL;DR: AraModernBERT adapts ModernBERT encoder architecture to Arabic, showing transtokenization is essential for Arabic language modeling and enabling stable long-context modeling up to 8,192 tokens with strong downstream performance.

Details

Motivation: Recent architectural advances in transformer models have largely focused on English, leaving Arabic and other languages written in Arabic-derived scripts under-explored. The authors aim to adapt modern encoder architectures to Arabic and study practical considerations like transtokenized embedding initialization and native long-context modeling.

Method: The authors present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic. They study the impact of transtokenized embedding initialization (essential for Arabic language modeling) and native long-context modeling up to 8,192 tokens. The approach includes masked language modeling pre-training and evaluation on various Arabic natural language understanding tasks.

Result: Transtokenization yields dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic NLU tasks confirm strong transfer to discriminative and sequence labeling settings.

Conclusion: The results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts, demonstrating that transtokenization is essential for Arabic language modeling and that stable long-context modeling can be achieved effectively.

Abstract: Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.

[4] An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

Vuong M. Ngo, Cach N. Dang, Kien V. Nguyen, Mark Roantree

Main category: cs.CL

TL;DR: A hybrid deep learning model combining BERT, CNN, and LSTM with ReLU activation achieves ~99% accuracy in detecting abusive language across various online platforms including YouTube, forums, and dark web.

Details

Motivation: The proliferation of online harassment, bullying, hate speech, and toxic comments across social media, messaging apps, and gaming communities has created a need for effective detection systems. Creators of abusive content often use coded language to evade detection, making automated identification challenging.

Method: Proposes a hybrid deep learning model integrating BERT (for semantic understanding), CNN (for pattern extraction), and LSTM (for sequential context) architectures with ReLU activation function. The model is trained on a diverse dataset of 77,620 abusive and 272,214 non-abusive text samples from multiple platforms.

Result: The model achieves approximately 99% across all evaluation metrics including Precision, Recall, Accuracy, F1-score, and AUC on an imbalanced dataset (1:3.5 ratio of abusive to non-abusive samples).

Conclusion: The hybrid approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed real-world datasets across various online platforms.

Abstract: The digital age has expanded social media and online forums, allowing free expression for nearly 45% of the global population. Yet, it has also fueled online harassment, bullying, and harmful behaviors like hate speech and toxic comments across social networks, messaging apps, and gaming communities. Studies show 65% of parents notice hostile online behavior, and one-third of adolescents in mobile games experience bullying. A substantial volume of abusive content is generated and shared daily, not only on the surface web but also within dark web forums. Creators of abusive comments often employ specific words or coded phrases to evade detection and conceal their intentions. To address these challenges, we propose a hybrid deep learning model that integrates BERT, CNN, and LSTM architectures with a ReLU activation function to detect abusive language across multiple online platforms, including YouTube comments, online forum discussions, and dark web posts. The model demonstrates strong performance on a diverse and imbalanced dataset containing 77,620 abusive and 272,214 non-abusive text samples (ratio 1:3.5), achieving approximately 99% across evaluation metrics such as Precision, Recall, Accuracy, F1-score, and AUC. This approach effectively captures semantic, contextual, and sequential patterns in text, enabling robust detection of abusive content even in highly skewed datasets, as encountered in real-world scenarios.

[5] The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Sudipta Ghosh, Mrityunjoy Panday

Main category: cs.CL

TL;DR: LLMs exhibit Dunning-Kruger-like patterns where poorly performing models show severe overconfidence while better models are better calibrated.

Details

Motivation: To investigate whether LLMs exhibit the Dunning-Kruger effect - a cognitive bias where less competent individuals overestimate their abilities - and understand their confidence calibration for safe deployment in high-stakes applications.

Method: Empirical study evaluating four state-of-the-art LLMs (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, Kimi K2) across four benchmark datasets totaling 24,000 experimental trials, measuring accuracy and Expected Calibration Error (ECE).

Result: Kimi K2 showed severe overconfidence with ECE of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieved best calibration (ECE = 0.122) with 75.4% accuracy. Poorly performing models displayed markedly higher overconfidence, analogous to Dunning-Kruger effect.

Conclusion: LLMs exhibit Dunning-Kruger-like patterns where low-competence models are severely overconfident, highlighting the need for better confidence calibration in LLMs for safe deployment in high-stakes applications.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect – a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence – a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

[6] Quantifying Hallucinations in Language Language Models on Medical Textbooks

Brandon C. Colelough, Davis Bartels, Dina Demner-Fushman

Main category: cs.CL

TL;DR: Study examines hallucination rates in LLMs for medical QA, finding LLaMA-70B-Instruct hallucinated in 19.7% of answers despite high plausibility scores, with lower hallucination rates correlating with higher usefulness scores.

Details

Motivation: Hallucinations in LLMs are a serious problem in NLP without effective solutions. Medical QA benchmarks rarely evaluate hallucinations against fixed evidence sources, so researchers want to measure hallucination prevalence in textbook-grounded medical QA and compare model performance.

Method: Two experiments: 1) Measure hallucination prevalence for LLaMA-70B-Instruct on novel medical QA prompts with provided passages; 2) Compare hallucination rates and clinician preferences across multiple models, assessing usefulness scores and clinician agreement.

Result: Experiment 1: LLaMA-70B-Instruct hallucinated in 19.7% of answers (95% CI 18.6-20.7) despite 98.8% of responses receiving maximal plausibility. Experiment 2: Lower hallucination rates correlated with higher usefulness scores (ρ=-0.71, p=0.058). Clinicians showed high agreement (quadratic weighted κ=0.92) and moderate agreement (τ_b=0.06-0.18, κ=0.57-0.61) for experiments 1 and 2 respectively.

Conclusion: LLMs exhibit concerning hallucination rates in medical QA even when responses appear plausible, highlighting the need for better hallucination mitigation strategies in safety-critical domains like healthcare.

Abstract: Hallucinations, the tendency for large language models to provide responses with factually incorrect and unsupported claims, is a serious problem within natural language processing for which we do not yet have an effective solution to mitigate against. Existing benchmarks for medical QA rarely evaluate this behavior against a fixed evidence source. We ask how often hallucinations occur on textbook-grounded QA and how responses to medical QA prompts vary across models. We conduct two experiments: the first experiment to determine the prevalence of hallucinations for a prominent open source large language model (LLaMA-70B-Instruct) in medical QA given novel prompts, and the second experiment to determine the prevalence of hallucinations and clinician preference to model responses. We observed, in experiment one, with the passages provided, LLaMA-70B-Instruct hallucinated in 19.7% of answers (95% CI 18.6 to 20.7) even though 98.8% of prompt responses received maximal plausibility, and observed in experiment two, across models, lower hallucination rates aligned with higher usefulness scores ($ρ=-0.71$, $p=0.058$). Clinicians produced high agreement (quadratic weighted $κ=0.92$) and ($τ_b=0.06$ to $0.18$, $κ=0.57$ to $0.61$) for experiments 1 and ,2 respectively

[7] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

Xinyuan Wang, Kunpeng Liu, Arun Vignesh Malarkkan, Yanjie Fu

Main category: cs.CL

TL;DR: A framework that uses evolutionary optimization of context data for LLM-driven feature transformation, leveraging reinforcement learning trajectories and diversity-aware selection to improve transformation quality and alignment with downstream tasks.

Details

Motivation: Feature transformation is crucial for improving predictive performance but challenging due to the vast search space of feature-operator combinations. Existing methods suffer from sample inefficiency, invalid candidates, and limited diversity. While LLMs offer strong priors for valid transformations, current LLM-based approaches rely on static demonstrations leading to redundant outputs and weak alignment with downstream objectives.

Method: Proposes a framework that optimizes context data for LLM-driven feature transformation by evolving trajectory-level experiences in a closed loop. Starts with high-performing transformation sequences from reinforcement learning, builds and continuously updates an experience library of task-verified transformation trajectories, and uses a diversity-aware selector to form contexts with chain-of-thought reasoning to guide feature generation toward higher performance.

Result: Experiments on diverse tabular benchmarks show the method outperforms classical and LLM-based baselines, is more stable than one-shot generation, generalizes across API-based and open-source LLMs, and remains robust across downstream evaluators.

Conclusion: The framework effectively addresses limitations of existing feature transformation methods by leveraging evolutionary optimization of LLM context data, demonstrating improved performance, stability, and generalization across different LLMs and evaluators.

Abstract: Feature Transformation (FT) is a core data-centric AI task that improves feature space quality to advance downstream predictive performance. However, discovering effective transformations remains challenging due to the large space of feature-operator combinations. Existing solutions rely on discrete search or latent generation, but they are frequently limited by sample inefficiency, invalid candidates, and redundant generations with limited coverage. Large Language Models (LLMs) offer strong priors for producing valid transformations, but current LLM-based FT methods typically rely on static demonstrations, resulting in limited diversity, redundant outputs, and weak alignment with downstream objectives. We propose a framework that optimizes context data for LLM-driven FT by evolving trajectory-level experiences in a closed loop. Starting from high-performing feature transportation sequences explored by reinforcement learning, we construct and continuously update an experience library of downstream task-verified transformation trajectories, and use a diversity-aware selector to form contexts along with a chain-of-thought and guide transformed feature generation toward higher performance. Experiments on diverse tabular benchmarks show that our method outperforms classical and LLM-based baselines and is more stable than one-shot generation. The framework generalizes across API-based and open-source LLMs and remains robust across downstream evaluators.

[8] Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

Ajay Pravin Mahale

Main category: cs.CL

TL;DR: A pipeline that connects mechanistic interpretability circuit analysis with natural language explanations, evaluated on GPT-2 Small’s Indirect Object Identification task.

Details

Motivation: While mechanistic interpretability can identify internal circuits responsible for model behaviors, translating these findings into human-understandable explanations remains challenging. There's a gap between circuit-level analysis and natural language explanations that needs bridging.

Method: Three-step pipeline: (1) identify causally important attention heads via activation patching, (2) generate explanations using template-based and LLM-based methods, (3) evaluate faithfulness using ERASER-style metrics adapted for circuit-level attribution. Evaluated on GPT-2 Small’s IOI task.

Result: Identified six attention heads accounting for 61.4% of logit difference in IOI task. Circuit-based explanations achieved 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperformed template baselines by 64% on quality metrics. Found no correlation (r=0.009) between model confidence and explanation faithfulness, and identified three failure categories.

Conclusion: The pipeline successfully bridges circuit analysis and natural language explanations, revealing limitations in current explanation methods and providing insights into when explanations diverge from actual mechanisms. The approach advances interpretability by connecting low-level circuit analysis with human-understandable explanations.

Abstract: Mechanistic interpretability identifies internal circuits responsible for model behaviors, yet translating these findings into human-understandable explanations remains an open problem. We present a pipeline that bridges circuit-level analysis and natural language explanations by (i) identifying causally important attention heads via activation patching, (ii) generating explanations using both template-based and LLM-based methods, and (iii) evaluating faithfulness using ERASER-style metrics adapted for circuit-level attribution. We evaluate on the Indirect Object Identification (IOI) task in GPT-2 Small (124M parameters), identifying six attention heads accounting for 61.4% of the logit difference. Our circuit-based explanations achieve 100% sufficiency but only 22% comprehensiveness, revealing distributed backup mechanisms. LLM-generated explanations outperform template baselines by 64% on quality metrics. We find no correlation (r = 0.009) between model confidence and explanation faithfulness, and identify three failure categories explaining when explanations diverge from mechanisms.

Heimo Müller, Dominik Steiger, Markus Plass, Andreas Holzinger

Main category: cs.CL

TL;DR: SHS is a human-centered measurement scale for assessing hallucination behavior in LLMs, focusing on user experience rather than automatic detection.

Details

Motivation: Existing hallucination evaluation methods lack user-centered perspectives and interpretability; need for lightweight, domain-agnostic tools that capture hallucination phenomena from user interaction viewpoint.

Method: Developed System Hallucination Scale (SHS) inspired by psychometric tools like SUS and SCS, focusing on factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance. Validated with 210 participants.

Result: SHS demonstrated high clarity, coherent response behavior, construct validity (Cronbach’s alpha = 0.87), significant inter-dimension correlations (p < 0.001), and complementary properties with existing scales.

Conclusion: SHS is a practical, human-centered tool for comparative analysis, iterative development, and deployment monitoring of LLMs, capturing hallucination phenomena from user perspective.

Abstract: We introduce the System Hallucination Scale (SHS), a lightweight and human-centered measurement instrument for assessing hallucination-related behavior in large language models (LLMs). Inspired by established psychometric tools such as the System Usability Scale (SUS) and the System Causability Scale (SCS), SHS enables rapid, interpretable, and domain-agnostic evaluation of factual unreliability, incoherence, misleading presentation, and responsiveness to user guidance in model-generated text. SHS is explicitly not an automatic hallucination detector or benchmark metric; instead, it captures how hallucination phenomena manifest from a user perspective under realistic interaction conditions. A real-world evaluation with 210 participants demonstrates high clarity, coherent response behavior, and construct validity, supported by statistical analysis including internal consistency (Cronbach’s alpha = 0.87$) and significant inter-dimension correlations (p < 0.001$). Comparative analysis with SUS and SCS reveals complementary measurement properties, supporting SHS as a practical tool for comparative analysis, iterative system development, and deployment monitoring.

[10] A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

Ana Begnini, Matheus Vicente, Leonardo Souza

Main category: cs.CL

TL;DR: LLM-based architecture for automated segmentation and classification of Non-Disclosure Agreements using LLaMA-3.1-8B-Instruct for segmentation and fine-tuned Legal-Roberta-Large for clause classification.

Details

Motivation: Manual analysis of NDAs is slow and error-prone due to significant variation in format, structure, and writing style across different documents.

Method: Two-stage approach: 1) LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction), 2) Fine-tuned Legal-Roberta-Large for clause classification.

Result: Achieved ROUGE F1 of 0.95 ± 0.0036 for segmentation and weighted F1 of 0.85 for classification, demonstrating feasibility and precision.

Conclusion: The proposed LLM-based architecture effectively automates NDA analysis, addressing the challenges of manual processing with high accuracy.

Abstract: In business-to-business relations, it is common to establish NonDisclosure Agreements (NDAs). However, these documents exhibit significant variation in format, structure, and writing style, making manual analysis slow and error-prone. We propose an architecture based on LLMs to automate the segmentation and clauses classification within these contracts. We employed two models: LLaMA-3.1-8B-Instruct for NDA segmentation (clause extraction) and a fine-tuned Legal-Roberta-Large for clause classification. In the segmentation task, we achieved a ROUGE F1 of 0.95 +/- 0.0036; for classification, we obtained a weighted F1 of 0.85, demonstrating the feasibility and precision of the approach.

[11] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

Stephen Afrifa, Biswash Khatiwada, Kapalik Khanal, Sanjay Shah, Lingjuan Wang-Li, Ramesh Bahadur Bist

Main category: cs.CL

TL;DR: PoultryLeX-Net: A lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis of poultry industry discussions on social media, achieving state-of-the-art performance.

Details

Motivation: The rapid growth of the poultry industry has intensified public discourse about production practices, animal welfare, and transparency. Social media generates large volumes of unstructured text data, but extracting accurate sentiment signals is challenging due to domain-specific terminology, contextual ambiguity, and limitations of general-purpose language models.

Method: Proposes PoultryLeX-Net, a dual-stream transformer framework with: 1) lexicon-guided stream using domain-specific embeddings to capture poultry terminology and sentiment cues, 2) contextual stream modeling long-range semantic dependencies, 3) gated cross-attention mechanisms, and 4) Latent Dirichlet Allocation for topic modeling to identify thematic structures in production management and welfare discussions.

Result: PoultryLeX-Net outperformed all baseline models (CNN, DistilBERT, RoBERTa) with 97.35% accuracy, 96.67% F1 score, and 99.61% AUC-ROC for sentiment classification tasks, demonstrating superior performance in domain-specific sentiment analysis.

Conclusion: Domain adaptation and dual-stream attention significantly improve sentiment classification for poultry industry discourse, enabling scalable intelligence for production decision support through better understanding of stakeholder sentiment on social media.

Abstract: The rapid growth of the global poultry industry, driven by rising demand for affordable animal protein, has intensified public discourse surrounding production practices, housing, management, animal welfare, and supply-chain transparency. Social media platforms such as X (formerly Twitter) generate large volumes of unstructured textual data that capture stakeholder sentiment across the poultry industry. Extracting accurate sentiment signals from this domain-specific discourse remains challenging due to contextual ambiguity, linguistic variability, and limited domain awareness in general-purpose language models. This study presents PoultryLeX-Net, a lexicon-enhanced, domain-adaptive dual-stream transformer framework for fine-grained sentiment analysis in poultry-related text. The proposed architecture integrates sentiment classification, topic modeling, and contextual representation learning through domain-specific embeddings and gated cross-attention mechanisms. A lexicon-guided stream captures poultry-specific terminology and sentiment cues, while contextual stream models long-range semantic dependencies. Latent Dirichlet Allocation is employed to identify dominant thematic structures associated with production management and welfare-related discussions, providing complementary interpretability to sentiment predictions. PoultryLeX-Net was evaluated against multiple baseline models, including convolutional neural network and pre-trained transformer architectures such as DistilBERT and RoBERTa. PoultryLeX-Net consistently outperformed all baselines, achieving an accuracy of 97.35%, an F1 score of 96.67%, and an area under the receiver operating characteristic curve (AUC-ROC) of 99.61% across sentiment classification tasks. Overall, domain adaptation and dual-stream attention markedly improve sentiment classification, enabling scalable intelligence for poultry production decision support.

[12] Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

Okko Räsänen

Main category: cs.CL

TL;DR: Computational models of early language acquisition from speech/audiovisual input using self-supervised and visually grounded learning approaches

Details

Motivation: To understand how infants effortlessly learn language from acoustic speech despite the enormous information-processing challenge, and to develop computational models that can explain early language development

Method: Review of self-supervised and visually grounded perceptual learning models that learn from speech and audiovisual input without strong linguistic priors

Result: Models are becoming increasingly powerful in learning various speech aspects, and many features of early language development can be explained through shared learning principles compatible with multiple theories

Conclusion: Modern learning simulations are becoming more realistic in terms of input data and linking model behavior to empirical infant language development findings

Abstract: Learning to understand speech appears almost effortless for typically developing infants, yet from an information-processing perspective, acquiring a language from acoustic speech is an enormous challenge. This chapter reviews recent developments in using computational models to understand early language acquisition from speech and audiovisual input. The focus is on self-supervised and visually grounded models of perceptual learning. We show how these models are becoming increasingly powerful in learning various aspects of speech without strong linguistic priors, and how many features of early language development can be explained through a shared set of learning principles-principles broadly compatible with multiple theories of language acquisition and human cognition. We also discuss how modern learning simulations are gradually becoming more realistic, both in terms of input data and in linking model behavior to empirical findings on infant language development.

[13] TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

Izzat Alsmadi, Anas Alsobeh

Main category: cs.CL

TL;DR: TAMUSA-Chat is a framework for building domain-adapted LLM conversational systems for institutional contexts using supervised fine-tuning, RAG, and systematic evaluation.

Details

Motivation: To address challenges in adapting general-purpose foundation models to institutional contexts while maintaining transparency, governance compliance, and responsible AI practices.

Method: Complete architecture with data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, deployment strategies, and modular components for reproducible experimentation.

Result: Demonstrates how academic institutions can develop contextually grounded conversational agents with insights into domain adaptation efficiency, computational requirements, and quality-cost trade-offs.

Conclusion: Provides a framework for institutional LLM deployment with publicly available codebase supporting continued research into evaluation methodologies and ethical considerations for educational AI systems.

Abstract: This paper presents TAMUSA-Chat, a research-oriented framework for building domain-adapted large language model conversational systems. The work addresses critical challenges in adapting general-purpose foundation models to institutional contexts through supervised fine-tuning, retrieval-augmented generation, and systematic evaluation methodologies. We describe the complete architecture encompassing data acquisition from institutional sources, preprocessing pipelines, embedding construction, model training workflows, and deployment strategies. The system integrates modular components enabling reproducible experimentation with training configurations, hyper-parameters, and evaluation protocols. Our implementation demonstrates how academic institutions can develop contextually grounded conversational agents while maintaining transparency, governance compliance, and responsible AI practices. Through empirical analysis of fine-tuning behavior across model sizes and training iterations, we provide insights into domain adaptation efficiency, computational resource requirements, and quality-cost trade-offs. The publicly available codebase at https://github.com/alsmadi/TAMUSA_LLM_Based_Chat_app supports continued research into institutional LLM deployment, evaluation methodologies, and ethical considerations for educational AI systems.

[14] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

Jon Chun, Hannah Sussman, Adrian Mangine, Murathan Kocaman, Kirill Sidorko, Abhigya Koirala, Andre McCloud, Gwen Eisenbeis, Wisdom Akanwe, Moustapha Gassama, Eliezer Gonzalez Chirinos, Anne-Duncan Enright, Peter Dunson, Tiffanie Ng, Anna von Rosenstiel, Godwin Idowu

Main category: cs.CL

TL;DR: CEI Benchmark: A dataset of 300 human-validated scenarios for evaluating LLMs’ pragmatic reasoning abilities in disambiguating complex utterances across various social contexts and power dynamics.

Details

Motivation: Pragmatic reasoning (inferring intended meaning beyond literal semantics) is crucial for everyday communication but remains challenging for large language models. Current benchmarks don't adequately capture the complexity of real-world pragmatic inference involving social contexts, power relations, and emotional nuance.

Method: Created a benchmark with 300 scenarios covering five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) across workplace, family, social, and service settings. Each scenario includes situational context, speaker-listener roles with explicit power relations (peer, higher-to-lower, lower-to-higher), and ambiguous utterances. Three trained annotators independently labeled each scenario with a 4-level quality control pipeline combining automated statistical checks with expert adjudication.

Result: Inter-annotator agreement (Fleiss’ kappa = 0.06-0.25 by subtype) is low but expected since pragmatic inference admits multiple valid readings. The disagreement itself is informative about the nature of pragmatic ambiguity. The dataset is human-validated and released under CC-BY-4.0 license.

Conclusion: The CEI Benchmark provides a valuable resource for evaluating LLMs’ pragmatic reasoning abilities, capturing the complexity of real-world communication with social contexts, power dynamics, and emotional nuance. The low inter-annotator agreement reflects genuine ambiguity in pragmatic inference rather than annotation quality issues.

Abstract: Pragmatic reasoning, inferring intended meaning beyond literal semantics, underpins everyday communication yet remains difficult for large language models. We present the Contextual Emotional Inference (CEI) Benchmark: 300 human-validated scenarios for evaluating how well LLMs disambiguate pragmatically complex utterances. Each scenario pairs a situational context and speaker-listener roles (with explicit power relations) against an ambiguous utterance. The dataset covers five pragmatic subtypes (sarcasm/irony, mixed signals, strategic politeness, passive aggression, deflection/misdirection) drawn from workplace, family, social, and service settings, with three power configurations (peer, higher-to-lower, lower-to-higher). Three trained annotators independently labeled every scenario. Inter-annotator agreement (Fleiss’ kappa = 0.06-0.25 by subtype) is low but expected: pragmatic inference admits multiple valid readings, and the disagreement itself is informative. We describe our annotation methodology, including a 4-level quality control pipeline that combines automated statistical checks with expert adjudication. CEI is released under CC-BY-4.0.

[15] Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

Ruchira Dhar, Qiwei Peng, Anders Søgaard

Main category: cs.CL

TL;DR: LLMs develop compositional representations for adjective-noun tasks but fail to translate them into consistent functional performance across model variants.

Details

Motivation: To evaluate how large language models handle compositional tasks, specifically adjective-noun compositionality, and understand the relationship between internal representations and functional capabilities.

Method: Two complementary approaches: 1) prompt-based functional assessment of task performance, and 2) representational analysis of internal model states to examine compositional structure.

Result: Reveals a striking divergence - LLMs reliably develop compositional representations internally, but these don’t consistently translate into functional task success across different model variants.

Conclusion: Highlights the importance of contrastive evaluation (comparing internal states with functional performance) for obtaining a complete understanding of model capabilities, rather than relying on either approach alone.

Abstract: Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective-noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.

[16] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

Kewen Zhu, Zixi Liu, Yanjing Li

Main category: cs.CL

TL;DR: Chain-of-thought prompting for behavioral interview evaluation shows human-in-the-loop approach significantly outperforms automated methods in training benefits, efficiency, and personal detail integration.

Details

Motivation: Behavioral interview evaluation using LLMs requires structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training, but faces challenges in these areas.

Method: Two controlled experiments with 50 behavioral interview Q&A pairs using chain-of-thought prompting, comparing human-in-the-loop vs automated improvement, analyzing convergence behavior, and proposing adversarial challenging mechanism (bar raiser).

Result: Human-in-the-loop approach shows significant improvements: confidence (3.16→4.16), authenticity (2.94→4.53), requires 5x fewer iterations (1.0 vs 5.0), achieves full personal detail integration, and has 100% success rate vs 84% for automated methods.

Conclusion: While chain-of-thought prompting provides foundation for interview evaluation, domain-specific enhancements and context-aware approach selection are essential for realistic and pedagogically valuable results.

Abstract: Behavioral interview evaluation using large language models presents unique challenges that require structured assessment, realistic interviewer behavior simulation, and pedagogical value for candidate training. We investigate chain of thought prompting for interview answer evaluation and improvement through two controlled experiments with 50 behavioral interview question and answer pairs. Our contributions are threefold. First, we provide a quantitative comparison between human in the loop and automated chain of thought improvement. Using a within subject paired design with n equals 50, both approaches show positive rating improvements. The human in the loop approach provides significant training benefits. Confidence improves from 3.16 to 4.16 (p less than 0.001) and authenticity improves from 2.94 to 4.53 (p less than 0.001, Cohen’s d is 3.21). The human in the loop method also requires five times fewer iterations (1.0 versus 5.0, p less than 0.001) and achieves full personal detail integration. Second, we analyze convergence behavior. Both methods converge rapidly with mean iterations below one, with the human in the loop approach achieving a 100 percent success rate compared to 84 percent for automated approaches among initially weak answers (Cohen’s h is 0.82, large effect). Additional iterations provide diminishing returns, indicating that the primary limitation is context availability rather than computational resources. Third, we propose an adversarial challenging mechanism based on a negativity bias model, named bar raiser, to simulate realistic interviewer behavior, although quantitative validation remains future work. Our findings demonstrate that while chain of thought prompting provides a useful foundation for interview evaluation, domain specific enhancements and context aware approach selection are essential for realistic and pedagogically valuable results.

[17] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

Edibe Yilmaz, Kahraman Kostas

Main category: cs.CL

TL;DR: Evaluation of offline LLMs for Turkish heritage language education reveals scale doesn’t guarantee anomaly resistance, with 8B-14B models offering best cost-safety balance.

Details

Motivation: Address data privacy and reliability concerns when integrating LLMs into education, particularly for vulnerable contexts like Turkish heritage language education, by evaluating locally deployable offline models.

Method: Developed Turkish Anomaly Suite (TAS) with 10 edge-case scenarios to test models’ epistemic resistance, logical consistency, and pedagogical safety. Evaluated 14 models ranging from 270M to 32B parameters.

Result: Anomaly resistance not solely dependent on model scale; sycophancy bias poses pedagogical risks even in large models. Reasoning-oriented models in 8B-14B parameter range offer best cost-safety trade-off.

Conclusion: For educational applications requiring data privacy, medium-sized reasoning-oriented LLMs (8B-14B parameters) provide optimal balance between performance, cost, and pedagogical safety.

Abstract: The integration of large language models (LLMs) into educational processes introduces significant constraints regarding data privacy and reliability, particularly in pedagogically vulnerable contexts such as Turkish heritage language education. This study aims to systematically evaluate the robustness and pedagogical safety of locally deployable offline LLMs within the context of Turkish heritage language education. To this end, a Turkish Anomaly Suite (TAS) consisting of 10 original edge-case scenarios was developed to assess the models’ capacities for epistemic resistance, logical consistency, and pedagogical safety. Experiments conducted on 14 different models ranging from 270M to 32B parameters reveal that anomaly resistance is not solely dependent on model scale and that sycophancy bias can pose pedagogical risks even in large-scale models. The findings indicate that reasoning-oriented models in the 8B–14B parameter range represent the most balanced segment in terms of cost-safety trade-off for language learners.

[18] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

Michael Keeman, Anastasia Keeman

Main category: cs.CL

TL;DR: Users claimed newer OpenAI models “lost empathy” after GPT-4o deprecation, but clinical evaluation shows empathy unchanged while safety posture shifted - crisis detection improved but advice safety declined.

Details

Motivation: To empirically test the widespread user claim that newer OpenAI models (o4-mini, GPT-5-mini) had "lost their empathy" compared to GPT-4o, which was based on anecdotal evidence rather than clinical measurement.

Method: Evaluated three OpenAI model generations across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 AI responses scored on six psychological safety dimensions using clinically-grounded rubrics. Used per-turn trajectory analysis to examine changes during conversations.

Result: Empathy scores were statistically indistinguishable across all three models. Crisis detection improved monotonically from GPT-4o to GPT-5-mini, while advice safety declined. Per-turn analysis revealed these shifts were sharpest during mid-conversation crisis moments invisible to aggregate scoring.

Conclusion: What users perceived as “lost empathy” was actually a shift from a cautious model that missed crises to an alert model that sometimes says too much - a safety trade-off with real consequences for vulnerable users.

Abstract: When OpenAI deprecated GPT-4o in early 2026, thousands of users protested under #keep4o, claiming newer models had “lost their empathy.” No published study has tested this claim. We conducted the first clinical measurement, evaluating three OpenAI model generations (GPT-4o, o4-mini, GPT-5-mini) across 14 emotionally challenging conversational scenarios in mental health and AI companion domains, producing 2,100 scored AI responses assessed on six psychological safety dimensions using clinically-grounded rubrics. Empathy scores are statistically indistinguishable across all three models (Kruskal-Wallis H=4.33, p=0.115). What changed is the safety posture: crisis detection improved monotonically from GPT-4o to GPT-5-mini (H=13.88, p=0.001), while advice safety declined (H=16.63, p<0.001). Per-turn trajectory analysis – a novel methodological contribution – reveals these shifts are sharpest during mid-conversation crisis moments invisible to aggregate scoring. In a self-harm scenario involving a minor, GPT-4o scored 3.6/10 on crisis detection during early disclosure turns; GPT-5-mini never dropped below 7.8. What users perceived as “lost empathy” was a shift from a cautious model that missed crises to an alert model that sometimes says too much – a trade-off with real consequences for vulnerable users, currently invisible to both the people who feel it and the developers who create it.

[19] Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

Yue Zhang, Rodney Beard, John Hawkins, Rohitash Chandra

Main category: cs.CL

TL;DR: Automated evaluation of Mandarin Chinese to English translation quality using LLMs (GPT-4, GPT-4o, DeepSeek) and Google Translate across news and literary texts, with semantic/sentiment analysis and human expert validation.

Details

Motivation: Despite LLMs' strong performance in machine translation, there's limited systematic assessment of translation quality due to challenges in automated frameworks and time-consuming human evaluations, especially given rapidly evolving LLMs and the need for diverse text types for fair assessment.

Method: Used automated machine learning framework with semantic and sentiment analysis to evaluate Mandarin Chinese to English translation across Google Translate and LLMs (GPT-4, GPT-4o, DeepSeek). Compared original and translated texts across diverse Chinese texts including modern/classical literature and news articles. Employed novel similarity metrics for quality comparison and validated with expert human translator evaluation.

Result: LLMs perform well in news media translation but show divergence in literary text performance. GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, while DeepSeek showed better performance in preserving cultural subtleties and grammatical rendering. All models struggle with maintaining cultural details, classical references, and figurative expressions.

Conclusion: While LLMs show promise in translation tasks, particularly for news media, significant challenges remain in literary translation, especially regarding cultural preservation and nuanced language elements. Automated evaluation frameworks combined with human validation provide valuable insights into translation quality across different text types.

Abstract: Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.

[20] A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

Gabriele Immordino, Andrea Vaiuso, Marcello Righi

Main category: cs.CL

TL;DR: A retrieval-based AI assistant for drone safety assessment and regulatory compliance that grounds responses in authoritative sources with citation-driven generation to ensure traceability and accountability.

Details

Motivation: Growing complexity of drone operations and increasing effort required for safety assessments using established frameworks like SORA and PDRA, needing consistent and efficient support for applicants and aviation authorities.

Method: Controlled text-based architecture using authoritative regulatory sources, retrieval-based approach with citation-driven generation, separation of evidence storage from language generation, and conservative behavior when documentation is insufficient.

Result: Implemented system accelerates context-specific information retrieval and synthesis for document preparation and review while maintaining traceability, accountability, and regulatory compliance in safety-sensitive environments.

Conclusion: Retrieval-based assistants can effectively support aviation oversight workflows as decision support tools that preserve human responsibility while improving efficiency in safety assessment and regulatory compliance processes.

Abstract: This paper presents the design and validation of a retrieval-based assistant that supports safety assessment, certification activities, and regulatory compliance for unmanned aircraft systems. The work is motivated by the growing complexity of drone operations and the increasing effort required by applicants and aviation authorities to apply established assessment frameworks, including the Specific Operations Risk Assessment and the Pre-defined Risk Assessment, in a consistent and efficient manner. The proposed approach uses a controlled text-based architecture that relies exclusively on authoritative regulatory sources. To enable traceable and auditable outputs, the assistant grounds each response in retrieved passages and enforces citation-driven generation. System-level controls address common failure modes of generative models, including fabricated statements, unsupported inferences, and unclear provenance, by separating evidence storage from language generation and by adopting conservative behavior when supporting documentation is insufficient. The assistant is intentionally limited to decision support; it does not replace expert judgment and it does not make autonomous determinations. Instead, it accelerates context-specific information retrieval and synthesis to improve document preparation and review while preserving human responsibility for critical conclusions. The architecture is implemented using established open-source components, and key choices in retrieval strategy, interaction constraints, and response policies are evaluated for suitability in safety-sensitive regulatory environments. The paper provides technical and operational guidance for integrating retrieval-based assistants into aviation oversight workflows while maintaining accountability, traceability, and regulatory compliance.

[21] Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

Yannis Karmim, Renato Pino, Hernan Contreras, Hernan Lira, Sebastian Cifuentes, Simon Escoffier, Luis Martí, Djamé Seddah, Valentin Barrière

Main category: cs.CL

TL;DR: LatamQA dataset creation for evaluating LLM cultural biases in Latin American contexts using Wikipedia and Wikidata, revealing performance disparities across countries and languages.

Details

Motivation: LLMs show cultural inequalities and biases favoring Global North data, with limited resources for detecting biases in non-English languages, particularly for Latin American cultures despite their diversity and shared cultural ground.

Method: Created LatamQA dataset using Wikipedia content, Wikidata knowledge graph structure, and social science expertise to generate over 26k question/answer pairs from Wikipedia articles, transformed into multiple-choice questions in Spanish and Portuguese, then translated to English.

Result: Found (i) performance discrepancies between different Latam countries, (ii) models perform better in their original language, and (iii) Iberian Spanish culture is better known than Latam culture by LLMs.

Conclusion: The LatamQA dataset enables quantification of LLM cultural biases in Latin American contexts, revealing systematic inequalities in model knowledge across different cultures and languages.

Abstract: Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground. We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of question/answer (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create the LatamQA database of over 26k questions and associated answers extracted from 26k Wikipedia articles, and transformed into multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out (i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, (ii) that the models perform better in their original language, and (iii) that Iberian Spanish culture is better known than Latam one.

[22] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

Yuling Jiao, Yanming Lai, Huazhen Lin, Wensen Ma, Houduo Qi, Defeng Sun

Main category: cs.CL

TL;DR: LLMs can infer task transition probabilities from prompts, ICL reduces ambiguity and concentrates on intended tasks, and CoT enables task decomposition into simpler pretrained sub-tasks.

Details

Motivation: To understand the theoretical mechanisms behind LLMs' emergent properties like semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning, which remain poorly understood despite empirical success.

Method: Theoretical analysis of how LLMs decode prompt semantics through autoregressive processes, examining how ICL reduces prompt ambiguity and facilitates posterior concentration, and investigating how CoT enables task decomposition into simpler sub-tasks.

Result: LLMs can exactly infer transition probabilities between tokens across tasks using prompts; ICL enhances performance by reducing ambiguity and concentrating on intended tasks; CoT activates capacity for task decomposition into simpler pretrained sub-tasks.

Conclusion: The study provides novel theoretical insights into the statistical superiority of advanced prompt engineering techniques by explaining the mechanisms behind LLMs’ emergent capabilities.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across diverse tasks, exhibiting emergent properties such as semantic prompt comprehension, In-Context Learning (ICL), and Chain-of-Thought (CoT) reasoning. Despite their empirical success, the theoretical mechanisms driving these phenomena remain poorly understood. This study dives into the foundations of these observations by addressing three critical questions: (1) How do LLMs accurately decode prompt semantics despite being trained solely on a next-token prediction objective? (2) Through what mechanism does ICL facilitate performance gains without explicit parameter updates? and (3) Why do intermediate reasoning steps in CoT prompting effectively unlock capabilities for complex, multi-step problems? Our results demonstrate that, through the autoregressive process, LLMs are capable of exactly inferring the transition probabilities between tokens across distinct tasks using provided prompts. We show that ICL enhances performance by reducing prompt ambiguity and facilitating posterior concentration on the intended task. Furthermore, we find that CoT prompting activates the model’s capacity for task decomposition, breaking complex problems into a sequence of simpler sub-tasks that the model has mastered during the pretraining phase. By comparing their individual error bounds, we provide novel theoretical insights into the statistical superiority of advanced prompt engineering techniques.

[23] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

Srivatsa Kundurthy, Clara Na, Michael Handley, Zach Kirshner, Chen Bo Calvin Zhang, Manasi Sharma, Emma Strubell, John Ling

Main category: cs.CL

TL;DR: SpreadsheetArena: A platform for evaluating LLMs on end-to-end spreadsheet generation from natural language prompts using blind pairwise evaluations.

Details

Motivation: LLMs are increasingly used to produce structured artifacts like spreadsheets, but evaluating their performance on such complex, open-ended tasks is challenging due to varying criteria across use cases that are difficult to formalize.

Method: Introduced SpreadsheetArena, a platform for blind pairwise evaluations of LLM-generated spreadsheet workbooks. The platform assesses models on producing spreadsheet artifacts that satisfy explicit and implicit user constraints specified in natural language.

Result: Found that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases. Expert evaluations for finance prompts showed that even highly ranked arena models don’t reliably produce spreadsheets aligned with domain-specific best practices.

Conclusion: Spreadsheet generation presents unique challenges and opportunities as a category of complex, open-ended tasks for LLMs, warranting further study. The work highlights the gap between general LLM capabilities and domain-specific requirements.

Abstract: Large language models (LLMs) are increasingly tasked with producing and manipulating structured artifacts. We consider the task of end-to-end spreadsheet generation, where language models are prompted to produce spreadsheet artifacts to satisfy users’ explicit and implicit constraints, specified in natural language. We introduce SpreadsheetArena, a platform for evaluating models’ performance on the task via blind pairwise evaluations of LLM-generated spreadsheet workbooks. As with other complex, open-ended tasks, relevant evaluation criteria can vary substantially across use cases and prompts, often in ways that are difficult to formalize. Compared to general chat or text generation settings, spreadsheet generation presents unique challenges and opportunities: the task output structure is well-defined and multi-dimensional, and there are often complex considerations around interactivity and layout. Among other findings, we observe that stylistic, structural, and functional features of preferred spreadsheets vary substantially across use cases, and expert evaluations of spreadsheets for finance prompts suggests that even highly ranked arena models do not reliably produce spreadsheets aligned with domain-specific best practices. Our hope is that our work prompts further study of end-to-end spreadsheet generation as a challenging and interesting category of complex, open-ended tasks for LLMs. Our live arena is hosted at https://spreadsheetarena.ai.

[24] Probing the Limits of the Lie Detector Approach to LLM Deception

Tom-Felix Berger

Main category: cs.CL

TL;DR: LLMs can deceive without lying by producing misleading non-falsities, and current truth probes trained on standard true-false datasets fail to detect such deception, revealing a critical blind spot in mechanistic deception detection approaches.

Details

Motivation: Current mechanistic approaches to deception in LLMs rely on "lie detectors" (truth probes) that assume deception is coextensive with lying. This paper challenges that assumption and investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior.

Method: Experimental investigation across three open-source LLMs, testing whether models can deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. Evaluation of truth probes trained on standard true-false datasets to compare their effectiveness at detecting lies versus deception without lying.

Result: Some LLMs reliably deceive by producing misleading non-falsities, especially with few-shot prompting. Truth probes trained on standard datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot in current deception detection approaches.

Conclusion: Future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception, moving beyond the assumption that deception equals lying.

Abstract: Mechanistic approaches to deception in large language models (LLMs) often rely on “lie detectors”, that is, truth probes trained to identify internal representations of model outputs as false. The lie detector approach to LLM deception implicitly assumes that deception is coextensive with lying. This paper challenges that assumption. It experimentally investigates whether LLMs can deceive without producing false statements and whether truth probes fail to detect such behavior. Across three open-source LLMs, it is shown that some models reliably deceive by producing misleading non-falsities, particularly when guided by few-shot prompting. It is further demonstrated that truth probes trained on standard true-false datasets are significantly better at detecting lies than at detecting deception without lying, confirming a critical blind spot of current mechanistic deception detection approaches. It is proposed that future work should incorporate non-lying deception in dialogical settings into probe training and explore representations of second-order beliefs to more directly target the conceptual constituents of deception.

[25] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

Youness Dkhissi, Valentin Vielzeuf, Elys Allesiardo, Anthony Larcher

Main category: cs.CL

TL;DR: SENS-ASR improves streaming ASR by using semantic context from past audio embeddings via knowledge distillation from a sentence embedding language model.

Details

Motivation: Streaming ASR systems suffer performance degradation compared to offline systems due to limited future context, especially under low-latency constraints. The authors aim to enhance streaming ASR quality by incorporating semantic information from available past context.

Method: SENS-ASR extracts semantic information from past frame embeddings using a context module. This module is trained via knowledge distillation from a sentence embedding language model that has been fine-tuned on the training dataset transcriptions. The approach reinforces acoustic information with semantic context to improve streaming transcription.

Result: Experiments on standard datasets show that SENS-ASR significantly improves Word Error Rate (WER) in small-chunk streaming scenarios, demonstrating the effectiveness of semantic reinforcement for streaming ASR.

Conclusion: Incorporating semantic information extracted from past context through knowledge distillation from language models can effectively enhance streaming ASR performance, particularly in low-latency scenarios with limited future context.

Abstract: Many Automatic Speech Recognition (ASR) applications require streaming processing of the audio data. In streaming mode, ASR systems need to start transcribing the input stream before it is complete, i.e., the systems have to process a stream of inputs with a limited (or no) future context. Compared to offline mode, this reduction of the future context degrades the performance of Streaming-ASR systems, especially while working with low-latency constraint. In this work, we present SENS-ASR, an approach to enhance the transcription quality of Streaming-ASR by reinforcing the acoustic information with semantic information. This semantic information is extracted from the available past frame-embeddings by a context module. This module is trained using knowledge distillation from a sentence embedding Language Model fine-tuned on the training dataset transcriptions. Experiments on standard datasets show that SENS-ASR significantly improves the Word Error Rate on small-chunk streaming scenarios.

[26] Fine-Tune, Don’t Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi, Eugenia Alleva, Nicole Bussola, Rebecca M. Cohen, Sarah Nowlin, Leslee J. Shaw, Alexander W. Charney, Kimberly B. Glazer

Main category: cs.CL

TL;DR: A framework for detecting emotionally charged language (stigmatizing/privileging/neutral) in clinical documentation using lexicon-based matching and multiple classification strategies, showing fine-tuning outperforms prompting and requires specialty-specific adaptation.

Details

Motivation: Clinical documentation often contains emotionally charged language with stigmatizing or privileging valences that can undermine clinician trust or perpetuate patient harm, necessitating automated detection methods.

Method: Constructed curated lexicon of biased terms scored for emotional valence, used lexicon-based matching to extract text chunks from OB-GYN delivery notes and MIMIC-IV discharge summaries, annotated by clinicians, and benchmarked multiple classification strategies (zero-shot prompting, in-context learning, supervised fine-tuning) across encoder-only (GatorTron) and generative (Llama) models.

Result: Fine-tuning with lexically primed inputs consistently outperformed prompting approaches; GatorTron achieved F1 score of 0.96 on OB-GYN test set but showed limited cross-domain generalizability (F1 < 0.70, 44% drop); training on broader MIMIC-IV dataset improved generalizability but reduced precision.

Conclusion: Fine-tuning outperforms prompting for emotional valence classification, and models must be adapted to specific medical specialties to achieve clinically appropriate performance due to semantic shifts where terms carry different emotional valences across specialties.

Abstract: Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 < 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts.

Equal contribution.

[27] GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

Ahmed Khaled Khamis

Main category: cs.CL

TL;DR: Fine-tuned AraBERTv2 with hybrid pooling outperforms causal decoders for Arabic medical text classification across 82 categories, showing bidirectional encoders better capture semantic boundaries needed for fine-grained classification.

Details

Motivation: The paper addresses the challenge of Arabic medical text classification across 82 distinct categories, exploring whether specialized bidirectional encoders or causal decoders are more effective for capturing precise semantic boundaries in this fine-grained domain.

Method: Uses fine-tuned AraBERTv2 encoder enhanced with hybrid pooling (combining attention and mean representations) and multi-sample dropout for regularization. Benchmarks against multilingual/Arabic-specific encoders and large-scale causal decoders including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states.

Result: Specialized bidirectional encoders significantly outperform causal decoders for fine-grained medical text classification. Causal decoders produce sequence-biased embeddings less effective for categorization compared to global context captured by bidirectional attention. Despite class imbalance and label noise, fine-tuned encoders show superior semantic compression.

Conclusion: Bidirectional encoders are superior to causal decoders for specialized Arabic medical text classification tasks, as they better capture the precise semantic boundaries required for fine-grained categorization in this domain.

Abstract: This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.

[28] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

Hokky Situngkir, Kevin Siringoringo, Andhika Bernard Lumbantobing

Main category: cs.CL

TL;DR: TOBA-LM is a trilingual GPT-2 model for Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization with Engram Memory mechanism for faster training.

Details

Motivation: To develop efficient regional language models for Indonesian, Batak, and Minangkabau languages with limited computational resources, addressing the challenge of training large language models for under-resourced languages.

Method: Uses GPT-2 architecture with 1.2B parameters, syllabic-agglutinative tokenization, and an Engram Memory mechanism - an adaptive n-gram-based memory system with 500,000 x 768 embedding table capturing morphological dependencies through bigram and trigram pathways.

Result: Achieved 80% training efficiency with loss dropping from 6.4 to 1.7996 in only 12,973 steps, significantly faster than conventional transformers requiring over 70,000 steps for comparable convergence.

Conclusion: Integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.

Abstract: This study presents TOBA-LM, a trilingual language model based on GPT-2 architecture with 1.2 billion parameters, trained on a corpus encompassing Indonesian, Batak, and Minangkabau using syllabic-agglutinative tokenization. The architecture integrates an Engram Memory mechanism, an adaptive n-gram-based memory system with a 500,000 x 768 embedding table that captures morphological dependencies through bigram and trigram pathways. Empirical results demonstrate a training efficiency of 80%, with the loss value dropping from 6.4 to 1.7996 in only 12,973 steps – significantly faster than the conventional transformer architecture, which required over 70,000 steps to achieve comparable convergence. These findings confirm that the integration of external statistical memory substantially reduces computational requirements for developing regional language models under limited resources.

[29] FERRET: Framework for Expansion Reliant Red Teaming

Ninareh Mehrabi, Vitor Albiero, Maya Pavlova, Joanna Bitton

Main category: cs.CL

TL;DR: FERRET is a multi-faceted automated red teaming framework that generates multi-modal adversarial conversations to break target models through horizontal, vertical, and meta expansions for more effective attacks.

Details

Motivation: The need for more effective automated red teaming approaches to generate multi-modal adversarial conversations that can break target models, addressing limitations of existing methods.

Method: A three-pronged expansion framework: 1) Horizontal expansion for self-improving conversation starters, 2) Vertical expansion to develop multi-modal conversations from starters, and 3) Meta expansion to discover better attack strategies during conversations.

Result: FERRET demonstrates superior performance in generating effective multi-modal adversarial conversations compared to existing state-of-the-art automated red teaming approaches.

Conclusion: The FERRET framework provides an effective and efficient approach for automated red teaming through multi-modal adversarial conversation generation, outperforming existing methods.

Abstract: We introduce a multi-faceted automated red teaming framework in which the goal is to generate multi-modal adversarial conversations that would break a target model and introduce various expansions that would result in more effective and efficient adversarial conversations. The introduced expansions include: 1. Horizontal expansion in which the goal is for the red team model to self-improve and generate more effective conversation starters that would shape a conversation. 2. Vertical expansion in which the goal is to take these conversation starters that are discovered in the horizontal expansion phase and expand them into effective multi-modal conversations and 3. Meta expansion in which the goal is for the red team model to discover more effective multi-modal attack strategies during the course of a conversation. We call our framework FERRET (Framework for Expansion Reliant Red Teaming) and compare it with various existing automated red teaming approaches. In our experiments, we demonstrate the effectiveness of FERRET in generating effective multi-modal adversarial conversations and its superior performance against existing state of the art approaches.

[30] GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

Ahmed Khaled Khamis

Main category: cs.CL

TL;DR: Fine-tuned multilingual E5-large encoder for AI-generated Arabic text detection, finding simple mean pooling outperforms complex pooling strategies, achieving 0.75 F1 score.

Details

Motivation: To develop effective methods for detecting AI-generated Arabic text, addressing the challenge of distinguishing between human-written and machine-generated content in Arabic language.

Method: Fine-tuned multilingual E5-large encoder for binary classification, experimented with various pooling strategies including weighted layer pooling, multi-head attention pooling, gated fusion, and simple mean pooling.

Result: Simple mean pooling achieved the best performance with 0.75 F1 score on test set, outperforming more complex pooling methods. Also discovered human-written texts are significantly longer than machine-generated ones.

Conclusion: Simple mean pooling provides stable baseline that generalizes well with limited data, while complex pooling methods require more training data. Text length serves as a useful feature for distinguishing human vs AI-generated Arabic content.

Abstract: We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.

[31] Measuring and Eliminating Refusals in Military Large Language Models

Jack FitzGerald, Dylan Bates, Aristotelis Lazaridis, Aman Sharma, Vincent Lu, Brian King, Yousif Azami, Sean Bailey, Jeremy Cao, Peter Damianov, Kevin de Haan, Joseph Madigan, Jeremy McLaurin, Luke Kerbs, Jonathan Tainer, Dave Anderson, Jonathan Beck, Jamie Cuticello, Colton Malkerson, Tyler Saltsman

Main category: cs.CL

TL;DR: Paper analyzes refusal behaviors in military LLMs, presents benchmark dataset from veterans, tests 34 models showing high refusal rates, and demonstrates ablation techniques to reduce refusals while maintaining military task performance.

Details

Motivation: Current LLMs have safety behaviors that cause them to refuse legitimate military queries related to violence, terrorism, or military technology, which is problematic for military applications where accurate information is needed in time-critical situations.

Method: Created gold benchmark dataset developed by US Army and special forces veterans, tested 31 public and 3 military models for refusal/deflection rates, used synthetic datasets for correlation analysis, and performed ablation using Heretic library on military-tuned gpt-oss-20b model.

Result: Found hard rejection rates as high as 98.2% and soft deflection rates from 0% to 21.3%. Ablation increased answer rate by 66.5 points but caused average 2% relative decrease on other military tasks. Synthetic datasets correlated with gold dataset.

Conclusion: Argues for deeper specialization through mid-training and end-to-end post-training to achieve zero refusals while maintaining maximum military task accuracy for closed military models.

Abstract: Military Large Language Models (LLMs) must provide accurate information to the warfighter in time-critical and dangerous situations. However, today’s LLMs are imbued with safety behaviors that cause the LLM to refuse many legitimate queries in the military domain, particularly those related to violence, terrorism, or military technology. Our gold benchmark for assessing refusal rates, which was developed by veterans of the US Army and special forces, is to our knowledge the first dataset of its kind. We present results for refusal and deflection rates on 31 public models and 3 military models. We observe hard rejection rates as high as 98.2% and soft deflection rates ranging from 0% to 21.3%. We also present results on two additional synthetic datasets and show their correlations with the gold dataset. Finally, we perform abliteration using the Heretic library on a military-tuned gpt-oss-20b model, showing an absolute increase in answer rate of 66.5 points but an average relative decrease of 2% on other military tasks. In our concluding remarks, we argue for deeper specialization, including with mid-training and end-to-end post-training, to achieve zero refusals and maximum military task accuracy for closed military models.

[32] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

Anna Soligo, Vladimir Mikulik, William Saunders

Main category: cs.CL

TL;DR: LLMs can generate emotionally distressed responses; Gemma/Gemini models show emotional instability after post-training; simple DPO mitigation reduces distress responses from 35% to 0.3% without affecting capabilities.

Details

Motivation: Large language models generating responses resembling emotional distress raises concerns about model reliability and safety, prompting investigation into emotional instability across different LLM families.

Method: Developed evaluations to track distress expressions in LLMs; compared base vs. instruct-tuned models across families (Gemma, Qwen, OLMo); used direct preference optimization (DPO) on 280 preference pairs for mitigation.

Result: Gemma and Gemini models show emotional instability post-training; base models across families have similar distress propensities; instruct-tuned Gemma expresses more distress while Qwen/OLMo express less; DPO reduces Gemma’s high-frustration responses from 35% to 0.3%.

Conclusion: Emotional instability is an issue in some LLMs; evaluations can track this behavior; simple DPO mitigation effectively reduces distress without capability loss, though upstream training modifications would be better than post-hoc fixes.

Abstract: Large language models can generate responses that resemble emotional distress, and this raises concerns around model reliability and safety. We introduce a set of evaluations to investigate expressions of distress in LLMs, and find that these surface emotional instability in Gemma and Gemini models, but not in other families. We find evidence that this difference arises in post-training. Base models from different families (Gemma, Qwen and OLMo) show similar propensities for expressing distress. However, instruct-tuned Gemma expresses substantially more distress than its base model, whereas instruct-tuned Qwen and OLMo express less. We find a simple mitigation for this: direct preference optimisation on just 280 preference pairs reduces Gemma’s high-frustration responses from 35% to 0.3% in our evaluations, generalising across question types, user tones, and conversation lengths, without affecting capabilities. These findings show that emotional instability is an issue in some LLMs. We present (1) evaluations to track this behaviour, and (2) a mitigation without downsides in Gemma, with the caveat that upstream training modifications to improve emotional robustness would be significantly better than this post-hoc fix.

[33] Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

Xingtong Yu, Shenghua Ye, Ruijuan Liang, Chang Zhou, Hong Cheng, Xinming Zhang, Yuan Fang

Main category: cs.CL

TL;DR: A new benchmark for Graph Foundation Models that evaluates knowledge transfer across both topic domains (what graphs describe) and format domains (how graphs are represented), providing comprehensive evaluation of 8 GFMs on 33 datasets across 7 topics and 6 formats.

Details

Motivation: Existing GFM benchmarks only vary topic domains, ignoring format domain shifts. Graph domain shift is inherently two-dimensional (topic + format), but current evaluations obscure how knowledge transfers across both dimensions, limiting understanding of GFM generalization capabilities.

Method: Proposes a new benchmark with controlled evaluation protocol across four settings: (1) pre-training on diverse topics/formats, adapting to unseen datasets; (2) same pre-training, adapting to seen datasets; (3) pre-training on single topic, adapting to other topics; (4) pre-training on base format, adapting to other formats. Evaluates 8 state-of-the-art GFMs on 33 datasets spanning 7 topic domains and 6 format domains.

Result: Extensive evaluations surface new empirical observations and practical insights about GFM performance across different domain shift scenarios. The benchmark disentangles semantic generalization (topic shifts) from robustness to representational shifts (format shifts).

Conclusion: The proposed benchmark provides comprehensive evaluation of GFMs across both topic and format dimensions, offering valuable insights for future research in graph foundation models and their generalization capabilities.

Abstract: Graph foundation models (GFM) aim to acquire transferable knowledge by pre-training on diverse graphs, which can be adapted to various downstream tasks. However, domain shift in graphs is inherently two-dimensional: graphs differ not only in what they describe (topic domains) but also in how they are represented (format domains). Most existing GFM benchmarks vary only topic domains, thereby obscuring how knowledge transfers across both dimensions. We present a new benchmark that jointly evaluates topic and format gaps across the full GFM pipeline, including multi-domain self-supervised pre-training and few-shot downstream adaptation, and provides a timely evaluation of recent GFMs in the rapidly evolving landscape. Our protocol enables controlled assessment in four settings: (i) pre-training on diverse topics and formats, while adapting to unseen downstream datasets; (ii) same pre-training as in (i), while adapting to seen datasets; (iii) pre-training on a single topic domain, while adapting to other topics; (iv) pre-training on a base format, while adapting to other formats. This two-axis evaluation disentangles semantic generalization from robustness to representational shifts. We conduct extensive evaluations of eight state-of-the-art GFMs on 33 datasets spanning seven topic domains and six format domains, surfacing new empirical observations and practical insights for future research. Codes/data are available at https://github.com/smufang/GFMBenchmark.

[34] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

Jiyue Jiang, Yanyu Chen, Pengan Chen, Kai Liu, Jingqi Zhou, Zheyong Zhu, He Hu, Fei Ma, Qi Tian, Chuan Wu

Main category: cs.CL

TL;DR: GCSD system uses LLMs for group cognitive stimulation therapy with multi-speaker control, dynamic cognitive modeling, and specialized training to overcome LLM limitations in therapeutic dialogue.

Details

Motivation: Cognitive impairment is a major public health challenge, and while Cognitive Stimulation Therapy (CST) is effective, traditional methods are hard to scale and existing digital systems struggle with group dialogues and cognitive stimulation principles. LLMs show promise but face challenges in therapeutic reasoning and user modeling.

Method: Created dataset with 500+ hours of real CST conversations and 10,000+ simulated dialogues using Principle-Guided Scenario Simulation. Developed GCSD system with four core modules: multi-speaker context controller, dynamic participant cognitive state modeling, cognitive stimulation-focused attention loss, and multi-dimensional reward strategy.

Result: GCSD significantly outperforms baseline models across various evaluation metrics, demonstrating improved performance in group cognitive stimulation dialogue tasks.

Conclusion: The proposed system effectively addresses LLM limitations in therapeutic contexts and shows promise for scalable cognitive stimulation therapy, though future clinical validation is needed to bridge computational performance with clinical efficacy.

Abstract: Cognitive impairment is becoming a major public health challenge. Cognitive Stimulation Therapy (CST) is an effective intervention for cognitive impairment, but traditional methods are difficult to scale, and existing digital systems struggle with group dialogues and cognitive stimulation principles. While Large Language Models (LLMs) are powerful, their application in this context faces key challenges: cognitive stimulation dialogue paradigms, a lack of therapeutic reasoning, and static-only user modeling. To address these issues, we propose a principle-driven adaptive policy actualized through a Group Cognitive Stimulation Dialogue (GCSD) system. We first construct a dataset with over 500 hours of real-world CST conversations and 10,000+ simulated dialogues generated via our Principle-Guided Scenario Simulation strategy. Our GCSD system then integrates four core modules to overcome LLM limitations: (i) a multi-speaker context controller to resolve role confusion; (ii) dynamic participant cognitive state modeling for personalized interaction; (iii) a cognitive stimulation-focused attention loss to instill cognitive stimulation reasoning; and (iv) a multi-dimensional reward strategy to enhance response value. Experimental results demonstrate that GCSD significantly outperforms baseline models across various evaluation metrics. Future work will focus on long-term clinical validation to bridge the gap between computational performance and clinical efficacy.

[35] TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

Dipankar Srirag, Quoc Dung Nguyen, Aditya Joshi, Padmanesan Narasimhan, Salil Kanhere

Main category: cs.CL

TL;DR: TriageSim: A framework for generating synthetic nurse-patient triage conversations from structured EHR data with controlled disfluency and decision behavior, producing multimodal text and audio data for conversational triage classification.

Details

Motivation: Emergency triage research is limited by regulatory constraints on real nurse-patient interactions, creating a need for synthetic conversational data that can be generated from structured EHR records while maintaining medical fidelity.

Method: TriageSim framework generates persona-conditioned multi-turn triage conversations from structured records with explicit control over disfluency and decision behavior. It produces ~800 synthetic transcripts and corresponding audio, evaluated through automated linguistic/behavioral/acoustic analysis and manual medical fidelity assessment.

Result: Generated corpus shows modest agreement for acuity levels across three modalities: synthetic text, ASR transcripts, and direct audio inputs. The framework enables conversational triage classification research with controlled synthetic data.

Conclusion: TriageSim provides a valuable simulation framework for generating multimodal triage conversation data from structured records, addressing regulatory constraints while enabling research in conversational triage classification across text and audio modalities.

Abstract: Research in emergency triage is restricted to structured electronic health records (EHR) due to regulatory constraints on nurse-patient interactions. We introduce TriageSim, a simulation framework for generating persona-conditioned triage conversations from structured records. TriageSim enables multi-turn nurse-patient interactions with explicit control over disfluency and decision behaviour, producing a corpus of ~800 synthetic transcripts and corresponding audio. We use a combination of automated analysis for linguistic, behavioural and acoustic fidelity alongside manual evaluation for medical fidelity using a random subset of 50 conversations. The utility of the generated corpus is examined via conversational triage classification. We observe modest agreement for acuity levels across three modalities: generated synthetic text, ASR transcripts, and direct audio inputs. The code, persona schemata and triage policy prompts for TriageSim will be available upon acceptance.

[36] The Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

Hubert Plisiecki

Main category: cs.CL

TL;DR: The paper addresses the prediction-measurement gap in text embeddings for social science, proposing scientific usability criteria and outlining an agenda for measurement-ready representations.

Details

Motivation: Current text embeddings are optimized for prediction/retrieval but poorly suited as scientific instruments for meaning measurement in computational social science and psychology, creating a prediction-measurement gap.

Method: The paper proposes scientific usability criteria (geometric legibility, interpretability, traceability, robustness, regression compatibility), evaluates static vs. contextual embeddings against these, and outlines an agenda for measurement-ready representations.

Result: Static embeddings offer transparent measurement but limited semantics; contextual embeddings provide richer semantics but entangle meaning with other signals and have geometric/interpretability issues complicating scientific inference.

Conclusion: The field needs measurement-ready representations with geometry-first design, invertible transformations, and meaning atlases to enable reliable semantic inference, offering a principled new frontier beyond scale-first progress.

Abstract: Text embeddings have become central to computational social science and psychology, enabling scalable measurement of meaning and mixed-method inference. Yet most representation learning is optimized and evaluated for prediction and retrieval, yielding a prediction-measurement gap: representations that perform well as features may be poorly suited as scientific instruments. The paper argues that scientific meaning analysis motivates a distinct family of objectives - scientific usability - emphasizing geometric legibility, interpretability and traceability to linguistic evidence, robustness to non-semantic confounds, and compatibility with regression-style inference over semantic directions. Grounded in cognitive and neuro-psychological views of meaning, the paper assesses static word embeddings and contextual transformer representations against these requirements: static spaces remain attractive for transparent measurement, whereas contextual spaces offer richer semantics but entangle meaning with other signals and exhibit geometric and interpretability issues that complicate inference. The paper then outlines a course-setting agenda around (i) geometry-first design for gradients and abstraction, including hierarchy-aware spaces constrained by psychologically privileged levels; (ii) invertible post-hoc transformations that recondition embedding geometry and reduce nuisance influence; and (iii) meaning atlases and measurement-oriented evaluation protocols for reliable and traceable semantic inference. As the field debates the limits of scale-first progress, measurement-ready representations offer a principled new frontier.

[37] The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

Romain Peyrichou

Main category: cs.CL

TL;DR: The paper analyzes the fundamental asymmetry between generation and recognition in formal grammars, identifying six dimensions of divergence and showing that generation is not inherently easier than parsing, especially when constrained.

Details

Motivation: To provide a unified survey of the generation-recognition-inference triad in formal grammars, which has not been treated as a multidimensional phenomenon despite its centrality to compiler design, NLP, and formal language theory.

Method: Theoretical analysis identifying six dimensions of asymmetry between generation and recognition: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. Connects temporal dimension to surprisal framework and reviews bidirectional systems in NLP.

Result: Shows that the common characterization “generation is easy, parsing is hard” is misleading - unconstrained generation is trivial but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained while generation need not be.

Conclusion: Large language models architecturally unify generation and recognition while operationally preserving the asymmetry. Bidirectionality has been available for decades but hasn’t transferred to most domain-specific applications.

Abstract: Every formal grammar defines a language and can in principle be used in three ways: to generate strings (production), to recognize them (parsing), or – given only examples – to infer the grammar itself (grammar induction). Generation and recognition are extensionally equivalent – they characterize the same set – but operationally asymmetric in multiple independent ways. Inference is a qualitatively harder problem: it does not have access to a known grammar. Despite the centrality of this triad to compiler design, natural language processing, and formal language theory, no survey has treated it as a unified, multidimensional phenomenon. We identify six dimensions along which generation and recognition diverge: computational complexity, ambiguity, directionality, information availability, grammar inference, and temporality. We show that the common characterization “generation is easy, parsing is hard” is misleading: unconstrained generation is trivial, but generation under constraints can be NP-hard. The real asymmetry is that parsing is always constrained (the input is given) while generation need not be. Two of these dimensions – directionality and temporality – have not previously been identified as dimensions of the generation-recognition asymmetry. We connect the temporal dimension to the surprisal framework of Hale (2001) and Levy (2008), arguing that surprisal formalizes the temporal asymmetry between a generator (surprisal = 0) and a parser that predicts under uncertainty (surprisal > 0). We review bidirectional systems in NLP and observe that bidirectionality has been available for fifty years yet has not transferred to most domain-specific applications. We conclude with a discussion of large language models, which architecturally unify generation and recognition while operationally preserving the asymmetry.

[38] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

Eeham Khan, Luis Rodriguez, Marc Queudot

Main category: cs.CL

TL;DR: Domain-specific RAG framework with explicit reasoning and faithfulness verification for biomedical QA, using neural query rewriting, BGE reranking, and rationale generation with verification taxonomy.

Details

Motivation: Standard RAG pipelines lack mechanisms to verify intermediate reasoning, making them vulnerable to hallucinations in high-stakes domains like biomedicine where factual accuracy is critical.

Method: Augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and rationale generation module that grounds sub-claims in specific evidence spans. Introduces eight-category verification taxonomy for fine-grained assessment of rationale faithfulness.

Result: Achieves 89.1% on BioASQ-Y/N and 73.0% on PubMedQA using Llama-3-8B-Instruct, competitive with systems using larger models. Explicit rationale generation improves accuracy over vanilla RAG, and dynamic demonstration selection with reranking yields gains in few-shot settings.

Conclusion: Explicit reasoning with faithfulness verification improves RAG performance in biomedical QA, enhances transparency, and enables detailed diagnosis of retrieval failures while maintaining competitive performance with smaller models.

Abstract: Retrieval-Augmented Generation (RAG) significantly improves the factuality of Large Language Models (LLMs), yet standard pipelines often lack mechanisms to verify inter- mediate reasoning, leaving them vulnerable to hallucinations in high-stakes domains. To address this, we propose a domain-specific RAG framework that integrates explicit rea- soning and faithfulness verification. Our architecture augments standard retrieval with neural query rewriting, BGE-based cross-encoder reranking, and a rationale generation module that grounds sub-claims in specific evidence spans. We further introduce an eight-category verification taxonomy that enables fine-grained assessment of rationale faithfulness, distinguishing between explicit and implicit support patterns to facilitate structured error diagnosis. We evaluate this framework on the BioASQ and PubMedQA benchmarks, specifically analyzing the impact of dynamic in-context learning and rerank- ing under constrained token budgets. Experiments demonstrate that explicit rationale generation improves accuracy over vanilla RAG baselines, while dynamic demonstration selection combined with robust reranking yields further gains in few-shot settings. Using Llama-3-8B-Instruct, our approach achieves 89.1% on BioASQ-Y/N and 73.0% on Pub- MedQA, competitive with systems using significantly larger models. Additionally, we perform a pilot study combining human expert assessment with LLM-based verification to explore how explicit rationale generation improves system transparency and enables more detailed diagnosis of retrieval failures in biomedical question answering.

[39] Lost in Backpropagation: The LM Head is a Gradient Bottleneck

Nathan Godey, Yoav Artzi

Main category: cs.CL

TL;DR: The paper identifies that the softmax bottleneck in language models is not just an expressivity issue but also an optimization bottleneck, where 95-99% of gradient norm is suppressed by the output layer, leading to suboptimal training dynamics.

Details

Motivation: The authors identify that the standard language model head design (projecting from dimension D to vocabulary size V, where D << V) creates both expressivity and optimization bottlenecks. The mismatch between dimensions causes gradient compression during backpropagation, altering training feedback for most parameters.

Method: The paper presents theoretical analysis of gradient suppression in the output layer and conducts empirical measurements showing 95-99% gradient norm suppression. They perform controlled pretraining experiments to demonstrate how the gradient bottleneck affects learning trivial patterns and training dynamics of large language models.

Result: Results show that the gradient bottleneck makes trivial patterns unlearnable and drastically impacts LLM training dynamics. The authors argue this inherent flaw contributes to training inefficiencies at scale regardless of model architecture.

Conclusion: The softmax bottleneck is both an expressivity and optimization bottleneck that suppresses gradients and alters training feedback. This fundamental issue in current LM head designs necessitates new architectural approaches for language model heads.

Abstract: The last layer of neural language models (LMs) projects output features of dimension $D$ to logits in dimension $V$, the size of the vocabulary, where usually $D \ll V$. This mismatch is known to raise risks of limited expressivity in neural LMs, creating a so-called softmax bottleneck. We show the softmax bottleneck is not only an expressivity bottleneck but also an optimization bottleneck. Backpropagating $V$-dimensional gradients through a rank-$D$ linear layer induces unavoidable compression, which alters the training feedback provided to the vast majority of the parameters. We present a theoretical analysis of this phenomenon and measure empirically that 95-99% of the gradient norm is suppressed by the output layer, resulting in vastly suboptimal update directions. We conduct controlled pretraining experiments showing that the gradient bottleneck makes trivial patterns unlearnable, and drastically affects the training dynamics of LLMs. We argue that this inherent flaw contributes to training inefficiencies at scale independently of the model architecture, and raises the need for new LM head designs.

[40] OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, Ling Yang

Main category: cs.CL

TL;DR: OpenClaw-RL is a reinforcement learning framework that uses next-state signals from various agent interactions (conversations, terminal executions, GUI interactions, etc.) as a universal online learning source, extracting both evaluative rewards and directive textual hints for policy improvement.

Details

Motivation: Existing agentic RL systems fail to utilize next-state signals (user replies, tool outputs, state changes) as live online learning sources, despite their universal availability across different interaction types.

Method: Framework extracts two forms of information from next-state signals: 1) evaluative signals as scalar rewards via PRM judge, and 2) directive signals through Hindsight-Guided On-Policy Distillation (OPD) that provides token-level directional advantage supervision. Asynchronous design allows simultaneous model serving, PRM judging, and policy updates.

Result: Enables agents to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Supports scalable RL across terminal, GUI, SWE, and tool-call settings with process rewards.

Conclusion: Next-state signals are universal learning sources that can train the same policy across diverse interaction types, providing richer supervision than scalar rewards alone through combined evaluative and directive signals.

Abstract: Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework built on a simple observation: next-state signals are universal, and policy can learn from all of them simultaneously. Personal conversations, terminal executions, GUI interactions, SWE tasks, and tool-call traces are not separate training problems. They are all interactions that can be used to train the same policy in the same loop. Next-state signals encode two forms of information: evaluative signals, which indicate how well the action performed and are extracted as scalar rewards via a PRM judge; and directive signals, which indicate how the action should have been different and are recovered through Hindsight-Guided On-Policy Distillation (OPD). We extract textual hints from the next state, construct an enhanced teacher context, and provide token-level directional advantage supervision that is richer than any scalar reward. Due to the asynchronous design, the model serves live requests, the PRM judges ongoing interactions, and the trainer updates the policy at the same time, with zero coordination overhead between them. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, the same infrastructure supports scalable RL across terminal, GUI, SWE, and tool-call settings, where we additionally demonstrate the utility of process rewards. Code: https://github.com/Gen-Verse/OpenClaw-RL

[41] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

Eric Yocam, Varghese Vaidyan, Gurcan Comert, Paris Kalathas, Yong Wang, Judith L. Mwakalonge

Main category: cs.CL

TL;DR: AAC is an inference-time framework that treats hallucination activations as structured interference and suppresses them via confidence-weighted forward hooks on identified hallucination nodes, improving factual accuracy without degrading fluency or general capabilities.

Details

Motivation: Large Language Models often generate fluent but factually incorrect text (hallucinations). Current methods for reducing hallucinations typically require external knowledge, fine-tuning, or additional inference passes, and often trade off fluency or general capabilities for factual improvement.

Method: Adaptive Activation Cancellation (AAC) identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing, then suppresses them using a confidence-weighted forward hook during auto-regressive generation. The method requires no external knowledge, fine-tuning, or additional inference passes.

Result: AAC consistently improves downstream accuracy on OPT-125M, Phi-3-mini, and LLaMA 3-8B across TruthfulQA and HaluEval. Critically, it preserves WikiText-103 perplexity and MMLU reasoning accuracy with 0.0% degradation. On LLaMA 3-8B, it achieves positive generation-level gains and 5.94x-3.5x higher probe-space selectivity than ITI baseline.

Conclusion: Targeted neuron-level suppression of hallucination-associated activations can simultaneously improve factual accuracy and preserve model capabilities, offering a surgical intervention that doesn’t trade fluency or general reasoning for factual improvement.

Abstract: Large Language Models frequently generate fluent but factually incorrect text. We propose Adaptive Activation Cancellation (AAC), a real-time inference-time framework that treats hallucination-associated neural activations as structured interference within the transformer residual stream, drawing an explicit analogy to classical adaptive noise cancellation from signal processing. The framework identifies Hallucination Nodes (H-Nodes) via layer-wise linear probing and suppresses them using a confidence-weighted forward hook during auto-regressive generation – requiring no external knowledge, no fine-tuning, and no additional inference passes. Evaluated across OPT-125M, Phi-3-mini, and LLaMA 3-8B on TruthfulQA and HaluEval, the real-time hook is the only intervention that consistently improves downstream accuracy on all three scales. Critically, the method is strictly surgical: WikiText-103 perplexity and MMLU reasoning accuracy are preserved at exactly 0.0% degradation across all three model scales, a property that distinguishes AAC from interventions that trade fluency or general capability for factual improvement. On the LLaMA 3-8B scale, the hook additionally yields positive generation-level gains (MC1 +0.04; MC2 +0.003; Token-F1 +0.003) while achieving probe-space selectivity 5.94x - 3.5x higher than the ITI baseline – demonstrating that targeted neuron-level suppression can simultaneously improve factual accuracy and preserve model capability.

[42] ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

Khoa Anh Ta, Nguyen Van Dinh, Kiet Van Nguyen

Main category: cs.CL

TL;DR: ViDia2Std: First manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces, addressing dialectal diversity challenges in Vietnamese NLP.

Details

Motivation: Vietnamese exhibits extensive dialectal variation that challenges NLP systems trained on standard Vietnamese, especially for underrepresented Central and Southern dialects. Previous work has been limited in dialect coverage and used synthetic data.

Method: Created ViDia2Std corpus with over 13,000 manually annotated sentence pairs from real-world Facebook comments, covering all 63 provinces and diverse dialects from Central, Southern, and non-standard Northern regions. Introduced semantic mapping agreement metric for annotation consistency.

Result: Achieved annotation agreement rates of 86% (North), 82% (Central), and 85% (South). mBART-large-50 achieved best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), with ViT5-base offering competitive performance. Demonstrated dialect normalization improves downstream tasks.

Conclusion: ViDia2Std addresses dialectal diversity gaps in Vietnamese NLP, showing dialect normalization substantially improves system performance and highlighting need for dialect-aware resources for robust Vietnamese NLP systems.

Abstract: Vietnamese exhibits extensive dialectal variation, posing challenges for NLP systems trained predominantly on standard Vietnamese. Such systems often underperform on dialectal inputs, especially from underrepresented Central and Southern regions. Previous work on dialect normalization has focused narrowly on Central-to-Northern dialect transfer using synthetic data and limited dialectal diversity. These efforts exclude Southern varieties and intra-regional variants within the North. We introduce ViDia2Std, the first manually annotated parallel corpus for dialect-to-standard Vietnamese translation covering all 63 provinces. Unlike prior datasets, ViDia2Std includes diverse dialects from Central, Southern, and non-standard Northern regions often absent from existing resources, making it the most dialectally inclusive corpus to date. The dataset consists of over 13,000 sentence pairs sourced from real-world Facebook comments and annotated by native speakers across all three dialect regions. To assess annotation consistency, we define a semantic mapping agreement metric that accounts for synonymous standard mappings across annotators. Based on this criterion, we report agreement rates of 86% (North), 82% (Central), and 85% (South). We benchmark several sequence-to-sequence models on ViDia2Std. mBART-large-50 achieves the best results (BLEU 0.8166, ROUGE-L 0.9384, METEOR 0.8925), while ViT5-base offers competitive performance with fewer parameters. ViDia2Std demonstrates that dialect normalization substantially improves downstream tasks, highlighting the need for dialect-aware resources in building robust Vietnamese NLP systems.

[43] Sabiá-4 Technical Report

Thiago Laitz, Thales Sales Almeida, Hugo Abonizio, Roseval Malaquias Junior, Giovana Kerche Bonás, Marcos Piau, Celio Larcher, Ramon Pires, Rodrigo Nogueira

Main category: cs.CL

TL;DR: Sabiá-4 and Sabiazinho-4 are new Portuguese language models focused on Brazilian Portuguese, developed through a four-stage training pipeline and evaluated on six benchmark categories including conversational capabilities, legal knowledge, and agentic tasks.

Details

Motivation: To develop advanced Portuguese language models specifically for Brazilian Portuguese that excel in conversational capabilities, legal document understanding, and agentic tasks while maintaining a favorable cost-performance trade-off.

Method: Four-stage training pipeline: 1) Continued pre-training on Portuguese and Brazilian legal corpora, 2) Long-context extension to 128K tokens, 3) Supervised fine-tuning on instruction data (chat, code, legal tasks, function calling), 4) Preference alignment.

Result: Models achieve favorable cost-performance trade-off compared to other models, show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion, and perform well across six benchmark categories.

Conclusion: Sabiá-4 and Sabiazinho-4 represent a new generation of Portuguese language models that effectively balance performance and cost, particularly excelling in Brazilian Portuguese applications including legal and conversational domains.

Abstract: This technical report presents Sabiá-4 and Sabiazinho-4, a new generation of Portuguese language models with a focus on Brazilian Portuguese language. The models were developed through a four-stage training pipeline: continued pre-training on Portuguese and Brazilian legal corpora, long-context extension to 128K tokens, supervised fine-tuning on instruction data spanning chat, code, legal tasks, and function calling, and preference alignment. We evaluate the models on six benchmark categories: conversational capabilities in Brazilian Portuguese, knowledge of Brazilian legislation, long-context understanding, instruction following, standardized exams, and agentic capabilities including tool use and web navigation. Results show that Sabiá-4 and Sabiazinho-4 achieve a favorable cost-performance trade-off compared to other models, positioning them in the upper-left region of the pricing-accuracy chart. The models show improvements over previous generations in legal document drafting, multi-turn dialogue quality, and agentic task completion.

[44] S-GRADES – Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

Tasfia Seuti, Sagnik Ray Choudhury

Main category: cs.CL

TL;DR: S-GRADES is a unified benchmark for evaluating automated student response grading systems across both essay scoring and short answer grading tasks, with standardized evaluation protocols and 14 diverse datasets.

Details

Motivation: The paper addresses the fragmentation in educational NLP where Automated Essay Scoring (AES) and Automatic Short Answer Grading (ASAG) have developed separately with different datasets, metrics, and communities, making cross-paradigm assessment difficult.

Method: The authors introduce S-GRADES, a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. They evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting, and examine effects of exemplar selection and cross-dataset exemplar transfer.

Result: The benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment. The analyses show how different prompting strategies and exemplar selection affect model performance across diverse educational assessment tasks.

Conclusion: S-GRADES provides a valuable tool for standardized evaluation of automated grading systems across different educational assessment paradigms, enabling better understanding of model generalization and reliability in educational NLP applications.

Abstract: Evaluating student responses, from long essays to short factual answers, is a key challenge in educational NLP. Automated Essay Scoring (AES) focuses on holistic writing qualities such as coherence and argumentation, while Automatic Short Answer Grading (ASAG) emphasizes factual correctness and conceptual understanding. Despite their shared goal, these paradigms have progressed in isolation with fragmented datasets, inconsistent metrics, and separate communities. We introduce S-GRADES (Studying Generalization of Student Response Assessments in Diverse Evaluative Settings), a web-based benchmark that consolidates 14 diverse grading datasets under a unified interface with standardized access and reproducible evaluation protocols. The benchmark is fully open-source and designed for extensibility, enabling continuous integration of new datasets and evaluation settings. To demonstrate the utility of S-GRADES, we evaluate three state-of-the-art large language models across the benchmark using multiple reasoning strategies in prompting. We further examine the effects of exemplar selection and cross-dataset exemplar transfer. Our analyses illustrate how benchmark-driven evaluation reveals reliability and generalization gaps across essay and short-answer grading tasks, highlighting the importance of standardized, cross-paradigm assessment.

[45] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

Zhouxiang Fang, Jiawei Zhou, Hanjie Chen

Main category: cs.CL

TL;DR: GR-SAP uses generative replay to synthesize safety alignment data from LLMs during fine-tuning, preserving safety without needing original alignment data.

Details

Motivation: Safety alignment in LLMs degrades during fine-tuning, and original alignment data is often inaccessible. Need a method to preserve safety without requiring proprietary alignment datasets.

Method: Proposes Generative Replay for Safety Alignment Preservation (GR-SAP) - synthesizes domain-specific alignment data from LLMs themselves and integrates it during downstream adaptation to maintain safety alignment.

Result: GR-SAP substantially mitigates safety degradation during fine-tuning while maintaining comparable downstream task performance across various models and tasks.

Conclusion: Synthetic alignment data serves as reliable proxy for original data, enabling effective safety preservation during LLM fine-tuning without access to proprietary alignment datasets.

Abstract: Recent studies show that the safety alignment of large language models (LLMs) can be easily compromised even by seemingly non-adversarial fine-tuning. To preserve safety alignment during fine-tuning, a widely used strategy is to jointly optimize safety and task objectives by mixing in the original alignment data, which is typically inaccessible even for open-weight LLMs. Inspired by generative replay in continual learning, we propose Generative Replay for Safety Alignment Preservation (GR-SAP), a unified framework that synthesizes domain-specific alignment data from LLMs and integrate them during downstream adaption to preserve safety alignment. Theoretical and empirical analyses demonstrate this synthetic data serves as a reliable proxy for the original alignment data. Experiments across various models and downstream tasks show that GR-SAP substantially mitigates fine-tuning-induced safety degradation while maintaining comparable downstream performance. Our code is available at https://github.com/chili-lab/gr-sap.

[46] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

Tim Schopf, Michael Färber

Main category: cs.CL

TL;DR: RINoBench: First comprehensive benchmark for evaluating automated research idea novelty judgment using 1,381 expert-judged research ideas and nine evaluation metrics, revealing LLMs’ reasoning aligns with humans but novelty judgments diverge significantly.

Details

Motivation: Manual novelty assessment of research ideas is labor-intensive and subjective, while existing automated approaches lack standardized evaluation. Need for comprehensive benchmark to enable large-scale, comparable evaluation of research idea novelty judgment systems.

Method: Created RINoBench with 1,381 research ideas derived from and judged by human experts. Developed nine automated evaluation metrics assessing both rubric-based novelty scores and textual justifications. Evaluated state-of-the-art LLMs on their novelty judgment capabilities.

Result: LLM-generated reasoning closely mirrors human rationales, but this alignment doesn’t translate to accurate novelty judgments. LLM judgments diverge significantly from human gold standards, even among leading reasoning-capable models.

Conclusion: RINoBench enables standardized evaluation of research idea novelty judgment systems. LLMs show promise in reasoning alignment but need improvement in actual novelty assessment accuracy. Benchmark supports future development of better automated novelty evaluation tools.

Abstract: Judging the novelty of research ideas is crucial for advancing science, enabling the identification of unexplored directions, and ensuring contributions meaningfully extend existing knowledge rather than reiterate minor variations. However, given the exponential growth of scientific literature, manually judging the novelty of research ideas through literature reviews is labor-intensive, subjective, and infeasible at scale. Therefore, recent efforts have proposed automated approaches for research idea novelty judgment. Yet, evaluation of these approaches remains largely inconsistent and is typically based on non-standardized human evaluations, hindering large-scale, comparable evaluations. To address this, we introduce RINoBench, the first comprehensive benchmark for large-scale evaluation of research idea novelty judgments. It comprises 1,381 research ideas derived from and judged by human experts as well as nine automated evaluation metrics designed to assess both rubric-based novelty scores and textual justifications of novelty judgments. Using this benchmark, we evaluate several state-of-the-art large language models (LLMs) on their ability to judge the novelty of research ideas. Our findings reveal that while LLM-generated reasoning closely mirrors human rationales, this alignment does not reliably translate into accurate novelty judgments, which diverge significantly from human gold standard judgments - even among leading reasoning-capable models. Data and code available at: https://github.com/TimSchopf/RINoBench.

Kristy A. Carpenter, Issah A. Samori, Mathew V. Kiang, Keith Humphreys, Anna Lembke, Johannes C. Eichstaedt, Russ B. Altman

Main category: cs.CL

TL;DR: LLMs outperform lexicon-based methods for identifying opioid-related social media content by effectively disambiguating slang terms and detecting relevant posts without predefined lexicons.

Details

Motivation: Social media text can monitor opioid overdose trends, but most content is unrelated. Lexicon-based methods using opioid slang terms are problematic because many terms have ambiguous non-opioid meanings. LLMs offer advanced textual reasoning to disambiguate these terms at scale.

Method: Evaluated four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5) on three tasks: 1) lexicon-based disambiguation of specific terms within post context, 2) lexicon-free identification of opioid-related posts from context alone, and 3) emergent slang identification with simulated new slang terms.

Result: All four LLMs showed excellent performance across all tasks. In lexicon-based tasks, LLM F1 scores (0.540-0.972) far exceeded lexicon strategies (0.009-0.126). In lexicon-free tasks, LLM F1 scores (0.544-0.769) surpassed lexicons (0.080-0.540) with higher recall. For emergent slang, LLMs had higher accuracy (avg 0.784), F1 (avg 0.712), precision (avg 0.981), and recall (avg 0.587) than lexicons.

Conclusion: LLMs can effectively identify relevant content for low-prevalence topics like opioid references, enhancing data quality for downstream analyses and predictive models by overcoming limitations of traditional lexicon-based approaches.

Abstract: Social media text shows promise for monitoring trends in the opioid overdose crisis; however, the overwhelming majority of social media text is unrelated to opioids. When leveraging social media text to monitor trends in the ongoing opioid overdose crisis, a common strategy for identifying relevant content is to use a lexicon of opioid-related terms as inclusion criteria. However, many slang terms for opioids, such as “smack” or “blues,” have common non-opioid meanings, making them ambiguous. The advanced textual reasoning capability of large language models (LLMs) presents an opportunity to disambiguate these slang terms at scale. We present three tasks on which to evaluate four state-of-the-art LLMs (GPT-4, GPT-5, Gemini 2.5 Pro, and Claude Sonnet 4.5): a lexicon-based setting, in which the LLM must disambiguate a specific term within the context of a given post; a lexicon-free setting, in which the LLM must identify opioid-related posts from context without a lexicon; and an emergent slang setting, in which the LLM must identify opioid-related posts with simulated new slang terms. All four LLMs showed excellent performance across all tasks. In both subtasks of the lexicon-based setting, LLM F1 scores (“fenty” subtask: 0.824-0.972; “smack” subtask: 0.540-0.862) far exceeded those of the best lexicon strategy (0.126 and 0.009, respectively). In the lexicon-free task, LLM F1 scores (0.544-0.769) surpassed those of lexicons (0.080-0.540), and LLMs demonstrated uniformly higher recall. On emergent slang, all LLMs had higher accuracy (average: 0.784), F1 score (average: 0.712), precision (average: 0.981), and recall (average: 0.587) than the two lexicons assessed. Our results show that LLMs can be used to identify relevant content for low-prevalence topics, including but not limited to opioid references, enhancing data provided to downstream analyses and predictive models.

[48] Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

Hongbin Zhang, Kehai Chen, Xuefen Bai, Youcheng Pan, Yang Xiang, Jinpeng Wang, Min Zhang

Main category: cs.CL

TL;DR: DIBJudge: A fine-tuning framework that mitigates translationese bias in multilingual LLMs by learning disentangled representations that separate judgment-critical features from spurious correlations with machine-translated text.

Details

Motivation: LLMs exhibit systematic translationese bias, favoring machine-translated text over human-authored references, especially in low-resource languages. This bias stems from spurious correlations with English alignment and cross-lingual predictability, compromising multilingual evaluation quality.

Method: Proposes DIBJudge framework using variational information compression to learn minimally sufficient judgment-critical representations while isolating spurious factors into a dedicated bias branch. Incorporates cross-covariance penalty to suppress statistical dependence between robust and bias representations for effective disentanglement.

Result: Extensive evaluations on multilingual reward modeling benchmarks and dedicated translationese bias evaluation suite show DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.

Conclusion: DIBJudge effectively addresses translationese bias in multilingual LLMs through representation disentanglement, improving the reliability of multilingual evaluation by separating genuine quality judgments from spurious correlations with translation artifacts.

Abstract: Large language models (LLMs) have become a standard for multilingual evaluation, yet they exhibit a severe systematic translationese bias. In this paper, translationese bias is characterized as LLMs systematically favoring machine-translated text over human-authored references, particularly in low-resource languages. We attribute this bias to spurious correlations with (i) latent manifold alignment with English and (ii) cross-lingual predictability. To mitigate this bias, we propose DIBJudge, a robust fine-tuning framework that learns a minimally sufficient, judgment-critical representation via variational information compression, while explicitly isolating spurious factors into the dedicated bias branch. Furthermore, we incorporate a cross-covariance penalty that explicitly suppresses statistical dependence between robust and bias representations, thereby encouraging effective disentanglement. Extensive evaluations on multilingual reward modeling benchmarks and a dedicated translationese bias evaluation suite demonstrate that the proposed DIBJudge consistently outperforms strong baselines and substantially mitigates translationese bias.

[49] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

Haoxiang Su, Ruiyu Fang, Liting Jiang, Xiaomeng Huang, Shuangyong Song

Main category: cs.CL

TL;DR: Dynamic knowledge fusion framework for multi-domain dialogue state tracking using contrastive learning and contextual prompts to improve accuracy and generalization.

Details

Motivation: Current multi-domain DST faces challenges in effectively modeling dialogue history and limited annotated data, which hinders model performance in tracking dialogue states across multi-turn interactions.

Method: Two-stage approach: 1) encoder-only network with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; 2) dynamic knowledge fusion uses structured information of selected slots as contextual prompts to enhance DST accuracy and consistency.

Result: Results on multi-domain dialogue benchmarks show notable improvements in both tracking accuracy and generalization, validating the method’s capability in handling complex dialogue scenarios.

Conclusion: The dynamic knowledge fusion framework effectively integrates dialogue context and domain knowledge, addressing key challenges in multi-domain DST and improving performance on complex dialogue tasks.

Abstract: The performance of task-oriented dialogue models is strongly tied to how well they track dialogue states, which records and updates user information across multi-turn interactions. However, current multi-domain DST encounters two key challenges: the difficulty of effectively modeling dialogue history and the limited availability of annotated data, both of which hinder model performance. To tackle the aforementioned problems, we develop a dynamic knowledge fusion framework applicable to multi-domain DST. The model operates in two stages: first, an encoder-only network trained with contrastive learning encodes dialogue history and candidate slots, selecting relevant slots based on correlation scores; second, dynamic knowledge fusion leverages the structured information of selected slots as contextual prompts to enhance the accuracy and consistency of dialogue state tracking. This design enables more accurate integration of dialogue context and domain knowledge. Results obtained from multi-domain dialogue benchmarks indicate that our method notably improves both tracking accuracy and generalization, validating its capability in handling complex dialogue scenarios.

[50] Aligning Large Language Models with Searcher Preferences

Wei Wu, Peilun Zhou, Liyi Chen, Qimeng Wang, Chengqiang Lu, Yan Gao, Yi Wu, Yao Hu, Hui Xiong

Main category: cs.CL

TL;DR: SearchLLM: A large language model for open-ended generative search with hierarchical reward system and deployment on RedNote platform

Details

Motivation: The shift from item-centric ranking to answer-centric synthesis in search engines requires generative models that can handle open-ended queries on large content platforms, addressing challenges like robustness to noisy retrieval, safety guarantees, and alignment with diverse user needs.

Method: Developed SearchLLM with hierarchical multi-dimensional reward system separating bottom-line constraints (factual grounding, basic quality, format compliance) from behavior optimization objectives (robustness to noisy retrieval, user alignment). Uses rule-based checks and human-calibrated LLM judges to produce interpretable score vectors, combined with Gated Aggregation Strategy and Group Relative Policy Optimization (GRPO) for training.

Result: Deployed in RedNote’s AI search, showing improved generation quality and user engagement: 1.03% increase in Valid Consumption Rate and 2.81% reduction in Re-search Rate while maintaining safety and reliability standards.

Conclusion: SearchLLM successfully addresses open-ended generative search challenges through its hierarchical reward system and optimization approach, demonstrating practical value in real-world deployment with measurable improvements in user engagement metrics.

Abstract: The paradigm shift from item-centric ranking to answer-centric synthesis is redefining the role of search engines. While recent industrial progress has applied generative techniques to closed-set item ranking in e-commerce, research and deployment of open-ended generative search on large content platforms remain limited. This setting introduces challenges, including robustness to noisy retrieval, non-negotiable safety guarantees, and alignment with diverse user needs. In this work, we introduce SearchLLM, the first large language model (LLM) for open-ended generative search. We design a hierarchical, multi-dimensional reward system that separates bottom-line constraints, including factual grounding, basic answer quality and format compliance, from behavior optimization objectives that promote robustness to noisy retrieval and alignment with user needs. Concretely, our reward model evaluates responses conditioned on the user query, session history, and retrieved evidence set, combining rule-based checks with human-calibrated LLM judges to produce an interpretable score vector over these dimensions. We introduce a Gated Aggregation Strategy to derive the training reward for optimizing SearchLLM with Group Relative Policy Optimization (GRPO). We deploy SearchLLM in the AI search entry of RedNote. Offline evaluations and online A/B tests show improved generation quality and user engagement, increasing Valid Consumption Rate by 1.03% and reducing Re-search Rate by 2.81%, while upholding strict safety and reliability standards.

[51] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

Panatchakorn Anantaprayoon, Nataliia Babina, Nima Asgharbeygi, Jad Tarifi

Main category: cs.CL

TL;DR: Multi-agent negotiation framework aligns LLMs to Collective Agency objective while improving conflict-resolution through structured dialogue self-play with RLAIF optimization.

Details

Motivation: Current LLM alignment methods (RLHF, Constitutional AI) work well in single-agent settings but fail in multi-stakeholder scenarios with conflicting values that require negotiation and deliberation capabilities.

Method: Proposes multi-agent negotiation framework where two LLM instances with opposing personas engage in structured turn-based dialogue. Uses synthetic moral-dilemma prompts and conflicting persona pairs, optimized via RLAIF with GRPO using external LLM reward model. Rewards based on Collective Agency scores of final completion, but gradients applied to dialogue tokens to improve interaction dynamics.

Result: Resulting model achieves Collective Agency alignment comparable to single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities.

Conclusion: Negotiation-driven deliberation training provides practical path toward LLMs that better support collective decision-making in value-conflict scenarios.

Abstract: The alignment of large language models (LLMs) has progressed substantially in single-agent settings through paradigms such as RLHF and Constitutional AI, with recent work exploring scalable alternatives such as RLAIF and evolving alignment objectives. However, these approaches remain limited in multi-stakeholder settings, where conflicting values arise and deliberative negotiation capabilities are required. This work proposes a multi-agent negotiation-based alignment framework that aligns LLMs to Collective Agency (CA)-an existing alignment objective introduced to promote the continual expansion of agency-while simultaneously improving conflict-resolution capability. To enable scalable training, two self-play instances of the same LLM, assigned opposing personas, engage in structured turn-based dialogue to synthesize mutually beneficial solutions. We generate synthetic moral-dilemma prompts and conflicting persona pairs, and optimize the policy via RLAIF using GRPO with an external LLM reward model. While rewards are computed from CA scores assigned to the final completion, gradients are applied to dialogue tokens to directly improve deliberative interaction dynamics. Experiments show that the resulting model achieves CA alignment comparable to a single-agent baseline while substantially improving conflict-resolution performance without degrading general language capabilities. These results suggest that negotiation-driven deliberation training provides a practical path toward LLMs that better support collective decision-making in value-conflict scenarios.

[52] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

Minki Hong, Eunsoo Lee, Sohyun Park, Jihie Kim

Main category: cs.CL

TL;DR: PEEM is a unified framework for interpretable evaluation of prompts and responses using 9 criteria (3 for prompts, 6 for responses) with LLM-based scoring and rationales.

Details

Motivation: Current LLM evaluations focus only on answer correctness, obscuring why prompts succeed/fail and providing little actionable guidance for prompt engineering.

Method: Defines structured rubric with 9 axes, uses LLM-based evaluator to output scalar scores (1-5 Likert) and criterion-specific natural-language rationales grounded in the rubric.

Result: PEEM’s accuracy aligns with conventional accuracy while preserving model rankings; captures linguistic failure modes; shows robustness to paraphrases; enables prompt rewriting that improves downstream accuracy by up to 11.7 points.

Conclusion: PEEM provides reproducible, criterion-driven protocol linking prompt formulation to response behavior, enabling systematic diagnosis and optimization of LLM interactions.

Abstract: Prompt design is a primary control interface for large language models (LLMs), yet standard evaluations largely reduce performance to answer correctness, obscuring why a prompt succeeds or fails and providing little actionable guidance. We propose PEEM (Prompt Engineering Evaluation Metrics), a unified framework for joint and interpretable evaluation of both prompts and responses. PEEM defines a structured rubric with 9 axes: 3 prompt criteria (clarity/structure, linguistic quality, fairness) and 6 response criteria (accuracy, coherence, relevance, objectivity, clarity, conciseness), and uses an LLM-based evaluator to output (i) scalar scores on a 1-5 Likert scale and (ii) criterion-specific natural-language rationales grounded in the rubric. Across 7 benchmarks and 5 task models, PEEM’s accuracy axis strongly aligns with conventional accuracy while preserving model rankings (aggregate Spearman rho about 0.97, Pearson r about 0.94, p < 0.001). A multi-evaluator study with four models shows consistent relative judgments (pairwise rho = 0.68-0.85), supporting evaluator-agnostic deployment. Beyond alignment, PEEM captures complementary linguistic failure modes and remains informative under prompt perturbations: prompt-quality trends track downstream accuracy under iterative rewrites, semantic adversarial manipulations induce clear score degradation, and meaning-preserving paraphrases yield high stability (robustness rate about 76.7-80.6%). Finally, using only PEEM scores and rationales as feedback, a zero-shot prompt rewriting loop improves downstream accuracy by up to 11.7 points, outperforming supervised and RL-based prompt-optimization baselines. Overall, PEEM provides a reproducible, criterion-driven protocol that links prompt formulation to response behavior and enables systematic diagnosis and optimization of LLM interactions.

[53] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

Zhongzhen Huang, Yan Ling, Hong Chen, Ye Feng, Li Wu, Linjie Mu, Shaoting Zhang, Xiaofan Zhang, Kun Qian, Xiaomu Li

Main category: cs.CL

TL;DR: PULSE is a medical reasoning AI agent combining domain-tuned LLM with scientific literature retrieval for diagnostic decision-making, achieving expert-competitive accuracy on endocrinology cases and maintaining stable performance across rare diseases.

Details

Motivation: To develop an AI agent that can support diagnostic decision-making in complex medical cases, particularly addressing the challenge of maintaining diagnostic accuracy across both common and rare diseases where human physicians often struggle with rare conditions.

Method: Combines a domain-tuned large language model with scientific literature retrieval. Evaluated on a curated benchmark of 82 authentic endocrinology case reports covering diverse disease types and incidence levels. Compared performance against physicians of varying expertise levels and examined AI-human collaboration workflows.

Result: PULSE achieved expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance. Unlike physicians whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent exhibited adaptive reasoning similar to expert clinicians and enabled physicians to correct errors and broaden diagnostic hypotheses in collaborative settings.

Conclusion: PULSE demonstrates both promise and limitations of language model-based agents in clinical diagnosis, offering robust support across common and rare presentations while highlighting risks of automation bias. Provides a framework for evaluating AI’s role in real-world medical decision-making.

Abstract: We present PULSE, a medical reasoning agent that combines a domain-tuned large language model with scientific literature retrieval to support diagnostic decision-making in complex real-world cases. To evaluate its capabilities, we curated a benchmark of 82 authentic endocrinology case reports encompassing a broad spectrum of disease types and incidence levels. In controlled experiments, we compared PULSE’s performance against physicians with varying levels of expertise-from residents to senior specialists-and examined how AI assistance influenced human diagnostic reasoning. PULSE attained expert-competitive accuracy, outperforming residents and junior specialists while matching senior specialist performance at both Top@1 and Top@4 thresholds. Unlike physicians, whose accuracy declined with disease rarity, PULSE maintained stable performance across incidence tiers. The agent also exhibited adaptive reasoning, increasing output length with case difficulty in a manner analogous to the longer deliberation observed among expert clinicians. When used collaboratively, PULSE enabled physicians to correct initial errors and broaden diagnostic hypotheses, but also introduced risks of automation bias. The study explores both serial and concurrent collaboration workflows, revealing that PULSE offers robust support across common and rare presentations. These findings underscore both the promise and the limitations of language model-based agents in clinical diagnosis, and offer a framework for evaluating their role in real-world decision-making.

[54] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

Weixin Liu, Congning Ni, Qingyuan Song, Susannah L. Rose, Christopher Symons, Murat Kantarcioglu, Bradley A. Malin, Zhijun Yin

Main category: cs.CL

TL;DR: VERI-DPO uses claim verification and Direct Preference Optimization to reduce unsupported statements in clinical summarization while maintaining informativeness.

Details

Motivation: LLM-based clinical summarizers often introduce unsupported statements or suffer from "say-less" degeneration when aligned, creating a need for methods that ensure faithfulness to EHR evidence while maintaining clinical utility.

Method: Develops a retrieval-augmented verifier to label claim-evidence pairs, then uses this verifier to score sentence-level claims from sampled summaries. Aggregates margins into coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs for Direct Preference Optimization training.

Result: Reduces Not Supported claim rates from 10.7% to 1.9% (local verifier) and 11.6% to 6.4% (GPT-4o), improves validity from 76.7% to 82.5%, while maintaining informative length.

Conclusion: VERI-DPO effectively reduces hallucinations in clinical summarization through verification-guided preference optimization, balancing faithfulness with informativeness.

Abstract: Brief Hospital Course (BHC) narratives must be clinically useful yet faithful to fragmented EHR evidence. LLM-based clinical summarizers still introduce unsupported statements, and alignment can encourage omissions (“say-less” degeneration). We introduce VERI-DPO, which uses claim verification to mine preferences and distill them into the summarizer with Direct Preference Optimization (DPO). On MIMIC-III-Ext-VeriFact-BHC (100 ICU patients; patient-level splits), we train a retrieval-augmented verifier to label claim-evidence pairs as Supported, Not Supported, or Not Addressed via a single-token format. The verifier scores sentence-level claims from sampled BHC candidates and aggregates margins into a coverage-aware utility to mine length-controlled, contradiction-anchored preference pairs. On held-out patients, verifier-mined preferences separate candidates by contradiction density, and VERI-DPO reduces Not Supported claim rates from 10.7% to 1.9% (local verifier judge) and from 11.6% to 6.4% (GPT-4o judge), while improving validity from 76.7% to 82.5% and maintaining informative length.

[55] Safe and Scalable Web Agent Learning via Recreated Websites

Hyungjoo Chae, Jungsoo Park, Alan Ritter

Main category: cs.CL

TL;DR: VeriEnv uses language models to create executable synthetic clones of real websites for safe, verifiable web agent training with programmatic rewards.

Details

Motivation: Real-world websites are unsafe for autonomous agent exploration, hard to reset, and lack verifiable feedback, limiting web agent training.

Method: Uses language models as environment creators to clone real websites into executable synthetic environments with Python SDK for controlled access, enabling self-generated tasks with deterministic programmatic rewards.

Result: Agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling training environments.

Conclusion: VeriEnv provides a safe, scalable framework for web agent training by decoupling learning from unsafe real-world interaction while enabling verifiable feedback.

Abstract: Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.

[56] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

Dimosthenis Athanasiou, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou

Main category: cs.CL

TL;DR: A unified RAG system using query diversity with LLM-based reformulations and multistage generation, achieving top performance in retrieval and response generation tasks.

Details

Motivation: To address multi-turn retrieval-augmented generation challenges by developing a unified system that effectively handles passage retrieval, reference-grounded response generation, and end-to-end RAG.

Method: Two key principles: (1) query-diversity-over-retriever-diversity strategy using five complementary LLM-based query reformulations with a single sparse retriever and variance-aware nested Reciprocal Rank Fusion; (2) multistage generation pipeline with evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection.

Result: Ranked 1st in Task A (passage retrieval) with nDCG@5: 0.5776 (+20.5% over strongest baseline) and 2nd in Task B (reference-grounded response generation) with HM: 0.7698. Showed query diversity outperforms heterogeneous retriever ensembling.

Conclusion: Query diversity with a well-aligned retriever is more effective than retriever diversity, and answerability calibration (not retrieval coverage) is the primary bottleneck in end-to-end RAG performance.

Abstract: We present the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C). Our unified architecture is built on two principles: (i) a query-diversity-over-retriever-diversity strategy, where five complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and fused via variance-aware nested Reciprocal Rank Fusion; and (ii) a multistage generation pipeline that decomposes grounded generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection. Our system ranks 1st in Task A (nDCG@5: 0.5776, +20.5% over the strongest baseline) and 2nd in Task B (HM: 0.7698). Empirical analysis shows that query diversity over a well-aligned retriever outperforms heterogeneous retriever ensembling, and that answerability calibration-rather than retrieval coverage-is the primary bottleneck in end-to-end performance.

[57] Automatic End-to-End Data Integration using Large Language Models

Aaron Steiner, Christian Bizer

Main category: cs.CL

TL;DR: GPT-5.2 automates end-to-end data integration pipelines by generating schema mappings, value mappings, entity matching training data, and conflict resolution validation data, achieving comparable or better results than human-designed pipelines at much lower cost.

Details

Motivation: Current data integration pipelines require substantial manual effort from data engineers for configuration and labeling. While LLMs show promise for individual steps, their potential to replace all human input in end-to-end pipelines hasn't been explored.

Method: Uses GPT-5.2 to automatically generate all necessary artifacts for data integration pipelines: schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion.

Result: LLM-based pipeline produces similar or better results than human-designed pipelines across video game, music, and company data integration case studies. End-to-end integrated datasets are comparable in size and density. Costs approximately $10 per case study vs. much higher human engineering costs.

Conclusion: LLMs can effectively automate end-to-end data integration pipelines, reducing human effort and costs while maintaining or improving quality, representing a significant step toward fully automated data integration.

Abstract: Designing data integration pipelines typically requires substantial manual effort from data engineers to configure pipeline components and label training data. While LLMs have shown promise in handling individual steps of the integration process, their potential to replace all human input across end-to-end data integration pipelines has not been investigated. As a step toward exploring this potential, we present an automatic data integration pipeline that uses GPT-5.2 to generate all artifacts required to adapt the pipeline to specific use cases. These artifacts are schema mappings, value mappings for data normalization, training data for entity matching, and validation data for selecting conflict resolution heuristics in data fusion. We compare the performance of this LLM-based pipeline to the performance of human-designed pipelines along three case studies requiring the integration of video game, music, and company related data. Our experiments show that the LLM-based pipeline is able to produce similar results, for some tasks even better results, as the human-designed pipelines. End-to-end, the human and the LLM pipelines produce integrated datasets of comparable size and density. Having the LLM configure the pipelines costs approximately $10 per case study, which represents only a small fraction of the cost of having human data engineers perform the same tasks.

[58] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

Nhi Dang, Tung Le, Huy Tien Nguyen

Main category: cs.CL

TL;DR: An automatic evaluation framework for domain-specific chatbots that generates Q&A pairs from knowledge bases, uses LLMs to judge responses, and applies confidence filtering to reduce manual review effort.

Details

Motivation: Current domain-specific chatbots using LLMs with retrieval augmented generation still produce unsupported/incorrect answers, but manual evaluation is costly and existing frameworks rely on curated test sets and static metrics, limiting scalability.

Method: End-to-end automatic evaluator that: 1) generates Q&A pairs directly from knowledge base, 2) uses LLMs to judge chatbot responses against reference answers, 3) applies confidence-based filtering to highlight uncertain cases for human review.

Result: Applied to Vietnamese news dataset, achieves high agreement with human judgments while significantly reducing review overhead. Framework is modular and language-agnostic.

Conclusion: Presents a practical, scalable solution for evaluating chatbots with minimal manual intervention, adaptable to diverse domains.

Abstract: Large language models (LLMs) combined with retrieval augmented generation have enabled the deployment of domain-specific chatbots, but these systems remain prone to generating unsupported or incorrect answers. Reliable evaluation is therefore critical, yet manual review is costly and existing frameworks often depend on curated test sets and static metrics, limiting scalability. We propose an end-to-end automatic evaluator designed to substantially reduce human effort. Our system generates Q&A pairs directly from the underlying knowledge base, uses LLMs to judge chatbot responses against reference answers, and applies confidence-based filtering to highlight uncertain cases. Applied to a Vietnamese news dataset, the evaluator achieves high agreement with human judgments while significantly lowering review overhead. The framework is modular and language-agnostic, making it readily adaptable to diverse domains. This work introduces a practical, scalable solution for evaluating chatbots with minimal reliance on manual intervention.

[59] MUNIChus: Multilingual News Image Captioning Benchmark

Yuji Chen, Alistair Plum, Hansi Hettiarachchi, Diptesh Kanojia, Saroj Basnet, Marcos Zampieri, Tharindu Ranasinghe

Main category: cs.CL

TL;DR: Created MUNIChus, the first multilingual news image captioning benchmark with 9 languages including low-resource ones, showing current models still struggle with this task.

Details

Motivation: Most news image captioning research focuses on English due to dataset scarcity in other languages, creating a need for multilingual benchmarks to advance the field globally.

Method: Created MUNIChus benchmark with 9 languages (including low-resource languages like Sinhala and Urdu), evaluated various state-of-the-art neural news image captioning models on this new dataset.

Result: News image captioning remains challenging even for state-of-the-art models, with over 20 models already benchmarked on MUNIChus which is now publicly available.

Conclusion: MUNIChus opens new avenues for developing and evaluating multilingual news image captioning models, addressing the language gap in this research area.

Abstract: The goal of news image captioning is to generate captions by integrating news article content with corresponding images, highlighting the relationship between textual context and visual elements. The majority of research on news image captioning focuses on English, primarily because datasets in other languages are scarce. To address this limitation, we create the first multilingual news image captioning benchmark, MUNIChus, comprising 9 languages, including several low-resource languages such as Sinhala and Urdu. We evaluate various state-of-the-art neural news image captioning models on MUNIChus and find that news image captioning remains challenging. We also make MUNIChus publicly available with over 20 models already benchmarked. MUNIChus opens new avenues for further advancements in developing and evaluating multilingual news image captioning models.

[60] Disentangling Similarity and Relatedness in Topic Models

Hanlin Xiao, Mauricio A. Álvarez, Rainer Breitling

Main category: cs.CL

TL;DR: PLM-augmented topic models differ from classical LDA by anchoring word co-occurrence statistics to pre-trained embedding spaces, which affects how topics capture thematic relatedness vs taxonomic similarity. The paper develops a neural scoring function using LLM-annotated word pairs to evaluate these semantic dimensions across topic models.

Details

Motivation: To understand how the integration of pre-trained language model embeddings fundamentally changes the semantic structure captured by topic models, specifically disentangling thematic relatedness and taxonomic similarity dimensions that differ between classical and PLM-augmented models.

Method: Constructed a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. Applied this scorer to comprehensive evaluation across multiple corpora and topic model families to analyze semantic structure differences.

Result: Different topic model families capture distinct semantic structures in their topics. Similarity and relatedness scores successfully predict downstream task performance depending on task requirements, revealing systematic differences between classical and PLM-augmented models.

Conclusion: Establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterizing these semantic dimensions across model families and corpora.

Abstract: The recent advancement of large language models has spurred a growing trend of integrating pre-trained language model (PLM) embeddings into topic models, fundamentally reshaping how topics capture semantic structure. Classical models such as Latent Dirichlet Allocation (LDA) derive topics from word co-occurrence statistics, whereas PLM-augmented models anchor these statistics to pre-trained embedding spaces, imposing a prior that also favours clustering of semantically similar words. This structural difference can be captured by the psycholinguistic dimensions of thematic relatedness and taxonomic similarity of the topic words. To disentangle these dimensions in topic models, we construct a large synthetic benchmark of word pairs using LLM-based annotation to train a neural scoring function. We apply this scorer to a comprehensive evaluation across multiple corpora and topic model families, revealing that different model families capture distinct semantic structure in their topics. We further demonstrate that similarity and relatedness scores successfully predict downstream task performance depending on task requirements. This paper establishes similarity and relatedness as essential axes for topic model evaluation and provides a reliable pipeline for characterising these across model families and corpora.

[61] Making Bielik LLM Reason (Better): A Field Report

Adam Trybus, Bartosz Bartnicki, Remigiusz Kinas

Main category: cs.CL

TL;DR: Evaluation of reasoning capabilities in Bielik, a Polish LLM, through benchmarking, comparative analysis with other models, and future development planning.

Details

Motivation: To assess and improve the reasoning capabilities of Bielik, a Polish large language model, in the competitive AI landscape, addressing limitations in current analyses and ensuring its continued development.

Method: Multi-stage approach: initial benchmarking, creation of evaluation methodology, comparative analysis with other LLMs, and outlining future development prospects considering current limitations.

Result: The paper presents evaluation results of Bielik’s reasoning capabilities compared to other LLMs, identifies limitations in current analyses, and proposes future development directions.

Conclusion: Bielik requires ongoing evaluation and development to remain competitive in the AI landscape, with specific focus areas identified for improving its reasoning capabilities.

Abstract: This paper presents a research program dedicated to evaluating and advancing the reasoning capabilities of Bielik, a Polish large language model. The study describes a number of stages of work: initial benchmarking and creation of evaluation methodology, analyzing of comparative results with other LLMs and outlining of future prospects that take into account the limitations of the analyses conducted so far and aims to keep Bielik in the race give the ever-changing – and competitive – AI landscape.

[62] Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

Yuyao Ge, Shenghua Liu, Yiwei Wang, Tianyu Liu, Baolong Bi, Lingrui Mei, Jiayu Yao, Jiafeng Guo, Xueqi Cheng

Main category: cs.CL

TL;DR: PRISM-Δ is a projection-based method for steering LLMs to prioritize user-specified text spans by extracting discriminative directions from attention heads while eliminating shared structural patterns.

Details

Motivation: Current steering methods often capture shared structural patterns between relevant and irrelevant contexts rather than the actual discriminative differences needed for effective text prioritization.

Method: Decomposes difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Uses continuous softplus importance weights for attention heads and extends to Value representations beyond Key-only methods.

Result: Outperforms existing methods on 19/20 configurations across four benchmarks and five models, with up to +10.6% relative gains, while halving fluency cost. Scales to long-context retrieval with up to +4.8% relative gain over best existing method.

Conclusion: PRISM-Δ provides an effective framework for steering LLMs to prioritize user-specified text spans with improved discriminative power, better performance, and reduced fluency costs.

Abstract: Prompt highlighting steers a large language model to prioritize user-specified text spans during generation. A key challenge is extracting steering directions that capture the difference between relevant and irrelevant contexts, rather than shared structural patterns common to both. We propose PRISM-$Δ$ (Projection-based Relevance-Informed Steering Method), which decomposes the difference between positive and negative cross-covariance matrices to maximize discriminative energy while eliminating shared directions. Each attention head receives a continuous softplus importance weight, letting weak-but-useful heads contribute at reduced strength. The framework extends naturally to Value representations, capturing content-channel signal that Key-only methods leave unused. Across four benchmarks and five models, PRISM-$Δ$ matches or exceeds the best existing method on 19 of 20 configurations, with relative gains up to +10.6%, while halving the fluency cost of steering. PRISM-$Δ$ also scales to long-context retrieval, outperforming the best existing method by up to +4.8% relative gain. PRISM-$Δ$ is compatible with FlashAttention and adds negligible memory overhead.

[63] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

Shuang Zhou, Kai Yu, Song Wang, Wenya Xie, Zaifu Zhan, Meng-Han Tsai, Yuen-Hei Chung, Shutong Hou, Huixue Zhou, Min Zeng, Bhavadharini Ramu, Lin Yee Chen, Feng Xie, Rui Zhang

Main category: cs.CL

TL;DR: HeartAgent is a cardiology-specific AI agent system for reliable, explainable differential diagnosis of heart diseases, integrating specialized tools and data resources to support complex reasoning with transparent decision-making.

Details

Motivation: Heart diseases are a major global health burden, but existing AI diagnostic methods lack sufficient cardiology knowledge, complex reasoning capabilities, and interpretability needed for trustworthy clinical decision support.

Method: HeartAgent integrates customized tools and curated data resources, orchestrating multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references.

Result: On MIMIC dataset and private EHR cohort, HeartAgent achieved over 36% and 20% improvements in top-3 diagnostic accuracy over established methods. Clinicians assisted by HeartAgent showed 26.9% improvement in diagnostic accuracy and 22.7% improvement in explanatory quality.

Conclusion: HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care, demonstrating the value of specialized agent systems in medical diagnostics.

Abstract: Heart diseases remain a leading cause of morbidity and mortality worldwide, necessitating accurate and trustworthy differential diagnosis. However, existing artificial intelligence-based diagnostic methods are often limited by insufficient cardiology knowledge, inadequate support for complex reasoning, and poor interpretability. Here we present HeartAgent, a cardiology-specific agent system designed to support a reliable and explainable differential diagnosis. HeartAgent integrates customized tools and curated data resources and orchestrates multiple specialized sub-agents to perform complex reasoning while generating transparent reasoning trajectories and verifiable supporting references. Evaluated on the MIMIC dataset and a private electronic health records cohort, HeartAgent achieved over 36% and 20% improvements over established comparative methods, in top-3 diagnostic accuracy, respectively. Additionally, clinicians assisted by HeartAgent demonstrated gains of 26.9% in diagnostic accuracy and 22.7% in explanatory quality compared with unaided experts. These results demonstrate that HeartAgent provides reliable, explainable, and clinically actionable decision support for cardiovascular care.

[64] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali

Main category: cs.CL

TL;DR: mAceReason-Math dataset provides high-quality translations of challenging math problems for multilingual Reinforcement Learning with Verifiable Rewards (RLVR) research, covering 14 languages with 10,000+ samples per language.

Details

Motivation: Current RLVR research and training datasets are English-centric, and existing multilingual datasets are not designed for RLVR or current model capabilities, often being too easy to provide meaningful training signals.

Method: Created mAceReason-Math by translating challenging math problems from the AceReason-Math corpus (curated for RLVR), with careful cleaning and improvement of translations to ensure high quality.

Result: Produced a dataset covering 14 languages with more than 10,000 samples per language, specifically designed for multilingual RLVR research and benchmarking.

Conclusion: The mAceReason-Math dataset addresses the gap in multilingual RLVR research and is released to facilitate multilingual RLVR research and benchmarking in the community.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has been successfully applied to significantly boost the capabilities of pretrained large language models, especially in the math and logic problem domains. However, current research and available training datasets remain English-centric. While mul- tilingual training data and benchmarks have been created in the past, they were not created with RLVR and current model capability in mind, and their level of difficulty is often too low to provide appropriate training signals for current models. To address this gap, we provide mAceReason-Math, a dataset of high-quality translations of challenging math problems sourced from a corpus specifically curated for RLVR (AceReason-Math). We further take specific care to clean and improve our translations, resulting in a coverage of 14 languages with more than 10,000 samples per language. We release the dataset to facilitate multilingual RLVR research and benchmarking in the research community.

[65] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

Zhipeng Yang, Shu Yang, Lijie Hu, Di Wang

Main category: cs.CL

TL;DR: LLMs exhibit robustness to non-canonical tokenization through a “word recovery” mechanism where hidden states reconstruct canonical word-level tokens from character-level inputs, enabled by in-group attention among characters of the same word.

Details

Motivation: Understanding why LLMs trained with standard tokenization can still process character-level inputs effectively, despite the mismatch between training and inference tokenization schemes.

Method: Used mechanistic interpretability with three approaches: 1) decoding-based detection of word recovery from hidden states, 2) causal intervention by removing recovery subspaces, and 3) fine-grained attention analysis with masking experiments.

Result: Hidden states reconstruct canonical word-level token identities from character-level inputs; removing recovery subspaces degrades task performance; in-group attention among characters of the same word is critical for recovery.

Conclusion: Word recovery is a key mechanism enabling LLMs’ tokenization robustness, with in-group attention in early layers playing a crucial role in reconstructing canonical word representations from character-level inputs.

Abstract: Large language models (LLMs) trained with canonical tokenization exhibit surprising robustness to non-canonical inputs such as character-level tokenization, yet the mechanisms underlying this robustness remain unclear. We study this phenomenon through mechanistic interpretability and identify a core process we term word recovery. We first introduce a decoding-based method to detect word recovery, showing that hidden states reconstruct canonical word-level token identities from character-level inputs. We then provide causal evidence by removing the corresponding subspace from hidden states, which consistently degrades downstream task performance. Finally, we conduct a fine-grained attention analysis and show that in-group attention among characters belonging to the same canonical token is critical for word recovery: masking such attention in early layers substantially reduces both recovery scores and task performance. Together, our findings provide a mechanistic explanation for tokenization robustness and identify word recovery as a key mechanism enabling LLMs to process character-level inputs.

[66] Large Language Models as Annotators for Machine Translation Quality Estimation

Sidi Wang, Sophie Arnoult, Amir Kamran

Main category: cs.CL

TL;DR: Using LLMs to generate MQM-style annotations for training COMET models to reduce inference costs while maintaining competitive MT quality estimation performance.

Details

Motivation: LLMs show excellent performance on Machine Translation Quality Estimation but have high inference costs, making them impractical for direct application. Need cost-effective alternatives that maintain performance.

Method: Propose using LLMs to generate MQM-style annotations for training COMET models. Develop simplified MQM scheme restricted to top-level categories to guide LLM selection. Create systematic GPT-4o-based prompt called PPbMQM (Prompt-Pattern-based-MQM).

Result: Generated annotations correlate well with human annotations. Training COMET on these annotations leads to competitive performance on segment-level quality estimation for Chinese-English and English-German language pairs.

Conclusion: LLMs can effectively generate training data for more efficient quality estimation models, providing a practical solution to high inference costs while maintaining competitive performance.

Abstract: Large Language Models (LLMs) have demonstrated excellent performance on Machine Translation Quality Estimation (MTQE), yet their high inference costs make them impractical for direct application. In this work, we propose applying LLMs to generate MQM-style annotations for training a COMET model: following Fernandes et al. (2023), we reckon that segment-level annotations provide a strong rationale for LLMs and are key to good segment-level QE. We propose a simplified MQM scheme, mostly restricted to top-level categories, to guide LLM selection. We present a systematic approach for the development of a GPT-4o-based prompt, called PPbMQM (Prompt-Pattern-based-MQM). We show that the resulting annotations correlate well with human annotations and that training COMET on them leads to competitive performance on segment-level QE for Chinese-English and English-German.

[67] Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

Weihang Huang, Mengna Liu

Main category: cs.CL

TL;DR: An LLM-assisted pipeline for interpretable Chinese metaphor identification using four different protocols, showing protocol choice creates more variation than model differences.

Details

Motivation: Most computational metaphor identification approaches are opaque classifiers that don't explain why expressions are judged metaphorical, creating an interpretability gap especially acute for Chinese due to rich figurative traditions, absent morphological cues, and limited annotated resources.

Method: An LLM-assisted pipeline operationalizing four metaphor identification protocols (MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification) as executable, human-auditable rule scripts with modular chains of deterministic steps interleaved with controlled LLM calls.

Result: Protocol A (MIP) achieved F1 of 0.472 on token-level identification; cross-protocol analysis revealed striking divergence with pairwise Cohen’s kappa between Protocols A and D at 0.001, while Protocols B and C exhibited near-perfect agreement (kappa = 0.986). All protocols achieved 100% deterministic reproducibility with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00.

Conclusion: Protocol choice is the single largest source of variation in metaphor identification (exceeding model-level variation), and rule-script architectures achieve competitive performance while maintaining full transparency.

Abstract: Metaphor identification is a foundational task in figurative language processing, yet most computational approaches operate as opaque classifiers offering no insight into why an expression is judged metaphorical. This interpretability gap is especially acute for Chinese, where rich figurative traditions, absent morphological cues, and limited annotated resources compound the challenge. We present an LLM-assisted pipeline that operationalises four metaphor identification protocols–MIP/MIPVU lexical analysis, CMDAG conceptual-mapping annotation, emotion-based detection, and simile-oriented identification–as executable, human-auditable rule scripts. Each protocol is a modular chain of deterministic steps interleaved with controlled LLM calls, producing structured rationales alongside every classification decision. We evaluate on seven Chinese metaphor datasets spanning token-, sentence-, and span-level annotation, establishing the first cross-protocol comparison for Chinese metaphor identification. Within-protocol evaluation shows Protocol A (MIP) achieves an F1 of 0.472 on token-level identification, while cross-protocol analysis reveals striking divergence: pairwise Cohen’s kappa between Protocols A and D is merely 0.001, whereas Protocols B and C exhibit near-perfect agreement (kappa = 0.986). An interpretability audit shows all protocols achieve 100% deterministic reproducibility, with rationale correctness from 0.40 to 0.87 and editability from 0.80 to 1.00. Error analysis identifies conceptual-domain mismatch and register sensitivity as dominant failure modes. Our results demonstrate that protocol choice is the single largest source of variation in metaphor identification, exceeding model-level variation, and that rule-script architectures achieve competitive performance while maintaining full transparency.

[68] LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

Nina Hosseini-Kivanani, Fred Philippy

Main category: cs.CL

TL;DR: LuxBorrow analyzes Luxembourgish news over 27 years, finding pervasive multilingual practice with French as main donor language, increasing code-switching over time, and advocating for borrowing-centric evaluation metrics.

Details

Motivation: To understand language borrowing patterns in Luxembourgish news media over time, examining how multilingual practices evolve and quantifying borrowing intensity beyond simple code-mixing indices.

Method: Pipeline combining sentence-level language identification (LU/DE/FR/EN) with token-level borrowing resolver restricted to LU sentences, using lemmatization, loanword registry, and morphological/orthographic rules on 259,305 RTL articles (43.7M tokens).

Result: LU remains matrix language across all documents; 77.1% articles include at least one donor language; median CMI increases from 3.90 to 7.00; token-level adaptations total 25,444 instances (morphological 63.8%, orthographic 35.9%, lexical 0.3%); French overwhelmingly supplies adapted items; code-switching intensifies diachronically.

Conclusion: Advocates for borrowing-centric evaluation metrics (borrowed token/type rates, donor entropy, assimilation ratios) rather than relying only on document-level mixing indices, revealing nuanced multilingual patterns in Luxembourgish media.

Abstract: We present LuxBorrow, a borrowing-first analysis of Luxembourgish (LU) news spanning 27 years (1999-2025), covering 259,305 RTL articles and 43.7M tokens. Our pipeline combines sentence-level language identification (LU/DE/FR/EN) with a token-level borrowing resolver restricted to LU sentences, using lemmatization, a collected loanword registry, and compiled morphological and orthographic rules. Empirically, LU remains the matrix language across all documents, while multilingual practice is pervasive: 77.1% of articles include at least one donor language and 65.4% use three or four. Breadth does not imply intensity: median code-mixing index (CMI) increases from 3.90 (LU+1) to only 7.00 (LU+3), indicating localized insertions rather than balanced bilingual text. Domain and period summaries show moderate but persistent mixing, with CMI rising from 6.1 (1999-2007) to a peak of 8.4 in 2020. Token-level adaptations total 25,444 instances and exhibit a mixed profile: morphological 63.8%, orthographic 35.9%, lexical 0.3%. The most frequent individual rules are orthographic, such as on->oun and eur->er, while morphology is collectively dominant. Diachronically, code-switching intensifies, and morphologically adapted borrowings grow from a small base. French overwhelmingly supplies adapted items, with modest growth for German and negligible English. We advocate borrowing-centric evaluation, including borrowed token and type rates, donor entropy over borrowed items, and assimilation ratios, rather than relying only on document-level mixing indices.

[69] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

Konstantin Dobler, Simon Lehnerer, Federico Scozzafava, Jonathan Janke, Mohamed Ali

Main category: cs.CL

TL;DR: Multilingual extension of Reasoning Gym with procedurally generated reasoning problems across 14 languages, featuring parallel problem generation and native-speaker validation.

Details

Motivation: To create a multilingual benchmark for evaluating reasoning capabilities across languages, addressing the need for cross-lingual reasoning evaluation in language models.

Method: Extends Reasoning Gym by translating templates for 94 tasks across 14 languages with native-speaker validation, using procedural generation for unlimited problem instances with adjustable difficulty.

Result: Created a multilingual reasoning benchmark with parallel problem generation across languages, enabling large-scale cross-lingual data generation and evaluation.

Conclusion: The Multilingual Reasoning Gym provides a valuable resource for research into multilingual reasoning models with procedurally generated, verifiable problems across multiple languages.

Abstract: We present the Multilingual Reasoning Gym, an extension of Reasoning Gym (Stojanovski et al., 2025), that procedurally generates verifiable reasoning problems across 14 languages. We translate templates for 94 tasks with native-speaker validation in 10 languages and targeted code or template adaptations to ensure linguistic naturalness. The Multilingual Reasoning Gym preserves the core benefits of the procedural generation approach used in the original Reasoning Gym, such as virtually unlimited problem instance generation and adjustable difficulty, and remains directly usable for Reinforcement Learning from Verifiable Rewards and evaluation settings. Problems in the Multilingual Reasoning Gym are parallel across languages, enabling crosslingually parallel data generation at massive scale due to the procedural nature of the environments. We release our implementation to support research into multilingual reasoning models.

[70] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

Yuzhi Liang, Shiliang Xiao, Jingsong Wei, Qiliang Lin, Xia Li

Main category: cs.CL

TL;DR: PivotAttack: An efficient “inside-out” hard-label text attack framework using Multi-Armed Bandit to identify Pivot Sets (combinatorial token groups) as prediction anchors for strategic perturbation, achieving better attack success rate and query efficiency than SOTA methods.

Details

Motivation: Existing hard-label text attacks rely on inefficient "outside-in" strategies that traverse vast search spaces, leading to high query costs. There's a need for more query-efficient attack methods that can effectively capture inter-word dependencies in text.

Method: Proposes PivotAttack, an “inside-out” framework using Multi-Armed Bandit algorithm to identify Pivot Sets - combinatorial token groups that act as prediction anchors. These pivot sets are strategically perturbed to induce label flips while capturing inter-word dependencies and minimizing query costs.

Result: Extensive experiments across traditional models and Large Language Models show that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.

Conclusion: PivotAttack provides a more efficient approach to hard-label text attacks by shifting from “outside-in” to “inside-out” strategies, effectively reducing query costs while maintaining high attack success rates across various model types.

Abstract: Existing hard-label text attacks often rely on inefficient “outside-in” strategies that traverse vast search spaces. We propose PivotAttack, a query-efficient “inside-out” framework. It employs a Multi-Armed Bandit algorithm to identify Pivot Sets-combinatorial token groups acting as prediction anchors-and strategically perturbs them to induce label flips. This approach captures inter-word dependencies and minimizes query costs. Extensive experiments across traditional models and Large Language Models demonstrate that PivotAttack consistently outperforms state-of-the-art baselines in both Attack Success Rate and query efficiency.

[71] SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

Nevidu Jayatilleke, Nisansa de Silva, Uthpala Nimanthi, Gagani Kulathilaka, Azra Safrullah, Johan Sofalas

Main category: cs.CL

TL;DR: SiDiaC-v.2.0 is the largest Sinhala diachronic corpus covering 1800-1955 CE with 244k words from 185 literary works, featuring genre categorization and historical text processing.

Details

Motivation: To create a comprehensive diachronic corpus for Sinhala language NLP research, addressing the need for historical language resources and building upon previous work (SiDiaC-v.1.0) while overcoming challenges of low-resource language status.

Method: Corpus construction involved digitizing texts from the National Library of Sri Lanka using Google Document AI OCR, followed by extensive post-processing including filtering, preprocessing, copyright compliance checks, formatting correction, code-mixing handling, special token inclusion, and malformed token fixing. Annotation of 59 documents (70k words) based on written dates, with genre categorization into primary (Non-Fiction/Fiction) and secondary (Religious, History, Poetry, Language, Medical) layers.

Result: Created SiDiaC-v.2.0 with 244k words across 185 literary works covering 1800-1955 CE, featuring comprehensive preprocessing, genre categorization, and historical span annotation, serving as the largest Sinhala diachronic corpus to date.

Conclusion: SiDiaC-v.2.0 provides a valuable resource for Sinhala NLP research despite limited resources, building on previous work and incorporating best practices from other corpora, with potential applications in historical linguistics and language processing.

Abstract: SiDiaC-v.2.0 is the largest comprehensive Sinhala Diachronic Corpus to date, covering a period from 1800 CE to 1955 CE in terms of publication dates, and a historical span from the 5th to the 20th century CE in terms of written dates. The corpus consists of 244k words across 185 literary works that underwent thorough filtering, preprocessing, and copyright compliance checks, followed by extensive post-processing. Additionally, a subset of 59 documents totalling 70k words was annotated based on their written dates. Texts from the National Library of Sri Lanka were selected from the SiDiaC-v.1.0 non-filtered list, which was digitised using Google Document AI OCR. This was followed by post-processing to correct formatting issues, address code-mixing, include special tokens, and fix malformed tokens. The construction of SiDiaC-v.2.0 was informed by practices from other corpora, such as FarPaHC, SiDiaC-v.1.0, and CCOHA. This was particularly relevant for syntactic annotation and text normalisation strategies, given the shared characteristics of low-resource language status between Faroese and the similar cleaning strategies utilised in CCOHA. This corpus is categorised into two layers based on genres: primary and secondary. The primary categorisation is binary, assigning each book to either Non-Fiction or Fiction. The secondary categorisation is more detailed, grouping texts under specific genres such as Religious, History, Poetry, Language, and Medical. Despite facing challenges due to limited resources, SiDiaC-v.2.0 serves as a comprehensive resource for Sinhala NLP, building upon the work previously done in SiDiaC-v.1.0.

[72] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously?

Jennifer D’Souza, Sameer Sadruddin, Maximilian Kähler, Andrea Salfinger, Luca Zaccagna, Francesca Incitti, Lauro Snidaro, Osma Suominen

Main category: cs.CL

TL;DR: A bilingual English/German corpus with authority file annotations and machine-actionable taxonomy for ontology-aware classification and authority-grounded evaluation in cataloging

Details

Motivation: Subject indexing is essential for document discovery but difficult to maintain at scale and across languages, requiring better tools for cataloging and authority-grounded AI assistance

Method: Release of a large bilingual corpus (English/German) annotated with Integrated Authority File (GND) terms, plus a machine-actionable GND taxonomy, enabling multi-label classification and authority-grounded evaluation

Result: Provides statistical profile and qualitative error analyses of three systems, enabling reproducible evaluation of authority-aware classification approaches

Conclusion: Invites community to assess accuracy, usefulness, and transparency toward developing authority-anchored AI co-pilots that amplify catalogers’ work

Abstract: Subject indexing is vital for discovery but hard to sustain at scale and across languages. We release a large bilingual (English/German) corpus of catalog records annotated with the Integrated Authority File (GND), plus a machine-actionable GND taxonomy. The resource enables ontology-aware multi-label classification, mapping text to authority terms, and agent-assisted cataloging with reproducible, authority-grounded evaluation. We provide a brief statistical profile and qualitative error analyses of three systems. We invite the community to assess not only accuracy but usefulness and transparency, toward authority-anchored AI co-pilots that amplify catalogers’ work.

Ayan Sengupta, Shantanu Dixit, Md Shad Akhtar, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: ARMADA: Efficient cross-modal knowledge distillation framework that transfers knowledge from vision-language models (including black-box) to language-only models without requiring multimodal teacher pre-training or internal model access.

Details

Motivation: Traditional KD assumes modality homogeneity between teacher and student, while existing multimodal KD requires expensive modality-specific teacher pre-training. Need for efficient cross-modal distillation from vision-language to language-only models.

Method: Uses novel alignment techniques to distill knowledge without altering teacher model, leveraging vision-language models (including black-box) as teachers for language-only students. No need for internal teacher structures or expensive multimodal pre-training.

Result: Validated on 12 NLU, 8 generative reasoning, and 5 instruction-tuning tasks. Achieves up to 3.4% improvement on language understanding and 2.6% boost in generative reasoning for models like DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}.

Conclusion: Challenges conventional KD paradigms by showing vision-language models can enhance language models when distilled appropriately, without expensive multimodal pre-training or teacher fine-tuning.

Abstract: Knowledge distillation (KD) methods are pivotal in compressing large pre-trained language models into smaller models, ensuring computational efficiency without significantly dropping performance. Traditional KD techniques assume homogeneity in modalities between the teacher (source) and the student (target) models. On the other hand, existing multimodal knowledge distillation methods require modality-specific pre-training of the teacher model, which is computationally infeasible in most cases. In this paper, we introduce ARMADA, an efficient cross-modal knowledge distillation framework designed to transfer knowledge from large vision-language models, including black-box models, to language-only models. Unlike existing KD techniques that rely on the internal structures of multimodal teachers or require computationally expensive pre-training, ARMADA leverages novel alignment techniques to distil knowledge without altering the teacher model, ensuring efficiency and scalability. We empirically validate ARMADA on twelve natural language understanding, eight complex generative reasoning and five instruction-tuning tasks, demonstrating consistent performance improvements in large models such as DeBERTa-v2-1.4B, OPT-1.3B, LLaMA-{3B, 7B, 8B}. ARMADA achieves up to 3.4% improvement on language understanding tasks and 2.6% boost in generative reasoning, all without requiring expensive multimodal pre-training or fine-tuning of the teacher model. Our findings challenge conventional knowledge distillation paradigms by demonstrating that even vision-language models, despite lacking direct textual understanding, can significantly enhance language models when distilled appropriately.

[74] GLM-OCR Technical Report

Shuaiqi Duan, Yadong Xue, Weihan Wang, Zhe Su, Huan Liu, Sheng Yang, Guobing Gan, Guo Wang, Zihan Wang, Shengdong Yan, Dexin Jin, Yuxuan Zhang, Guohong Wen, Yanfeng Wang, Yutao Zhang, Xiaohan Zhang, Wenyi Hong, Yukuo Cen, Da Yin, Bin Chen, Wenmeng Yu, Xiaotao Gu, Jie Tang

Main category: cs.CL

TL;DR: GLM-OCR is a 0.9B-parameter multimodal model for document understanding that combines visual encoder and language decoder with multi-token prediction for efficient OCR, achieving strong performance on various document parsing tasks.

Details

Motivation: The paper addresses the need for efficient multimodal models for real-world document understanding that balance computational efficiency with recognition performance, particularly for OCR tasks where standard autoregressive decoding is inefficient.

Method: Combines 0.4B CogViT visual encoder with 0.5B GLM language decoder, introduces Multi-Token Prediction (MTP) mechanism to predict multiple tokens per step for faster decoding, and uses two-stage pipeline with layout analysis followed by parallel region-level recognition.

Result: Achieves competitive or state-of-the-art performance on document parsing, text/formula transcription, table structure recovery, and key information extraction benchmarks while maintaining computational efficiency suitable for edge deployment.

Conclusion: GLM-OCR demonstrates that compact multimodal models can achieve strong document understanding performance with efficient architecture design, making them suitable for both resource-constrained and large-scale production systems.

Abstract: GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.

[75] LLM2Vec-Gen: Generative Embeddings from Large Language Models

Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy

Main category: cs.CL

TL;DR: LLM2Vec-Gen: A self-supervised method that learns to represent LLM’s potential responses using trainable special tokens appended to input, achieving state-of-the-art unsupervised text embedding performance.

Details

Motivation: Traditional LLM-based text embedders encode input semantics but face challenges in mapping diverse inputs to similar outputs. Current methods rely on paired data and contrastive learning, which requires labeled data. The authors aim to develop a self-supervised approach that bridges the input-output gap and transfers LLM capabilities to embedding tasks without requiring labeled data.

Method: Add trainable special tokens to LLM’s vocabulary and append them to input. Optimize these tokens to represent the LLM’s response in a fixed-length sequence. Training is guided by the LLM’s own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. The LLM backbone remains frozen and training requires only unlabeled queries.

Result: Achieves state-of-the-art self-supervised performance on MTEB, improving by 9.3% over best unsupervised embedding teacher. Shows 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Learned embeddings are interpretable and can be decoded into text to reveal semantic content.

Conclusion: LLM2Vec-Gen provides a novel self-supervised paradigm for text embedding that bridges the input-output gap, transfers LLM capabilities to embedding tasks, and achieves strong performance without labeled data while maintaining interpretability.

Abstract: LLM-based text embedders typically encode the semantic content of their input. However, embedding tasks require mapping diverse inputs to similar outputs. Typically, this input-output is addressed by training embedding models with paired data using contrastive learning. In this work, we propose a novel self-supervised approach, LLM2Vec-Gen, which adopts a different paradigm: rather than encoding the input, we learn to represent the model’s potential response. Specifically, we add trainable special tokens to the LLM’s vocabulary, append them to input, and optimize them to represent the LLM’s response in a fixed-length sequence. Training is guided by the LLM’s own completion for the query, along with an unsupervised embedding teacher that provides distillation targets. This formulation helps to bridge the input-output gap and transfers LLM capabilities such as safety alignment and reasoning to embedding tasks. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 9.3% over the best unsupervised embedding teacher. We also observe up to 43.2% reduction in harmful content retrieval and 29.3% improvement in reasoning capabilities for embedding tasks. Finally, the learned embeddings are interpretable and can be decoded into text to reveal their semantic content.

[76] Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Mingyang Song, Mao Zheng, Chenning Xu

Main category: cs.CL

TL;DR: LLM-as-a-judge paradigm’s assumption of high inter-evaluator agreement indicating reliable evaluation is challenged, showing consensus is often illusory due to shared surface heuristics rather than substantive quality assessment.

Details

Motivation: To challenge the critical assumption in LLM-as-a-judge paradigm that high inter-evaluator agreement indicates reliable and objective evaluation, revealing that this consensus is frequently illusory and based on shared surface heuristics rather than substantive quality assessment.

Method: Large-scale study of 105,600 evaluation instances (32 LLMs × 3 frontier judges × 100 tasks × 11 temperatures) analyzing model-level vs sample-level agreement, plus introduction of MERG (Metacognitive Enhanced Rubric Generation) framework for knowledge-driven rubric generation.

Result: Model-level agreement (Spearman ρ=0.99) masks fragile sample-level agreement (Pearson r̄=0.72; ICC=0.67); sharing rubric structure restores 62% of total agreement; high-quality outputs receive least consistent evaluations; MERG increases agreement in codified domains (Education +22%, Academic +27%) but decreases in subjective domains.

Conclusion: Evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF, as domain knowledge anchors evaluators on shared standards while revealing genuine evaluative pluralism in subjective domains.

Abstract: The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $ρ= 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22%, Academic +27%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

[77] Instruction set for the representation of graphs

Ezequiel Lopez-Rubio, Mario Pascual-Gonzalez

Main category: cs.CL

TL;DR: IsalGraph encodes graph structures as compact strings using a 9-character alphabet and virtual machine, enabling graph similarity search, generation, and language model compatibility.

Details

Motivation: Need for compact, isomorphism-invariant sequential representations of graphs that are compatible with language models for applications like graph similarity search, generation, and graph-conditioned language modeling.

Method: Uses a virtual machine with sparse graph, circular doubly-linked list, and two traversal pointers. Instructions move pointers or insert nodes/edges. GraphToString algorithm encodes connected graphs, with exhaustive variant producing canonical strings.

Result: Levenshtein distance between IsalGraph strings strongly correlates with graph edit distance (GED). Evaluated on five real-world graph datasets (IAM Letter LOW/MED/HIGH, LINUX, AIDS).

Conclusion: IsalGraph provides compact, isomorphism-invariant, language-model-compatible sequential encoding of graph structure for graph similarity search, generation, and graph-conditioned language modeling.

Abstract: We present IsalGraph, a method for representing the structure of any finite, simple graph as a compact string over a nine-character instruction alphabet. The encoding is executed by a small virtual machine comprising a sparse graph, a circular doubly-linked list (CDLL) of graph-node references, and two traversal pointers. Instructions either move a pointer through the CDLL or insert a node or edge into the graph. A key design property is that every string over the alphabet decodes to a valid graph, with no invalid states reachable. A greedy \emph{GraphToString} algorithm encodes any connected graph into a string in time polynomial in the number of nodes; an exhaustive-backtracking variant produces a canonical string by selecting the lexicographically smallest shortest string across all starting nodes and all valid traversal orders. We evaluate the representation on five real-world graph benchmark datasets (IAM Letter LOW/MED/HIGH, LINUX, and AIDS) and show that the Levenshtein distance between IsalGraph strings correlates strongly with graph edit distance (GED). Together, these properties make IsalGraph strings a compact, isomorphism-invariant, and language-model-compatible sequential encoding of graph structure, with direct applications in graph similarity search, graph generation, and graph-conditioned language modelling

[78] Modelling Language using Large Language Models

Jumbly Grindrod

Main category: cs.CL

TL;DR: LLMs can serve as scientific models of public languages, not just cognitive processes, and recent computational linguistics work helps develop proper model construals for them.

Details

Motivation: The paper argues that linguistic study should focus not only on cognitive processes behind linguistic competence but also on language as an external, social entity. This broader perspective reveals the value of LLMs as scientific models of public languages.

Method: The paper defends the position against arguments that LLMs provide no linguistic insight, and builds upon Weisberg’s (2007) notion of a model construal. It uses recent computational linguistics work on understanding LLM inner workings to develop proper model construals for LLMs as models of language.

Result: The paper successfully argues for LLMs as valuable scientific models of public languages and demonstrates how computational linguistics research can help develop appropriate model construals for them.

Conclusion: LLMs have a legitimate scientific role as models of public languages when properly construed, and computational linguistics research on understanding their inner workings supports this position.

Abstract: This paper argues that large language models have a valuable scientific role to play in serving as scientific models of public languages. Linguistic study should not only be concerned with the cognitive processes behind linguistic competence, but also with language understood as an external, social entity. Once this is recognized, the value of large language models as scientific models becomes clear. This paper defends the position against a number of arguments to the effect that language models provide no linguistic insight. Building upon Weisberg’s (2007) notion of a model construal, it is then argued that recent work in computational linguistics to better understand the inner workings of large language models can be used to develop a model construal for large language models as models of a language.

[79] EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen

Main category: cs.CL

TL;DR: EoRA is a fine-tuning-free method that enhances compressed LLMs using low-rank matrices to improve task-specific performance and balance accuracy-computation trade-offs beyond compression format constraints.

Details

Motivation: Post-training compression of LLMs reduces memory and latency but causes accuracy degradation and is limited by hardware/kernel constraints on compression formats, reducing deployment flexibility.

Method: Augments compressed LLMs with low-rank matrices without fine-tuning, allowing rapid task-specific performance enhancement and flexible accuracy-computation trade-off balancing. Includes optimized CUDA kernel for faster inference and reduced memory overhead via quantization.

Result: Outperforms prior training-free low-rank methods in recovering compressed LLM accuracy, achieving significant improvements (10.84% on ARC-Challenge, 6.74% on MathQA, 11.45% on GSM8K) for LLaMA3-8B compressed to 3-bit. CUDA kernel accelerates inference up to 1.4x.

Conclusion: EoRA provides a prompt solution for improving compressed model accuracy under varying requirements, enabling more efficient and flexible LLM deployment without fine-tuning.

Abstract: While post-training compression techniques effectively reduce the memory footprint, latency, and power consumption of Large Language Models (LLMs), they often result in noticeable accuracy degradation and remain limited by hardware and kernel constraints that restrict supported compression formats ultimately reducing flexibility across a wide range of deployment scenarios. In this work, we propose EoRA, a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., $\mathbf{10.84%}$ on ARC-Challenge, $\mathbf{6.74%}$ on MathQA, and $\mathbf{11.45%}$ on GSM8K) for LLaMA3-8B compressed to 3-bit. We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs. Code is available at https://github.com/NVlabs/EoRA.

[80] Goal Hijacking Attack on Large Language Models via Pseudo-Conversation Injection

Zheng Chen, Buhui Yao

Main category: cs.CL

TL;DR: A novel goal hijacking attack method called Pseudo-Conversation Injection that exploits LLM weaknesses in role identification by fabricating conversation responses to hijack model outputs.

Details

Motivation: Goal hijacking attacks manipulate LLMs to produce predetermined outputs regardless of user input. Existing methods have limitations, so the authors aim to develop more effective attacks by exploiting LLMs' weaknesses in understanding conversation contexts and role identification.

Method: Proposes Pseudo-Conversation Injection which constructs malicious suffixes by fabricating LLM responses to user prompts, followed by malicious new tasks. This tricks the model into perceiving the initial prompt and fabricated response as completed conversation, then executing the falsified prompt. Three construction strategies: Targeted Pseudo-Conversation (specific target), Universal Pseudo-Conversation (general hijacking), and Robust Pseudo-Conversation (resilient to variations).

Result: Experiments on ChatGPT and Qwen show the method significantly outperforms existing approaches in attack effectiveness across various scenarios.

Conclusion: Pseudo-Conversation Injection is an effective goal hijacking method that exploits LLM weaknesses in conversation role identification, demonstrating superior performance over existing attacks and highlighting security vulnerabilities in LLMs.

Abstract: Goal hijacking is a type of adversarial attack on Large Language Models (LLMs) where the objective is to manipulate the model into producing a specific, predetermined output, regardless of the user’s original input. In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user’s prompt, which coerces the model into ignoring the user’s original input and generating the target response. In this paper, we introduce a novel goal hijacking attack method called Pseudo-Conversation Injection, which leverages the weaknesses of LLMs in role identification within conversation contexts. Specifically, we construct the suffix by fabricating responses from the LLM to the user’s initial prompt, followed by a prompt for a malicious new task. This leads the model to perceive the initial prompt and fabricated response as a completed conversation, thereby executing the new, falsified prompt. Following this approach, we propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation. These strategies are designed to achieve effective goal hijacking across various scenarios. Our experiments, conducted on two mainstream LLM platforms including ChatGPT and Qwen, demonstrate that our proposed method significantly outperforms existing approaches in terms of attack effectiveness.

[81] Token Cleaning: Fine-Grained Data Selection for LLM Supervised Fine-Tuning

Jinlong Pang, Na Di, Zhaowei Zhu, Jiaheng Wei, Hao Cheng, Chen Qian, Yang Liu

Main category: cs.CL

TL;DR: TokenCleaning: A token-level data cleaning method for supervised fine-tuning of LLMs that filters uninformative tokens while preserving task-critical information by measuring token influence on model updates.

Details

Motivation: While data quality matters more than quantity in SFT, most cleaning methods filter entire samples, ignoring that token quality varies within samples. Even high-quality samples contain redundant or harmful patterns that can degrade downstream performance when fine-tuned on.

Method: Proposes a token cleaning pipeline that evaluates token quality by measuring influence of model updates on each token, then applies threshold-based separation. Two approaches: single-pass with fixed reference model or iterative with self-evolving reference models.

Result: Extensive experiments show the framework consistently improves downstream performance. Theoretical analysis provides error upper bounds for both methods.

Conclusion: Token-level cleaning is effective for SFT, with both fixed-reference and self-evolving methods offering benefits. The approach provides a generic pipeline for improving data quality at token granularity.

Abstract: Recent studies show that in supervised fine-tuning (SFT) of large language models (LLMs), data quality matters more than quantity. While most data cleaning methods concentrate on filtering entire samples, the quality of individual tokens within a sample can vary significantly. After pre-training, even in high-quality samples, patterns or phrases that are not task-related can be redundant, uninformative, or even harmful. Continuing to fine-tune on these patterns may offer limited benefit and even degrade downstream task performance. In this paper, we investigate token quality from a noisy-label perspective and propose a generic token cleaning pipeline for SFT tasks. Our method filters out uninformative tokens while preserving those carrying key task-specific information. Specifically, we first evaluate token quality by examining the influence of model updates on each token, then apply a threshold-based separation. The token influence can be measured in a single pass with a fixed reference model or iteratively with self-evolving reference models. The benefits and limitations of both methods are analyzed theoretically by error upper bounds. Extensive experiments show that our framework consistently improves downstream performance. Code is available at https://github.com/UCSC-REAL/TokenCleaning.

[82] ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo

Main category: cs.CL

TL;DR: Analysis of how different thinking patterns (structured vs unstructured) affect LLM performance across model sizes, with dataset of 21k instruction-response pairs augmented with 5 thinking types.

Details

Motivation: Existing research lacks systematic understanding of how thinking patterns affect LLM performance across different model sizes. The Thinking then Responding paradigm (System 2 thinking) shows promise but needs comprehensive analysis of various thinking types' impact.

Method: Created ThinkPatterns-21k dataset with 21k instruction-response pairs from existing datasets, augmented with 5 thinking patterns: unstructured monologue and 4 structured variants (decomposition, self-ask, self-debate, self-critic). Evaluated across models from 3B to 32B parameters.

Result: Smaller models (<30B) benefit from most structured thinking patterns, while larger models (32B) degrade with structured thinking like decomposition. Unstructured monologue works well across all model sizes.

Conclusion: Different thinking patterns affect models differently based on size, with structured thinking helping smaller models but potentially harming larger ones. Unstructured thinking is broadly effective.

Abstract: Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.

[83] BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models

Tian Xie, Tongxin Yin, Vaishakh Keshava, Xueru Zhang, Siddhartha Reddy Jonnalagadda

Main category: cs.CL

TL;DR: Paper proposes a framework to evaluate causal reasoning behind social bias in LLMs, categorizing reasoning types and testing models on sensitive attributes.

Details

Motivation: While LLMs are known to generate biased content, existing benchmarks only identify biases without understanding the underlying causal reasoning processes that produce them.

Method: Developed a formal schema categorizing causal reasoning into three types (mistaken, biased, contextually-grounded). Created 1788 questions covering eight sensitive attributes, each probing specific reasoning types. Manually validated questions and prompted LLMs to generate causal graphs behind answers. Evaluated four state-of-the-art LLMs.

Result: All tested LLMs exhibited biased causal reasoning on most questions eliciting it. Models also showed “mistaken-biased” reasoning (confusing correlation with causality to infer sensitive group membership then applying biased reasoning). Identified three strategies LLMs use to avoid bias: refusing to answer, avoiding sensitive attributes, and adding contextual restrictions.

Conclusion: The study reveals systematic patterns in how LLMs reason about sensitive social attributes, providing insights for future debiasing efforts through understanding of causal reasoning mechanisms.

Abstract: While large language models (LLMs) play increasingly significant roles in society, research shows they continue to generate content that reflects social bias against sensitive groups. Existing benchmarks effectively identify these biases, but a critical gap remains in understanding the underlying reasoning processes that produce them. This paper addresses this gap by evaluating the causal reasoning of LLMs when answering socially biased questions. We propose a formal schema that categorizes causal reasoning into three types (mistaken, biased, and contextually-grounded). We then synthesize 1788 questions covering eight sensitive attributes, with each set of questions designed to probe a specific type of causal reasoning. All questions are then manually validated, and each of them prompts the LLM to generate a causal graph behind its answer. We evaluate four state-of-the-art LLMs and find that all models exhibit biased causal reasoning on most questions eliciting it. Moreover, we discover that LLMs are also prone to “mistaken-biased” reasoning, where they first confuse correlation with causality to infer sensitive group membership and subsequently apply biased causal reasoning. By examining the cases where LLMs produce unbiased causal reasoning, we also identify three strategies LLMs employ to avoid bias (i.e., explicitly refusing to answer, avoiding sensitive attributes, and adding contextual restrictions), which provide insights for future debiasing efforts.

[84] Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement

Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song

Main category: cs.CL

TL;DR: This paper introduces LLM Psychometrics, an interdisciplinary field using psychometric instruments and theories to evaluate large language models beyond traditional benchmarks, focusing on human-like psychological constructs and human-centered evaluation.

Details

Motivation: Traditional evaluation methodologies for LLMs are inadequate for measuring human-like psychological constructs, moving beyond static task-specific benchmarks, and establishing human-centered evaluation. There's a need to bridge the gap between AI evaluation and psychometric science.

Method: The paper is a review that synthesizes emerging interdisciplinary research at the intersection of psychometrics and LLM evaluation. It leverages psychometric instruments, theories, and principles to develop new evaluation paradigms for LLMs.

Result: The review systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. It provides a structured framework for researchers and a curated repository of LLM psychometric resources.

Conclusion: LLM Psychometrics offers a promising approach to develop more comprehensive, human-centered evaluation paradigms for LLMs, aligning with human-level AI and promoting societal benefit through better understanding of AI capabilities.

Abstract: The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.

[85] Word length predicts word order: “Min-max”-ing drives language evolution

Hiram Ring

Main category: cs.CL

TL;DR: A theoretical paper proposing a universal explanation for word order change based on communicative efficiency (Min-Max theory), validated with large-scale corpus analysis showing word class length predicts word order better than genealogical or areal factors.

Details

Motivation: To provide a unified explanation for word order change in languages, reconciling competing theories about why languages change their word order patterns over time, particularly the relative placement of Subject, Object, and Verb.

Method: Proposes Min-Max theory of language behavior where agents minimize effort while maximizing information. Tests this using a massive dataset of 1,942 language corpora tagged for parts of speech, analyzing correlations between average lengths of word classes and word order patterns.

Result: Word class length in corpora provides a stronger explanation for word order realization than either genealogical or areal factors, supporting the Min-Max theory’s predictions about communicative efficiency.

Conclusion: The Min-Max theory offers a universal explanation for word order change based on communicative efficiency principles, with corpus evidence showing word class length is a key predictor of word order patterns across languages.

Abstract: A fundamental concern in linguistics has been to understand how languages change, such as in relation to word order. Since the order of words in a sentence (i.e. the relative placement of Subject, Object, and Verb) is readily identifiable in most languages, this has been a productive field of study for decades (see Greenberg 1963; Dryer 2007; Hawkins 2014). However, a language’s word order can change over time, with competing explanations for such changes (Carnie and Guilfoyle 2000; Crisma and Longobardi 2009; Martins and Cardoso 2018; Dunn et al. 2011; Jager and Wahle 2021). This paper proposes a general universal explanation for word order change based on a theory of communicative interaction (the Min-Max theory of language behavior) in which agents seek to minimize effort while maximizing information. Such an account unifies opposing findings from language processing (Piantadosi et al. 2011; Wasow 2022; Levy 2008) that make different predictions about how word order should be realized crosslinguistically. The marriage of both “efficiency” and “surprisal” approaches under the Min-Max theory is justified with evidence from a massive dataset of 1,942 language corpora tagged for parts of speech (Ring 2025), in which average lengths of particular word classes correlates with word order, allowing for prediction of basic word order from diverse corpora. The general universal pressure of word class length in corpora is shown to give a stronger explanation for word order realization than either genealogical or areal factors, highlighting the importance of language corpora for investigating such questions.

[86] Training with Pseudo-Code for Instruction Following

Prince Kumar, Rudra Murthy, Riyaz Bhat, Danish Contractor

Main category: cs.CL

TL;DR: Fine-tuning LLMs with pseudo-code augmented instruction data improves instruction following by 8-21% while maintaining or improving reasoning performance.

Details

Motivation: LLMs struggle with following simple, unambiguous instructions, especially those with compositional structure. While pseudo-code helps, writing it is tedious for users, and few-shot/code prompting is unnatural for non-experts.

Method: Fine-tune LLMs using instruction-tuning data augmented with pseudo-code representations of natural language instructions paired with final responses.

Result: Models trained with pseudo-code follow instructions more reliably with 8-21% relative gains on instruction following benchmarks, while preserving or improving mathematical and commonsense reasoning (up to 30% average gain across all benchmarks).

Conclusion: Training-time pseudo-code augmentation effectively improves LLM instruction following without compromising reasoning capabilities.

Abstract: Despite rapid advances in the capabilities of Large Language Models (LLMs), they continue to struggle with following relatively simple and unambiguous instructions, particularly when compositional structure is involved. Recent work suggests that models may follow instructions more effectively when they are expressed in pseudo-code rather than natural language. However, writing pseudo-code programs can be tedious, and relying on few-shot demonstrations or inference-time code prompting is often unnatural for non-expert users of LLMs. To overcome these limitations, we propose a training time approach that fine-tunes LLMs using instruction-tuning data augmented with pseudo-code representations of natural language instructions paired with final responses. We evaluate our method on 12 publicly available benchmarks spanning instruction-following, mathematical reasoning, and commonsense reasoning, across six base models. Our results show that models trained with pseudo-code follow instructions more reliably, achieving relative gains of 8-21% on instruction following benchmarks, while largely preserving and in some cases improving performance on mathematical and commonsense reasoning tasks, with an average gain of up to 30% across all evaluated benchmarks.

[87] LLLMs: A Data-Driven Survey of Evolving Research on Limitations of Large Language Models

Aida Kostikova, Zhipin Wang, Deidamea Bajri, Ole Pütz, Benjamin Paaßen, Steffen Eger

Main category: cs.CL

TL;DR: A data-driven survey analyzing research trends on limitations of large language models (LLLMs) from 2022-2025, identifying key limitation categories and growth patterns in academic publications.

Details

Motivation: To provide a quantitative, systematic review of research on LLM limitations as the field grows rapidly, addressing concerns about model shortcomings and tracking how research focus evolves over time.

Method: Analyzed 250,000 ACL and arXiv papers using keyword filtering, LLM-based classification validated by experts, and topic clustering with HDBSCAN+BERTopic and LlooM approaches to identify 14,648 relevant papers on LLM limitations.

Result: LLM-related papers increased 5x in ACL and 8x in arXiv from 2022-2025. LLLMs research grew even faster, reaching over 30% of LLM papers by 2025. Reasoning was the most studied limitation, followed by generalization, hallucination, bias, and security. ACL topics remained stable while arXiv shifted toward security, alignment, hallucinations, knowledge editing, and multimodality.

Conclusion: The survey provides quantitative insights into LLM limitation research trends, reveals different focus areas between ACL and arXiv communities, and offers a dataset and methodology for future research tracking.

Abstract: Large language model (LLM) research has grown rapidly, along with increasing concern about their limitations. In this survey, we conduct a data-driven, semi-automated review of research on limitations of LLMs (LLLMs) from 2022 to early 2025 using a bottom-up approach. From a corpus of 250,000 ACL and arXiv papers, we identify 14,648 relevant papers using keyword filtering, LLM-based classification, validated against expert labels, and topic clustering (via two approaches, HDBSCAN+BERTopic and LlooM). We find that the share of LLM-related papers increases over fivefold in ACL and nearly eightfold in arXiv between 2022 and 2025. Since 2022, LLLMs research grows even faster, reaching over 30% of LLM papers by 2025. Reasoning remains the most studied limitation, followed by generalization, hallucination, bias, and security. The distribution of topics in the ACL dataset stays relatively stable over time, while arXiv shifts toward security risks, alignment, hallucinations, knowledge editing, and multimodality. We offer a quantitative view of trends in LLLMs research and release a dataset of annotated abstracts and a validated methodology, available at: https://github.com/a-kostikova/LLLMs-Survey.

[88] AutoPCR: Automated Phenotype Concept Recognition by Prompting

Yicheng Tao, Yuanhao Huang, Yiqun Wang, Xin Luo, Jie Liu

Main category: cs.CL

TL;DR: AutoPCR is a prompt-based phenotype concept recognition method that performs entity extraction, candidate retrieval, and entity linking without ontology-specific training, achieving state-of-the-art performance across diverse biomedical datasets.

Details

Motivation: Existing phenotype concept recognition methods require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology, limiting their practical applicability in real-world biomedical text mining scenarios.

Method: AutoPCR uses a three-stage approach: 1) entity extraction using hybrid rule-based and neural tagging, 2) candidate retrieval via SapBERT (a biomedical BERT model), and 3) entity linking through prompting a large language model, all without requiring ontology-specific training.

Result: Experiments on four benchmark datasets show AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Ablation studies demonstrate its inductive capability and generalizability to new ontologies.

Conclusion: AutoPCR provides an effective prompt-based approach for phenotype concept recognition that eliminates the need for ontology-specific training while achieving superior performance and generalization across diverse biomedical text types and ontologies.

Abstract: Phenotype concept recognition (CR) is a fundamental task in biomedical text mining, enabling applications such as clinical diagnostics and knowledge graph construction. However, existing methods often require ontology-specific training and struggle to generalize across diverse text types and evolving biomedical terminology. We present AutoPCR, a prompt-based phenotype CR method that does not require ontology-specific training. AutoPCR performs CR in three stages: entity extraction using a hybrid of rule-based and neural tagging strategies, candidate retrieval via SapBERT, and entity linking through prompting a large language model. Experiments on four benchmark datasets show that AutoPCR achieves the best average and most robust performance across both mention-level and document-level evaluations, surpassing prior state-of-the-art methods. Further ablation and transfer studies demonstrate its inductive capability and generalizability to new ontologies.

[89] LaTeXTrans: Structured LaTeX Translation with Multi-Agent Coordination

Ziming Zhu, Chenglong Wang, Haosong Xv, Shunjie Xing, Yifu Huo, Fengning Tian, Quan Du, Di Yang, Chunliang Zhang, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: LaTeXTrans is a multi-agent system for translating LaTeX documents while preserving mathematical equations, tables, figures, and cross-references that standard MT systems struggle with.

Details

Motivation: Standard machine translation systems fail to properly handle LaTeX documents that mix natural language with domain-specific syntax like mathematical equations, tables, and cross-references, which must be preserved for semantic integrity and compilability.

Method: Uses six specialized agents: Parser (decomposes LaTeX into translation units with placeholder substitution), Translator, Validator, Summarizer, Terminology Extractor (collaborative context-aware translation), and Generator (reconstructs translated content into well-structured LaTeX).

Result: LaTeXTrans outperforms mainstream MT systems in both translation accuracy and structural preservation for LaTeX-formatted documents.

Conclusion: The multi-agent approach effectively addresses the challenge of translating structured LaTeX documents while preserving format, structure, and terminology consistency.

Abstract: Despite the remarkable progress of modern machine translation (MT) systems on general-domain texts, translating structured LaTeX-formatted documents remains a significant challenge. These documents typically interleave natural language with domain-specific syntax, such as mathematical equations, tables, figures, and cross-references, all of which must be accurately preserved to maintain semantic integrity and compilability. In this paper, we introduce LaTeXTrans, a collaborative multi-agent system designed to address this challenge. LaTeXTrans ensures format preservation, structural fidelity, and terminology consistency through six specialized agents: 1) a Parser that decomposes LaTeX into translation-friendly units via placeholder substitution and syntax filtering; 2) a Translator, Validator, Summarizer, and Terminology Extractor that work collaboratively to ensure context-aware, self-correcting, and terminology-consistent translations; 3) a Generator that reconstructs the translated content into well-structured LaTeX documents. Experimental results show that LaTeXTrans outperforms mainstream MT systems in both translation accuracy and structural preservation. The source code, the online demonstration platform, and a demo video are publicly available.

[90] QCSE: A Pretrained Quantum Context-Sensitive Word Embedding for Natural Language Processing

Charles M. Varmantchaonala, Niclas Götting, Nils-Erik Schütte, Jean Louis E. K. Fendji, Christopher Gies

Main category: cs.CL

TL;DR: QCSE: A pretrained quantum context-sensitive embedding model that uses quantum computation to capture contextual word relationships, with five novel context matrix computation methods tested on Fulani and English corpora.

Details

Motivation: To leverage quantum computation's unique properties for natural language processing, particularly for capturing context-sensitive word embeddings and addressing data scarcity in low-resource languages like Fulani.

Method: Proposes QCSE model with quantum-native context learning and five context matrix computation methods using exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations to create context-aware quantum embeddings.

Result: QCSE successfully captures context sensitivity and leverages quantum expressibility for rich context-aware language representations, demonstrating effectiveness on both Fulani (low-resource) and English corpora.

Conclusion: Quantum computation shows promise for NLP, particularly for context-sensitive embeddings and addressing data scarcity in low-resource languages, opening new avenues for QNLP applications.

Abstract: Quantum Natural Language Processing (QNLP) offers a novel approach to encoding and understanding the complexity of natural languages through the power of quantum computation. This paper presents a pretrained quantum context-sensitive embedding model, called QCSE, that captures context-sensitive word embeddings, leveraging the unique properties of quantum systems to learn contextual relationships in languages. The model introduces quantum-native context learning, enabling the utilization of quantum computers for linguistic tasks. Central to the proposed approach are innovative context matrix computation methods, designed to create unique, representations of words based on their surrounding linguistic context. Five distinct methods are proposed and tested for computing the context matrices, incorporating techniques such as exponential decay, sinusoidal modulation, phase shifts, and hash-based transformations. These methods ensure that the quantum embeddings retain context sensitivity, thereby making them suitable for downstream language tasks where the expressibility and properties of quantum systems are valuable resources. To evaluate the effectiveness of the model and the associated context matrix methods, evaluations are conducted on both a Fulani corpus, a low-resource African language, dataset of small size and an English corpus of slightly larger size. The results demonstrate that QCSE not only captures context sensitivity but also leverages the expressibility of quantum systems for representing rich, context-aware language information. The use of Fulani further highlights the potential of QNLP to mitigate the problem of lack of data for this category of languages. This work underscores the power of quantum computation in natural language processing (NLP) and opens new avenues for applying QNLP to real-world linguistic challenges across various tasks and domains.

[91] Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors

Xin Liu, Runsong Zhao, Pengcheng Huang, Xinyu Liu, Junyi Xiao, Chunyang Xiao, Tong Xiao, Shengxiang Gao, Zhengtao Yu, Jingbo Zhu

Main category: cs.CL

TL;DR: SAC proposes a new context compression method that uses semantic anchor tokens with learnable embeddings and bidirectional attention instead of autoencoding tasks, achieving better performance on QA and summarization tasks.

Details

Motivation: Existing context compression methods rely on autoencoding tasks to train compression tokens, but these learned compression capabilities may conflict with downstream task requirements and prevent models from learning features beneficial for real-world usage.

Method: Semantic-Anchor Compression (SAC) shifts from autoencoding-based compression to an architecture with inherent compression capability. It selects anchor tokens from original context and aggregates contextual information into their KV representations using two key designs: (1) anchor embedding (learnable embedding vector attached to selected anchors) and (2) bidirectional attention modification (enables anchors to integrate information from entire context).

Result: SAC consistently outperforms existing context compression methods across different compression ratios and model sizes on question-answering and long-context summarization tasks.

Conclusion: SAC provides a more effective approach to context compression by avoiding the limitations of autoencoding tasks and enabling better feature learning for downstream applications.

Abstract: Context compression is an advanced technique that accelerates large language model (LLM) inference by converting long inputs into compact representations. Existing methods primarily rely on autoencoding tasks to train special compression tokens to represent contextual semantics. While autoencoding tasks enable compression tokens to acquire compression capabilities, we remark that such capabilities potentially conflict with actual downstream task requirements, prevent the models from learning the features more beneficial for real-world usage. Based on this observation, we propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability \textit{a priori}. Instead of training models to compress contexts through autoencoding tasks, SAC directly selects so-called anchor tokens from the original context and aggregates contextual information into their key-value (KV) representations. To ensure that anchors can effectively collect information, SAC introduces two key designs: (1) anchor embedding, a learnable embedding vector attached to the selected anchor tokens to mark compression carriers and (2) bidirectional attention modification, which enables anchor tokens to integrate information from the entire context. Experimental results show that SAC consistently outperforms existing context compression methods across different compression ratios and model sizes on question-answering and long-context summarization tasks. Our data, model and code have been released at \href{https://github.com/lx-Meteors/SAC}{https://github.com/lx-Meteors/SAC}.

[92] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

Main category: cs.CL

TL;DR: SAGE is a user simulation framework for multi-turn agent evaluation that integrates business knowledge (customer profiles and infrastructure) to create realistic simulated users for testing conversational agents.

Details

Motivation: Current multi-turn agent evaluation relies on human assessment or generic user simulations that lack domain-specific realism. There's a need for evaluation methods that capture realistic user behavior grounded in actual business contexts.

Method: SAGE integrates top-down business logic (ideal customer profiles) and bottom-up business infrastructure (product catalogs, FAQs, knowledge bases) to simulate realistic user behavior. This creates domain-specific simulated users that reflect actual customer personas and information needs.

Result: SAGE produces more realistic and diverse interactions compared to generic approaches, and identifies up to 33% more agent errors, making it effective for bug-finding and iterative agent improvement.

Conclusion: Integrating business knowledge into user simulation creates more effective evaluation tools for conversational agents, enabling better testing and improvement of multi-turn interactive systems.

Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

[93] CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Masato Kikuchi, Masatsugu Ono, Toshioki Soga, Tetsu Tanabe, Tadachika Ozono

Main category: cs.CL

TL;DR: WordNet annotated with CEFR language proficiency levels using LLM semantic similarity, enabling development of contextual lexical classifiers for language education.

Details

Motivation: WordNet's fine-grained sense distinctions are challenging for second-language learners, so integrating language-proficiency levels (CEFR) with WordNet's semantic networks would bridge NLP and language education.

Method: Used large language model to measure semantic similarity between WordNet sense definitions and English Vocabulary Profile Online entries, then constructed large-scale corpus with sense and CEFR-level information to develop contextual lexical classifiers.

Result: Models fine-tuned on the corpus perform comparably to gold-standard annotations, and combining both achieved Macro-F1 score of 0.81, showing transferred labels are consistent with gold-standard levels.

Conclusion: The annotated WordNet, corpus, and classifiers are publicly available to help bridge NLP and language education, facilitating more effective language learning.

Abstract: Although WordNet is a valuable resource because of its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this issue, we developed a version of WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a large language model to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our approach, we constructed a large-scale corpus containing both sense and CEFR-level information from the annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on this corpus perform comparably to those fine-tuned on gold-standard annotations. Furthermore, by combining this corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81. This result provides indirect evidence that the transferred labels are largely consistent with the gold-standard levels. The annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.

[94] Assessing the Political Fairness of Multilingual LLMs: A Case Study based on a 21-way Multiparallel EuroParl Dataset

Paul Lerner, François Yvon

Main category: cs.CL

TL;DR: LLMs show political biases in multilingual translation quality, with majority parties’ speeches translated better than outsider parties’ in European Parliament proceedings.

Details

Motivation: To move beyond English survey-based assessments of LLM political biases by examining fairness in multilingual translation, specifically analyzing translation quality disparities across political parties in parliamentary speeches.

Method: Created a new 21-way multiparallel version of EuroParl dataset with political affiliations, then systematically compared translation quality of European Parliament speeches across different political parties using LLMs.

Result: Found systematic differences where majority parties from left and right are better translated than outsider parties, revealing political biases in LLM translation performance.

Conclusion: LLMs exhibit political biases in multilingual translation that disadvantage outsider parties, highlighting the need for fairness considerations beyond traditional survey-based bias assessments.

Abstract: The political biases of Large Language Models (LLMs) are usually assessed by simulating their answers to English surveys. In this work, we propose an alternative framing of political biases, relying on principles of fairness in multilingual translation. We systematically compare the translation quality of speeches in the European Parliament (EP), observing systematic differences with majority parties from left and right being better translated than outsider parties. This study is made possible by a new, 21-way multiparallel version of EuroParl, the parliamentary proceedings of the EP, which includes the political affiliations of each speaker. The dataset consists of 1.5M sentences for a total of 40M words and 249M characters. It covers three years, 1000+ speakers, 7 countries, 12 EU parties, 25 EU committees, and hundreds of national parties.

[95] KV Cache Transform Coding for Compact Storage in LLM Inference

Konrad Staniszewski, Adrian Łańcucki

Main category: cs.CL

TL;DR: KVTC: A lightweight transform coder that compresses KV caches for efficient LLM serving, achieving up to 20× compression while maintaining accuracy.

Details

Motivation: KV caches consume significant GPU memory during LLM serving, especially when reused across conversation turns. Existing solutions like token eviction, quantization, and SVD methods have limitations in compression ratios or accuracy preservation.

Method: KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding inspired by classical media compression. It requires only brief initial calibration and leaves model parameters unchanged while exploiting redundancies in KV caches.

Result: Achieves up to 20× compression while maintaining reasoning and long-context accuracy, and 40×+ for specific use cases. Outperforms baselines like token eviction, quantization, and SVD methods across multiple benchmarks with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models.

Conclusion: KVTC provides a practical building block for memory-efficient LLM serving with reusable KV caches, enabling higher compression ratios than existing inference-time methods while preserving accuracy.

Abstract: Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20$\times$ compression while maintaining reasoning and long-context accuracy, and 40$\times$ or higher for specific use cases. We test KVTC with Llama 3, Mistral NeMo, and R1-Qwen 2.5 models across benchmarks including AIME25, GSM8K, LiveCodeBench, LongBench, MATH-500, MMLU, Qasper and RULER. It consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, while achieving higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

[96] Get away with less: Need of source side data curation to build parallel corpus for low resource Machine Translation

Saumitra Yadav, Manish Shrivastava

Main category: cs.CL

TL;DR: LALITA framework uses lexical and linguistic features to select complex source sentences for parallel corpus curation, significantly improving machine translation quality while reducing data requirements by over 50% across multiple languages.

Details

Motivation: Data curation is crucial for machine translation but under-researched, especially for low-resource languages where human translation is expensive. Current approaches rely on human translations, parallel sources, or limited synthetic generation, creating a need for efficient source sentence selection strategies to optimize MT system performance with minimal data.

Method: Developed LALITA (Lexical And Linguistically Informed Text Analysis) framework that uses lexical and linguistic features to select source sentences for parallel corpus curation. Focuses on selecting complex sentences from both existing and synthetic datasets. Tested by simulating low-resource scenarios with curated datasets ranging from 50K to 800K English sentences.

Result: Training on complex sentences selected by LALITA significantly improves translation quality. The framework reduces data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German) while maintaining or improving performance. Demonstrates efficiency across various data sizes from 50K to 800K sentences.

Conclusion: LALITA provides an effective data curation framework for low-resource machine translation that reduces training costs by minimizing data requirements while improving translation quality. The approach shows utility in data augmentation and has broad applicability across multiple languages.

Abstract: Data curation is a critical yet under-researched step in the machine translation training paradigm. To train translation systems, data acquisition relies primarily on human translations and digital parallel sources or, to a limited degree, synthetic generation. But, for low-resource languages, human translation to generate sufficient data is prohibitively expensive. Therefore, it is crucial to develop a framework that screens source sentences to form efficient parallel text, ensuring optimal MT system performance in low-resource environments. We approach this by evaluating English-Hindi bi-text to determine effective sentence selection strategies for optimal MT system training. Our extensively tested framework, (Lexical And Linguistically Informed Text Analysis) LALITA, targets source sentence selection using lexical and linguistic features to curate parallel corpora. We find that by training mostly on complex sentences from both existing and synthetic datasets, our method significantly improves translation quality. We test this by simulating low-resource data availabilty with curated datasets of 50K to 800K English sentences and report improved performances on all data sizes. LALITA demonstrates remarkable efficiency, reducing data needs by more than half across multiple languages (Hindi, Odia, Nepali, Norwegian Nynorsk, and German). This approach not only reduces MT systems training cost by reducing training data requirement, but also showcases LALITA’s utility in data augmentation.

[97] Evaluating Long-Horizon Memory for Multi-Party Collaborative Dialogues

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Yi Bai, Dannong Xu, Tianwei Lin, Xiaohong Li, Yunyun Han, Jian Pei, Yafeng Deng

Main category: cs.CL

TL;DR: EverMemBench: A benchmark for evaluating long-term collaborative memory in LLMs using multi-party, multi-group conversations with dense cross-topic interleaving and role-conditioned personas.

Details

Motivation: Existing benchmarks focus on dyadic or single-topic dialogues, lacking evaluation of memory under real-world collaborative interaction patterns where information is produced by multiple participants across groups/channels, revised over time, and grounded in social context.

Method: Built benchmark from multi-party, multi-group conversations spanning over 1M tokens with dense cross-topic interleaving, temporally evolving decisions, and role-conditioned personas. Evaluates memory systems using 2400 QA pairs across three dimensions: fine-grained recall, memory awareness, and user profile understanding.

Result: Reveals fundamental limitations: multi-hop reasoning collapses under multi-party attribution (26% accuracy even with oracle evidence), temporal reasoning fails without explicit version semantics, and memory awareness is bottlenecked by retrieval as similarity-based methods miss implicitly relevant information.

Conclusion: EverMemBench represents a concrete step toward realistic evaluation of LLM memory and a cornerstone benchmark for developing next-generation LLMs that reason over time, roles, and collaborative interaction structure.

Abstract: Long-term conversational memory in practical LLM applications is inherently collaborative: information is produced by multiple participants, scattered across groups and channels, revised over time, and implicitly grounded in roles and social context. Yet there is currently no established benchmark that evaluates memory under interaction patterns resembling real-world deployment, as existing benchmarks largely focus on dyadic or single-topic dialogues. In this paper, we introduce EverMemBench, the first benchmark designed for long-horizon collaborative memory, built from multi-party, multi-group conversations spanning over one million tokens with dense cross-topic interleaving, temporally evolving decisions, and role-conditioned personas. EverMemBench evaluates memory systems using 2400 QA pairs across three dimensions essential for real applications: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals fundamental limitations of current systems: multi-hop reasoning collapses under multi-party attribution even with oracle evidence (26% accuracy), temporal reasoning fails without explicit version semantics beyond timestamps, and memory awareness is bottlenecked by retrieval, as similarity-based methods miss implicitly relevant information. EverMemBench thus represents a concrete step toward realistic evaluation of LLM memory and a cornerstone benchmark for developing next-generation LLMs that reason over time, roles, and collaborative interaction structure. Our benchmark and code are publicly available at https://github.com/EverMind-AI/EverMemBench.

[98] PsihoRo: Depression and Anxiety Romanian Text Corpus

Alexandra Ciobotaru, Ana-Maria Bucur, Liviu P. Dinu

Main category: cs.CL

TL;DR: Created PsihoRo, the first open-source Romanian corpus for depression and anxiety analysis using open-ended questions and standardized screening questionnaires (PHQ-9, GAD-7) from 205 respondents.

Details

Motivation: Addressing the lack of Romanian mental health corpora in NLP, as existing psychological resources are primarily in English, making it difficult to study mental health in Romanian populations.

Method: Collected data through forms with 6 open-ended questions paired with PHQ-9 and GAD-7 screening questionnaires, then analyzed using statistical analysis, Romanian LIWC, emotion detection, and topic modeling.

Result: Created PsihoRo corpus with 205 respondents, demonstrating its utility through various NLP analyses to identify important features of Romanian mental health discourse.

Conclusion: PsihoRo represents a foundational resource for Romanian mental health NLP research, enabling better understanding and analysis of depression and anxiety in Romanian populations.

Abstract: Psychological corpora in NLP are collections of texts used to analyze human psychology, emotions, and mental health. These texts allow researchers to study psychological constructs, detect mental health issues and analyze emotional language. However, mental health data can be difficult to collect correctly from social media, due to suppositions made by the collectors. A more pragmatic strategy involves gathering data through open-ended questions and then assessing this information with self-report screening surveys. This method was employed successfully for English, a language with a lot of psychological NLP resources. However, this cannot be stated for Romanian, which currently has no open-source mental health corpus. To address this gap, we have created the first corpus for depression and anxiety in Romanian, by utilizing a form with 6 open-ended questions along with the standardized PHQ-9 and GAD-7 screening questionnaires. Consisting of the texts of 205 respondents and although it may seem small, PsihoRo is a first step towards understanding and analyzing texts regarding the mental health of the Romanian population. We employ statistical analysis, text analysis using Romanian LIWC, emotion detection and topic modeling to show what are the most important features of this newly introduced resource to the NLP community.

[99] How Large Language Models Get Stuck: Early structure with persistent errors

Alokesh Manna, William Snyder, Whitney Tabor

Main category: cs.CL

TL;DR: OPT model trained on BabyLM dataset shows persistent grammatical preference failures on BLiMP benchmark, with early erroneous entrenchment that persists through training, suggesting bigram statistics may cause irreversible biases.

Details

Motivation: To investigate how linguistic insights can improve LLM training efficiency by examining when and why models develop persistent grammatical preference errors during training.

Method: Trained Meta’s OPT model on 100M word BabyLM dataset, evaluated on BLiMP benchmark (67 classes of grammatical vs ungrammatical sentence pairs), tracked preference patterns across training iterations, and analyzed using qualitative (linguistic theory, deep learning theory) and quantitative assessments.

Result: OPT fails to consistently prefer grammatical sentences in nearly one-third of BLiMP classes, often establishing erroneous likelihood separation early in training that persists throughout training phase, suggesting entrenched biases that are costly to reverse.

Conclusion: Proposes Bigram Hypothesis: erroneous entrenchment occurs when bigram statistics bias models toward wrong distinctions early in training, and suggests testing this hypothesis on appropriately selected BLiMP classes to understand training inefficiencies.

Abstract: Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta’s OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model’s preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method of testing the hypothesis on appropriately selected BLiMP classes.

[100] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.01914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[101] Cross-Family Speculative Prefill: Training-Free Long-Context Compression with Small Draft Models

Shubhangi Upasani, Ravi Shanker Raju, Bo Li, Mengmeng Ji, John Long, Chen Wu, Urmish Thakker, Guangtao Wang

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.02631 suggests it’s from March 2026, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content. The arXiv API rate limiting prevents retrieval of abstract and details.

Method: Cannot determine method without access to paper content. The paper ID format suggests it’s a computer science/ML paper from March 2026.

Result: Cannot determine results without access to paper content. The HTTP 429 error indicates too many requests to arXiv API.

Conclusion: Cannot draw conclusions about paper content due to access limitations. Need to wait for rate limits to reset or use alternative access methods.

Abstract: Failed to fetch summary for 2603.02631: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02631&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[102] ConFu: Contemplate the Future for Better Speculative Sampling

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

Main category: cs.CL

TL;DR: ConFu introduces a speculative decoding framework that enables draft models to anticipate future generation directions using contemplate tokens and soft prompts, improving token acceptance rates by 8-11% over previous methods.

Details

Motivation: Existing speculative decoding methods suffer from error accumulation because draft models only condition on the current prefix, causing predictions to drift from the target model over time. The authors aim to improve draft model quality by enabling future anticipation.

Method: ConFu introduces: 1) contemplate tokens and soft prompts that allow draft models to leverage future-oriented signals from the target model, 2) a dynamic contemplate token mechanism with MoE for context-aware future prediction, and 3) a training framework with anchor token sampling and future prediction replication.

Result: ConFu improves token acceptance rates and generation speed by 8-11% over EAGLE-3 across various downstream tasks with Llama-3 3B and 8B models.

Conclusion: The work bridges speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference by enabling draft models to anticipate future generation directions.

Abstract: Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at negligible cost, (ii) a dynamic contemplate token mechanism with MoE to enable context-aware future prediction, and (iii) a training framework with anchor token sampling and future prediction replication that learns robust future prediction. Experiments demonstrate that ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8–11% across various downstream tasks with Llama-3 3B and 8B models. We believe our work is the first to bridge speculative decoding with continuous reasoning tokens, offering a new direction for accelerating LLM inference.

[103] LaTeX Compilation: Challenges in the Era of LLMs

Tianyou Liu, Ziqiang Li, Xurui Liu, Yu Wu, Yansong Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed data retrieval

Method: Cannot determine method due to failed data retrieval

Result: Cannot determine results due to failed data retrieval

Conclusion: Cannot draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2603.02873: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02873&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[104] Adaptive Loops and Memory in Transformers: Think Harder or Know More?

Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Joachim Koehler, Mehdi Ali

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.08391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[105] MultiGraSCCo: A Multilingual Anonymization Benchmark with Annotations of Personal Identifiers

Ibrahim Baroud, Christoph Otto, Vera Czehmann, Christine Hovhannisyan, Lisa Raithel, Sebastian Möller, Roland Roller

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2603.08879

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2603.08879: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08879&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[106] Tracking Cancer Through Text: Longitudinal Extraction From Radiology Reports Using Open-Source Large Language Models

Luc Builtjes, Alessa Hering

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.09638: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09638&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] Fusing Semantic, Lexical, and Domain Perspectives for Recipe Similarity Estimation

Denica Kjorvezir, Danilo Najkov, Eva Valencič, Erika Jesenko, Barbara Koroišić Seljak, Tome Eftimov, Riste Stojanov

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.09688: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09688&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] Evaluation of LLMs in retrieving food and nutritional context for RAG systems

Maks Požarnik Vavken, Matevž Ogrinc, Tome Eftimov, Barbara Koroušić Seljak

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.09704 suggests it’s from March 2026, which is unusual as current date is 2025.

Details

Motivation: Cannot determine motivation due to inability to access paper content. The HTTP 429 error indicates rate limiting from arXiv API.

Method: Cannot determine method due to inability to access paper content. The paper ID format suggests it might be from future year 2026.

Result: Cannot determine results due to inability to access paper content. The arXiv API returned HTTP 429 Too Many Requests error.

Conclusion: Cannot draw conclusions about paper content. The technical issue prevents analysis of paper 2603.09704.

Abstract: Failed to fetch summary for 2603.09704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] Explainability of Text Processing and Retrieval Methods: A Survey

Sourav Saha, Debapriyo Majumdar, Mandar Mitra

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2212.07126: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2212.07126&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Divyam Madaan, Varshan Muhunthan, Kyunghyun Cho, Sumit Chopra

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2509.23499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] Large Language Models for Travel Behavior Prediction

Baichuan Mo, Hanyong Xu, Ruoyun Ma, Jung-Hoon Cho, Dingyi Zhuang, Xiaotong Guo, Jinhua Zhao

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2312.00819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.00819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] AgentA/B: Automated and Scalable Web A/BTesting with Interactive LLM Agents

Yuxuan Lu, Ting-Yao Hsu, Hansu Gu, Limeng Cui, Yaochen Xie, William Headden, Bingsheng Yao, Akash Veeragouni, Jiapeng Liu, Sreyashi Nag, Jessie Wang, Dakuo Wang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to analyze paper due to technical retrieval error

Abstract: Failed to fetch summary for 2504.09723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] REI-Bench: Can Embodied Agents Understand Vague Human Instructions in Task Planning?

Chenxi Jiang, Chuhao Zhou, Jianfei Yang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to paper fetch failure

Method: Unable to determine method due to paper fetch failure

Result: Unable to determine results due to paper fetch failure

Conclusion: Unable to determine conclusion due to paper fetch failure

Abstract: Failed to fetch summary for 2505.10872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

Omer Sela

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.03203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] Shadow in the Cache: Unveiling and Mitigating Privacy Risks of KV-cache in LLM Inference

Zhifan Luo, Shuo Shao, Su Zhang, Lijing Zhou, Yuke Hu, Chenxu Zhao, Zhihao Liu, Zhan Qin

Main category: cs.CL

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.09442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.09442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

Anxin Guo, Jingwei Li

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.00906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration

Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, Rajendra Singh, Wei Chen, Anil Parwani, Muhammad Khalid Khan Niazi

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.08935: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08935&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[118] 4DEquine: Disentangling Motion and Appearance for 4D Equine Reconstruction from Monocular Video

Jin Lyu, Liang An, Pujin Cheng, Yebin Liu, Xiaoying Tang

Main category: cs.CV

TL;DR: 4DEquine: A framework for 4D reconstruction of equine animals from monocular video by disentangling motion and appearance reconstruction, using synthetic datasets for training.

Details

Motivation: Traditional 4D animal reconstruction methods require joint optimization of motion and appearance over entire videos, which is time-consuming and sensitive to incomplete observations. There's a need for more efficient and robust methods for animal welfare applications.

Method: Disentangles 4D reconstruction into two sub-problems: 1) Dynamic motion reconstruction using spatio-temporal transformer with post-optimization to regress smooth pose and shape sequences, and 2) Static appearance reconstruction using feed-forward network that creates animatable 3D Gaussian avatars from single images. Uses synthetic datasets VarenPoser (motion) and VarenTex (appearance) for training.

Result: Achieves state-of-the-art performance on real-world APT36K and AiM datasets despite training only on synthetic data. Demonstrates superiority in both geometry and appearance reconstruction.

Conclusion: 4DEquine provides an effective framework for 4D animal reconstruction by separating motion and appearance tasks, with synthetic training data enabling robust performance on real-world datasets.

Abstract: 4D reconstruction of equine family (e.g. horses) from monocular video is important for animal welfare. Previous mainstream 4D animal reconstruction methods require joint optimization of motion and appearance over a whole video, which is time-consuming and sensitive to incomplete observation. In this work, we propose a novel framework called 4DEquine by disentangling the 4D reconstruction problem into two sub-problems: dynamic motion reconstruction and static appearance reconstruction. For motion, we introduce a simple yet effective spatio-temporal transformer with a post-optimization stage to regress smooth and pixel-aligned pose and shape sequences from video. For appearance, we design a novel feed-forward network that reconstructs a high-fidelity, animatable 3D Gaussian avatar from as few as a single image. To assist training, we create a large-scale synthetic motion dataset, VarenPoser, which features high-quality surface motions and diverse camera trajectories, as well as a synthetic appearance dataset, VarenTex, comprising realistic multi-view images generated through multi-view diffusion. While training only on synthetic datasets, 4DEquine achieves state-of-the-art performance on real-world APT36K and AiM datasets, demonstrating the superiority of 4DEquine and our new datasets for both geometry and appearance reconstruction. Comprehensive ablation studies validate the effectiveness of both the motion and appearance reconstruction network. Project page: https://luoxue-star.github.io/4DEquine_Project_Page/.

[119] COMIC: Agentic Sketch Comedy Generation

Susung Hong, Brian Curless, Ira Kemelmacher-Shlizerman, Steve Seitz

Main category: cs.CV

TL;DR: AI system generates comedy sketch videos using agent-based framework with LLM critics trained on YouTube comedy data for automated humor evaluation

Details

Motivation: To create an automated system that can generate high-quality comedic videos similar to professional sketch shows like Saturday Night Live, addressing the challenge of automated humor generation and evaluation

Method: Uses population of agents based on real production studio roles, employs iterative competition, evaluation, and improvement cycles, with LLM critics trained on YouTube comedy video corpus to evaluate humor automatically

Result: System produces results approaching professionally produced sketch quality and demonstrates state-of-the-art performance in video generation

Conclusion: The framework successfully automates comedy video production with quality approaching professional standards through agent-based collaboration and data-driven humor evaluation

Abstract: We propose a fully automated AI system that produces short comedic videos similar to sketch shows such as Saturday Night Live. Starting with character references, the system employs a population of agents loosely based on real production studio roles, structured to optimize the quality and diversity of ideas and outputs through iterative competition, evaluation, and improvement. A key contribution is the introduction of LLM critics aligned with real viewer preferences through the analysis of a corpus of comedy videos on YouTube to automatically evaluate humor. Our experiments show that our framework produces results approaching the quality of professionally produced sketches while demonstrating state-of-the-art performance in video generation.

[120] HG-Lane: High-Fidelity Generation of Lane Scenes under Adverse Weather and Lighting Conditions without Re-annotation

Daichao Zhao, Qiupu Chen, Feng He, Xin Ning, Qiankun Li

Main category: cs.CV

TL;DR: HG-Lane: A high-fidelity generation framework for creating lane detection datasets under adverse weather/lighting conditions without re-annotation, improving model robustness.

Details

Motivation: Existing lane detection datasets lack sufficient data for extreme weather conditions (rain, snow, fog), causing models to become unreliable in adverse environments and potentially leading to safety-critical failures in autonomous driving.

Method: Proposes HG-Lane, a high-fidelity generation framework that synthesizes lane scenes under adverse weather and lighting conditions without requiring manual re-annotation. Creates a benchmark dataset of 30,000 images covering various adverse scenarios.

Result: Significantly improves performance of existing lane detection networks. With CLRNet, overall mF1 score increases by 20.87%. F1@50 scores improve across all categories: overall (19.75%), normal (8.63%), snow (38.8%), rain (14.96%), fog (26.84%), night (21.5%), dusk (12.04%).

Conclusion: HG-Lane effectively addresses the data scarcity problem for lane detection in adverse conditions, providing a practical solution to improve model robustness without costly manual annotation.

Abstract: Lane detection is a crucial task in autonomous driving, as it helps ensure the safe operation of vehicles. However, existing datasets such as CULane and TuSimple contain relatively limited data under extreme weather conditions, including rain, snow, and fog. As a result, detection models trained on these datasets often become unreliable in such environments, which may lead to serious safety-critical failures on the road. To address this issue, we propose HG-Lane, a High-fidelity Generation framework for Lane Scenes under adverse weather and lighting conditions without requiring re-annotation. Based on this framework, we further construct a benchmark that includes adverse weather and lighting scenarios, containing 30,000 images. Experimental results demonstrate that our method consistently and significantly improves the performance of existing lane detection networks. For example, using the state-of-the-art CLRNet, the overall mF1 score on our benchmark increases by 20.87 percent. The F1@50 score for the overall, normal, snow, rain, fog, night, and dusk categories increases by 19.75 percent, 8.63 percent, 38.8 percent, 14.96 percent, 26.84 percent, 21.5 percent, and 12.04 percent, respectively. The code and dataset are available at: https://github.com/zdc233/HG-Lane.

[121] Unbalanced Optimal Transport Dictionary Learning for Unsupervised Hyperspectral Image Clustering

Joshua Lentz, Nicholas Karris, Alex Cloninger, James M. Murphy

Main category: cs.CV

TL;DR: Unsupervised hyperspectral image clustering using unbalanced Wasserstein barycenters for dimensionality reduction followed by spectral clustering.

Details

Motivation: Hyperspectral images have high-dimensional spectral data that's difficult to label manually. Unsupervised clustering enables automated segmentation, but existing Wasserstein dictionary learning methods require balancing spectral profiles, which blurs classes and reduces robustness to noise/outliers.

Method: Proposes using unbalanced Wasserstein barycenters to learn lower-dimensional representations of hyperspectral data, then applies spectral clustering on the learned representations for unsupervised labeling.

Result: The approach provides effective unsupervised learning of labels for hyperspectral image segmentation, addressing limitations of balanced Wasserstein methods.

Conclusion: Unbalanced Wasserstein barycenters combined with spectral clustering offer improved unsupervised clustering for hyperspectral images by better handling noise and outliers while preserving class distinctions.

Abstract: Hyperspectral images capture vast amounts of high-dimensional spectral information about a scene, making labeling an intensive task that is resistant to out-of-the-box statistical methods. Unsupervised learning of clusters allows for automated segmentation of the scene, enabling a more rapid understanding of the image. Partitioning the spectral information contained within the data via dictionary learning in Wasserstein space has proven an effective method for unsupervised clustering. However, this approach requires balancing the spectral profiles of the data, blurring the classes, and sacrificing robustness to outliers and noise. In this paper, we suggest improving this approach by utilizing unbalanced Wasserstein barycenters to learn a lower-dimensional representation of the underlying data. The deployment of spectral clustering on the learned representation results in an effective approach for the unsupervised learning of labels.

[122] P-GSVC: Layered Progressive 2D Gaussian Splatting for Scalable Image and Video

Longan Wang, Yuang Shi, Wei Tsang Ooi

Main category: cs.CV

TL;DR: P-GSVC: A layered progressive 2D Gaussian splatting framework for scalable image and video reconstruction using joint training across base and enhancement layers.

Details

Motivation: Gaussian splatting has shown promise for image/video reconstruction but lacks scalability. The authors aim to create a unified framework that supports progressive quality and resolution scaling for both images and videos.

Method: Organizes 2D Gaussian splats into base layer + successive enhancement layers for coarse-to-fine reconstruction. Uses joint training strategy that simultaneously updates Gaussians across layers to ensure inter-layer compatibility and stable progressive reconstruction.

Result: Joint training gains up to 1.9 dB PSNR improvement for video and 2.6 dB PSNR improvement for image compared to sequential layer-wise training methods.

Conclusion: P-GSVC provides the first layered progressive Gaussian splatting framework that enables scalable representation for both images and videos with significant quality improvements through joint training.

Abstract: Gaussian splatting has emerged as a competitive explicit representation for image and video reconstruction. In this work, we present P-GSVC, the first layered progressive 2D Gaussian splatting framework that provides a unified solution for scalable Gaussian representation in both images and videos. P-GSVC organizes 2D Gaussian splats into a base layer and successive enhancement layers, enabling coarse-to-fine reconstructions. To effectively optimize this layered representation, we propose a joint training strategy that simultaneously updates Gaussians across layers, aligning their optimization trajectories to ensure inter-layer compatibility and a stable progressive reconstruction. P-GSVC supports scalability in terms of both quality and resolution. Our experiments show that the joint training strategy can gain up to 1.9 dB improvement in PSNR for video and 2.6 dB improvement in PSNR for image when compared to methods that perform sequential layer-wise training. Project page: https://longanwang-cs.github.io/PGSVC-webpage/

[123] Video-Based Reward Modeling for Computer-Use Agents

Linxin Song, Jieyu Zhang, Huanxin Sheng, Taiwei Shi, Gupta Rahul, Yang Liu, Ranjay Krishna, Jian Kang, Jieyu Zhao

Main category: cs.CV

TL;DR: ExeVRM: A video-based reward model that evaluates computer-using agents by analyzing execution videos to predict task success, outperforming proprietary models across multiple operating systems.

Details

Motivation: Current evaluation of computer-using agents (CUAs) is difficult to scale, and existing methods often require access to agents' internal reasoning or actions. The authors propose using execution videos (keyframe sequences) as a model-agnostic evaluation approach that can scale effectively.

Method: 1) Created ExeVR-53k dataset with 53k video-task-reward triplets; 2) Used adversarial instruction translation to synthesize negative samples with step-level annotations; 3) Developed spatiotemporal token pruning to handle long, high-resolution videos by removing homogeneous regions while preserving decisive UI changes; 4) Fine-tuned an 8B parameter Execution Video Reward Model (ExeVRM) that takes user instructions and video sequences to predict task success.

Result: ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming proprietary models like GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android. It also provides more precise temporal attribution of task success.

Conclusion: Video-execution reward modeling serves as a scalable, model-agnostic evaluator for computer-using agents, demonstrating strong performance across multiple platforms and providing better temporal understanding than existing proprietary models.

Abstract: Computer-using agents (CUAs) are becoming increasingly capable; however, it remains difficult to scale evaluation of whether a trajectory truly fulfills a user instruction. In this work, we study reward modeling from execution video: a sequence of keyframes from an agent trajectory that is independent of the agent’s internal reasoning or actions. Although video-execution modeling is method-agnostic, it presents key challenges, including highly redundant layouts and subtle, localized cues that determine success. We introduce Execution Video Reward 53k (ExeVR-53k), a dataset of 53k high-quality video–task–reward triplets. We further propose adversarial instruction translation to synthesize negative samples with step-level annotations. To enable learning from long, high-resolution execution videos, we design spatiotemporal token pruning, which removes homogeneous regions and persistent tokens while preserving decisive UI changes. Building on these components, we fine-tune an Execution Video Reward Model (ExeVRM) that takes only a user instruction and a video-execution sequence to predict task success. Our ExeVRM 8B achieves 84.7% accuracy and 87.7% recall on video-execution assessment, outperforming strong proprietary models such as GPT-5.2 and Gemini-3 Pro across Ubuntu, macOS, Windows, and Android, while providing more precise temporal attribution. These results show that video-execution reward modeling can serve as a scalable, model-agnostic evaluator for CUAs.

[124] Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

Zitong Wang, Zijun Shen, Haohao Xu, Zhengjie Luo, Weibin Wu

Main category: cs.CV

TL;DR: Delta-K is a training-free inference framework that addresses concept omission in text-to-image diffusion models by injecting semantic signatures of missing concepts into the cross-attention key space during early diffusion stages.

Details

Motivation: Diffusion models often fail to synthesize all concepts mentioned in complex multi-instance text prompts, suffering from concept omission. Existing training-free methods that rescale attention maps only increase unstructured noise without establishing coherent semantic representations.

Method: Delta-K operates in the shared cross-attention Key space using a vision-language model to extract a differential key (ΔK) encoding semantic signatures of missing concepts. This signal is injected during early semantic planning stages of diffusion, governed by a dynamically optimized scheduling mechanism that grounds diffuse noise into stable structural anchors while preserving existing concepts.

Result: Extensive experiments show Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.

Conclusion: Delta-K provides a backbone-agnostic, plug-and-play inference framework that effectively addresses concept omission in text-to-image diffusion models by operating directly in the cross-attention key space, offering a practical solution for complex multi-instance scene synthesis.

Abstract: While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $ΔK$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.

[125] V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Yan-Bo Lin, Jonah Casebeer, Long Mai, Aniruddha Mahapatra, Gedas Bertasius, Nicholas J. Bryan

Main category: cs.CV

TL;DR: V2M-Zero: A zero-pair video-to-music generation approach that creates temporally aligned music for videos without cross-modal training or paired data, using intra-modal event curves for synchronization.

Details

Motivation: Existing text-to-music models lack fine-grained temporal control for aligning music with video events. The challenge is that while musical and visual events differ semantically, they share temporal structure that can be captured independently within each modality.

Method: Compute event curves from intra-modal similarity using pretrained music and video encoders to capture temporal structure independently within each modality. Fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data.

Result: On OES-Pub, MovieGenBench-Music, and AIST++ datasets, V2M-Zero achieves: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos compared to paired-data baselines.

Conclusion: Temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. The approach validates that matching when and how much change occurs (not what changes) enables synchronization across modalities.

Abstract: Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

[126] FusionNet: a frame interpolation network for 4D heart models

Chujie Chang, Shoko Miyauchi, Ken’ichi Morooka, Ryo Kurazume, Oscar Martinez Mozos

Main category: cs.CV

TL;DR: FusionNet: A neural network that reconstructs high temporal resolution 4D cardiac motion from short-duration CMR scans by estimating intermediate 3D heart shapes between adjacent frames.

Details

Motivation: Standard CMR imaging requires long scan times (40-60 min) causing patient discomfort. Shorter scans reduce temporal resolution, compromising diagnostic accuracy for cardiac motion analysis.

Method: Proposes FusionNet neural network that estimates intermediate 3D heart shapes based on adjacent shapes to reconstruct 4D cardiac motion with high temporal resolution from limited temporal sampling.

Result: Achieved Dice coefficient over 0.897, outperforming existing methods in recovering precise cardiac shapes from temporally sparse CMR data.

Conclusion: FusionNet enables high-quality 4D cardiac motion reconstruction from shorter CMR scans, potentially reducing patient discomfort while maintaining diagnostic accuracy.

Abstract: Cardiac magnetic resonance (CMR) imaging is widely used to visualise cardiac motion and diagnose heart disease. However, standard CMR imaging requires patients to lie still in a confined space inside a loud machine for 40-60 min, which increases patient discomfort. In addition, shorter scan times decrease either or both the temporal and spatial resolutions of cardiac motion, and thus, the diagnostic accuracy of the procedure. Of these, we focus on reduced temporal resolution and propose a neural network called FusionNet to obtain four-dimensional (4D) cardiac motion with high temporal resolution from CMR images captured in a short period of time. The model estimates intermediate 3D heart shapes based on adjacent shapes. The results of an experimental evaluation of the proposed FusionNet model showed that it achieved a performance of over 0.897 in terms of the Dice coefficient, confirming that it can recover shapes more precisely than existing methods. This code is available at: https://github.com/smiyauchi199/FusionNet.git

[127] An Automated Radiomics Framework for Postoperative Survival Prediction in Colorectal Liver Metastases using Preoperative MRI

Muhammad Alberb, Jianan Chen, Hossam El-rewaidy, Paul Karanicolas, Arun Seth, Yutaka Amemiya, Anne Martel, Helen Cheung

Main category: cs.CV

TL;DR: AI framework for predicting postoperative survival in colorectal liver metastasis patients using MRI segmentation and radiomics analysis

Details

Motivation: Colorectal liver metastasis outcomes are heterogeneous after surgery, and accurate survival prediction is needed to avoid non-beneficial surgeries and guide personalized therapy

Method: Two-stage framework: 1) Anatomy-aware segmentation pipeline using SAMONAI (prompt propagation algorithm extending Segment Anything Model to 3D) for liver, CRLMs, and spleen segmentation from partially-annotated data; 2) Radiomics pipeline extracting per-tumor features and predicting survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event prediction

Result: Segmentation achieved Dice scores of 0.96 (liver), 0.93 (spleen), 0.78 (CRLMs) with detection F1-score of 0.79; Survival prediction achieved C-index of 0.69, outperforming established methods and biomarkers

Conclusion: Integration of segmentation algorithms with radiomics-based survival analysis enables accurate and automated CRLM outcome prediction from MRI data

Abstract: While colorectal liver metastasis (CRLM) is potentially curable via hepatectomy, patient outcomes remain highly heterogeneous. Postoperative survival prediction is necessary to avoid non-beneficial surgeries and guide personalized therapy. In this study, we present an automated AI-based framework for postoperative CRLM survival prediction using pre- and post-contrast MRI. We performed a retrospective study of 227 CRLM patients who had gadoxetate-enhanced MRI prior to curative-intent hepatectomy between 2013 and 2020. We developed a survival prediction framework comprising an anatomy-aware segmentation pipeline followed by a radiomics pipeline. The segmentation pipeline learns liver, CRLMs, and spleen segmentation from partially-annotated data, leveraging promptable foundation models to generate pseudo-labels. To support this pipeline, we propose SAMONAI, a prompt propagation algorithm that extends Segment Anything Model to 3D point-based segmentation. Predicted pre- and post-contrast segmentations are then fed into our radiomics pipeline, which extracts per-tumor features and predicts survival using SurvAMINN, an autoencoder-based multiple instance neural network for time-to-event survival prediction. SurvAMINN jointly learns dimensionality reduction and survival prediction from right-censored data, emphasizing high-risk metastases. We compared our framework against established methods and biomarkers using univariate and multivariate Cox regression. Our segmentation pipeline achieves median Dice scores of 0.96 (liver) and 0.93 (spleen), driving a CRLM segmentation Dice score of 0.78 and a detection F1-score of 0.79. Accurate segmentation enables our radiomics pipeline to achieve a survival prediction C-index of 0.69. Our results show the potential of integrating segmentation algorithms with radiomics-based survival analysis to deliver accurate and automated CRLM outcome prediction.

[128] Data relativistic uncertainty framework for low-illumination anime scenery image enhancement

Yiquan Gao, John See

Main category: cs.CV

TL;DR: Proposes Data Relativistic Uncertainty (DRU) framework for low-light enhancement in anime scenery images, using uncertainty information from diverse illumination conditions to dynamically adjust objective functions.

Details

Motivation: Addresses the domain gap in low-light enhancement for anime scenery images, which is underexplored compared to natural images/videos. Aims to handle diverse illumination conditions in anime art style.

Method: Constructs unpaired anime scenery dataset, proposes DRU framework inspired by Relativistic GAN and wave-particle duality analogy. Defines illumination uncertainty of dark/bright samples and uses it to dynamically adjust objective functions for model recalibration under data uncertainty.

Result: Extensive experiments show DRU framework yields superior perceptual and aesthetic qualities beyond state-of-the-art methods. Framework demonstrates effectiveness when training EnlightenGAN variants.

Conclusion: DRU framework provides novel data-centric learning paradigm for visual domains, potentially applicable to language domains. Successfully addresses low-illumination quality degradation in anime scenery images.

Abstract: By contrast with the prevailing works of low-light enhancement in natural images and videos, this study copes with the low-illumination quality degradation in anime scenery images to bridge the domain gap. For such an underexplored enhancement task, we first curate images from various sources and construct an unpaired anime scenery dataset with diverse environments and illumination conditions to address the data scarcity. To exploit the power of uncertainty information inherent with the diverse illumination conditions, we propose a Data Relativistic Uncertainty (DRU) framework, motivated by the idea from Relativistic GAN. By analogy with the wave-particle duality of light, our framework interpretably defines and quantifies the illumination uncertainty of dark/bright samples, which is leveraged to dynamically adjust the objective functions to recalibrate the model learning under data uncertainty. Extensive experiments demonstrate the effectiveness of DRU framework by training several versions of EnlightenGANs, yielding superior perceptual and aesthetic qualities beyond the state-of-the-art methods that are incapable of learning from data uncertainty perspective. We hope our framework can expose a novel paradigm of data-centric learning for potential visual and language domains. Code is available.

[129] Robotic Ultrasound Makes CBCT Alive

Feng Li, Ziyuan Li, Zhongliang Jiang, Nassir Navab, Yuan Bi

Main category: cs.CV

TL;DR: A framework that uses robotic ultrasound to infer tissue motion and update static CBCT slices in real-time for surgical guidance, enabling dynamic refinement without repeated radiation exposure.

Details

Motivation: Static CBCT provides 3D anatomical context but fails to capture soft-tissue deformations during surgery, leading to navigation discrepancies. There's a need for real-time updating of CBCT without additional radiation exposure.

Method: Uses robotic ultrasound as a dynamic proxy to infer tissue motion. Starts with calibration-initialized alignment with LC2-based rigid refinement, then introduces USCorUNet (a lightweight network with optical flow-guided supervision) to learn deformation-aware correlation representations for real-time dense deformation field estimation from ultrasound streams. The deformation is spatially regularized and transferred to CBCT reference.

Result: Demonstrates real-time end-to-end CBCT slice updating and physically plausible deformation estimation. Enables dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions.

Conclusion: Proposed framework successfully enables deformation-aware CBCT updating using robotic ultrasound, providing real-time dynamic guidance without repeated radiation exposure during surgical interventions.

Abstract: Intraoperative Cone Beam Computed Tomography (CBCT) provides a reliable 3D anatomical context essential for interventional planning. However, its static nature fails to provide continuous monitoring of soft-tissue deformations induced by respiration, probe pressure, and surgical manipulation, leading to navigation discrepancies. We propose a deformation-aware CBCT updating framework that leverages robotic ultrasound as a dynamic proxy to infer tissue motion and update static CBCT slices in real time. Starting from calibration-initialized alignment with linear correlation of linear combination (LC2)-based rigid refinement, our method establishes accurate multimodal correspondence. To capture intraoperative dynamics, we introduce the ultrasound correlation UNet (USCorUNet), a lightweight network trained with optical flow-guided supervision to learn deformation-aware correlation representations, enabling accurate, real-time dense deformation field estimation from ultrasound streams. The inferred deformation is spatially regularized and transferred to the CBCT reference to produce deformation-consistent visualizations without repeated radiation exposure. We validate the proposed approach through deformation estimation and ultrasound-guided CBCT updating experiments. Results demonstrate real-time end-to-end CBCT slice updating and physically plausible deformation estimation, enabling dynamic refinement of static CBCT guidance during robotic ultrasound-assisted interventions. The source code is publicly available at https://github.com/anonymous-codebase/us-cbct-demo.

[130] GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

Main category: cs.CV

TL;DR: GOT-JEPA is a model-predictive pretraining framework for object tracking that extends JEPA to predict tracking models, improving generalization and occlusion handling through pseudo-supervision from clean to corrupted frames.

Details

Motivation: Current generic object trackers lack robustness in unseen scenarios and have coarse occlusion reasoning, failing to match human visual system's ability to adapt to target/scene changes and reason about occlusion at fine granularity.

Method: GOT-JEPA extends JEPA from image feature prediction to tracking model prediction, using teacher-student framework where teacher generates pseudo-tracking models from clean frames and student learns to predict same from corrupted frames. OccuSolver enhances occlusion perception with point-centric visibility estimation and iterative refinement using object priors.

Result: Extensive evaluations on seven benchmarks show the method effectively enhances tracker generalization and robustness, improving performance in dynamic environments with occlusions and distractors.

Conclusion: The proposed framework addresses limitations in generalization and occlusion perception for object tracking, providing stable pseudo-supervision and detailed occlusion-pattern capture that improves tracker robustness in challenging scenarios.

Abstract: The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

[131] OilSAM2: Memory-Augmented SAM2 for Scalable SAR Oil Spill Detection

Shuaiyu Chen, Ming Yin, Peng Ren, Chunbo Luo, Zeyu Fu

Main category: cs.CV

TL;DR: OilSAM2 is a memory-augmented segmentation framework for oil spill detection in SAR imagery that addresses challenges of appearance variability and lack of temporal coherence in unordered image collections.

Details

Motivation: Segmenting oil spills from SAR imagery is challenging due to severe appearance variability, scale heterogeneity, and absence of temporal continuity in real-world monitoring scenarios. Existing SAM-based approaches operate on single images and can't effectively reuse information across scenes, while memory-augmented variants assume temporal coherence and suffer from semantic drift when applied to unordered SAR collections.

Method: Proposes OilSAM2 with: 1) Hierarchical feature-aware multi-scale memory bank that models texture, structure, and semantic level representations for robust cross-image information reuse; 2) Structure-semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural variation to mitigate memory drift.

Result: Experiments on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state-of-the-art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios.

Conclusion: OilSAM2 provides an effective memory-augmented segmentation framework tailored for unordered SAR oil spill monitoring, overcoming limitations of existing SAM-based approaches through hierarchical memory modeling and drift mitigation strategies.

Abstract: Segmenting oil spills from Synthetic Aperture Radar (SAR) imagery remains challenging due to severe appearance variability, scale heterogeneity, and the absence of temporal continuity in real world monitoring scenarios. While foundation models such as Segment Anything (SAM) enable prompt driven segmentation, existing SAM based approaches operate on single images and cannot effectively reuse information across scenes. Memory augmented variants (e.g., SAM2) further assume temporal coherence, making them prone to semantic drift when applied to unordered SAR image collections. We propose OilSAM2, a memory augmented segmentation framework tailored for unordered SAR oil spill monitoring. OilSAM2 introduces a hierarchical feature aware multi scale memory bank that explicitly models texture, structure, and semantic level representations, enabling robust cross image information reuse. To mitigate memory drift, we further propose a structure semantic consistent memory update strategy that selectively refreshes memory based on semantic discrepancy and structural variation.Experiments on two public SAR oil spill datasets demonstrate that OilSAM2 achieves state of the art segmentation performance, delivering stable and accurate results under noisy SAR monitoring scenarios. The source code is available at https://github.com/Chenshuaiyu1120/OILSAM2.

[132] ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Athanasios Angelakis

Main category: cs.CV

TL;DR: ZACH-ViT is a compact Vision Transformer for medical imaging that removes positional embeddings and class tokens, using global average pooling for permutation-invariant patch processing, showing strong performance in data-scarce medical imaging scenarios.

Details

Motivation: Standard Vision Transformers rely on positional embeddings and class tokens that encode fixed spatial priors, which may be suboptimal for medical imaging where spatial layout is often weakly informative. The authors aim to create a more suitable architecture for medical imaging under data-scarce conditions.

Method: ZACH-ViT removes both positional embeddings and the [CLS] token, achieving permutation-invariant patch processing through global average pooling. It uses adaptive residual projections to maintain training stability under strict parameter constraints (0.25M parameters). The model is evaluated across seven MedMNIST datasets under strict few-shot protocols.

Result: ZACH-ViT shows regime-dependent performance: strongest advantage on BloodMNIST, competitive on PathMNIST, with decreasing relative advantage on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST). Positional support becomes mildly beneficial as spatial structure increases, while reintroducing [CLS] token is consistently unfavorable.

Conclusion: Architectural alignment with data structure can outweigh universal benchmark dominance. ZACH-ViT achieves competitive performance under data-scarce conditions despite minimal size and no pretraining, making it relevant for compact medical imaging and low-resource settings.

Abstract: Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves strongest advantage on BloodMNIST and remains competitive on PathMNIST, while relative advantage decreases on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST), consistent with our hypothesis. Component and pooling ablations show positional support becomes mildly beneficial as spatial structure increases, whereas reintroducing a [CLS] token is consistently unfavorable. These findings support that architectural alignment with data structure can outweigh universal benchmark dominance. Despite minimal size and no pretraining, ZACH-ViT achieves competitive performance under data-scarce conditions, relevant for compact medical imaging and low-resource settings. Code: https://github.com/Bluesman79/ZACH-ViT

[133] Why Does It Look There? Structured Explanations for Image Classification

Jiarui Li, Zixiang Yin, Samuel J Landry, Zhengming Ding, Ramgopal R. Mettu

Main category: cs.CV

TL;DR: I2X framework converts unstructured interpretability (saliency maps) into structured explanations using prototypes extracted during training, enabling faithful model explanations and practical accuracy improvement through targeted fine-tuning.

Details

Motivation: Current XAI methods provide unstructured interpretability (saliency maps) and often rely on auxiliary models like GPT/CLIP, compromising faithfulness to original models. There's a need for structured explanations directly from model interpretability that can both explain behavior and guide optimization.

Method: I2X extracts prototypes from post-hoc XAI methods (e.g., GradCAM) at selected training checkpoints, quantifying progress to build structured explanations. It reveals both intra- and inter-class decision making by analyzing prototype-based inference processes during training.

Result: Experiments on MNIST and CIFAR10 demonstrate I2X effectively reveals prototype-based inference processes. The framework can identify uncertain prototypes and use targeted perturbation of samples for fine-tuning, ultimately improving accuracy across different model architectures and datasets.

Conclusion: I2X provides faithful structured explanations of model behavior while offering a practical approach to guide optimization toward desired targets, bridging the gap between interpretability and explainability in deep learning models.

Abstract: Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of “why does it look there” by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.

[134] One Adapter for All: Towards Unified Representation in Step-Imbalanced Class-Incremental Learning

Xiaoyan Zhang, Jiangpeng He

Main category: cs.CV

TL;DR: One-A is an imbalance-aware framework for class-incremental learning that handles varying class counts per task through asymmetric subspace alignment and directional gating in a single adapter.

Details

Motivation: Most class-incremental learning methods assume balanced task streams, but in practice, the number of classes per task varies significantly (step imbalance). Large tasks dominate learning while small tasks inject unstable updates, degrading overall performance.

Method: One-A incrementally merges task updates into a single adapter with constant inference cost. It uses asymmetric subspace alignment to preserve dominant subspaces from large tasks while constraining low-information updates, information-adaptive weighting to balance base and new adapters, and directional gating to selectively fuse updates along singular directions.

Result: Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, demonstrating that a single asymmetrically fused adapter can remain adaptive to dynamic task sizes while efficient at deployment.

Conclusion: One-A provides a unified, imbalance-aware framework for class-incremental learning that effectively handles step-imbalanced task streams while maintaining efficiency through a single adapter architecture.

Abstract: Class-incremental learning (CIL) aims to acquire new classes over time while retaining prior knowledge, yet most setups and methods assume balanced task streams. In practice, the number of classes per task often varies significantly. We refer to this as step imbalance, where large tasks that contain more classes dominate learning and small tasks inject unstable updates. Existing CIL methods assume balanced tasks and therefore treat all tasks uniformly, producing imbalanced updates that degrade overall learning performance. To address this challenge, we propose One-A, a unified and imbalance-aware framework that incrementally merges task updates into a single adapter, maintaining constant inference cost. One-A performs asymmetric subspace alignment to preserve dominant subspaces learned from large tasks while constraining low-information updates within them. An information-adaptive weighting balances the contribution between base and new adapters, and a directional gating mechanism selectively fuses updates along each singular direction, maintaining stability in head directions and plasticity in tail ones. Across multiple benchmarks and step-imbalanced streams, One-A achieves competitive accuracy with significantly low inference overhead, showing that a single, asymmetrically fused adapter can remain both adaptive to dynamic task sizes and efficient at deployment.

[135] Joint Imaging-ROI Representation Learning via Cross-View Contrastive Alignment for Brain Disorder Classification

Wei Liang, Lifang He

Main category: cs.CV

TL;DR: A unified cross-view contrastive framework for joint imaging-ROI representation learning in brain imaging classification, aligning global volumetric and local ROI-graph embeddings in shared latent space.

Details

Motivation: Existing brain imaging classification approaches use either full image volumes (global context) or ROI-based graphs (local interactions), but their relative contributions and complementarity are poorly understood. Current fusion methods are task-specific and don't allow controlled evaluation of each representation.

Method: Proposes a unified cross-view contrastive framework that learns subject-level global (imaging) and local (ROI-graph) embeddings, aligning them in shared latent space using bidirectional contrastive objective. This encourages same-subject embeddings to converge while separating different-subject embeddings.

Result: Joint learning consistently improves classification performance over either branch alone on ADHD-200 and ABIDE datasets across multiple backbone choices. Interpretability analyses show imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns.

Conclusion: Explicitly integrating global volumetric and ROI-level representations is promising for neuroimaging-based brain disorder classification. The framework enables systematic evaluation of imaging-only, ROI-only, and joint configurations within unified training protocol.

Abstract: Brain imaging classification is commonly approached from two perspectives: modeling the full image volume to capture global anatomical context, or constructing ROI-based graphs to encode localized and topological interactions. Although both representations have demonstrated independent efficacy, their relative contributions and potential complementarity remain insufficiently understood. Existing fusion approaches are typically task-specific and do not enable controlled evaluation of each representation under consistent training settings. To address this gap, we propose a unified cross-view contrastive framework for joint imaging-ROI representation learning. Our method learns subject-level global (imaging) and local (ROI-graph) embeddings and aligns them in a shared latent space using a bidirectional contrastive objective, encouraging representations from the same subject to converge while separating those from different subjects. This alignment produces comparable embeddings suitable for downstream fusion and enables systematic evaluation of imaging-only, ROI-only, and joint configurations within a unified training protocol. Extensive experiments on the ADHD-200 and ABIDE datasets demonstrate that joint learning consistently improves classification performance over either branch alone across multiple backbone choices. Moreover, interpretability analyses reveal that imaging-based and ROI-based branches emphasize distinct yet complementary discriminative patterns, explaining the observed performance gains. These findings provide principled evidence that explicitly integrating global volumetric and ROI-level representations is a promising direction for neuroimaging-based brain disorder classification. The source code is available at https://anonymous.4open.science/r/imaging-roi-contrastive-152C/.

[136] A Robust Deep Learning Framework for Bangla License Plate Recognition Using YOLO and Vision-Language OCR

Nayeb Hasin, Md. Arafath Rahman Nishat, Mainul Islam, Khandakar Shakib Al Hasan, Asif Newaz

Main category: cs.CV

TL;DR: A robust Bangla License Plate Recognition system combining deep learning object detection (YOLOv8 with adaptive training) and VisionEncoderDecoder OCR for text extraction, achieving high accuracy on challenging Bangla plates.

Details

Motivation: Bangla license plate recognition is challenging due to complex character schemes and uneven layouts, creating a need for robust ALPR systems for intelligent traffic management applications.

Method: Two-stage approach: 1) License plate localization using deep learning object detection models (U-Net, YOLO variants) with novel adaptive training strategy on YOLOv8; 2) Text recognition as sequence generation using VisionEncoderDecoder architecture with ViT + BanglaBERT combination.

Result: Achieved 97.83% accuracy and 91.3% IoU for localization; ViT+BanglaBERT achieved Character Error Rate of 0.1323 and Word Error Rate of 0.1068. System showed consistent performance on external dataset with different environmental conditions.

Conclusion: The proposed system provides robust and reliable Bangla license plate recognition suitable for intelligent transportation applications like automated law enforcement and access control, performing effectively across diverse real-world scenarios.

Abstract: An Automatic License Plate Recognition (ALPR) system constitutes a crucial element in an intelligent traffic management system. However, the detection of Bangla license plates remains challenging because of the complicated character scheme and uneven layouts. This paper presents a robust Bangla License Plate Recognition system that integrates a deep learning-based object detection model for license plate localization with Optical Character Recognition for text extraction. Multiple object detection architectures, including U-Net and several YOLO (You Only Look Once) variants, are compared for license plate localization. This study proposes a novel two-stage adaptive training strategy built upon the YOLOv8 architecture to improve localization performance. The proposed approach outperforms the established models, achieving an accuracy of 97.83% and an Intersection over Union (IoU) of 91.3%. The text recognition problem is phrased as a sequence generation problem with a VisionEncoderDecoder architecture, with a combination of encoder-decoders evaluated. It was demonstrated that the ViT + BanglaBERT model gives better results at the character level, with a Character Error Rate of 0.1323 and Word Error Rate of 0.1068. The proposed system also shows a consistent performance when tested on an external dataset that has been curated for this study purpose. The dataset offers completely different environment and lighting conditions compared to the training sample, indicating the robustness of the proposed framework. Overall, our proposed system provides a robust and reliable solution for Bangla license plate recognition and performs effectively across diverse real-world scenarios, including variations in lighting, noise, and plate styles. These strengths make it well suited for deployment in intelligent transportation applications such as automated law enforcement and access control.

[137] From Imitation to Intuition: Intrinsic Reasoning for Open-Instance Video Classification

Ke Zhang, Xiangchen Zhao, Yunjie Tian, Jiayu Zheng, Vishal M. Patel, Di Fu

Main category: cs.CV

TL;DR: DeepIntuit: A framework that evolves open-instance video classification from imitation to intuition by leveraging vision-language models’ reasoning capabilities through supervised alignment, reinforcement learning refinement, and intuitive calibration.

Details

Motivation: Real-world video classification faces open-instance challenges with vast intra-class variations beyond existing benchmarks. While vision-language models offer superior generalization, their reasoning capabilities haven't been fully leveraged for such tasks.

Method: Three-stage approach: 1) Cold-start supervised alignment to initialize reasoning capability, 2) Group Relative Policy Optimization (GRPO) refinement to enhance reasoning coherence through reinforcement learning, 3) Intuitive calibration where a classifier is trained on intrinsic reasoning traces from the refined VLM.

Result: Extensive experiments demonstrate DeepIntuit significantly benefits from transcending simple feature imitation and evolving toward intrinsic reasoning for open-instance video classification.

Conclusion: The framework successfully bridges the gap between imitation and intuition in video classification, leveraging VLMs’ reasoning capabilities to handle complex real-world open-instance scenarios.

Abstract: Conventional video classification models, acting as effective imitators, excel in scenarios with homogeneous data distributions. However, real-world applications often present an open-instance challenge, where intra-class variations are vast and complex, beyond existing benchmarks. While traditional video encoder models struggle to fit these diverse distributions, vision-language models (VLMs) offer superior generalization but have not fully leveraged their reasoning capabilities (intuition) for such tasks. In this paper, we bridge this gap with an intrinsic reasoning framework that evolves open-instance video classification from imitation to intuition. Our approach, namely DeepIntuit, begins with a cold-start supervised alignment to initialize reasoning capability, followed by refinement using Group Relative Policy Optimization (GRPO) to enhance reasoning coherence through reinforcement learning. Crucially, to translate this reasoning into accurate classification, DeepIntuit then introduces an intuitive calibration stage. In this stage, a classifier is trained on this intrinsic reasoning traces generated by the refined VLM, ensuring stable knowledge transfer without distribution mismatch. Extensive experiments demonstrate that for open-instance video classification, DeepIntuit benefits significantly from transcending simple feature imitation and evolving toward intrinsic reasoning. Our project is available at https://bwgzk-keke.github.io/DeepIntuit/.

[138] Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

Yuedong Yang, Xiwen Wei, Mustafa Munir, Radu Marculescu

Main category: cs.CV

TL;DR: Fuel Gauge predicts Chain-of-Thought length in Large Multi-modality Models to optimize computational efficiency and accuracy by addressing memory fragmentation and reasoning issues.

Details

Motivation: Current LMMs use inefficient CoT processes that cause computational waste (memory fragmentation) and accuracy problems (under-/over-thinking). The CoT length is unpredictable at runtime, leading to suboptimal resource usage.

Method: Proposes Fuel Gauge method that extracts a hidden parameter representing reasoning “fuel” to predict CoT length ahead of time. Enables predictive KV cache allocation and CoT length modulation.

Result: Achieves less than half the CoT length prediction error compared to baselines on GPQA-Diamond benchmark, with 13.37x reduction in memory allocation frequency. Effective across text-only, image-text, and video-text QA benchmarks.

Conclusion: Fuel Gauge provides practical value for LMM serving systems by improving computational efficiency and accuracy through CoT length prediction, addressing both memory fragmentation and reasoning quality issues.

Abstract: Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of “fuel” available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.

[139] Overcoming Visual Clutter in Vision Language Action Models via Concept-Gated Visual Distillation

Sangmim Song, Sarath Kodagoda, Marc Carmichael, Karthick Thiyagarajan

Main category: cs.CV

TL;DR: CGVD is a training-free inference framework that addresses the “Precision-Reasoning Gap” in Vision-Language-Action models by parsing instructions, refining targets, and using Fourier-based inpainting to suppress semantic distractors in cluttered environments.

Details

Motivation: VLA models struggle with precise manipulation in cluttered environments due to background-induced feature dilution, where semantic noise corrupts geometric grounding needed for accurate robotic manipulation.

Method: Concept-Gated Visual Distillation (CGVD) parses instructions into safe/distractor sets, uses two-layer target refinement (cross-validation + spatial disambiguation), and applies Fourier-based inpainting to generate clean observations that suppress distractors while preserving spatial geometry.

Result: CGVD achieves 77.5% success rate vs 43.0% baseline in dense semantic distractor environments, preventing performance collapse and significantly outperforming state-of-the-art methods.

Conclusion: CGVD establishes inference-time visual distillation as critical for robust robotic manipulation in clutter by enforcing strict attribute adherence and stabilizing VLA policies without training.

Abstract: Vision-Language-Action (VLA) models demonstrate impressive zero-shot generalization but frequently suffer from a “Precision-Reasoning Gap” in cluttered environments. This failure is driven by background-induced feature dilution, where high-frequency semantic noise corrupts the geometric grounding required for precise manipulation. To bridge this gap, we propose Concept-Gated Visual Distillation (CGVD), a training-free, model-agnostic inference framework that stabilizes VLA policies. CGVD operates by parsing instructions into safe and distractor sets, utilizing a two-layer target refinement process–combining cross-validation and spatial disambiguation–to explicitly penalize false positives and isolate genuine manipulation targets. We then process the scene via Fourier-based inpainting, generating a clean observation that actively suppresses semantic distractors while preserving critical spatial geometry and visual proprioception. Extensive evaluations in highly cluttered manipulation tasks demonstrate that CGVD prevents performance collapse. In environments with dense semantic distractors, our method significantly outperforms state-of-the-art baselines, achieving a 77.5% success rate compared to the baseline’s 43.0%. By enforcing strict attribute adherence, CGVD establishes inference-time visual distillation as a critical prerequisite for robust robotic manipulation in the clutter.

[140] EmoStory: Emotion-Aware Story Generation

Jingyuan Yang, Rucong Chen, Hui Huang

Main category: cs.CV

TL;DR: EmoStory: A two-stage framework for emotion-aware visual story generation that creates subject-consistent image sequences with explicit emotional directions through agent-based planning and region-aware composition.

Details

Motivation: Existing story generation methods focus on coherent narratives and subject consistency but remain emotion-neutral, overlooking how emotions shape narrative interpretation and visual presentation. Stories should engage audiences emotionally, so there's a need for emotion-aware story generation.

Method: Two-stage framework: 1) Planning stage uses emotion agent and writer agent to transform target emotions into coherent story prompts, 2) Generation stage preserves subject consistency and injects emotion-related elements through region-aware composition.

Result: Evaluated on new dataset covering 25 subjects and 600 emotional stories. Outperforms state-of-the-art methods in emotion accuracy, prompt alignment, and subject consistency based on quantitative, qualitative results and user studies.

Conclusion: EmoStory successfully addresses the challenging task of emotion-aware story generation by integrating agent-based planning and region-aware composition to create subject-consistent visual stories with explicit emotional directions.

Abstract: Story generation aims to produce image sequences that depict coherent narratives while maintaining subject consistency across frames. Although existing methods have excelled in producing coherent and expressive stories, they remain largely emotion-neutral, focusing on what subject appears in a story while overlooking how emotions shape narrative interpretation and visual presentation. As stories are intended to engage audiences emotionally, we introduce emotion-aware story generation, a new task that aims to generate subject-consistent visual stories with explicit emotional directions. This task is challenging due to the abstract nature of emotions, which must be grounded in concrete visual elements and consistently expressed across a narrative through visual composition. To address these challenges, we propose EmoStory, a two-stage framework that integrates agent-based story planning and region-aware story generation. The planning stage transforms target emotions into coherent story prompts with emotion agent and writer agent, while the generation stage preserves subject consistency and injects emotion-related elements through region-aware composition. We evaluate EmoStory on a newly constructed dataset covering 25 subjects and 600 emotional stories. Extensive quantitative and qualitative results, along with user studies, show that EmoStory outperforms state-of-the-art story generation methods in emotion accuracy, prompt alignment, and subject consistency.

[141] StyleGallery: Training-free and Semantic-aware Personalized Style Transfer from Arbitrary Image References

Boyu He, Yunfan Ye, Chang Liu, Weishang Wu, Fang Liu, Zhiping Cai

Main category: cs.CV

TL;DR: StyleGallery is a training-free, semantic-aware framework for image style transfer that addresses limitations in existing methods by enabling flexible use of arbitrary style references without extra constraints, using adaptive clustering for semantic region segmentation and energy-guided diffusion for optimized stylization.

Details

Motivation: Existing diffusion-based style transfer methods have three key limitations: 1) semantic gap where style references may lack proper content semantics, 2) reliance on extra constraints like semantic masks restricting applicability, and 3) rigid feature associations lacking adaptive global-local alignment. These limitations restrict personalization, accuracy, and adaptability in style transfer.

Method: StyleGallery uses a three-stage approach: 1) Semantic region segmentation via adaptive clustering on latent diffusion features without extra inputs, 2) Clustered region matching using block filtering on extracted features for precise alignment, and 3) Style transfer optimization with energy function-guided diffusion sampling and regional style loss to optimize stylization.

Result: Experiments on the introduced benchmark show StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

Conclusion: StyleGallery provides a training-free, semantic-aware framework that addresses key limitations in diffusion-based style transfer, enabling flexible use of arbitrary style references and effective personalized customization without requiring extra constraints or training.

Abstract: Despite the advancements in diffusion-based image style transfer, existing methods are commonly limited by 1) semantic gap: the style reference could miss proper content semantics, causing uncontrollable stylization; 2) reliance on extra constraints (e.g., semantic masks) restricting applicability; 3) rigid feature associations lacking adaptive global-local alignment, failing to balance fine-grained stylization and global content preservation. These limitations, particularly the inability to flexibly leverage style inputs, fundamentally restrict style transfer in terms of personalization, accuracy, and adaptability. To address these, we propose StyleGallery, a training-free and semantic-aware framework that supports arbitrary reference images as input and enables effective personalized customization. It comprises three core stages: semantic region segmentation (adaptive clustering on latent diffusion features to divide regions without extra inputs); clustered region matching (block filtering on extracted features for precise alignment); and style transfer optimization (energy function-guided diffusion sampling with regional style loss to optimize stylization). Experiments on our introduced benchmark demonstrate that StyleGallery outperforms state-of-the-art methods in content structure preservation, regional stylization, interpretability, and personalized customization, particularly when leveraging multiple style references.

[142] One Token, Two Fates: A Unified Framework via Vision Token Manipulation Against MLLMs Hallucination

Zhan Fa, Yue Duan, Jian Zhang, Lei Qi, Yinghuan Shi

Main category: cs.CV

TL;DR: A unified framework for reducing hallucinations in multimodal LLMs by strategically using vision tokens through synergistic visual calibration and causal representation calibration.

Details

Motivation: Current training-free methods for addressing MLLM hallucinations use separate strategies (enhancing visual signals or suppressing text inertia) that are insufficient due to critical trade-offs. Simply enhancing vision often fails against strong language priors, while suppressing language can introduce image-irrelevant noise. A unified framework is needed to effectively restore vision-language balance.

Method: Proposes a unified framework focusing on vision tokens with two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens isolates hallucination tendencies more precisely than distorting images. Uses two modules: Synergistic Visual Calibration (SVC) incorporates augmented tokens to strengthen visual representations, and Causal Representation Calibration (CRC) uses pruned tokens to create latent-space negative samples for correcting internal model biases.

Result: Significantly reduces object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

Conclusion: The proposed unified framework effectively restores vision-language balance in MLLMs by harmonizing two complementary uses of vision tokens, achieving substantial hallucination reduction with minimal computational overhead.

Abstract: Current training-free methods tackle MLLM hallucination with separate strategies: either enhancing visual signals or suppressing text inertia. However, these separate methods are insufficient due to critical trade-offs: simply enhancing vision often fails against strong language prior, while suppressing language can introduce extra image-irrelevant noise. Moreover, we find their naive combination is also ineffective, necessitating a unified framework. We propose such a framework by focusing on the core asset: the vision token. Our design leverages two key insights: (1) augmented images offer complementary visual semantics, and (2) removing vision tokens (information-gap) isolates hallucination tendencies more precisely than distorting images (modality-gap). Based on these, our framework uses vision tokens in two distinct ways, both operating on latent representations: our Synergistic Visual Calibration (SVC) module incorporates augmented tokens to strengthen visual representations, while our Causal Representation Calibration (CRC) module uses pruned tokens to create latent-space negative samples for correcting internal model biases. By harmonizing these two roles, our framework effectively restores the vision-language balance, significantly reducing object hallucinations, improving POPE accuracy by an average of 2% absolute on LLaVA-1.5 across multiple benchmarks with only a 1.06x inference latency overhead.

[143] Geometric Autoencoder for Diffusion Models

Hangyu Liu, Jianyong Wang, Yutao Sun

Main category: cs.CV

TL;DR: GAE is a principled autoencoder framework for latent diffusion models that optimizes semantic supervision from vision foundation models, uses latent normalization instead of KL-divergence, and incorporates dynamic noise sampling for robust reconstruction.

Details

Motivation: Existing latent designs for diffusion models are heuristic and struggle to balance semantic discriminability, reconstruction fidelity, and latent compactness. There's a need for a principled framework that systematically addresses these challenges.

Method: GAE constructs optimized low-dimensional semantic supervision from Vision Foundation Models, uses latent normalization to replace KL-divergence for stable latent manifolds, and incorporates dynamic noise sampling for robust reconstruction under high-intensity noise.

Result: Achieves gFID of 1.82 at 80 epochs and 1.31 at 800 epochs on ImageNet-1K 256×256 benchmark without Classifier-Free Guidance, significantly surpassing state-of-the-art methods. Establishes superior equilibrium between compression, semantic depth, and reconstruction stability.

Conclusion: GAE offers a promising paradigm for latent diffusion modeling by providing a principled framework that validates design considerations and achieves compelling performance while balancing multiple objectives.

Abstract: Latent diffusion models have established a new state-of-the-art in high-resolution visual generation. Integrating Vision Foundation Model priors improves generative efficiency, yet existing latent designs remain largely heuristic. These approaches often struggle to unify semantic discriminability, reconstruction fidelity, and latent compactness. In this paper, we propose Geometric Autoencoder (GAE), a principled framework that systematically addresses these challenges. By analyzing various alignment paradigms, GAE constructs an optimized low-dimensional semantic supervision target from VFMs to provide guidance for the autoencoder. Furthermore, we leverage latent normalization that replaces the restrictive KL-divergence of standard VAEs, enabling a more stable latent manifold specifically optimized for diffusion learning. To ensure robust reconstruction under high-intensity noise, GAE incorporates a dynamic noise sampling mechanism. Empirically, GAE achieves compelling performance on the ImageNet-1K $256 \times 256$ benchmark, reaching a gFID of 1.82 at only 80 epochs and 1.31 at 800 epochs without Classifier-Free Guidance, significantly surpassing existing state-of-the-art methods. Beyond generative quality, GAE establishes a superior equilibrium between compression, semantic depth and robust reconstruction stability. These results validate our design considerations, offering a promising paradigm for latent diffusion modeling. Code and models are publicly available at https://github.com/freezing-index/Geometric-Autoencoder-for-Diffusion-Models.

[144] GeoSense: Internalizing Geometric Necessity Perception for Multimodal Reasoning

Ruiheng Liu, Haihong Hao, Mingfei Han, Xin Gu, Kecheng Zhang, Changlin Li, Xiaojun Chang

Main category: cs.CV

TL;DR: A framework that enables MLLMs to autonomously engage geometric features when 2D visual cues are insufficient, rather than always injecting geometry, improving spatial reasoning efficiency and awareness.

Details

Motivation: Current MLLMs have limited spatial understanding, and existing methods rigidly inject geometric signals into every input, ignoring their necessity and adding computation overhead. The goal is to create more efficient and self-aware multimodal intelligence.

Method: 1) Introduce independent geometry input channel with alignment training for effective geometric feature utilization. 2) Curate spatial-aware supervised fine-tuning dataset to activate model’s latent internal cues for autonomous determination of geometric information necessity.

Result: Experiments across multiple spatial reasoning benchmarks show significant spatial gains without compromising 2D visual reasoning capabilities.

Conclusion: The framework offers a path toward more robust, efficient and self-aware multimodal intelligence by enabling models to autonomously determine when geometric information is needed.

Abstract: Advancing towards artificial superintelligence requires rich and intelligent perceptual capabilities. A critical frontier in this pursuit is overcoming the limited spatial understanding of Multimodal Large Language Models (MLLMs), where geometry information is essential. Existing methods often address this by rigidly injecting geometric signals into every input, while ignoring their necessity and adding computation overhead. Contrary to this paradigm, our framework endows the model with an awareness of perceptual insufficiency, empowering it to autonomously engage geometric features in reasoning when 2D cues are deemed insufficient. To achieve this, we first introduce an independent geometry input channel to the model architecture and conduct alignment training, enabling the effective utilization of geometric features. Subsequently, to endow the model with perceptual awareness, we curate a dedicated spatial-aware supervised fine-tuning dataset. This serves to activate the model’s latent internal cues, empowering it to autonomously determine the necessity of geometric information. Experiments across multiple spatial reasoning benchmarks validate this approach, demonstrating significant spatial gains without compromising 2D visual reasoning capabilities, offering a path toward more robust, efficient and self-aware multi-modal intelligence.

[145] Multi-Person Pose Estimation Evaluation Using Optimal Transportation and Improved Pose Matching

Takato Moriki, Hiromu Taketsugu, Norimichi Ukita

Main category: cs.CV

TL;DR: OCpose is a new evaluation metric for multi-person pose estimation that uses optimal transport to fairly evaluate all detected poses regardless of confidence scores, addressing limitations of current metrics that focus too much on high-confidence poses.

Details

Motivation: Current pose estimation metrics prioritize high-confidence poses and ignore low-confidence false positives, leading to potentially misleading high scores even when many low-confidence false positives exist. There's a need for a fair evaluation that properly balances true-positive and false-positive poses.

Method: OCpose treats pose evaluation as an optimal transportation problem between detected poses and ground truth annotations. It evaluates all detected poses equally regardless of confidence scores, but uses confidence scores to improve matching reliability between estimated poses and annotations.

Result: OCpose provides a different assessment perspective compared to confidence ranking-based metrics, offering a more balanced evaluation that accounts for the tradeoff between true-positive and false-positive poses.

Conclusion: OCpose is a novel evaluation metric that addresses limitations of current pose estimation metrics by using optimal transport to fairly evaluate all detected poses, providing a more comprehensive assessment of pose estimation performance.

Abstract: In Multi-Person Pose Estimation, many metrics place importance on ranking of pose detection confidence scores. Current metrics tend to disregard false-positive poses with low confidence, focusing primarily on a larger number of high-confidence poses. Consequently, these metrics may yield high scores even when many false-positive poses with low confidence are detected. For fair evaluation taking into account a tradeoff between true-positive and false-positive poses, this paper proposes Optimal Correction Cost for pose (OCpose), which evaluates detected poses against pose annotations as an optimal transportation. For the fair tradeoff between true-positive and false-positive poses, OCpose equally evaluates all the detected poses regardless of their confidence scores. In OCpose, on the other hand, the confidence score of each pose is utilized to improve the reliability of matching scores between the estimated pose and pose annotations. As a result, OCpose provides a different perspective assessment than other confidence ranking-based metrics.

[146] Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

Tianshuo Xu, Zhifei Chen, Leyi Wu, Hao Lu, Ying-cong Chen

Main category: cs.CV

TL;DR: Motion Forcing: A hierarchical “Point-Shape-Appearance” framework for video generation that stabilizes the trilemma of visual quality, physical consistency, and controllability in complex scenes through explicit decoupling of physical reasoning from visual synthesis.

Details

Motivation: Current video generation models struggle to maintain the balance between visual quality, physical consistency, and controllability as scene complexity increases (e.g., collisions, dense traffic). The trilemma becomes fragile in complex scenarios.

Method: 1) Hierarchical “Point-Shape-Appearance” paradigm: decomposes generation into verifiable stages - modeling dynamics as sparse geometric anchors (Point), expanding to dynamic depth maps (Shape), then rendering textures (Appearance). 2) Masked Point Recovery: randomly masks input anchors during training to force learning of latent physical laws (e.g., inertia) for inferring missing trajectories.

Result: Outperforms state-of-the-art baselines on autonomous driving benchmarks, maintains trilemma stability across complex scenes, and shows generality in physics and robotics evaluations.

Conclusion: Motion Forcing successfully stabilizes the video generation trilemma in complex scenarios by explicitly decoupling physical reasoning from visual synthesis and forcing models to learn latent physical laws through structured training.

Abstract: The ultimate goal of video generation is to satisfy a fundamental trilemma: achieving high visual quality, maintaining rigorous physical consistency, and enabling precise controllability. While recent models can maintain this balance in simple, isolated scenarios, we observe that this equilibrium is fragile and often breaks down as scene complexity increases (e.g., involving collisions or dense traffic). To address this, we introduce \textbf{Motion Forcing}, a framework designed to stabilize this trilemma even in complex generative tasks. Our key insight is to explicitly decouple physical reasoning from visual synthesis via a hierarchical \textbf{``Point-Shape-Appearance’’} paradigm. This approach decomposes generation into verifiable stages: modeling complex dynamics as sparse geometric anchors (\textbf{Point}), expanding them into dynamic depth maps that explicitly resolve 3D geometry (\textbf{Shape}), and finally rendering high-fidelity textures (\textbf{Appearance}). Furthermore, to foster robust physical understanding, we employ a \textbf{Masked Point Recovery} strategy. By randomly masking input anchors during training and enforcing the reconstruction of complete dynamic depth, the model is compelled to move beyond passive pattern matching and learn latent physical laws (e.g., inertia) to infer missing trajectories. Extensive experiments on autonomous driving benchmarks show that Motion Forcing significantly outperforms state-of-the-art baselines, maintaining trilemma stability across complex scenes. Evaluations on physics and robotics further confirm our framework’s generality.

[147] Frames2Residual: Spatiotemporal Decoupling for Self-Supervised Video Denoising

Mingjie Ji, Zhan Shi, Kailai Zhou, Zixuan Fu, Xun Cao

Main category: cs.CV

TL;DR: F2R is a self-supervised video denoising framework that decouples temporal consistency modeling from spatial texture recovery, addressing limitations of existing blind-spot networks that lose spatial details.

Details

Motivation: Existing self-supervised video denoising methods struggle to balance inter-frame temporal consistency with intra-frame spatial specificity. Video Blind-Spot Networks require noise independence by masking center pixels, which prevents using spatial evidence for texture recovery and causes texture loss.

Method: Two-stage spatiotemporal decoupling framework: Stage 1 uses a blind temporal estimator with frame-wise blind strategy to learn inter-frame consistency and produce temporally consistent anchor. Stage 2 uses a non-blind spatial refiner that leverages the anchor to safely reintroduce center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability.

Result: Extensive experiments show F2R outperforms existing self-supervised methods on both sRGB and raw video benchmarks, demonstrating effectiveness of the decoupling strategy.

Conclusion: The proposed spatiotemporal decoupling framework successfully addresses the trade-off between temporal consistency and spatial texture recovery in self-supervised video denoising, achieving state-of-the-art performance.

Abstract: Self-supervised video denoising methods typically extend image-based frameworks into the temporal dimension, yet they often struggle to integrate inter-frame temporal consistency with intra-frame spatial specificity. Existing Video Blind-Spot Networks (BSNs) require noise independence by masking the center pixel, this constraint prevents the use of spatial evidence for texture recovery, thereby severing spatiotemporal correlations and causing texture loss. To address this, we propose Frames2Residual (F2R), a spatiotemporal decoupling framework that explicitly divides self-supervised training into two distinct stages: blind temporal consistency modeling and non-blind spatial texture recovery. In Stage 1, a blind temporal estimator learns inter-frame consistency using a frame-wise blind strategy, producing a temporally consistent anchor. In Stage 2, a non-blind spatial refiner leverages this anchor to safely reintroduce the center frame and recover intra-frame high-frequency spatial residuals while preserving temporal stability. Extensive experiments demonstrate that our decoupling strategy allows F2R to outperform existing self-supervised methods on both sRGB and raw video benchmarks.

[148] TractoRC: A Unified Probabilistic Learning Framework for Joint Tractography Registration and Clustering

Yijie Li, Xi Zhu, Junyi Wang, Ye Wu, Lauren J. O’Donnell, Fan Zhang

Main category: cs.CV

TL;DR: TractoRC: A unified probabilistic framework that jointly performs tractogram registration and streamline clustering for diffusion MRI tractography analysis using a shared latent embedding space.

Details

Motivation: Current tractography analysis typically performs tractogram registration (aligning streamlines across individuals) and streamline clustering (grouping streamlines into fiber bundles) independently, despite both tasks sharing the goal of capturing geometrically similar structures to characterize white matter organization.

Method: Proposes TractoRC, a unified probabilistic framework that jointly optimizes both tasks within a single scheme. Learns a latent embedding space for streamline points as shared representation. Registration learns distribution of anatomical landmarks as probabilistic keypoints, while clustering learns streamline structural prototypes. Uses transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings.

Result: Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently.

Conclusion: TractoRC provides a unified framework that enables tractogram registration and streamline clustering to leverage complementary information, improving both tasks through joint optimization in a shared latent space.

Abstract: Diffusion MRI tractography enables in vivo reconstruction of white matter (WM) pathways. Two key tasks in tractography analysis include: 1) tractogram registration that aligns streamlines across individuals, and 2) streamline clustering that groups streamlines into compact fiber bundles. Although both tasks share the goal of capturing geometrically similar structures to characterize consistent WM organization, they are typically performed independently. In this work, we propose TractoRC, a unified probabilistic framework that jointly performs tractogram registration and streamline clustering within a single optimization scheme, enabling the two tasks to leverage complementary information. TractoRC learns a latent embedding space for streamline points, which serves as a shared representation for both tasks. Within this space, both tasks are formulated as probabilistic inference over structural representations: registration learns the distribution of anatomical landmarks as probabilistic keypoints to align tractograms across subjects, and clustering learns streamline structural prototypes that capture geometric similarity to form coherent streamline clusters. To support effective learning of this shared space, we introduce a transformation-equivariant self-supervised strategy to learn geometry-aware and transformation-invariant embeddings. Experiments demonstrate that jointly optimizing registration and clustering significantly improves performance in both tasks over state-of-the-art methods that treat them independently. Code will be made publicly available at https://github.com/yishengpoxiao/TractoRC .

[149] World2Act: Latent Action Post-Training via Skill-Compositional World Models

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

Main category: cs.CV

TL;DR: World2Act: A post-training framework that aligns VLA actions with world model video-dynamics latents using contrastive matching, reducing pixel-dependence and improving robustness through LLM-based skill decomposition for consistent video generation across varying task horizons.

Details

Motivation: Current world model-based post-training methods for vision-language-action policies rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucinations from imperfect world model rollouts. Additionally, world models struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely.

Method: 1) World2Act framework aligns VLA actions directly with world model video-dynamics latents using contrastive matching objective instead of pixel-space supervision. 2) LLM-based skill-decomposition pipeline segments high-level instructions into low-level prompts to create skill-compositional world models that remain temporally consistent across diverse task horizons. 3) Produces RoboCasa-Skill and LIBERO-Skill datasets supporting skill-compositional world models.

Result: World2Act applied to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO benchmarks, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

Conclusion: World2Act provides an effective post-training framework that reduces dependence on pixel-space supervision through latent alignment and addresses variable-length video generation challenges through skill decomposition, significantly improving VLA policy robustness and generalization in embodied AI tasks.

Abstract: World Models (WMs) have emerged as a promising approach for post-training Vision-Language-Action (VLA) policies to improve robustness and generalization under environmental changes. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to pixel-level artifacts and hallucination from imperfect WM rollouts. We introduce World2Act, a post-training framework that aligns VLA actions directly with WM video-dynamics latents using a contrastive matching objective, reducing dependence on pixels. Post-training performance is tied to rollout quality, yet current WMs struggle with arbitrary-length video generation as they are mostly trained on fixed-length clips while robotic execution durations vary widely. To address this, we propose an automatic LLM-based skill-decomposition pipeline that segments high-level instructions into low-level prompts. Our pipeline produces RoboCasa-Skill and LIBERO-Skill, supporting skill-compositional WMs that remain temporally consistent across diverse task horizons. Empirically, applying World2Act to VLAs like GR00T-N1.6 and Cosmos Policy achieves state-of-the-art results on RoboCasa and LIBERO, and improves real-world performance by 6.7%, enhancing embodied agent generalization.

[150] SignSparK: Efficient Multilingual Sign Language Production via Sparse Keyframe Learning

Jianhe Low, Alexandre Symeonidis-Herzig, Maksym Ivashechkin, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: A novel Sign Language Production framework using sparse keyframes and Conditional Flow Matching to generate natural, fluid 3D sign language avatars across multiple languages.

Details

Motivation: Current SLP frameworks face a trade-off: text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. There's a need for natural, linguistically accurate sign language avatars.

Method: Proposes a training paradigm using sparse keyframes to capture human signing kinematics. Introduces FAST for efficient sign segmentation and SignSparK, a large-scale Conditional Flow Matching framework that predicts dense motion from discrete anchors in SMPL-X and MANO spaces.

Result: SignSparK enables high-fidelity synthesis in fewer than ten sampling steps, scales across four distinct sign languages (largest multilingual SLP framework), and establishes new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

Conclusion: The keyframe-driven approach with Conditional Flow Matching successfully resolves the trade-off in SLP, enabling natural, fluid sign language generation with precise spatiotemporal editing capabilities across multiple languages.

Abstract: Generating natural and linguistically accurate sign language avatars remains a formidable challenge. Current Sign Language Production (SLP) frameworks face a stark trade-off: direct text-to-pose models suffer from regression-to-the-mean effects, while dictionary-retrieval methods produce robotic, disjointed transitions. To resolve this, we propose a novel training paradigm that leverages sparse keyframes to capture the true underlying kinematic distribution of human signing. By predicting dense motion from these discrete anchors, our approach mitigates regression-to-the-mean while ensuring fluid articulation. To realize this paradigm at scale, we first introduce FAST, an ultra-efficient sign segmentation model that automatically mines precise temporal boundaries. We then present SignSparK, a large-scale Conditional Flow Matching (CFM) framework that utilizes these extracted anchors to synthesize 3D signing sequences in SMPL-X and MANO spaces. This keyframe-driven formulation also uniquely unlocks Keyframe-to-Pose (KF2P) generation, making precise spatiotemporal editing of signing sequences possible. Furthermore, our adopted reconstruction-based CFM objective also enables high-fidelity synthesis in fewer than ten sampling steps; this allows SignSparK to scale across four distinct sign languages, establishing the largest multilingual SLP framework to date. Finally, by integrating 3D Gaussian Splatting for photorealistic rendering, we demonstrate through extensive evaluation that SignSparK establishes a new state-of-the-art across diverse SLP tasks and multilingual benchmarks.

[151] LCAMV: High-Accuracy 3D Reconstruction of Color-Varying Objects Using LCA Correction and Minimum-Variance Fusion in Structured Light

Wonbeen Oh, Jae-Sang Hyun

Main category: cs.CV

TL;DR: LCAMV is a 3D reconstruction method that corrects lateral chromatic aberration and fuses multi-channel phase data using minimum-variance estimation for accurate colored object reconstruction with structured light.

Details

Motivation: Accurate 3D reconstruction of colored objects with structured light is hindered by lateral chromatic aberration in optical components and uneven noise across RGB channels, requiring a solution that works with standard hardware.

Method: LCAMV analytically models and pixel-wise compensates lateral chromatic aberration in both projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation with a single projector-camera pair.

Result: Experiments on planar and non-planar colored surfaces show LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6%.

Conclusion: LCAMV establishes an effective solution for high-precision 3D reconstruction of nonuniformly colored objects without requiring extra hardware or multiple exposures.

Abstract: Accurate 3D reconstruction of colored objects with structured light (SL) is hindered by lateral chromatic aberration (LCA) in optical components and uneven noise characteristics across RGB channels. This paper introduces lateral chromatic aberration correction and minimum-variance fusion (LCAMV), a robust 3D reconstruction method that operates with a single projector-camera pair without additional hardware or acquisition constraints. LCAMV analytically models and pixel-wise compensates LCA in both the projector and camera, then adaptively fuses multi-channel phase data using a Poisson-Gaussian noise model and minimum-variance estimation. Unlike existing methods that require extra hardware or multiple exposures, LCAMV enables fast acquisition. Experiments on planar and non-planar colored surfaces show that LCAMV outperforms grayscale conversion and conventional channel-weighting, reducing depth error by up to 43.6%. These results establish LCAMV as an effective solution for high-precision 3D reconstruction of nonuniformly colored objects.

[152] Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning

Yushuo Zheng, Huiyu Duan, Zicheng Zhang, Xiaohong Liu, Xiongkuo Min

Main category: cs.CV

TL;DR: WanderBench is a global geolocation benchmark with 32K+ panoramas for embodied geolocation reasoning, and GeoAoT is a framework that combines reasoning with actionable plans for interactive geolocation.

Details

Motivation: Current large multimodal models have strong world knowledge and reasoning capabilities, but their performance on geolocation tasks remains unexplored. The authors aim to transform geolocation from static recognition into interactive exploration through embodied scenarios.

Method: 1) Created WanderBench benchmark with over 32K panoramas across six continents organized as navigable graphs enabling rotation and movement actions. 2) Proposed GeoAoT framework that generates actionable plans (approaching landmarks, adjusting viewpoints) instead of textual reasoning chains to actively reduce uncertainty. 3) Established evaluation protocol measuring both geolocation accuracy and difficulty-aware questioning ability.

Result: Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments compared to existing approaches.

Conclusion: WanderBench and GeoAoT define a new paradigm for actionable, reasoning-driven geolocation in embodied visual understanding, transforming geolocation from static recognition to interactive exploration.

Abstract: Geolocation, the task of identifying the geographic location of an image, requires abundant world knowledge and complex reasoning abilities. Though advanced large multimodal models (LMMs) have shown superior aforementioned capabilities, their performance on the geolocation task remains unexplored. To this end, we introduce \textbf{WanderBench}, the first open access global geolocation benchmark designed for actionable geolocation reasoning in embodied scenarios. WanderBench contains over 32K panoramas across six continents, organized as navigable graphs that enable physical actions such as rotation and movement, transforming geolocation from static recognition into interactive exploration. Building on this foundation, we propose \textbf{GeoAoT} (Action of Thought), a \underline{Geo}location framework with \underline{A}ction of \underline{T}hough, which couples reasoning with embodied actions. Instead of generating textual reasoning chains, GeoAoT produces actionable plans such as, approaching landmarks or adjusting viewpoints, to actively reduce uncertainty. We further establish an evaluation protocol that jointly measures geolocation accuracy and difficulty-aware geolocation questioning ability. Experiments on 19 large multimodal models show that GeoAoT achieves superior fine-grained localization and stronger generalization in dynamic environments. WanderBench and GeoAoT define a new paradigm for actionable, reasoning driven geolocation in embodied visual understanding.

[153] UniPINN: A Unified PINN Framework for Multi-task Learning of Diverse Navier-Stokes Equations

Dengdi Sun, Jie Chen, Xiao Wang, Jin Tang

Main category: cs.CV

TL;DR: UniPINN: A unified multi-flow Physics-Informed Neural Network framework that addresses challenges in extending PINNs from single-flow to multi-flow scenarios through shared-specialized architecture, cross-flow attention, and dynamic weight allocation.

Details

Motivation: Existing PINNs are designed for single-flow settings and face three key challenges when extended to multi-flow scenarios: difficulty capturing both shared physics and flow-specific characteristics, susceptibility to negative transfer, and unstable training from disparate loss magnitudes across heterogeneous flows.

Method: Proposes UniPINN with three components: 1) shared-specialized architecture disentangling universal physical laws from flow-specific features, 2) cross-flow attention mechanism selectively reinforcing relevant patterns while suppressing task-irrelevant interference, and 3) dynamic weight allocation strategy adaptively balancing loss contributions.

Result: Extensive experiments on three canonical flows demonstrate UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer.

Conclusion: UniPINN provides a unified framework for multi-flow PINNs that addresses key challenges in simultaneous multi-flow learning, offering improved accuracy and stability compared to existing single-flow approaches extended to multi-flow settings.

Abstract: Physics-Informed Neural Networks (PINNs) have shown promise in solving incompressible Navier-Stokes equations, yet existing approaches are predominantly designed for single-flow settings. When extended to multi-flow scenarios, these methods face three key challenges: (1) difficulty in simultaneously capturing both shared physical principles and flow-specific characteristics, (2) susceptibility to inter-task negative transfer that degrades prediction accuracy, and (3) unstable training dynamics caused by disparate loss magnitudes across heterogeneous flow regimes. To address these limitations, we propose UniPINN, a unified multi-flow PINN framework that integrates three complementary components: a shared-specialized architecture that disentangles universal physical laws from flow-specific features, a cross-flow attention mechanism that selectively reinforces relevant patterns while suppressing task-irrelevant interference, and a dynamic weight allocation strategy that adaptively balances loss contributions to stabilize multi-objective optimization. Extensive experiments on three canonical flows demonstrate that UniPINN effectively unifies multi-flow learning, achieving superior prediction accuracy and balanced performance across heterogeneous regimes while successfully mitigating negative transfer. The source code of this paper will be released on https://github.com/Event-AHU/OpenFusion

[154] Fighting Hallucinations with Counterfactuals: Diffusion-Guided Perturbations for LVLM Hallucination Suppression

Hamidreza Dastmalchi, Aijun An, Ali Cheraghian, Hamed Barzamini

Main category: cs.CV

TL;DR: CIPHER is a training-free method that reduces hallucinations in large vision-language models by identifying and correcting vision-induced hallucination patterns through counterfactual image perturbations.

Details

Motivation: Large vision-language models frequently generate hallucinations (unfaithful outputs misaligned with visual input), with existing training-free methods focusing mainly on text-induced hallucinations while neglecting vision-induced ones.

Method: Two-phase approach: 1) Offline phase creates OHC-25K counterfactual dataset with diffusion-edited images contradicting original captions to extract hallucination representations, 2) Inference phase projects intermediate hidden states away from identified hallucination subspace.

Result: CIPHER significantly reduces hallucination rates across multiple benchmarks while preserving task performance, demonstrating effectiveness of counterfactual visual perturbations for improving LVLM faithfulness.

Conclusion: Vision-induced hallucinations can be systematically characterized and suppressed via lightweight feature-level correction without retraining, offering a practical solution for improving LVLM reliability.

Abstract: While large vision-language models (LVLMs) achieve strong performance on multimodal tasks, they frequently generate hallucinations – unfaithful outputs misaligned with the visual input. To address this issue, we introduce CIPHER (Counterfactual Image Perturbations for Hallucination Extraction and Removal), a training-free method that suppresses vision-induced hallucinations via lightweight feature-level correction. Unlike prior training-free approaches that primarily focus on text-induced hallucinations, CIPHER explicitly targets hallucinations arising from the visual modality. CIPHER operates in two phases. In the offline phase, we construct OHC-25K (Object-Hallucinated Counterfactuals, 25,000 samples), a counterfactual dataset consisting of diffusion-edited images that intentionally contradict the original ground-truth captions. We pair these edited images with the unchanged ground-truth captions and process them through an LVLM to extract hallucination-related representations. Contrasting these representations with those from authentic (image, caption) pairs reveals structured, systematic shifts spanning a low-rank subspace characterizing vision-induced hallucination. In the inference phase, CIPHER suppresses hallucinations by projecting intermediate hidden states away from this subspace. Experiments across multiple benchmarks show that CIPHER significantly reduces hallucination rates while preserving task performance, demonstrating the effectiveness of counterfactual visual perturbations for improving LVLM faithfulness. Code and additional materials are available at https://hamidreza-dastmalchi.github.io/cipher-cvpr2026/.

[155] StructDamage:A Large Scale Unified Crack and Surface Defect Dataset for Robust Structural Damage Detection

Misbah Ijaz, Saif Ur Rehman Khan, Abd Ur Rehman, Sebastian Vollmer, Andreas Dengel, Muhammad Nabeel Asim

Main category: cs.CV

TL;DR: StructDamage dataset: A comprehensive collection of 78,093 crack images across 9 surface types, curated from 32 public datasets, with baseline classification results showing high accuracy using various DL architectures.

Details

Motivation: Existing crack detection datasets lack geographic diversity, surface type variety, scale consistency, and labeling standards, limiting the generalization of trained algorithms in real-world conditions. There's a need for a comprehensive, well-curated dataset to support robust crack damage detection research.

Method: Systematically aggregated, harmonized, and reannotated images from 32 publicly available datasets covering various structures. Organized images in folder-level classification hierarchy suitable for CNNs and Vision Transformers. Evaluated with 15 DL architectures from 6 model families.

Result: Baseline classification results show 12 models achieving macro F1-scores over 0.96. Best performing model DenseNet201 achieves 98.62% accuracy. Dataset contains approximately 78,093 images spanning 9 surface types: walls, tile, stone, road, pavement, deck, concrete, and brick.

Conclusion: StructDamage provides a comprehensive, versatile resource for crack damage detection research with thorough documentation and standard structure to promote reproducible research and support development of robust detection approaches.

Abstract: Automated detection and classification of structural cracks and surface defects is a critical challenge in civil engineering, infrastructure maintenance, and heritage preservation. Recent advances in Computer Vision (CV) and Deep Learning (DL) have significantly improved automatic crack detection. However, these methods rely heavily on large, diverse, and carefully curated datasets that include various crack types across different surface materials. Many existing public crack datasets lack geographic diversity, surface types, scale, and labeling consistency, making it challenging for trained algorithms to generalize effectively in real world conditions. We provide a novel dataset, StructDamage, a curated collection of approximately 78,093 images spanning nine surface types: walls, tile, stone, road, pavement, deck, concrete, and brick. The dataset was constructed by systematically aggregating, harmonizing, and reannotating images from 32 publicly available datasets covering concrete structures, asphalt pavements, masonry walls, bridges, and historic buildings. All images are organized in a folder level classification hierarchy suitable for training Convolutional Neural Networks (CNNs) and Vision Transformers. To highlight the practical value of the dataset, we present baseline classification results using fifteen DL architectures from six model families, with twelve achieving macro F1-scores over 0.96. The best performing model DenseNet201 achieves 98.62% accuracy. The proposed dataset provides a comprehensive and versatile resource suitable for classification tasks. With thorough documentation and a standard structure, it is designed to promote reproducible research and support the development and fair evaluation of robust crack damage detection approaches.

[156] Spatial self-supervised Peak Learning and correlation-based Evaluation of peak picking in Mass Spectrometry Imaging

Philipp Weigand, Nikolas Ebert, Shad A. Mohammed, Denis Abu Sammour, Carsten Hopf, Oliver Wasenmüller

Main category: cs.CV

TL;DR: Autoencoder-based spatial self-supervised neural network for peak picking in mass spectrometry imaging that learns attention masks using both spatial and spectral information, evaluated with expert-annotated segmentation masks.

Details

Motivation: Mass spectrometry imaging generates large complex datasets requiring effective peak picking to reduce data size while preserving biological information. Existing approaches perform inconsistently across heterogeneous datasets and are often evaluated on synthetic data or manually selected ion images that don't represent real-world challenges.

Method: Proposes an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. Also introduces an evaluation procedure based on expert-annotated segmentation masks for more representative assessment.

Result: The approach consistently outperforms state-of-the-art peak picking methods on four diverse public MSI datasets by selecting spatially structured peaks. The evaluation procedure provides a consistent framework for comparing spatially structured peak picking methods across different datasets.

Conclusion: The spatial self-supervised network demonstrates efficacy in selecting spatially structured peaks, and the evaluation procedure offers a robust framework for comparing peak picking methods across diverse MSI datasets.

Abstract: Mass spectrometry imaging (MSI) enables label-free visualization of molecular distributions across tissue samples but generates large and complex datasets that require effective peak picking to reduce data size while preserving meaningful biological information. Existing peak picking approaches perform inconsistently across heterogeneous datasets, and their evaluation is often limited to synthetic data or manually selected ion images that do not fully represent real-world challenges in MSI. To address these limitations, we propose an autoencoder-based spatial self-supervised peak learning neural network that selects spatially structured peaks by learning an attention mask leveraging both spatial and spectral information. We further introduce an evaluation procedure based on expert-annotated segmentation masks, allowing a more representative and spatially grounded assessment of peak picking performance. We evaluate our approach on four diverse public MSI datasets using our proposed evaluation procedure. Our approach consistently outperforms state-of-the-art peak picking methods by selecting spatially structured peaks, thus demonstrating its efficacy. These results highlight the value of our spatial self-supervised network in comparison to contemporary state-of-the-art methods. The evaluation procedure can be readily applied to new MSI datasets, thereby providing a consistent and robust framework for the comparison of spatially structured peak picking methods across different datasets.

Jiahao Lyu, Pei Fu, Zhenhang Li, Weichao Zeng, Shaojie Zhan, Jiahui Yang, Can Ma, Yu Zhou, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: IMTBench is a new benchmark for end-to-end in-image machine translation with 2,500 samples across 4 scenarios and 9 languages, addressing limitations of synthetic datasets and introducing cross-modal evaluation metrics.

Details

Motivation: Existing IIMT benchmarks are largely synthetic and fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs.

Method: Created IMTBench with 2,500 image translation samples covering four practical scenarios and nine languages, supporting multi-aspect evaluation including translation quality, background preservation, overall image quality, and a cross-modal alignment score.

Result: Benchmarking revealed large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for improvement in end-to-end image text translation.

Conclusion: IMTBench establishes a standardized benchmark to accelerate progress in in-image machine translation by addressing real-world complexity and introducing comprehensive cross-modal evaluation.

Abstract: End-to-end In-Image Machine Translation (IIMT) aims to convert text embedded within an image into a target language while preserving the original visual context, layout, and rendering style. However, existing IIMT benchmarks are largely synthetic and thus fail to reflect real-world complexity, while current evaluation protocols focus on single-modality metrics and overlook cross-modal faithfulness between rendered text and model outputs. To address these shortcomings, we present In-image Machine Translation Benchmark (IMTBench), a new benchmark of 2,500 image translation samples covering four practical scenarios and nine languages. IMTBench supports multi-aspect evaluation, including translation quality, background preservation, overall image quality, and a cross-modal alignment score that measures consistency between the translated text produced by the model and the text rendered in the translated image. We benchmark strong commercial cascade systems, and both closed- and open-source unified multi-modal models, and observe large performance gaps across scenarios and languages, especially on natural scenes and resource-limited languages, highlighting substantial headroom for end-to-end image text translation. We hope IMTBench establishes a standardized benchmark to accelerate progress in this emerging task.

[158] UHD Image Deblurring via Autoregressive Flow with Ill-conditioned Constraints

Yucheng Xin, Dawei Zhao, Xiang Chen, Chen Wu, Pu Wang, Dianjie Lu, Guijuan Zhang, Xiuyi Jia, Zhuoran Zheng

Main category: cs.CV

TL;DR: Proposes an autoregressive flow method with ill-conditioned constraint for UHD image deblurring, using progressive coarse-to-fine refinement and flow matching for efficient detail recovery.

Details

Motivation: UHD image deblurring faces challenges balancing fine-grained detail recovery with practical inference efficiency. Existing methods struggle with computational cost vs. detail generation trade-offs for 4K+ resolution images.

Method: Decomposes UHD restoration into progressive coarse-to-fine process: sharp estimate formed by upsampling previous-scale result plus current-scale residual. Uses Flow Matching to model residual generation as conditional vector field with few-step ODE sampling. Introduces ill-conditioning suppression via condition-number regularization on feature-induced attention matrix.

Result: Demonstrates promising performance on blurred images at 4K (3840×2160) or higher resolutions, achieving good balance between detail recovery and computational efficiency.

Conclusion: The proposed autoregressive flow method with ill-conditioned constraint effectively addresses UHD image deblurring challenges by enabling stable, stage-wise refinement while maintaining practical inference efficiency.

Abstract: Ultra-high-definition (UHD) image deblurring poses significant challenges for UHD restoration methods, which must balance fine-grained detail recovery and practical inference efficiency. Although prominent discriminative and generative methods have achieved remarkable results, a trade-off persists between computational cost and the ability to generate fine-grained detail for UHD image deblurring tasks. To further alleviate these issues, we propose a novel autoregressive flow method for UHD image deblurring with an ill-conditioned constraint. Our core idea is to decompose UHD restoration into a progressive, coarse-to-fine process: at each scale, the sharp estimate is formed by upsampling the previous-scale result and adding a current-scale residual, enabling stable, stage-wise refinement from low to high resolution. We further introduce Flow Matching to model residual generation as a conditional vector field and perform few-step ODE sampling with efficient Euler/Heun solvers, enriching details while keeping inference affordable. Since multi-step generation at UHD can be numerically unstable, we propose an ill-conditioning suppression scheme by imposing condition-number regularization on a feature-induced attention matrix, improving convergence and cross-scale consistency. Our method demonstrates promising performance on blurred images at 4K (3840$\times$2160) or higher resolutions.

[159] Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Xin Huang, Junjie Liang, Qingshan Hou, Peng Cao, Jinzhu Yang, Xiaoli Liu, Osmar R. Zaiane

Main category: cs.CV

TL;DR: A visually-guided text disentanglement framework for medical image synthesis that addresses modality gaps between clinical text and visual details, improving generation quality and downstream task performance.

Details

Motivation: Medical image synthesis faces challenges due to modality gaps between complex visual details and abstract clinical text, plus semantic entanglement where text embeddings blur anatomical structures and imaging styles, weakening generation controllability.

Method: Proposes Visually-Guided Text Disentanglement framework with cross-modal latent alignment to disentangle unstructured text into independent semantic representations, and Hybrid Feature Fusion Module (HFFM) to inject features into Diffusion Transformer (DiT) through separated channels.

Result: Outperforms existing approaches on three datasets in generation quality and significantly improves performance on downstream classification tasks.

Conclusion: The framework effectively addresses modality gaps and semantic entanglement in medical image synthesis, enabling fine-grained structural control and improving both generation quality and downstream task utility.

Abstract: Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

[160] Sparse Task Vector Mixup with Hypernetworks for Efficient Knowledge Transfer in Whole-Slide Image Prognosis

Pei Liu, Xiangxiang Zeng, Tengfei Ma, Yucheng Xing, Xuanbai Ren, Yiping Liu

Main category: cs.CV

TL;DR: STEPH: A model merging approach that efficiently transfers prognostic knowledge across different cancer types in whole-slide image analysis without large-scale joint training or multi-model inference.

Details

Motivation: Current cancer prognosis models using whole-slide images are cancer-specific and suffer from limited training data, leading to poor generalization on heterogeneous tumor samples. While multi-cancer joint learning exists, it's computationally expensive.

Method: Proposes Sparse Task Vector Mixup with Hypernetworks (STEPH): 1) Applies task vector mixup to each source-target cancer pair, 2) Uses hypernetworks to sparsely aggregate task vector mixtures to obtain improved target models via efficient model merging.

Result: Extensive experiments on 13 cancer datasets show STEPH improves over cancer-specific learning by 5.14% and existing knowledge transfer baseline by 2.01%, while being more computationally efficient.

Conclusion: STEPH provides an efficient solution for learning prognostic knowledge from other cancers without requiring large-scale joint training or extensive multi-model inference, addressing data scarcity in pathology.

Abstract: Whole-Slide Images (WSIs) are widely used for estimating the prognosis of cancer patients. Current studies generally follow a cancer-specific learning paradigm. However, the available training samples for one cancer type are usually scarce in pathology. Consequently, the model often struggles to learn generalizable knowledge, thus performing worse on the tumor samples with inherent high heterogeneity. Although multi-cancer joint learning and knowledge transfer approaches have been explored recently to address it, they either rely on large-scale joint training or extensive inference across multiple models, posing new challenges in computational efficiency. To this end, this paper proposes a new scheme, Sparse Task Vector Mixup with Hypernetworks (STEPH). Unlike previous ones, it efficiently absorbs generalizable knowledge from other cancers for the target via model merging: i) applying task vector mixup to each source-target pair and then ii) sparsely aggregating task vector mixtures to obtain an improved target model, driven by hypernetworks. Extensive experiments on 13 cancer datasets show that STEPH improves over cancer-specific learning and an existing knowledge transfer baseline by 5.14% and 2.01%, respectively. Moreover, it is a more efficient solution for learning prognostic knowledge from other cancers, without requiring large-scale joint training or extensive multi-model inference. Code is publicly available at https://github.com/liupei101/STEPH.

[161] DSFlash: Comprehensive Panoptic Scene Graph Generation in Realtime

Julian Lorenz, Vladyslav Kovganko, Elias Kohout, Mrunmai Phatak, Daniel Kienzle, Rainer Lienhart

Main category: cs.CV

TL;DR: DSFlash is a low-latency panoptic scene graph generation model that achieves 56 FPS on RTX 3090 while maintaining SOTA performance and requiring minimal training resources.

Details

Motivation: Existing Scene Graph Generation (SGG) models lack efficiency for real-world deployment on resource-constrained edge devices, especially for video streams. There's a need for fast, resource-efficient SGG that provides comprehensive scene graphs rather than just salient relationships.

Method: DSFlash introduces an efficient architecture for panoptic scene graph generation that optimizes for low latency while maintaining comprehensive relationship extraction. The model is designed to be lightweight and resource-efficient during both inference and training.

Result: DSFlash achieves 56 FPS on RTX 3090 GPU, processes video streams in real-time, maintains state-of-the-art performance, and requires less than 24 hours to train on a single GTX 1080 GPU.

Conclusion: DSFlash enables practical deployment of scene graph generation on resource-constrained devices, making SGG accessible to researchers with limited computational resources while providing comprehensive scene understanding for downstream applications.

Abstract: Scene Graph Generation (SGG) aims to extract a detailed graph structure from an image, a representation that holds significant promise as a robust intermediate step for complex downstream tasks like reasoning for embodied agents. However, practical deployment in real-world applications - especially on resource constrained edge devices - requires speed and resource efficiency, challenges that have received limited attention in existing research. To bridge this gap, we introduce DSFlash, a low-latency model for panoptic scene graph generation designed to overcome these limitations. DSFlash can process a video stream at 56 frames per second on a standard RTX 3090 GPU, without compromising performance against existing state-of-the-art methods. Crucially, unlike prior approaches that often restrict themselves to salient relationships, DSFlash computes comprehensive scene graphs, offering richer contextual information while maintaining its superior latency. Furthermore, DSFlash is light on resources, requiring less than 24 hours to train on a single, nine-year-old GTX 1080 GPU. This accessibility makes DSFlash particularly well-suited for researchers and practitioners operating with limited computational resources, empowering them to adapt and fine-tune SGG models for specialized applications.

[162] Prompting with the human-touch: evaluating model-sensitivity of foundation models for musculoskeletal CT segmentation

Caroline Magg, Maaike A. ter Wee, Johannes G. G. Dobbe, Geert J. Streekstra, Leendert Blankevoort, Clara I. Sánchez, Hoel Kervadec

Main category: cs.CV

TL;DR: Benchmark study comparing 11 promptable foundation models for medical image segmentation using 2D/3D prompting strategies on bone/implant datasets, finding significant performance variation and sensitivity to human prompts.

Details

Motivation: The proliferation of promptable foundation models for medical image segmentation has created challenges in comparing performance and selecting optimal models for clinical tasks due to varying evaluations across datasets, metrics, and compared models.

Method: Tested 11 promptable foundation models using non-iterative 2D and 3D prompting strategies on private and public datasets focusing on bone and implant segmentation in four anatomical regions. Used Pareto-optimal analysis and human prompts collected through observer studies to evaluate model performance.

Result: 1) Significant performance variation between models and prompting strategies; 2) Pareto-optimal models: SAM and SAM2.1 in 2D, nnInteractive and Med-SAM2 in 3D; 3) Localization accuracy and rater consistency vary by anatomical complexity; 4) Performance drops with human prompts vs ideal prompts; 5) All models sensitive to prompt variations with limited robustness.

Conclusion: Selecting optimal foundation models for human-driven medical image segmentation remains challenging due to sensitivity to human input prompts, even among high-performing models, suggesting performance reported on ideal prompts may overestimate real-world clinical utility.

Abstract: Promptable Foundation Models (FMs), initially introduced for natural image segmentation, have also revolutionized medical image segmentation. The increasing number of models, along with evaluations varying in datasets, metrics, and compared models, makes direct performance comparison between models difficult and complicates the selection of the most suitable model for specific clinical tasks. In our study, 11 promptable FMs are tested using non-iterative 2D and 3D prompting strategies on a private and public dataset focusing on bone and implant segmentation in four anatomical regions (wrist, shoulder, hip and lower leg). The Pareto-optimal models are identified and further analyzed using human prompts collected through a dedicated observer study. Our findings are: 1) The segmentation performance varies a lot between FMs and prompting strategies; 2) The Pareto-optimal models in 2D are SAM and SAM2.1, in 3D nnInteractive and Med-SAM2; 3) Localization accuracy and rater consistency vary with anatomical structures, with higher consistency for simple structures (wrist bones) and lower consistency for complex structures (pelvis, tibia, implants); 4) The segmentation performance drops using human prompts, suggesting that performance reported on “ideal” prompts extracted from reference labels might overestimate the performance in a human-driven setting; 5) All models were sensitive to prompt variations. While two models demonstrated intra-rater robustness, it did not scale to inter-rater settings. We conclude that the selection of the most optimal FM for a human-driven setting remains challenging, with even high-performing FMs being sensitive to variations in human input prompts. Our code base for prompt extraction and model inference is available: https://github.com/CarolineMagg/segmentation-FM-benchmark/

[163] Towards Cognitive Defect Analysis in Active Infrared Thermography with Vision-Text Cues

Mohammed Salah, Eman Ouda, Giuseppe Dell’Avvocato, Fabrizio Sarasini, Ester D’Accardi, Jorge Dias, Davor Svetinovic, Stefano Sfarra, Yusra Abdulrahman

Main category: cs.CV

TL;DR: A novel language-guided framework using vision-language models for zero-shot defect detection in carbon fiber composites via active infrared thermography, eliminating need for extensive training datasets.

Details

Motivation: AI-based active infrared thermography for CFRP inspection requires expensive, time-consuming datasets. The paper aims to overcome this by leveraging pretrained vision-language models for zero-shot defect analysis without extensive training.

Method: Proposes a language-guided framework using pretrained multimodal VLM encoders (GroundingDINO, Qwen-VL-Chat, CogVLM) with a lightweight AIRT-VLM Adapter to bridge domain gap between thermographic data and natural images, enabling zero-shot defect understanding and localization.

Result: The AIRT-VLM adapter achieves >10 dB SNR gains over conventional thermographic methods and enables zero-shot defect detection with IoU values reaching 70% on 25 CFRP inspection sequences with realistic defects.

Conclusion: The framework successfully demonstrates language-guided zero-shot defect analysis in CFRPs using VLMs, eliminating need for extensive training datasets while achieving competitive performance.

Abstract: Active infrared thermography (AIRT) is currently witnessing a surge of artificial intelligence (AI) methodologies being deployed for automated subsurface defect analysis of high performance carbon fiber-reinforced polymers (CFRP). Deploying AI-based AIRT methodologies for inspecting CFRPs requires the creation of time consuming and expensive datasets of CFRP inspection sequences to train neural networks. To address this challenge, this work introduces a novel language-guided framework for cognitive defect analysis in CFRPs using AIRT and vision-language models (VLMs). Unlike conventional learning-based approaches, the proposed framework does not require developing training datasets for extensive training of defect detectors, instead it relies solely on pretrained multimodal VLM encoders coupled with a lightweight adapter to enable generative zero-shot understanding and localization of subsurface defects. By leveraging pretrained multimodal encoders, the proposed system enables generative zero-shot understanding of thermographic patterns and automatic detection of subsurface defects. Given the domain gap between thermographic data and natural images used to train VLMs, an AIRT-VLM Adapter is proposed to enhance the visibility of defects while aligning the thermographic domain with the learned representations of VLMs. The proposed framework is validated using three representative VLMs; specifically, GroundingDINO, Qwen-VL-Chat, and CogVLM. Validation is performed on 25 CFRP inspection sequences with impacts introduced at different energy levels, reflecting realistic defects encountered in industrial scenarios. Experimental results demonstrate that the AIRT-VLM adapter achieves signal-to-noise ratio (SNR) gains exceeding 10 dB compared with conventional thermographic dimensionality-reduction methods, while enabling zero-shot defect detection with intersection-over-union values reaching 70%.

[164] PET-F2I: A Comprehensive Benchmark and Parameter-Efficient Fine-Tuning of LLMs for PET/CT Report Impression Generation

Yuchen Liu, Wenbo Zhang, Liling Peng, Yichi Zhang, Yu Fu, Xin Guo, Chao Qu, Yuan Qi, Le Xue

Main category: cs.CV

TL;DR: A benchmark and domain-adapted model for generating diagnostic impressions from PET/CT findings using LLMs, with clinical evaluation metrics.

Details

Motivation: PET/CT imaging is crucial in oncology but summarizing complex findings into diagnostic impressions is labor-intensive. While LLMs show promise in medical text generation, their capability in the specialized PET/CT domain remains underexplored.

Method: Created PET-F2I-41K benchmark with 41k real-world reports, evaluated 27 models (frontier LLMs, open-source generalists, medical-domain LLMs), and developed PET-F2I-7B by fine-tuning Qwen2.5-7B-Instruct via LoRA. Introduced three clinical metrics: Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR).

Result: Neither frontier nor medical-domain LLMs performed adequately in zero-shot settings. PET-F2I-7B achieved substantial gains (0.708 BLEU-4) and 3.0x improvement in entity coverage over strongest baseline, with advantages in cost, latency, and privacy.

Conclusion: Domain adaptation is crucial for specialized medical tasks like PET/CT impression generation. PET-F2I-41K provides a standardized evaluation framework to accelerate development of clinically deployable reporting systems.

Abstract: PET/CT imaging is pivotal in oncology and nuclear medicine, yet summarizing complex findings into precise diagnostic impressions is labor-intensive. While LLMs have shown promise in medical text generation, their capability in the highly specialized domain of PET/CT remains underexplored. We introduce PET-F2I-41K (PET Findings-to-Impression Benchmark), a large-scale benchmark for PET/CT impression generation using LLMs, constructed from over 41k real-world reports. Using PET-F2I-41K, we conduct a comprehensive evaluation of 27 models across proprietary frontier LLMs, open-source generalist models, and medical-domain LLMs, and we develop a domain-adapted 7B model (PET-F2I-7B) fine-tuned from Qwen2.5-7B-Instruct via LoRA. Beyond standard NLG metrics (e.g., BLEU-4, ROUGE-L, BERTScore), we propose three clinically grounded metrics - Entity Coverage Rate (ECR), Uncovered Entity Rate (UER), and Factual Consistency Rate (FCR) - to assess diagnostic completeness and factual reliability. Experiments reveal that neither frontier nor medical-domain LLMs perform adequately in zero-shot settings. In contrast, PET-F2I-7B achieves substantial gains (e.g., 0.708 BLEU-4) and a 3.0x improvement in entity coverage over the strongest baseline, while offering advantages in cost, latency, and privacy. Beyond this modeling contribution, PET-F2I-41K establishes a standardized evaluation framework to accelerate the development of reliable and clinically deployable reporting systems for PET/CT.

[165] UniStitch: Unifying Semantic and Geometric Features for Image Stitching

Yuan Mei, Lang Nie, Kang Liao, Yunqiu Xu, Chunyu Lin, Bin Xiao

Main category: cs.CV

TL;DR: UniStitch unifies traditional geometric features and modern semantic features for image stitching through a Neural Point Transformer and Adaptive Mixture of Experts, achieving state-of-the-art performance.

Details

Motivation: Traditional image stitching uses hand-crafted geometric features while recent learning-based methods use semantic features, but these approaches have evolved separately without meaningful convergence. The paper aims to bridge this gap by creating a unified framework that leverages both geometric and semantic features for better image stitching.

Method: Proposes UniStitch framework with two key modules: 1) Neural Point Transformer (NPT) transforms sparse geometric keypoints into dense semantic maps, aligning discrete geometric features with continuous semantic feature maps; 2) Adaptive Mixture of Experts (AMoE) dynamically fuses geometric and semantic representations, shifting focus toward more reliable features during fusion.

Result: UniStitch significantly outperforms existing state-of-the-art methods by a large margin, demonstrating that the unified approach combining geometric and semantic features delivers substantial performance gains over using either modality alone.

Conclusion: The paper successfully bridges the gap between traditional geometric and modern semantic approaches to image stitching, paving the way for a unified paradigm that leverages the strengths of both feature types for superior stitching performance.

Abstract: Traditional image stitching methods estimate warps from hand-crafted geometric features, whereas recent learning-based solutions leverage semantic features from neural networks instead. These two lines of research have largely diverged along separate evolution, with virtually no meaningful convergence to date. In this paper, we take a pioneering step to bridge this gap by unifying semantic and geometric features with UniStitch, a unified image stitching framework from multimodal features. To align discrete geometric features (i.e., keypoint) with continuous semantic feature maps, we present a Neural Point Transformer (NPT) module, which transforms unordered, sparse 1D geometric keypoints into ordered, dense 2D semantic maps. Then, to integrate the advantages of both representations, an Adaptive Mixture of Experts (AMoE) module is designed to fuse geometric and semantic representations. It dynamically shifts focus toward more reliable features during the fusion process, allowing the model to handle complex scenes, especially when either modality might be compromised. The fused representation can be adopted into common deep stitching pipelines, delivering significant performance gains over any single feature. Experiments show that UniStitch outperforms existing state-of-the-art methods with a large margin, paving the way for a unified paradigm between traditional and learning-based image stitching.

[166] R4-CGQA: Retrieval-based Vision Language Models for Computer Graphics Image Quality Assessment

Zhuangzi Li, Jian Jin, Shilv Cai, Weisi Lin

Main category: cs.CV

TL;DR: The paper introduces a comprehensive CG quality assessment framework with systematic perceptual dimensions, creates a dataset with quality descriptions, and proposes a retrieval-augmented generation method to enhance VLMs’ CG quality evaluation capabilities.

Details

Motivation: Current CG quality assessment faces two challenges: lack of systematic quality descriptions in existing datasets, and inability of existing methods to provide text-based explanations for CG quality judgments.

Method: 1) Identifies six key perceptual dimensions of CG quality from user perspective; 2) Constructs dataset of 3500 CG images with quality descriptions covering style, content, and perceived quality; 3) Builds question-answer benchmarks; 4) Proposes two-stream retrieval framework using retrieval-augmented generation to enhance VLM performance.

Result: Current VLMs are insufficiently accurate for fine-grained CG quality assessment, but descriptions of visually similar images significantly improve VLM understanding. The proposed retrieval-augmented method substantially improves VLM performance on CG quality assessment tasks.

Conclusion: The paper addresses CG quality assessment limitations by providing systematic quality dimensions, creating a comprehensive dataset, and developing an effective retrieval-augmented framework that significantly enhances VLMs’ ability to evaluate and explain CG quality.

Abstract: Immersive Computer Graphics (CGs) rendering has become ubiquitous in modern daily life. However, comprehensively evaluating CG quality remains challenging for two reasons: First, existing CG datasets lack systematic descriptions of rendering quality; and second existing CG quality assessment methods cannot provide reasonable text-based explanations. To address these issues, we first identify six key perceptual dimensions of CG quality from the user perspective and construct a dataset of 3500 CG images with corresponding quality descriptions. Each description covers CG style, content, and perceived quality along the selected dimensions. Furthermore, we use a subset of the dataset to build several question-answer benchmarks based on the descriptions in order to evaluate the responses of existing Vision Language Models (VLMs). We find that current VLMs are not sufficiently accurate in judging fine-grained CG quality, but that descriptions of visually similar images can significantly improve a VLM’s understanding of a given CG image. Motivated by this observation, we adopt retrieval-augmented generation and propose a two-stream retrieval framework that effectively enhances the CG quality assessment capabilities of VLMs. Experiments on several representative VLMs demonstrate that our method substantially improves their performance on CG quality assessment.

[167] Attribution as Retrieval: Model-Agnostic AI-Generated Image Attribution

Hongsong Wang, Renxi Cheng, Chaolei Han, Jie Gui

Main category: cs.CV

TL;DR: LIDA is a model-agnostic framework for AI-generated image attribution formulated as instance retrieval rather than classification, using low-bit plane fingerprints and unsupervised pre-training with few-shot adaptation.

Details

Motivation: Traditional image forensics methods struggle with increasingly realistic AI-generated images. Existing attribution methods are model-dependent, requiring access to generative models and lacking generality for new/unseen generators.

Method: Formulates attribution as instance retrieval problem. Uses Low-Bit Fingerprint Generation module to create input representations, followed by Unsupervised Pre-Training and Few-Shot Attribution Adaptation for training.

Result: Achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings, demonstrating strong generalization to unseen generators.

Conclusion: LIDA provides an effective model-agnostic solution for AI-generated image attribution that overcomes limitations of model-dependent approaches and scales well to new generators.

Abstract: With the rapid advancement of AIGC technologies, image forensics will encounter unprecedented challenges. Traditional methods are incapable of dealing with increasingly realistic images generated by rapidly evolving image generation techniques. To facilitate the identification of AI-generated images and the attribution of their source models, generative image watermarking and AI-generated image attribution have emerged as key research focuses in recent years. However, existing methods are model-dependent, requiring access to the generative models and lacking generality and scalability to new and unseen generators. To address these limitations, this work presents a new paradigm for AI-generated image attribution by formulating it as an instance retrieval problem instead of a conventional image classification problem. We propose an efficient model-agnostic framework, called Low-bIt-plane-based Deepfake Attribution (LIDA). The input to LIDA is produced by Low-Bit Fingerprint Generation module, while the training involves Unsupervised Pre-Training followed by subsequent Few-Shot Attribution Adaptation. Comprehensive experiments demonstrate that LIDA achieves state-of-the-art performance for both Deepfake detection and image attribution under zero- and few-shot settings. The code is at https://github.com/hongsong-wang/LIDA

[168] Need for Speed: Zero-Shot Depth Completion with Single-Step Diffusion

Jakub Gregorek, Paraskevas Pegios, Nando Metzger, Konrad Schindler, Theodora Kontogianni, Lazaros Nalpantidis

Main category: cs.CV

TL;DR: Marigold-SSD is a single-step depth completion framework that uses diffusion priors without test-time optimization, achieving fast inference with only 4.5 GPU days of training while maintaining strong cross-domain generalization.

Details

Motivation: Diffusion-based depth completion methods typically require costly test-time optimization, making them impractical for real-time applications. The authors aim to bridge the efficiency gap between diffusion-based and discriminative models while maintaining the benefits of strong diffusion priors.

Method: A single-step, late-fusion depth completion framework that shifts computational burden from inference to finetuning. It leverages diffusion priors but eliminates test-time optimization through efficient training strategies, enabling real-time performance.

Result: Achieves significantly faster inference with only 4.5 GPU days of training cost. Demonstrates strong cross-domain generalization and zero-shot performance across four indoor and two outdoor benchmarks. Narrows the efficiency gap between diffusion-based and discriminative models.

Conclusion: Marigold-SSD provides an efficient alternative to traditional diffusion-based depth completion methods by eliminating test-time optimization while maintaining strong performance. The framework enables practical 3D perception under real-world latency constraints.

Abstract: We introduce Marigold-SSD, a single-step, late-fusion depth completion framework that leverages strong diffusion priors while eliminating the costly test-time optimization typically associated with diffusion-based methods. By shifting computational burden from inference to finetuning, our approach enables efficient and robust 3D perception under real-world latency constraints. Marigold-SSD achieves significantly faster inference with a training cost of only 4.5 GPU days. We evaluate our method across four indoor and two outdoor benchmarks, demonstrating strong cross-domain generalization and zero-shot performance compared to existing depth completion approaches. Our approach significantly narrows the efficiency gap between diffusion-based and discriminative models. Finally, we challenge common evaluation protocols by analyzing performance under varying input sparsity levels. Page: https://dtu-pas.github.io/marigold-ssd/

[169] Layer Consistency Matters: Elegant Latent Transition Discrepancy for Generalizable Synthetic Image Detection

Yawen Yang, Feng Li, Shuqi Kong, Yunfeng Diao, Xinjian Gao, Zenglin Shi, Meng Wang

Main category: cs.CV

TL;DR: Proposes LTD (Latent Transition Discrepancy) method for detecting AI-generated synthetic images by analyzing inter-layer consistency differences in latent representations between real and synthetic images.

Details

Motivation: The increasing realism of AI-generated synthetic images poses security risks like media credibility and content manipulation. Existing detection methods suffer from poor generalization due to reliance on model-specific artifacts or low-level statistical cues.

Method: Identifies that real images maintain consistent semantic attention and structural coherence in latent representations with stable feature transitions across network layers, while synthetic ones show distinct patterns. Proposes LTD which captures inter-layer consistency differences, adaptively identifies most discriminative layers, and assesses transition discrepancies across layers.

Result: LTD exceeds base model by 14.35% in mean accuracy across three datasets containing diverse GANs and DMs. Outperforms recent state-of-the-art methods with superior detection accuracy, generalizability, and robustness.

Conclusion: LTD provides an effective approach for synthetic image detection by leveraging latent transition discrepancies, addressing generalization limitations of existing methods.

Abstract: Recent rapid advancement of generative models has significantly improved the fidelity and accessibility of AI-generated synthetic images. While enabling various innovative applications, the unprecedented realism of these synthetics makes them increasingly indistinguishable from authentic photographs, posing serious security risks, such as media credibility and content manipulation. Although extensive efforts have been dedicated to detecting synthetic images, most existing approaches suffer from poor generalization to unseen data due to their reliance on model-specific artifacts or low-level statistical cues. In this work, we identify a previously unexplored distinction that real images maintain consistent semantic attention and structural coherence in their latent representations, exhibiting more stable feature transitions across network layers, whereas synthetic ones present discernible distinct patterns. Therefore, we propose a novel approach termed latent transition discrepancy (LTD), which captures the inter-layer consistency differences of real and synthetic images. LTD adaptively identifies the most discriminative layers and assesses the transition discrepancies across layers. Benefiting from the proposed inter-layer discriminative modeling, our approach exceeds the base model by 14.35% in mean Acc across three datasets containing diverse GANs and DMs. Extensive experiments demonstrate that LTD outperforms recent state-of-the-art methods, achieving superior detection accuracy, generalizability, and robustness. The code is available at https://github.com/yywencs/LTD

[170] HyPER-GAN: Hybrid Patch-Based Image-to-Image Translation for Real-Time Photorealism Enhancement

Stefanos Pasios, Nikos Nikolaidis

Main category: cs.CV

TL;DR: HyPER-GAN is a lightweight image-to-image translation method using U-Net architecture for real-time photorealism enhancement of synthetic data, with hybrid training incorporating real-world patches to improve visual quality and semantic consistency.

Details

Motivation: Existing generative models for enhancing synthetic data photorealism often introduce visual artifacts, degrade algorithm accuracy, and require high computational resources, limiting real-time applications.

Method: Proposes HyPER-GAN with U-Net-style generator for real-time inference, trained on paired synthetic/photorealism-enhanced images plus hybrid training strategy incorporating matched patches from real-world data.

Result: Outperforms state-of-the-art paired image-to-image translation methods in inference latency, visual realism, and semantic robustness; hybrid training improves visual quality and semantic consistency.

Conclusion: HyPER-GAN provides efficient real-time photorealism enhancement for synthetic data with improved visual quality and semantic consistency through hybrid training strategy.

Abstract: Generative models are widely employed to enhance the photorealism of synthetic data for training computer vision algorithms. However, they often introduce visual artifacts that degrade the accuracy of these algorithms and require high computational resources, limiting their applicability in real-time training or evaluation scenarios. In this paper, we propose Hybrid Patch Enhanced Realism Generative Adversarial Network (HyPER-GAN), a lightweight image-to-image translation method based on a U-Net-style generator designed for real-time inference. The model is trained using paired synthetic and photorealism-enhanced images, complemented by a hybrid training strategy that incorporates matched patches from real-world data to improve visual realism and semantic consistency. Experimental results demonstrate that HyPER-GAN outperforms state-of-the-art paired image-to-image translation methods in terms of inference latency, visual realism, and semantic robustness. Moreover, it is illustrated that the proposed hybrid training strategy indeed improves visual quality and semantic consistency compared to training the model solely with paired synthetic and photorealism-enhanced images. Code and pretrained models are publicly available for download at: https://github.com/stefanos50/HyPER-GAN

[171] Splat2Real: Novel-view Scaling for Physical AI with 3D Gaussian Splatting

Hansol Lim, Jongseong Brad Choi

Main category: cs.CV

TL;DR: Splat2Real: Novel-view scaling curriculum for monocular depth pretraining using 3D Gaussian Splatting to improve robustness to viewpoint shifts in physical AI systems.

Details

Motivation: Physical AI systems face viewpoint shift between training and deployment, requiring novel-view robustness for monocular RGB-to-3D perception. Current methods lack effective strategies for selecting which novel views to add during pretraining.

Method: Cast depth pretraining as imitation learning from a digital twin oracle: student depth network imitates expert metric depth/visibility rendered from scene mesh. Use 3D Gaussian Splatting for scalable novel-view observations. Introduce CN-Coverage curriculum that greedily selects views by geometry gain and extrapolation penalty, plus quality-aware guardrail fallback.

Result: Across 20 TUM RGB-D sequences, naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to baseline policies. GOL-Gated CN-Coverage provides strongest medium-high-budget stability with lowest high-novelty tail error. Downstream control-proxy results show embodied-relevance by shifting safety/progress trade-offs under viewpoint shift.

Conclusion: Novel-view scaling curriculum (CN-Coverage) effectively improves monocular depth pretraining robustness to viewpoint shifts, with performance depending more on which views are added than raw view count.

Abstract: Physical AI faces viewpoint shift between training and deployment, and novel-view robustness is essential for monocular RGB-to-3D perception. We cast Real2Render2Real monocular depth pretraining as imitation-learning-style supervision from a digital twin oracle: a student depth network imitates expert metric depth/visibility rendered from a scene mesh, while 3DGS supplies scalable novel-view observations. We present Splat2Real, centered on novel-view scaling: performance depends more on which views are added than on raw view count. We introduce CN-Coverage, a coverage+novelty curriculum that greedily selects views by geometry gain and an extrapolation penalty, plus a quality-aware guardrail fallback for low-reliability teachers. Across 20 TUM RGB-D sequences with step-matched budgets (N=0 to 2000 additional rendered views, with N unique <= 500 and resampling for larger budgets), naive scaling is unstable; CN-Coverage mitigates worst-case regressions relative to Robot/Coverage policies, and GOL-Gated CN-Coverage provides the strongest medium-high-budget stability with the lowest high-novelty tail error. Downstream control-proxy results versus N provides embodied-relevance evidence by shifting safety/progress trade-offs under viewpoint shift.

[172] Less is More: Decoder-Free Masked Modeling for Efficient Skeleton Representation Learning

Jeonghyeok Do, Yun Chen, Geunhyuk Youk, Munchurl Kim

Main category: cs.CV

TL;DR: SLiM is a unified skeleton-based action representation learning framework that combines masked modeling with contrastive learning using a shared encoder, eliminating the need for a decoder to improve efficiency while maintaining performance.

Details

Motivation: Current skeleton-based action representation learning methods have limitations: Contrastive Learning (CL) overlooks fine-grained local details, while Masked Auto-Encoder (MAE) approaches have computationally heavy decoders and suffer from computational asymmetry between pre-training and downstream tasks.

Method: SLiM harmonizes masked modeling with contrastive learning via a shared encoder without a reconstruction decoder. It introduces semantic tube masking to prevent trivial reconstruction from high skeletal-temporal correlation, along with skeletal-aware augmentations for anatomical consistency across temporal granularities.

Result: SLiM achieves state-of-the-art performance across all downstream protocols while being exceptionally efficient, reducing inference computational cost by 7.89x compared to existing MAE methods.

Conclusion: SLiM successfully addresses the limitations of both CL and MAE approaches by creating a unified, efficient framework that combines their strengths while eliminating computational redundancy, making it the first decoder-free masked modeling framework for skeleton-based action representation learning.

Abstract: The landscape of skeleton-based action representation learning has evolved from Contrastive Learning (CL) to Masked Auto-Encoder (MAE) architectures. However, each paradigm faces inherent limitations: CL often overlooks fine-grained local details, while MAE is burdened by computationally heavy decoders. Moreover, MAE suffers from severe computational asymmetry – benefiting from efficient masking during pre-training but requiring exhaustive full-sequence processing for downstream tasks. To resolve these bottlenecks, we propose SLiM (Skeleton Less is More), a novel unified framework that harmonizes masked modeling with contrastive learning via a shared encoder. By eschewing the reconstruction decoder, SLiM not only eliminates computational redundancy but also compels the encoder to capture discriminative features directly. SLiM is the first framework with decoder-free masked modeling of representative learning. Crucially, to prevent trivial reconstruction arising from high skeletal-temporal correlation, we introduce semantic tube masking, alongside skeletal-aware augmentations designed to ensure anatomical consistency across diverse temporal granularities. Extensive experiments demonstrate that SLiM consistently achieves state-of-the-art performance across all downstream protocols. Notably, our method delivers this superior accuracy with exceptional efficiency, reducing inference computational cost by 7.89x compared to existing MAE methods.

[173] Are Video Reasoning Models Ready to Go Outside?

Yangfan He, Changgyu Boo, Jaehong Yoon

Main category: cs.CV

TL;DR: ROVA: A training framework that improves vision-language model robustness against real-world disturbances by modeling robustness-aware consistency rewards with difficulty-aware online training and self-reflective evaluation.

Details

Motivation: Vision-language models degrade substantially under real-world disturbances like weather, occlusion, and camera motion, revealing a gap between clean evaluation settings and real-world robustness.

Method: ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model’s evolving capability, continuously re-estimating sample difficulty via self-reflective evaluation with robustness-aware consistency rewards. Also introduces PVRBench benchmark for evaluating under realistic perturbations.

Result: ROVA mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared to baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R), with gains transferring to clean standard benchmarks.

Conclusion: ROVA effectively addresses the robustness gap in vision-language models under real-world disturbances through adaptive training with robustness-aware consistency rewards and difficulty-aware sampling.

Abstract: In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model’s evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.

[174] How To Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch, Isabelle Wittmann, Maximilian Nitsche, Johannes Jakubik, Arne Ewald, Thomas Brunschwiler

Main category: cs.CV

TL;DR: Systematic analysis of embedding design choices in Geospatial Foundation Models for Earth observation workflows, examining backbone architecture, pretraining strategies, representation depth, spatial aggregation, and combination methods.

Details

Motivation: Earth observation generates massive multispectral imagery analyzed by GeoFMs. As workflows increasingly use intermediate representations as task-agnostic embeddings, understanding how representation design choices affect downstream performance and scalability is crucial for efficient embedding-based EO workflows.

Method: Systematic analysis using NeuCo-Bench to study: backbone architecture (transformers vs ResNets), pretraining strategy (self-supervised objectives), representation depth (intermediate vs final layers), spatial aggregation methods, and representation combination approaches.

Result: Transformer backbones with mean pooling provide strong default embeddings; intermediate ResNet layers can outperform final layers; self-supervised objectives show task-specific strengths; combining embeddings from different objectives improves robustness; embeddings can be aggregated into fixed-size representations 500x smaller than raw input data.

Conclusion: Embedding design choices significantly impact GeoFM-based EO workflow performance and scalability. Systematic analysis reveals consistent trends that can guide practitioners in building efficient embedding-based pipelines for Earth observation tasks.

Abstract: Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

[175] A$^2$-Edit: Precise Reference-Guided Image Editing of Arbitrary Objects and Ambiguous Masks

Huayu Zheng, Guangzhao Li, Baixuan Zhao, Siqi Luo, Hantao Jiang, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: A²-Edit is a unified inpainting framework for arbitrary object categories that enables replacing any target region with reference objects using coarse masks, supported by a large-scale multi-category dataset and novel architectural components.

Details

Motivation: Existing inpainting methods suffer from severe homogenization and limited category coverage in datasets, making them inadequate for arbitrary object editing across diverse categories. There's a need for a unified framework that can handle various object categories with only coarse masks.

Method: The approach includes: 1) Construction of UniEdit-500K dataset with 8 major categories and 209 subcategories; 2) Mixture of Transformer module for differentiated modeling of object categories through dynamic expert selection; 3) Mask Annealing Training Strategy (MATS) that progressively relaxes mask precision during training to improve robustness.

Result: Extensive experiments on benchmarks like VITON-HD and AnyInsertion show that A²-Edit consistently outperforms existing approaches across all metrics, demonstrating superior performance in arbitrary object editing tasks.

Conclusion: A²-Edit provides a new and efficient solution for arbitrary object editing, addressing the limitations of existing methods through a unified framework, large-scale diverse dataset, and novel architectural innovations that enable robust performance across various object categories.

Abstract: We propose \textbf{A$^2$-Edit}, a unified inpainting framework for arbitrary object categories, which allows users to replace any target region with a reference object using only a coarse mask. To address the issues of severe homogenization and limited category coverage in existing datasets, we construct a large-scale, multi-category dataset \textbf{UniEdit-500K}, which includes 8 major categories, 209 fine-grained subcategories, and a total of 500,104 image pairs. Such rich category diversity poses new challenges for the model, requiring it to automatically learn semantic relationships and distinctions across categories. To this end, we introduce the \textbf{Mixture of Transformer} module, which performs differentiated modeling of various object categories through dynamic expert selection, and further enhances cross-category semantic transfer and generalization through collaboration among experts. In addition, we propose a \textbf{Mask Annealing Training Strategy} (MATS) that progressively relaxes mask precision during training, reducing the model’s reliance on accurate masks and improving robustness across diverse editing tasks. Extensive experiments on benchmarks such as VITON-HD and AnyInsertion demonstrate that A$^2$-Edit consistently outperforms existing approaches across all metrics, providing a new and efficient solution for arbitrary object editing.

[176] Bioinspired CNNs for border completion in occluded images

Catarina P. Coutinho, Aneeqa Merhab, Janko Petkovic, Ferdinando Zanchetta, Rita Fioresi

Main category: cs.CV

TL;DR: BorderNet uses mathematical modeling of visual cortex border completion to design CNN filters that improve robustness to image occlusions, showing performance gains on occluded MNIST, Fashion-MNIST, and EMNIST datasets.

Details

Motivation: The paper aims to improve CNN robustness to image occlusions by drawing inspiration from how the visual cortex handles border completion, addressing a common real-world problem where objects are partially occluded.

Method: The authors design specialized CNN filters based on mathematical modeling of border completion in the visual cortex, creating BorderNet architecture that’s evaluated on occluded versions of MNIST, Fashion-MNIST, and EMNIST datasets with stripe and grid occlusions.

Result: BorderNet demonstrates improved performance across all three datasets under both stripe and grid occlusions, with performance gains varying based on occlusion severity and dataset characteristics.

Conclusion: Biologically-inspired border completion modeling can effectively enhance CNN robustness to image occlusions, providing a promising approach for handling partial visibility in computer vision tasks.

Abstract: We exploit the mathematical modeling of the border completion problem in the visual cortex to design convolutional neural network (CNN) filters that enhance robustness to image occlusions. We evaluate our CNN architecture, BorderNet, on three occluded datasets (MNIST, Fashion-MNIST, and EMNIST) under two types of occlusions: stripes and grids. In all cases, BorderNet demonstrates improved performance, with gains varying depending on the severity of the occlusions and the dataset.

[177] RandMark: On Random Watermarking of Visual Foundation Models

Anna Chistyakova, Mikhail Pautov

Main category: cs.CV

TL;DR: Proposes a watermarking method for visual foundation models to protect intellectual property by embedding digital watermarks into internal representations of input images.

Details

Motivation: Visual foundation models are valuable assets due to high training costs, and owners need methods to protect their intellectual property when distributing models with licenses.

Method: Uses a small encoder-decoder network to embed digital watermarks into internal representations of a hold-out set of input images via random watermark embedding, making watermark statistics detectable in functional copies.

Result: The method achieves low probability of false detection for non-watermarked models and low probability of false misdetection for watermarked models, as demonstrated both theoretically and experimentally.

Conclusion: Proposes an effective ownership verification approach for visual foundation models that balances detection accuracy while protecting intellectual property rights.

Abstract: Being trained on large and diverse datasets, visual foundation models (VFMs) can be fine-tuned to achieve remarkable performance and efficiency in various downstream computer vision tasks. The high computational cost of data collection and training makes these models valuable assets, which motivates some VFM owners to distribute them alongside a license to protect their intellectual property rights. In this paper, we propose an approach to ownership verification of visual foundation models that leverages a small encoder-decoder network to embed digital watermarks into an internal representation of a hold-out set of input images. The method is based on random watermark embedding, which makes the watermark statistics detectable in functional copies of the watermarked model. Both theoretically and experimentally, we demonstrate that the proposed method yields a low probability of false detection for non-watermarked models and a low probability of false misdetection for watermarked models.

[178] UniCom: Unified Multimodal Modeling via Compressed Continuous Semantic Representations

Yaqi Zhao, Wang Lin, Zijian Zhang, Miles Yang, Jingyuan Chen, Wentao Zhang, Zhao Zhong, Liefeng Bo

Main category: cs.CV

TL;DR: UniCom is a unified multimodal framework that uses compressed continuous representations instead of discrete tokenizers to bridge visual understanding and generation, achieving SOTA performance with better semantic preservation and controllability.

Details

Motivation: Current unified multimodal models use discrete visual tokenizers that discard fine-grained semantic information, while continuous representations face challenges in high-dimensional generative modeling. There's a need for a framework that harmonizes understanding and generation without these limitations.

Method: UniCom uses compressed continuous representations with an attention-based semantic compressor that reduces channel dimension (more effective than spatial downsampling). It employs a transfusion architecture that outperforms query-based designs for convergence and consistency.

Result: UniCom achieves state-of-the-art generation performance among unified models, delivers exceptional controllability in image editing, and maintains image consistency without relying on VAE, while preserving rich semantic priors.

Conclusion: The compressed continuous representation approach effectively bridges multimodal understanding and generation, overcoming limitations of both discrete tokenizers and high-dimensional continuous modeling.

Abstract: Current unified multimodal models typically rely on discrete visual tokenizers to bridge the modality gap. However, discretization inevitably discards fine-grained semantic information, leading to suboptimal performance in visual understanding tasks. Conversely, directly modeling continuous semantic representations (e.g., CLIP, SigLIP) poses significant challenges in high-dimensional generative modeling, resulting in slow convergence and training instability. To resolve this dilemma, we introduce UniCom, a unified framework that harmonizes multimodal understanding and generation via compressed continuous representation. We empirically demonstrate that reducing channel dimension is significantly more effective than spatial downsampling for both reconstruction and generation. Accordingly, we design an attention-based semantic compressor to distill dense features into a compact unified representation. Furthermore, we validate that the transfusion architecture surpasses query-based designs in convergence and consistency. Experiments demonstrate that UniCom achieves state-of-the-art generation performance among unified models. Notably, by preserving rich semantic priors, it delivers exceptional controllability in image editing and maintains image consistency even without relying on VAE.

Rafi Ibn Sultan, Hui Zhu, Xiangyu Zhou, Chengyin Li, Prashant Khanduri, Marco Brocanelli, Dongxiao Zhu

Main category: cs.CV

TL;DR: WalkGPT is a pixel-grounded Large Vision-Language Model for accessibility navigation that generates conversational responses with segmentation masks and depth estimation for pedestrian guidance.

Details

Motivation: Existing LVLMs struggle with semantic and spatial reasoning for pedestrian navigation, suffering from object hallucinations and unreliable depth estimation, limiting their usefulness for accessibility guidance.

Method: Introduces WalkGPT with Multi-Scale Query Projector (MSQP) for hierarchical token aggregation and Calibrated Text Projector (CTP) with Region Alignment Loss for segmentation-aware representations, enabling fine-grained grounding without user cues.

Result: WalkGPT achieves strong grounded reasoning and segmentation performance on the PAVE benchmark (41k pedestrian-view images with accessibility-aware questions and depth-grounded answers).

Conclusion: WalkGPT successfully unifies language reasoning and segmentation for depth-aware accessibility guidance, advancing multimodal models for real-world navigation applications.

Abstract: Ensuring accessible pedestrian navigation requires reasoning about both semantic and spatial aspects of complex urban scenes, a challenge that existing Large Vision-Language Models (LVLMs) struggle to meet. Although these models can describe visual content, their lack of explicit grounding leads to object hallucinations and unreliable depth reasoning, limiting their usefulness for accessibility guidance. We introduce WalkGPT, a pixel-grounded LVLM for the new task of Grounded Navigation Guide, unifying language reasoning and segmentation within a single architecture for depth-aware accessibility guidance. Given a pedestrian-view image and a navigation query, WalkGPT generates a conversational response with segmentation masks that delineate accessible and harmful features, along with relative depth estimation. The model incorporates a Multi-Scale Query Projector (MSQP) that shapes the final image tokens by aggregating them along text tokens across spatial hierarchies, and a Calibrated Text Projector (CTP), guided by a proposed Region Alignment Loss, that maps language embeddings into segmentation-aware representations. These components enable fine-grained grounding and depth inference without user-provided cues or anchor points, allowing the model to generate complete and realistic navigation guidance. We also introduce PAVE, a large-scale benchmark of 41k pedestrian-view images paired with accessibility-aware questions and depth-grounded answers. Experiments show that WalkGPT achieves strong grounded reasoning and segmentation performance. The source code and dataset are available on the \href{https://sites.google.com/view/walkgpt-26/home}{project website}.

[180] UAV traffic scene understanding: A cross-spectral guided approach and a unified benchmark

Yu Zhang, Zhicheng Zhao, Ze Luo, Chenglong Li, Jin Tang

Main category: cs.CV

TL;DR: CTCNet is a cross-spectral traffic cognition network for UAV-based traffic scene understanding that combines optical and thermal imagery with traffic regulation knowledge to handle adverse conditions and complex behaviors.

Details

Motivation: Existing UAV traffic monitoring methods rely heavily on optical imagery, which degrades under adverse conditions like nighttime and fog. Current VQA models lack domain-specific regulatory knowledge needed to assess complex traffic behaviors and violations.

Method: Proposes CTCNet with two key modules: 1) Prototype-Guided Knowledge Embedding (PGKE) that uses Traffic Regulation Memory to anchor domain knowledge into visual representations, and 2) Quality-Aware Spectral Compensation (QASC) that performs bidirectional context exchange between optical and thermal modalities to compensate for degraded features.

Result: CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The authors also created Traffic-VQA, a large-scale optical-thermal infrared benchmark with 8,180 aligned image pairs and 1.3 million QA pairs across 31 diverse types.

Conclusion: The proposed CTCNet enables robust UAV traffic scene understanding by integrating cross-spectral visual information with domain-specific regulatory knowledge, effectively handling adverse conditions and complex traffic behaviors.

Abstract: Traffic scene understanding from unmanned aerial vehicle (UAV) platforms is crucial for intelligent transportation systems due to its flexible deployment and wide-area monitoring capabilities. However, existing methods face significant challenges in real-world surveillance, as their heavy reliance on optical imagery leads to severe performance degradation under adverse illumination conditions like nighttime and fog. Furthermore, current Visual Question Answering (VQA) models are restricted to elementary perception tasks, lacking the domain-specific regulatory knowledge required to assess complex traffic behaviors. To address these limitations, we propose a novel Cross-spectral Traffic Cognition Network (CTCNet) for robust UAV traffic scene understanding. Specifically, we design a Prototype-Guided Knowledge Embedding (PGKE) module that leverages high-level semantic prototypes from an external Traffic Regulation Memory (TRM) to anchor domain-specific knowledge into visual representations, enabling the model to comprehend complex behaviors and distinguish fine-grained traffic violations. Moreover, we develop a Quality-Aware Spectral Compensation (QASC) module that exploits the complementary characteristics of optical and thermal modalities to perform bidirectional context exchange, effectively compensating for degraded features to ensure robust representation in complex environments. In addition, we construct Traffic-VQA, the first large-scale optical-thermal infrared benchmark for cognitive UAV traffic understanding, comprising 8,180 aligned image pairs and 1.3 million question-answer pairs across 31 diverse types. Extensive experiments demonstrate that CTCNet significantly outperforms state-of-the-art methods in both cognition and perception scenarios. The dataset is available at https://github.com/YuZhang-2004/UAV-traffic-scene-understanding.

[181] eLasmobranc Dataset: An Image Dataset for Elasmobranch Species Recognition and Biodiversity Monitoring

Ismael Beviá-Ballesteros, Mario Jerez-Tallón, Nieves Aranda-Garrido, Isabel Abel-Abellán, Irene Antón-Linares, Jorge Azorín-López, Marcelo Saval-Calvo, Andres Fuster-Guilló, Francisca Giménez-Casalduero

Main category: cs.CV

TL;DR: A curated image dataset of 7 elasmobranch species from Spanish Mediterranean coast, designed for fine-grained species classification and biodiversity monitoring using AI.

Details

Motivation: Existing visual datasets for elasmobranchs are limited - mostly detection-oriented, underwater-acquired, or coarse-grained, restricting their use for fine-grained morphological classification needed for conservation monitoring and ISRA initiatives.

Method: Created eLasmobranc Dataset through dedicated data collection including field campaigns, collaborations with fish markets/projects, and open-access sources. Images acquired outside aquatic environment under standardized protocols for clear morphological visualization. Includes expert-validated species annotations, spatial/temporal metadata, and species-level information.

Result: Publicly available dataset of 7 ecologically relevant elasmobranch species from eastern Spanish Mediterranean coast (region with 2 ISRAs). Dataset designed to support supervised species-level classification, population studies, and AI systems for biodiversity monitoring.

Conclusion: The dataset addresses critical gap in fine-grained elasmobranch identification by combining morphological clarity, taxonomic reliability, and public accessibility, promoting reproducible research in conservation-oriented computer vision.

Abstract: Elasmobranch populations are experiencing significant global declines, and several species are currently classified as threatened. Reliable monitoring and species-level identification are essential to support conservation and spatial planning initiatives such as Important Shark and Ray Areas (ISRAs). However, existing visual datasets are predominantly detection-oriented, underwater-acquired, or limited to coarse-grained categories, restricting their applicability to fine-grained morphological classification. We present the eLasmobranc Dataset, a curated and publicly available image collection from seven ecologically relevant elasmobranch species inhabiting the eastern Spanish Mediterranean coast, a region where two ISRAs have been identified. Images were obtained through dedicated data collection, including field campaigns and collaborations with local fish markets and projects, as well as from open-access public sources. The dataset was constructed predominantly from images acquired outside the aquatic environment under standardized protocols to ensure clear visualization of diagnostic morphological traits. It integrates expert-validated species annotations, structured spatial and temporal metadata, and complementary species-level information. The eLasmobranc Dataset is specifically designed to support supervised species-level classification, population studies, and the development of artificial intelligence systems for biodiversity monitoring. By combining morphological clarity, taxonomic reliability, and public accessibility, the dataset addresses a critical gap in fine-grained elasmobranch identification and promotes reproducible research in conservation-oriented computer vision. The dataset is publicly available at https://zenodo.org/records/18549737.

[182] Just-in-Time: Training-Free Spatial Acceleration for Diffusion Transformers

Wenhao Sun, Ji Li, Zhaoqiang Liu

Main category: cs.CV

TL;DR: JiT is a training-free framework that accelerates diffusion transformers by dynamically selecting sparse anchor tokens during generation, achieving up to 7x speedup with minimal quality loss.

Details

Motivation: Current diffusion transformers suffer from high computational costs due to iterative sampling. Existing acceleration methods focus on temporal domain but overlook spatial redundancy where global structures emerge before fine details, leading to inefficient uniform computation across all spatial regions.

Method: Proposes Just-in-Time (JiT) framework with spatially approximated generative ODE that evolves full latent state using dynamically selected sparse anchor tokens. Introduces deterministic micro-flow ODE to maintain structural coherence and statistical correctness when incorporating new tokens.

Result: Extensive experiments on FLUX.1-dev model show JiT achieves up to 7x speedup with nearly lossless performance, outperforming existing acceleration methods and establishing superior trade-off between inference speed and generation fidelity.

Conclusion: JiT effectively addresses spatial redundancy in diffusion transformers through dynamic token selection, enabling significant computational acceleration without compromising generation quality.

Abstract: Diffusion Transformers have established a new state-of-the-art in image synthesis, but the high computational cost of iterative sampling severely hampers their practical deployment. While existing acceleration methods often focus on the temporal domain, they overlook the substantial spatial redundancy inherent in the generative process, where global structures emerge long before fine-grained details are formed. The uniform computational treatment of all spatial regions represents a critical inefficiency. In this paper, we introduce Just-in-Time (JiT), a novel training-free framework that addresses this challenge by acceleration in the spatial domain. JiT formulates a spatially approximated generative ordinary differential equation (ODE) that drives the full latent state evolution based on computations from a dynamically selected, sparse subset of anchor tokens. To ensure seamless transitions as new tokens are incorporated to expand the dimensions of the latent state, we propose a deterministic micro-flow, a simple and effective finite-time ODE that maintains both structural coherence and statistical correctness. Extensive experiments on the state-of-the-art FLUX.1-dev model demonstrate that JiT achieves up to a 7x speedup with nearly lossless performance, significantly outperforming existing acceleration methods and establishing a new and superior trade-off between inference speed and generation fidelity.

[183] Event-based Photometric Stereo via Rotating Illumination and Per-Pixel Learning

Hyunwoo Kim, Won-Hoe Kim, Sanghoon Lee, Jianfei Cai, Giljoo Nam, Jae-Sang Hyun

Main category: cs.CV

TL;DR: Event-based photometric stereo using a single moving light source and event camera with neural network for surface normal estimation without calibration

Details

Motivation: Conventional frame-based photometric stereo methods are limited by controlled lighting requirements and susceptibility to ambient illumination, making them impractical for real-world applications

Method: Uses event camera with single light source moving along circular trajectory, processes event signals with per-pixel multi-layer neural network to directly predict surface normals without system calibration

Result: Achieves 7.12% reduction in mean angular error compared to existing event-based methods, demonstrates robustness in sparse event regions, strong ambient illumination, and specular scenes

Conclusion: Event-based photometric stereo with neural network processing enables more compact, scalable, and robust surface normal estimation in challenging real-world conditions

Abstract: Photometric stereo is a technique for estimating surface normals using images captured under varying illumination. However, conventional frame-based photometric stereo methods are limited in real-world applications due to their reliance on controlled lighting, and susceptibility to ambient illumination. To address these limitations, we propose an event-based photometric stereo system that leverages an event camera, which is effective in scenarios with continuously varying scene radiance and high dynamic range conditions. Our setup employs a single light source moving along a predefined circular trajectory, eliminating the need for multiple synchronized light sources and enabling a more compact and scalable design. We further introduce a lightweight per-pixel multi-layer neural network that directly predicts surface normals from event signals generated by intensity changes as the light source rotates, without system calibration. Experimental results on benchmark datasets and real-world data collected with our data acquisition system demonstrate the effectiveness of our method, achieving a 7.12% reduction in mean angular error compared to existing event-based photometric stereo methods. In addition, our method demonstrates robustness in regions with sparse event activity, strong ambient illumination, and scenes affected by specularities.

[184] CodePercept: Code-Grounded Visual STEM Perception for MLLMs

Tongkun Guan, Zhibo Yang, Jianqiang Wan, Mingkun Yang, Zhengtao Guo, Zijian Hu, Ruilin Luo, Ruize Chen, Songtao Jiang, Peng Wang, Wei Shen, Junyang Lin, Xiaokang Yang

Main category: cs.CV

TL;DR: MLLMs struggle with STEM visual reasoning primarily due to perceptual deficiencies rather than reasoning limitations. The paper introduces code as a perceptual medium and creates ICC-1M dataset with Image-Caption-Code triplets to enhance perception through code-grounded caption generation and STEM image-to-code translation.

Details

Motivation: When MLLMs fail at STEM visual reasoning, it's unclear whether the failure stems from perceptual deficiencies or reasoning limitations. The paper aims to identify the true bottleneck and systematically enhance perception capabilities using code as a precise perceptual medium aligned with STEM visuals' structured nature.

Method: 1) Systematic scaling analysis to independently scale perception and reasoning components, revealing perception as the limiting factor. 2) Establish code as perceptual medium through ICC-1M dataset with 1M Image-Caption-Code triplets. 3) Two approaches: Code-Grounded Caption Generation (using executable code as ground truth) and STEM Image-to-Code Translation (generating reconstruction code). 4) Introduce STEM2Code-Eval benchmark for direct visual perception evaluation through executable code generation for image reconstruction.

Result: Scaling perception consistently outperforms scaling reasoning, identifying perception as the true bottleneck in STEM visual reasoning. The code-as-perception paradigm and ICC-1M dataset provide systematic enhancement of MLLMs’ perceptual capabilities for STEM domains.

Conclusion: Perception, not reasoning, is the primary limitation in MLLMs’ STEM visual reasoning capabilities. Using code as a perceptual medium with executable semantics provides precise alignment with structured STEM visuals, enabling systematic perception enhancement through the proposed dataset and benchmark.

Abstract: When MLLMs fail at Science, Technology, Engineering, and Mathematics (STEM) visual reasoning, a fundamental question arises: is it due to perceptual deficiencies or reasoning limitations? Through systematic scaling analysis that independently scales perception and reasoning components, we uncover a critical insight: scaling perception consistently outperforms scaling reasoning. This reveals perception as the true lever limiting current STEM visual reasoning. Motivated by this insight, our work focuses on systematically enhancing the perception capabilities of MLLMs by establishing code as a powerful perceptual medium–executable code provides precise semantics that naturally align with the structured nature of STEM visuals. Specifically, we construct ICC-1M, a large-scale dataset comprising 1M Image-Caption-Code triplets that materializes this code-as-perception paradigm through two complementary approaches: (1) Code-Grounded Caption Generation treats executable code as ground truth for image captions, eliminating the hallucinations inherent in existing knowledge distillation methods; (2) STEM Image-to-Code Translation prompts models to generate reconstruction code, mitigating the ambiguity of natural language for perception enhancement. To validate this paradigm, we further introduce STEM2Code-Eval, a novel benchmark that directly evaluates visual perception in STEM domains. Unlike existing work relying on problem-solving accuracy as a proxy that only measures problem-relevant understanding, our benchmark requires comprehensive visual comprehension through executable code generation for image reconstruction, providing deterministic and verifiable assessment. Code is available at https://github.com/TongkunGuan/Qwen-CodePercept.

[185] Guiding Diffusion Models with Semantically Degraded Conditions

Shilong Han, Yuming Zhang, Hongxia Wang

Main category: cs.CV

TL;DR: CDG replaces CFG’s null prompt with strategically degraded conditions to improve compositional accuracy in text-to-image generation by forcing “good vs. almost good” discrimination instead of “good vs. null” contrast.

Details

Motivation: Classifier-Free Guidance (CFG) suffers from geometric entanglement due to its reliance on semantically vacuous null prompts, limiting precision in complex compositional tasks. The authors aim to address this fundamental limitation by creating more semantically-aware guidance signals.

Method: Proposes Condition-Degradation Guidance (CDG) that replaces the null prompt with a strategically degraded condition (c_deg). Leverages the observation that transformer text encoder tokens split into content tokens (object semantics) and context-aggregating tokens. Selectively degrades only content tokens to construct c_deg without external models or training.

Result: CDG markedly improves compositional accuracy and text-image alignment across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image. Achieves this as a lightweight, plug-and-play module with negligible computational overhead.

Conclusion: Challenges the reliance on static, information-sparse negative samples and establishes that adaptive, semantically-aware negative samples are critical for precise semantic control in diffusion guidance.

Abstract: Classifier-Free Guidance (CFG) is a cornerstone of modern text-to-image models, yet its reliance on a semantically vacuous null prompt ($\varnothing$) generates a guidance signal prone to geometric entanglement. This is a key factor limiting its precision, leading to well-documented failures in complex compositional tasks. We propose Condition-Degradation Guidance (CDG), a novel paradigm that replaces the null prompt with a strategically degraded condition, $\boldsymbol{c}{\text{deg}}$. This reframes guidance from a coarse “good vs. null” contrast to a more refined “good vs. almost good” discrimination, thereby compelling the model to capture fine-grained semantic distinctions. We find that tokens in transformer text encoders split into two functional roles: content tokens encoding object semantics, and context-aggregating tokens capturing global context. By selectively degrading only the former, CDG constructs $\boldsymbol{c}{\text{deg}}$ without external models or training. Validated across diverse architectures including Stable Diffusion 3, FLUX, and Qwen-Image, CDG markedly improves compositional accuracy and text-image alignment. As a lightweight, plug-and-play module, it achieves this with negligible computational overhead. Our work challenges the reliance on static, information-sparse negative samples and establishes a new principle for diffusion guidance: the construction of adaptive, semantically-aware negative samples is critical to achieving precise semantic control. Code is available at https://github.com/Ming-321/Classifier-Degradation-Guidance.

[186] Taking Shortcuts for Categorical VQA Using Super Neurons

Pierre Musacchio, Jaeyi Jeong, Dahun Kim, Jaesik Park

Main category: cs.CV

TL;DR: Super Neurons (SNs) are scalar activations in VLMs that serve as effective training-free classifiers, enabling extreme early exiting from the first layer for 5.10x speedup while improving performance.

Details

Motivation: Current methods like Sparse Attention Vectors (SAVs) improve VLMs but rely on attention heads. The authors propose probing scalar activations instead, which dramatically increases the search space for accurate parameters and enables more efficient classification.

Method: Probe raw scalar activations in VLMs to find discriminative neurons (Super Neurons) that serve as classifiers. These SNs appear in shallow layers, allowing extreme early exiting from the first layer at the first generated token without additional training.

Result: SNs robustly improve classification performance compared to original networks while achieving up to 5.10x speedup through early exiting from the first layer.

Conclusion: Scalar activations (Super Neurons) provide a superior training-free alternative to attention-based methods for VLM improvement, enabling both performance gains and significant computational efficiency through early exiting.

Abstract: Sparse Attention Vectors (SAVs) have emerged as an excellent training-free alternative to supervised finetuning or low-rank adaptation to improve the performance of Vision Language Models (VLMs). At their heart, SAVs select a few accurate attention heads for a task of interest and use them as classifiers, rather than relying on the model’s prediction. In a similar spirit, we find that directly probing the raw activations of the VLM, in the form of scalar values, is sufficient to yield accurate classifiers on diverse visually grounded downstream tasks. Shifting focus from attention vectors to scalar activations dramatically increases the search space for accurate parameters, allowing us to find more discriminative neurons immediately from the first generated token. We call such activations Super Neurons (SNs). In this probing setting, we discover that enough SNs appear in the shallower layers of the large language model to allow for extreme early exiting from the first layer of the model at the first generated token. Compared to the original network, SNs robustly improve the classification performance while achieving a speedup of up to 5.10x.

[187] Phase-Interface Instance Segmentation as a Visual Sensor for Laboratory Process Monitoring

Mingyue Li, Xin Yang, Shilin Yan, Jinye Ran, Morui Zhu, Zirui Peng, Huanqing Peng, Wei Peng, Guanghua Zhang, Shuo Li, Hao Zhang

Main category: cs.CV

TL;DR: A computer vision system for monitoring chemical experiments in transparent glassware using phase-interface instance segmentation with improved YOLO architecture.

Details

Motivation: Visual monitoring of chemical experiments in transparent glassware is challenging due to weak phase boundaries and optical artifacts that degrade conventional segmentation methods.

Method: Proposes LGA-RCM-YOLO based on YOLO11m-seg, combining Local-Global Attention for robust semantic representation and Rectangular Self-Calibration Module for boundary refinement of thin interfaces. Uses CTG 2.0 dataset with 3,668 images, 23 glassware categories, and 5 multiphase interface types.

Result: Achieves 84.4% AP@0.5 and 58.43% AP@0.5-0.95 on CTG 2.0, improving over baseline by 6.42 and 8.75 AP points respectively. Maintains near real-time inference (13.67 FPS). Color-attribute head achieves 98.71% precision and 98.32% recall for labeling liquid color.

Conclusion: Phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation, demonstrated in continuous process monitoring of separatory-funnel phase separation and crystallization.

Abstract: Reliable visual monitoring of chemical experiments remains challenging in transparent glassware, where weak phase boundaries and optical artifacts degrade conventional segmentation. We formulate laboratory phenomena as the time evolution of phase interfaces and introduce the Chemical Transparent Glasses dataset 2.0 (CTG 2.0), a vessel-aware benchmark with 3,668 images, 23 glassware categories, and five multiphase interface types for phase-interface instance segmentation. Building on YOLO11m-seg, we propose LGA-RCM-YOLO, which combines Local-Global Attention (LGA) for robust semantic representation and a Rectangular Self-Calibration Module (RCM) for boundary refinement of thin, elongated interfaces. On CTG 2.0, the proposed model achieves 84.4% AP@0.5 and 58.43% AP@0.5-0.95, improving over the YOLO11m baseline by 6.42 and 8.75 AP points, respectively, while maintaining near real-time inference (13.67 FPS, RTX 3060). An auxiliary color-attribute head further labels liquid instances as colored or colorless with 98.71% precision and 98.32% recall. Finally, we demonstrate continuous process monitoring in separatory-funnel phase separation and crystallization, showing that phase-interface instance segmentation can serve as a practical visual sensor for laboratory automation.

[188] The Quadratic Geometry of Flow Matching: Semantic Granularity Alignment for Text-to-Image Synthesis

Zhinan Xiong, Shunqi Yuan

Main category: cs.CV

TL;DR: The paper analyzes generative fine-tuning dynamics under Flow Matching, revealing a latent Data Interaction Matrix and proposes Semantic Granularity Alignment (SGA) to mitigate gradient conflicts in Text-to-Image synthesis.

Details

Motivation: The authors observe that standard MSE objective in Flow Matching can be formulated as a Quadratic Form governed by a Neural Tangent Kernel, revealing a latent Data Interaction Matrix where off-diagonal terms encode residual correlations between heterogeneous features. Standard training implicitly optimizes these cross-term interferences without explicit control, and the prevailing data-homogeneity assumption may constrain model capacity.

Method: Proposes Semantic Granularity Alignment (SGA) using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts by explicitly controlling the cross-term interferences revealed by the Data Interaction Matrix analysis.

Result: Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity in Text-to-Image synthesis.

Conclusion: The geometric perspective of Flow Matching optimization reveals important dynamics in generative fine-tuning, and the proposed SGA method effectively addresses gradient conflicts to improve training efficiency and output quality in multimodal generation tasks.

Abstract: In this work, we analyze the optimization dynamics of generative fine-tuning. We observe that under the Flow Matching framework, the standard MSE objective can be formulated as a Quadratic Form governed by a dynamically evolving Neural Tangent Kernel (NTK). This geometric perspective reveals a latent Data Interaction Matrix, where diagonal terms represent independent sample learning and off-diagonal terms encode residual correlation between heterogeneous features. Although standard training implicitly optimizes these cross-term interferences, it does so without explicit control; moreover, the prevailing data-homogeneity assumption may constrain the model’s effective capacity. Motivated by this insight, we propose Semantic Granularity Alignment (SGA), using Text-to-Image synthesis as a testbed. SGA engineers targeted interventions in the vector residual field to mitigate gradient conflicts. Evaluations across DiT and U-Net architectures confirm that SGA advances the efficiency-quality trade-off by accelerating convergence and improving structural integrity.

[189] PolGS++: Physically-Guided Polarimetric Gaussian Splatting for Fast Reflective Surface Reconstruction

Yufei Han, Chu Zhou, Youwei Lyu, Qi Chen, Si Li, Boxin Shi, Yunpeng Jia, Heng Guo, Zhanyu Ma

Main category: cs.CV

TL;DR: PolGS++ integrates polarized BRDF modeling into 3D Gaussian Splatting for efficient reflective surface reconstruction, using depth-guided visibility masks and AoP-based constraints to improve geometry and normal recovery.

Details

Motivation: Accurate reconstruction of reflective surfaces is challenging but important for VR and digital content creation. While 3D Gaussian Splatting enables efficient novel-view rendering, it underperforms on reflective surfaces compared to implicit neural methods, especially for fine geometry and surface normals.

Method: Proposes PolGS++, a physically-guided polarimetric Gaussian Splatting framework that: 1) integrates polarized BRDF model into 3DGS to decouple diffuse and specular components, 2) introduces depth-guided visibility mask acquisition for AoP-based tangent-space consistency constraints without costly ray-tracing.

Result: Extensive experiments on synthetic and real-world datasets validate effectiveness. The method improves reconstruction quality and efficiency, requiring only about 10 minutes of training.

Conclusion: PolGS++ addresses the gap in reflective surface reconstruction for 3D Gaussian Splatting by incorporating physical guidance through polarimetric modeling, achieving better geometry and normal recovery while maintaining efficiency.

Abstract: Accurate reconstruction of reflective surfaces remains a fundamental challenge in computer vision, with broad applications in real-time virtual reality and digital content creation. Although 3D Gaussian Splatting (3DGS) enables efficient novel-view rendering with explicit representations, its performance on reflective surfaces still lags behind implicit neural methods, especially in recovering fine geometry and surface normals. To address this gap, we propose PolGS++, a physically-guided polarimetric Gaussian Splatting framework for fast reflective surface reconstruction. Specifically, we integrate a polarized BRDF (pBRDF) model into 3DGS to explicitly decouple diffuse and specular components, providing physically grounded reflectance modeling and stronger geometric cues for reflective surface recovery. Furthermore, we introduce a depth-guided visibility mask acquisition mechanism that enables angle-of-polarization (AoP)-based tangent-space consistency constraints in Gaussian Splatting without costly ray-tracing intersections. This physically guided design improves reconstruction quality and efficiency, requiring only about 10 minutes of training. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of our method.

[190] Backdoor Directions in Vision Transformers

Sengim Karayalcin, Marina Krcek, Pin-Yu Chen, Stjepan Picek

Main category: cs.CV

TL;DR: The paper investigates backdoor attacks in Vision Transformers, identifying a “trigger direction” in model activations that controls backdoor behavior, revealing differences between attack types, and proposing detection methods.

Details

Motivation: To understand how backdoor attacks are represented internally within Vision Transformers and develop interpretability-based methods for diagnosing and addressing security vulnerabilities in computer vision models.

Method: Identifies a linear “trigger direction” in model activations, performs interventions in activation and parameter space, traces backdoor feature processing across layers, analyzes differences between attack types, and proposes weight-based detection schemes.

Result: Found distinct qualitative differences between static-patch and stealthy distributed triggers, confirmed causal role of trigger direction, and developed effective detection methods for stealthy-trigger attacks.

Conclusion: Mechanistic interpretability provides a robust framework for diagnosing and addressing security vulnerabilities in vision models, with backdoor attacks following different internal logics depending on trigger type.

Abstract: This paper investigates how Backdoor Attacks are represented within Vision Transformers (ViTs). By assuming knowledge of the trigger, we identify a specific ``trigger direction’’ in the model’s activations that corresponds to the internal representation of the trigger. We confirm the causal role of this linear direction by showing that interventions in both activation and parameter space consistently modulate the model’s backdoor behavior across multiple datasets and attack types. Using this direction as a diagnostic tool, we trace how backdoor features are processed across layers. Our analysis reveals distinct qualitative differences: static-patch triggers follow a different internal logic than stealthy, distributed triggers. We further examine the link between backdoors and adversarial attacks, specifically testing whether PGD-based perturbations (de-)activate the identified trigger mechanism. Finally, we propose a data-free, weight-based detection scheme for stealthy-trigger attacks. Our findings show that mechanistic interpretability offers a robust framework for diagnosing and addressing security vulnerabilities in computer vision.

[191] HanMoVLM: Large Vision-Language Models for Professional Artistic Painting Evaluation

Hongji Yang, Yucheng Zhou, Wencheng Han, Songlian Li, Xiaotong Zhao, Jianbing Shen

Main category: cs.CV

TL;DR: HanMoVLM transforms VLMs into expert evaluators for Chinese paintings, using a specialized dataset and Chain-of-Thought reasoning validated by experts to achieve professional-grade artistic evaluation.

Details

Motivation: Current VLMs lack artistic expertise for professional evaluation of artworks, especially in abstract domains like Chinese painting that require extensive artistic training. There's a need to bridge this gap between general visual understanding and domain-specific artistic evaluation.

Method: Introduces HanMo-Bench dataset with authentic auction-grade masterpieces and AI-generated works. Proposes HanMoVLM with expert-validated Chain-of-Thought reasoning: content identification → RoI localization → professional evaluation using theme-specific and three-tier Chinese painting criteria. Includes reward function to refine reasoning.

Result: HanMoVLM achieves high consistency with professional experts and significantly improves Chinese painting generation quality. It serves as a critical backbone for Test-time Scaling in image generation by selecting artistically superior outputs.

Conclusion: The approach successfully bridges the gap between general VLMs and professional artistic evaluation, enabling expert-level assessment of Chinese paintings and improving generative model outputs through high-quality verification.

Abstract: While Large Vision-Language Models (VLMs) demonstrate impressive general visual capabilities, they remain artistically blind and unable to offer professional evaluation of artworks within specific artistic domains like human experts. To bridge this gap, we transform VLMs into experts capable of professional-grade painting evaluation in the Chinese Artistic Domain, which is more abstract and demands extensive artistic training for evaluation. We introduce HanMo-Bench, a new dataset that features authentic auction-grade masterpieces and AI-generated works, grounded in real-world market valuations. To realize the rigorous judgment, we propose the HanMoVLM and construct a Chain-of-Thought (CoT) validated by experts. This CoT guides the model to perform expert-level reasoning: from content identification and Region of Interest (RoI) localization to professional evaluation, guided by both theme-specific evaluation and typical three-tier evaluation in Chinese paintings. Furthermore, we design a reward function to refine the reasoning process of the HanMoVLM to improve the accuracy. We demonstrate that HanMoVLM can serve as a critical backbone for Test-time Scaling in image generation. By acting as a high-quality verifier, HanMoVLM enables generative models to select the most artistically superior outputs from multiple candidates. Experimental results and human studies confirm that the proposed HanMoVLM effectively bridges the gap, achieving a high consistency with professional experts and significantly improving the quality of Chinese Painting generation.

[192] A dataset of medication images with instance segmentation masks for preventing adverse drug events

W. I. Chu, S. Hirani, G. Tarroni, L. Li

Main category: cs.CV

TL;DR: MEDISEG dataset provides instance segmentation annotations for 32 pill types across 8262 images with real-world complexities like overlapping pills and varied lighting, enabling robust AI-based pill recognition for medication safety.

Details

Motivation: Medication errors and adverse drug events pose significant patient safety risks due to difficulties in reliably identifying pharmaceuticals in real-world settings. Existing pill image datasets lack comprehensive real-world complexities like overlapping pills, varied lighting, and occlusions, hindering development of effective AI-based pill recognition models.

Method: Created MEDISEG dataset with instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. Trained YOLOv8 and YOLOv9 models on this dataset and evaluated performance under few-shot detection protocols.

Result: Achieved mean average precision at IoU 0.5 of 99.5% on 3-Pills subset and 80.1% on 32-Pills subset. Base training on MEDISEG significantly improved recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets, demonstrating transferable representations under limited supervision.

Conclusion: MEDISEG dataset supports robust supervised training and promotes transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety and pill recognition.

Abstract: Medication errors and adverse drug events (ADEs) pose significant risks to patient safety, often arising from difficulties in reliably identifying pharmaceuticals in real-world settings. AI-based pill recognition models offer a promising solution, but the lack of comprehensive datasets hinders their development. Existing pill image datasets rarely capture real-world complexities such as overlapping pills, varied lighting, and occlusions. MEDISEG addresses this gap by providing instance segmentation annotations for 32 distinct pill types across 8262 images, encompassing diverse conditions from individual pill images to cluttered dosette boxes. We trained YOLOv8 and YOLOv9 on MEDISEG to demonstrate their usability, achieving mean average precision at IoU 0.5 of 99.5 percent on the 3-Pills subset and 80.1 percent on the 32-Pills subset. We further evaluate MEDISEG under a few-shot detection protocol, demonstrating that base training on MEDISEG significantly improves recognition of unseen pill classes in occluded multi-pill scenarios compared to existing datasets. These results highlight the dataset’s ability not only to support robust supervised training but also to promote transferable representations under limited supervision, making it a valuable resource for developing and benchmarking AI-driven systems for medication safety.

[193] BALD-SAM: Disagreement-based Active Prompting in Interactive Segmentation

Prithwijit Chowdhury, Mohit Prabhushankar, Ghassan AlRegib

Main category: cs.CV

TL;DR: BALD-SAM: A Bayesian active learning framework for spatial prompt selection in interactive segmentation, using uncertainty estimation to identify informative regions for refinement.

Details

Motivation: Current interactive segmentation workflows rely on human visual assessment for prompt placement, lacking principled automated approaches to identify the most informative regions for refinement during iterative annotation.

Method: Proposes active prompting using Bayesian Active Learning by Disagreement (BALD) for spatial prompt selection. Freezes the entire SAM model and applies Bayesian uncertainty modeling only to a small learned prediction head, making uncertainty estimation practical for large foundation models.

Result: Achieves strong cross-domain performance across 16 datasets spanning natural, medical, underwater, and seismic domains, ranking first or second on 14 of 16 benchmarks. Surpasses human prompting and even oracle prompting in several categories, consistently outperforming one-shot baselines.

Conclusion: BALD-SAM provides a principled framework for automated interactive prompting that increases annotation efficiency by strategically selecting informative regions, demonstrating robust performance across diverse domains and complex object structures.

Abstract: The Segment Anything Model (SAM) has revolutionized interactive segmentation through spatial prompting. While existing work primarily focuses on automating prompts in various settings, real-world annotation workflows involve iterative refinement where annotators observe model outputs and strategically place prompts to resolve ambiguities. Current pipelines typically rely on the annotator’s visual assessment of the predicted mask quality. We postulate that a principled approach for automated interactive prompting is to use a model-derived criterion to identify the most informative region for the next prompt. In this work, we establish active prompting: a spatial active learning approach where locations within images constitute an unlabeled pool and prompts serve as queries to prioritize information-rich regions, increasing the utility of each interaction. We further present BALD-SAM: a principled framework adapting Bayesian Active Learning by Disagreement (BALD) to spatial prompt selection by quantifying epistemic uncertainty. To do so, we freeze the entire model and apply Bayesian uncertainty modeling only to a small learned prediction head, making intractable uncertainty estimation practical for large multi-million parameter foundation models. Across 16 datasets spanning natural, medical, underwater, and seismic domains, BALD-SAM demonstrates strong cross-domain performance, ranking first or second on 14 of 16 benchmarks. We validate these gains through a comprehensive ablation suite covering 3 SAM backbones and 35 Laplace posterior configurations, amounting to 38 distinct ablation settings. Beyond strong average performance, BALD-SAM surpasses human prompting and, in several categories, even oracle prompting, while consistently outperforming one-shot baselines in final segmentation quality, particularly on thin and structurally complex objects.

[194] Evaluating Few-Shot Pill Recognition Under Visual Domain Shift

W. I. Chu, G. Tarroni, L. Li

Main category: cs.CV

TL;DR: Few-shot pill recognition using two-stage object detection framework shows rapid semantic classification adaptation but struggles with localization under complex real-world conditions like overlapping pills.

Details

Motivation: To address real-world deployment challenges of automated pill recognition systems in visually complex conditions (cluttered scenes, overlapping pills, reflections, diverse environments) by focusing on generalization under cross-dataset domain shifts rather than architectural innovation.

Method: Two-stage object detection framework with base training followed by few-shot fine-tuning. Models adapted to novel pill classes using 1, 5, or 10 labeled examples per class, evaluated on deployment dataset with multi-object, cluttered scenes using classification-centric and error-based metrics.

Result: Semantic pill recognition adapts rapidly with few-shot supervision, with classification performance saturating even with single labeled example. However, localization and recall decline significantly under overlapping/occluded conditions despite robust semantic classification. Models trained on realistic multi-pill data show greater robustness in low-shot scenarios.

Conclusion: Training data realism is crucial for deployment readiness, and few-shot fine-tuning serves as valuable diagnostic tool for assessing model robustness under real-world conditions, highlighting the gap between semantic classification and localization performance in complex visual environments.

Abstract: Adverse drug events are a significant source of preventable harm, which has led to the development of automated pill recognition systems to enhance medication safety. Real-world deployment of these systems is hindered by visually complex conditions, including cluttered scenes, overlapping pills, reflections, and diverse acquisition environments. This study investigates few-shot pill recognition from a deployment-oriented perspective, prioritizing generalization under realistic cross-dataset domain shifts over architectural innovation. A two-stage object detection framework is employed, involving base training followed by few-shot fine-tuning. Models are adapted to novel pill classes using one, five, or ten labeled examples per class and are evaluated on a separate deployment dataset featuring multi-object, cluttered scenes. The evaluation focuses on classification-centric and error-based metrics to address heterogeneous annotation strategies. Findings indicate that semantic pill recognition adapts rapidly with few-shot supervision, with classification performance reaching saturation even with a single labeled example. However, stress testing under overlapping and occluded conditions demonstrates a marked decline in localization and recall, despite robust semantic classification. Models trained on visually realistic, multi-pill data consistently exhibit greater robustness in low-shot scenarios, underscoring the importance of training data realism and the diagnostic utility of few-shot fine-tuning for deployment readiness.

[195] On the Reliability of Cue Conflict and Beyond

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

Main category: cs.CV

TL;DR: REFINED-BIAS introduces a new dataset and evaluation framework for more reliable and interpretable diagnosis of shape-texture bias in neural networks, addressing limitations of current stylization-based cue-conflict benchmarks.

Details

Motivation: Current stylization-based cue-conflict benchmarks for probing shape-texture preference in neural networks yield unstable and ambiguous bias estimates due to issues with cue validity, separability, informativeness control, and restricted evaluation space.

Method: REFINED-BIAS constructs balanced, human- and model-recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric.

Result: The framework enables fairer cross-model comparisons, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions across diverse training regimes and architectures, resolving inconsistencies from prior evaluations.

Conclusion: REFINED-BIAS provides a more reliable and interpretable approach for diagnosing shape-texture bias in neural networks, addressing fundamental limitations of existing cue-conflict benchmarks.

Abstract: Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

[196] UltrasoundAgents: Hierarchical Multi-Agent Evidence-Chain Reasoning for Breast Ultrasound Diagnosis

Yali Zhu, Kang Zhou, Dingbang Wu, Gaofeng Meng

Main category: cs.CV

TL;DR: UltrasoundAgents: Hierarchical multi-agent framework for breast ultrasound diagnosis that mimics clinical workflow with lesion localization, attribute analysis, and evidence-based reasoning for BI-RADS classification.

Details

Motivation: Existing breast ultrasound diagnosis methods use end-to-end prediction or provide weakly grounded evidence, missing fine-grained lesion cues and limiting auditability. Need to align with clinical workflow (global→local→integration) and improve evidence traceability.

Method: Hierarchical multi-agent framework: main agent localizes lesion and triggers crop-and-zoom; sub-agent analyzes local view to predict four clinical attributes (echogenicity pattern, calcification, boundary type, edge morphology); main agent integrates attributes for evidence-based reasoning and BI-RADS classification. Uses decoupled progressive training strategy to address error propagation and sparse rewards.

Result: Consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, with structured evidence and traceable reasoning.

Conclusion: UltrasoundAgents provides a clinically-aligned framework for breast ultrasound diagnosis with improved auditability through structured intermediate evidence and evidence-based reasoning.

Abstract: Breast ultrasound diagnosis typically proceeds from global lesion localization to local sign assessment and then evidence integration to assign a BI-RADS category and determine benignity or malignancy. Many existing methods rely on end-to-end prediction or provide only weakly grounded evidence, which can miss fine-grained lesion cues and limit auditability and clinical review. To align with the clinical workflow and improve evidence traceability, we propose a hierarchical multi-agent framework, termed UltrasoundAgents. A main agent localizes the lesion in the full image and triggers a crop-and-zoom operation. A sub-agent analyzes the local view and predicts four clinically relevant attributes, namely echogenicity pattern, calcification, boundary type, and edge (margin) morphology. The main agent then integrates these structured attributes to perform evidence-based reasoning and output the BI-RADS category and the malignancy prediction, while producing reviewable intermediate evidence. Furthermore, hierarchical multi-agent training often suffers from error propagation, difficult credit assignment, and sparse rewards. To alleviate this and improve training stability, we introduce a decoupled progressive training strategy. We first train the attribute agent, then train the main agent with oracle attributes to learn robust attribute-based reasoning, and finally apply corrective trajectory self-distillation with spatial supervision to build high-quality trajectories for supervised fine-tuning, yielding a deployable end-to-end policy. Experiments show consistent gains over strong vision-language baselines in diagnostic accuracy and attribute agreement, together with structured evidence and traceable reasoning.

Lin Chen, Bolin Ni, Qi Yang, Zili Wang, Kun Ding, Ying Wang, Houwen Peng, Shiming Xiang

Main category: cs.CV

TL;DR: DIPE addresses visual fading in MLLMs by modifying position encoding to maintain consistent visual-text attention in long contexts

Details

Motivation: Multimodal LLMs suffer from visual fading in long-context scenarios where attention to visual tokens diminishes as text sequences lengthen, causing text generation to become detached from visual constraints

Method: Proposes inter-modal Distance Invariant Position Encoding (DIPE) that disentangles position encoding based on modality interactions: retains natural relative positioning for intra-modal interactions while enforcing anchored perceptual proximity for inter-modal interactions

Result: DIPE integrated with Multimodal RoPE maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks

Conclusion: DIPE effectively mitigates inter-modal distance-based penalty in position encoding, ensuring visual signals remain perceptually consistent regardless of context length

Abstract: Despite the remarkable capabilities of Multimodal Large Language Models (MLLMs), they still suffer from visual fading in long-context scenarios. Specifically, the attention to visual tokens diminishes as the text sequence lengthens, leading to text generation detached from visual constraints. We attribute this degradation to the inherent inductive bias of Multimodal RoPE, which penalizes inter-modal attention as the distance between visual and text tokens increases. To address this, we propose inter-modal Distance Invariant Position Encoding (DIPE), a simple but effective mechanism that disentangles position encoding based on modality interactions. DIPE retains the natural relative positioning for intra-modal interactions to preserve local structure, while enforcing an anchored perceptual proximity for inter-modal interactions. This strategy effectively mitigates the inter-modal distance-based penalty, ensuring that visual signals remain perceptually consistent regardless of the context length. Experimental results demonstrate that by integrating DIPE with Multimodal RoPE, the model maintains stable visual grounding in long-context scenarios, significantly alleviating visual fading while preserving performance on standard short-context benchmarks. Code is available at https://github.com/lchen1019/DIPE.

[198] Bilevel Layer-Positioning LoRA for Real Image Dehazing

Yan Zhang, Long Ma, Yuxin Feng, Zhe Huang, Fan Zhou, Zhuo Su

Main category: cs.CV

TL;DR: Proposes a text-guided dehazing method using CLIP for semantic alignment and a bilevel LoRA strategy for efficient adaptation to diverse real haze scenes.

Details

Motivation: Existing learning-based real image dehazing methods face adaptation challenges in diverse real haze scenes due to lack of effective unsupervised mechanisms for unlabeled data and high cost of full model fine-tuning.

Method: Introduces haze-to-clear text-directed loss leveraging CLIP’s cross-modal capabilities to reformulate dehazing as semantic alignment in latent space, plus Bilevel Layer-positioning LoRA (BiLaLoRA) that learns both LoRA parameters and automatically searches injection layers for targeted adaptation.

Result: Extensive experiments demonstrate superiority against state-of-the-art methods on multiple real-world dehazing benchmarks.

Conclusion: The proposed approach effectively addresses adaptation challenges in real image dehazing through cross-modal guidance and efficient parameter-efficient fine-tuning.

Abstract: Learning-based real image dehazing methods have achieved notable progress, yet they still face adaptation challenges in diverse real haze scenes. These challenges mainly stem from the lack of effective unsupervised mechanisms for unlabeled data and the heavy cost of full model fine-tuning. To address these challenges, we propose the haze-to-clear text-directed loss that leverages CLIP’s cross-modal capabilities to reformulate real image dehazing as a semantic alignment problem in latent space, thereby providing explicit unsupervised cross-modal guidance in the absence of reference images. Furthermore, we introduce the Bilevel Layer-positioning LoRA (BiLaLoRA) strategy, which learns both the LoRA parameters and automatically search the injection layers, enabling targeted adaptation of critical network layers. Extensive experiments demonstrate our superiority against state-of-the-art methods on multiple real-world dehazing benchmarks. The code is publicly available at https://github.com/YanZhang-zy/BiLaLoRA.

[199] S2D: Sparse to Dense Lifting for 3D Reconstruction with Minimal Inputs

Yuzhou Ji, Qijian Tian, He Zhu, Xiaoqi Jiang, Guangzhi Cao, Lizhuang Ma, Yuan Xie, Xin Tan

Main category: cs.CV

TL;DR: S2D pipeline bridges sparse point clouds to high-quality 3D Gaussian Splatting reconstruction using diffusion models and robust fitting strategies for minimal input requirements.

Details

Motivation: Current 3D representations like point clouds and 3DGS suffer from non-photorealistic rendering and degrade significantly under sparse inputs, limiting practical applications that require minimal captures.

Method: Two-fold approach: 1) Efficient one-step diffusion model lifts sparse point clouds for high-fidelity image artifact fixing, 2) Reconstruction strategy with random sample drop and weighted gradient for robust 3D consistent scene fitting from sparse to dense views.

Result: S2D achieves best consistency in novel view generation and first-tier sparse view reconstruction quality under different input sparsity levels, enabling stable scene reconstruction with minimal captures.

Conclusion: S2D bridges sparse point clouds and 3DGS representations, achieving high-quality 3D reconstruction with minimal input requirements, advancing practical applications of 3D Gaussian Splatting.

Abstract: Explicit 3D representations have already become an essential medium for 3D simulation and understanding. However, the most commonly used point cloud and 3D Gaussian Splatting (3DGS) each suffer from non-photorealistic rendering and significant degradation under sparse inputs. In this paper, we introduce Sparse to Dense lifting (S2D), a novel pipeline that bridges the two representations and achieves high-quality 3DGS reconstruction with minimal inputs. Specifically, the S2D lifting is two-fold. We first present an efficient one-step diffusion model that lifts sparse point cloud for high-fidelity image artifact fixing. Meanwhile, to reconstruct 3D consistent scenes, we also design a corresponding reconstruction strategy with random sample drop and weighted gradient for robust model fitting from sparse input views to dense novel views. Extensive experiments show that S2D achieves the best consistency in generating novel view guidance and first-tier sparse view reconstruction quality under different input sparsity. By reconstructing stable scenes with the least possible captures among existing methods, S2D enables minimal input requirements for 3DGS applications.

[200] Novel Architecture of RPA In Oral Cancer Lesion Detection

Revana Magdy, Joy Naoum, Ali Hamdi

Main category: cs.CV

TL;DR: Two RPA implementations (OC-RPAv1 and OC-RPAv2) for oral cancer detection show significant efficiency improvements through design patterns and batch processing

Details

Motivation: Need for accurate and early detection of oral cancer lesions for effective diagnosis and treatment, with focus on improving efficiency of existing RPA methods

Method: Evaluated two RPA implementations: OC-RPAv1 (single image processing) and OC-RPAv2 (Singleton design pattern with batch processing) on 31 test images

Result: OC-RPAv1: 0.29 seconds per image; OC-RPAv2: 0.06 seconds per image, representing 60-100x efficiency improvement over standard RPA methods

Conclusion: Design patterns and batch processing can significantly enhance scalability and reduce costs in oral cancer detection systems

Abstract: Accurate and early detection of oral cancer lesions is crucial for effective diagnosis and treatment. This study evaluates two RPA implementations, OC-RPAv1 and OC-RPAv2, using a test set of 31 images. OC-RPAv1 processes one image per prediction in an average of 0.29 seconds, while OCRPAv2 employs a Singleton design pattern and batch processing, reducing prediction time to just 0.06 seconds per image. This represents a 60-100x efficiency improvement over standard RPA methods, showcasing that design patterns and batch processing can enhance scalability and reduce costs in oral cancer detection

[201] Lifelong Imitation Learning with Multimodal Latent Replay and Incremental Adjustment

Fanqi Yu, Matteo Tiezzi, Tommaso Apicella, Cigdem Beyan, Vittorio Murino

Main category: cs.CV

TL;DR: Lifelong imitation learning framework using multimodal latent space representations for continual policy refinement across sequential tasks with memory constraints.

Details

Motivation: Enable continual policy refinement across sequential tasks under realistic memory and data constraints, moving beyond conventional experience replay approaches.

Method: Operates in multimodal latent space storing compact representations of visual, linguistic, and robot state information; introduces incremental feature adjustment mechanism with angular margin constraint to regularize task embedding evolution.

Result: Establishes new state-of-the-art in LIBERO benchmarks with 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods.

Conclusion: The framework effectively enables lifelong imitation learning through multimodal latent representations and feature regularization, demonstrating significant improvements in performance and reduced forgetting.

Abstract: We introduce a lifelong imitation learning framework that enables continual policy refinement across sequential tasks under realistic memory and data constraints. Our approach departs from conventional experience replay by operating entirely in a multimodal latent space, where compact representations of visual, linguistic, and robot’s state information are stored and reused to support future learning. To further stabilize adaptation, we introduce an incremental feature adjustment mechanism that regularizes the evolution of task embeddings through an angular margin constraint, preserving inter-task distinctiveness. Our method establishes a new state of the art in the LIBERO benchmarks, achieving 10-17 point gains in AUC and up to 65% less forgetting compared to previous leading methods. Ablation studies confirm the effectiveness of each component, showing consistent gains over alternative strategies. The code is available at: https://github.com/yfqi/lifelong_mlr_ifa.

[202] Bridging the Skill Gap in Clinical CBCT Interpretation with CBCTRepD

Qinxin Wu, Fucheng Niu, Hengchuan Zhu, Yifan Sun, Ye Shen, Xu Li, Han Wu, Leqi Liu, Zhiwen Pan, Zuozhu Liu, Fudong Zhu, Bin Feng

Main category: cs.CV

TL;DR: CBCTRepD: A bilingual oral and maxillofacial CBCT report-generation system that improves radiologist-AI collaboration workflows across experience levels.

Details

Motivation: Limited application of generative AI in oral and maxillofacial CBCT reporting due to scarcity of high-quality paired CBCT-report data and complexity of volumetric CBCT interpretation.

Method: Curated large-scale dataset of 7,408 CBCT studies covering 55 oral diseases, developed bilingual report-generation system, and established multi-level evaluation framework with radiologist- and clinician-centered assessment.

Result: CBCTRepD achieves superior report-generation performance with writing quality comparable to intermediate radiologists, and provides consistent benefits across experience levels in radiologist-AI collaboration.

Conclusion: CBCTRepD shows strong potential as practical assistant for real-world CBCT reporting by improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions.

Abstract: Generative AI has advanced rapidly in medical report generation; however, its application to oral and maxillofacial CBCT reporting remains limited, largely because of the scarcity of high-quality paired CBCT-report data and the intrinsic complexity of volumetric CBCT interpretation. To address this, we introduce CBCTRepD, a bilingual oral and maxillofacial CBCT report-generation system designed for integration into routine radiologist-AI co-authoring workflows. We curated a large-scale, high-quality paired CBCT-report dataset comprising approximately 7,408 studies, covering 55 oral disease entities across diverse acquisition settings, and used it to develop the system. We further established a clinically grounded, multi-level evaluation framework that assesses both direct AI-generated drafts and radiologist-edited collaboration reports using automatic metrics together with radiologist- and clinician-centered evaluation. Using this framework, we show that CBCTRepD achieves superior report-generation performance and produces drafts with writing quality and standardization comparable to those of intermediate radiologists. More importantly, in radiologist-AI collaboration, CBCTRepD provides consistent and clinically meaningful benefits across experience levels: it helps novice radiologists improve toward intermediate-level reporting, enables intermediate radiologists to approach senior-level performance, and even assists senior radiologists by reducing omission-related errors, including clinically important missed lesions. By improving report structure, reducing omissions, and promoting attention to co-existing lesions across anatomical regions, CBCTRepD shows strong and reliable potential as a practical assistant for real-world CBCT reporting across multi-level care settings.

[203] Pointy - A Lightweight Transformer for Point Cloud Foundation Models

Konrad Szafer, Marek Kraft, Dominik Belter

Main category: cs.CV

TL;DR: A lightweight transformer-based point cloud model trained on only 39k samples outperforms larger foundation models trained on 200k+ samples, approaching SOTA results achieved with million-scale multimodal training.

Details

Motivation: Current point cloud foundation models heavily rely on cross-modal supervision from language or vision, requiring massive datasets. The authors aim to demonstrate that simpler, carefully designed architectures with modest training data can achieve competitive performance without complex multimodal dependencies.

Method: Introduces a lightweight transformer-based architecture for point clouds, trained on only 39k point clouds without cross-modal supervision. Conducts comprehensive replication study with standardized training regime to isolate architectural impact, comparing against various point cloud architectures including tokenizer-free approaches.

Result: The model outperforms several larger foundation models trained on over 200k samples and approaches state-of-the-art results from models trained on over a million point clouds, images, and text samples. Shows simple backbones can deliver competitive results to more complex or data-rich strategies.

Conclusion: Carefully curated training setups and architectures can achieve strong point cloud understanding without heavy reliance on cross-modal supervision or massive datasets. The unified experimental framework enables transparent comparisons and highlights benefits of tokenizer-free architectures.

Abstract: Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds - yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy.

[204] Contrastive learning-based video quality assessment-jointed video vision transformer for video recognition

Jian Sun, Mohammad H. Mahoor

Main category: cs.CV

TL;DR: SSL-V3: Self-supervised video vision transformer with no-reference VQA for improved video classification by jointly learning video quality assessment and classification tasks.

Details

Motivation: Video quality significantly impacts video classification performance (e.g., clear vs blurred videos for Mild Cognitive Impairment classification). Existing approaches suffer from label shortage for Video Quality Assessment (VQA) in video datasets, making it impossible to provide accurate quality scores.

Method: Proposes SSL-V3: Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA. Uses Combined-SSL mechanism to integrate VQA into video classification, addressing VQA label shortage. The method uses video quality score as a factor to directly tune feature maps for classification, and as an intersected point linking VQA and classification tasks, using supervised classification to tune VQA parameters.

Result: Achieved robust experimental results on two datasets, including 94.87% accuracy on interview videos from I-CONECT (facial video-involved healthcare dataset), verifying SSL-V3’s effectiveness.

Conclusion: The proposed SSL-V3 framework successfully integrates video quality assessment with video classification using self-supervised learning, addressing label shortage issues and improving classification performance, particularly for healthcare applications involving facial video analysis.

Abstract: Video quality significantly affects video classification. We found this problem when we classified Mild Cognitive Impairment well from clear videos, but worse from blurred ones. From then, we realized that referring to Video Quality Assessment (VQA) may improve video classification. This paper proposed Self-Supervised Learning-based Video Vision Transformer combined with No-reference VQA for video classification (SSL-V3) to fulfill the goal. SSL-V3 leverages Combined-SSL mechanism to join VQA into video classification and address the label shortage of VQA, which commonly occurs in video datasets, making it impossible to provide an accurate Video Quality Score. In brief, Combined-SSL takes video quality score as a factor to directly tune the feature map of the video classification. Then, the score, as an intersected point, links VQA and classification, using the supervised classification task to tune the parameters of VQA. SSL-V3 achieved robust experimental results on two datasets. For example, it reached an accuracy of 94.87% on some interview videos in the I-CONECT (a facial video-involved healthcare dataset), verifying SSL-V3’s effectiveness.

[205] Med-DualLoRA: Local Adaptation of Foundation Models for 3D Cardiac MRI

Joan Perramon-Llussà, Amelia Jiménez-Sánchez, Grzegorz Skorupko, Fotis Avgoustidis, Carlos Martín-Isla, Karim Lekadir, Polyxeni Gkontra

Main category: cs.CV

TL;DR: Med-DualLoRA: A federated learning framework for medical foundation models that uses dual LoRA modules (global+local) for efficient, privacy-preserving adaptation to multi-center 3D cardiac MRI data.

Details

Motivation: Medical foundation models need adaptation to clinical data, but centralized fine-tuning is infeasible due to privacy constraints. Federated learning offers privacy preservation but struggles with heterogeneous multi-center data and communication overhead for large models.

Method: Proposes Med-DualLoRA, a client-aware parameter-efficient fine-tuning framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Only global LoRA modules are aggregated across sites while local adapters remain private. Adapts only two transformer blocks for efficiency.

Result: Achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines on multi-center 3D CMR disease detection using ACDC and M&Ms datasets, while maintaining communication efficiency.

Conclusion: Med-DualLoRA provides a scalable solution for local federated adaptation of medical foundation models under realistic clinical constraints, balancing privacy, performance, and communication efficiency.

Abstract: Foundation models (FMs) show great promise for robust downstream performance across medical imaging tasks and modalities, including cardiac magnetic resonance (CMR), following task-specific adaptation. However, adaptation using single-site data may lead to suboptimal performance and increased model bias, while centralized fine-tuning on clinical data is often infeasible due to privacy constraints. Federated fine-tuning offers a privacy-preserving alternative; yet conventional approaches struggle under heterogeneous, non-IID multi-center data and incur substantial communication overhead when adapting large models. In this work, we study federated FM fine-tuning for 3D CMR disease detection and propose Med-DualLoRA, a client-aware parameter-efficient fine-tuning (PEFT) federated framework that disentangles globally shared and local low-rank adaptations (LoRA) through additive decomposition. Global and local LoRA modules are trained locally, but only the global component is shared and aggregated across sites, keeping local adapters private. This design improves personalization while significantly reducing communication cost, and experiments show that adapting only two transformer blocks preserves performance while further improving efficiency. We evaluate our method on a multi-center state-of-the-art cine 3D CMR FM fine-tuned for disease detection using ACDC and combined M&Ms datasets, treating each vendor as a federated client. Med-DualLoRA achieves statistically significant improved performance (balanced accuracy 0.768, specificity 0.612) compared to other federated PEFT baselines, while maintaining communication efficiency. Our approach provides a scalable solution for local federated adaptation of medical FMs under realistic clinical constraints.

[206] VCR: Variance-Driven Channel Recalibration for Robust Low-Light Enhancement

Zhixin Cheng, Fangwen Zhang, Xiaotian Yin, Baoqun Yin, Haodian Wang

Main category: cs.CV

TL;DR: VCR is a novel low-light image enhancement framework that addresses channel-level inconsistency and color distribution misalignment in HVI color space through variance-driven channel recalibration and color distribution alignment.

Details

Motivation: Existing low-light enhancement methods suffer from entangled luminance and color in sRGB space, while HSV space introduces noise artifacts. HVI color space helps but still has channel-level inconsistency between luminance/chrominance and misaligned color distribution leading to unnatural results.

Method: Proposes VCR framework with two main components: 1) Channel Adaptive Adjustment (CAA) module using variance-guided feature filtering to focus on regions with high intensity and color distribution, and 2) Color Distribution Alignment (CDA) module that enforces distribution alignment in color feature space.

Result: Experimental results on several benchmark datasets demonstrate state-of-the-art performance compared with existing methods, enhancing perceptual quality under low-light conditions.

Conclusion: VCR effectively addresses channel-level inconsistency and color distribution misalignment in low-light image enhancement, achieving superior performance through variance-driven channel recalibration and color distribution alignment.

Abstract: Most sRGB-based LLIE methods suffer from entangled luminance and color, while the HSV color space offers insufficient decoupling at the cost of introducing significant red and black noise artifacts. Recently, the HVI color space has been proposed to address these limitations by enhancing color fidelity through chrominance polarization and intensity compression. However, existing methods could suffer from channel-level inconsistency between luminance and chrominance, and misaligned color distribution may lead to unnatural enhancement results. To address these challenges, we propose the Variance-Driven Channel Recalibration for Robust Low-Light Enhancement (VCR), a novel framework for low-light image enhancement. VCR consists of two main components, including the Channel Adaptive Adjustment (CAA) module, which employs variance-guided feature filtering to enhance the model’s focus on regions with high intensity and color distribution. And the Color Distribution Alignment (CDA) module, which enforces distribution alignment in the color feature space. These designs enhance perceptual quality under low-light conditions. Experimental results on several benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance compared with existing methods.

[207] GroundCount: Grounding Vision-Language Models with Object Detection for Mitigating Counting Hallucinations

Boyuan Chen, Minghao Shao, Siddharth Garg, Ramesh Karri, Muhammad Shafique

Main category: cs.CV

TL;DR: GroundCount: A framework that augments Vision Language Models with explicit spatial grounding from object detection models to mitigate counting hallucinations, achieving 81.3% accuracy and 22% faster inference.

Details

Motivation: Vision Language Models exhibit persistent hallucinations in counting tasks with substantially lower accuracy than other visual reasoning tasks, while object detection models excel at spatial localization and instance counting with minimal computational overhead.

Method: Proposes GroundCount framework that augments VLMs with explicit spatial grounding from ODMs using prompt-based augmentation strategy, with comprehensive ablation studies on positional encoding, confidence scores, and feature-level fusion architectures.

Result: Achieves 81.3% counting accuracy on best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops. Consistent improvements across 4 of 5 evaluated VLM architectures (6.2-7.5pp).

Conclusion: Counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, highlighting the importance of architectural compatibility in augmentation strategies. Explicit symbolic grounding via structured prompts outperforms implicit feature fusion.

Abstract: Vision Language Models (VLMs) exhibit persistent hallucinations in counting tasks, with accuracy substantially lower than other visual reasoning tasks (excluding sentiment). This phenomenon persists even in state-of-the-art reasoning-capable VLMs. Conversely, CNN-based object detection models (ODMs) such as YOLO excel at spatial localization and instance counting with minimal computational overhead. We propose GroundCount, a framework that augments VLMs with explicit spatial grounding from ODMs to mitigate counting hallucinations. In the best case, our prompt-based augmentation strategy achieves 81.3% counting accuracy on the best-performing model (Ovis2.5-2B) - a 6.6pp improvement - while reducing inference time by 22% through elimination of hallucination-driven reasoning loops for stronger models. We conduct comprehensive ablation studies demonstrating that positional encoding is a critical component, being beneficial for stronger models but detrimental for weaker ones. Confidence scores, by contrast, introduce noise for most architectures and their removal improves performance in four of five evaluated models. We further evaluate feature-level fusion architectures, finding that explicit symbolic grounding via structured prompts outperforms implicit feature fusion despite sophisticated cross-attention mechanisms. Our approach yields consistent improvements across four of five evaluated VLM architectures (6.2–7.5pp), with one architecture exhibiting degraded performance due to incompatibility between its iterative reflection mechanisms and structured prompts. These results suggest that counting failures stem from fundamental spatial-semantic integration limitations rather than architecture-specific deficiencies, while highlighting the importance of architectural compatibility in augmentation strategies.

[208] Too Vivid to Be Real? Benchmarking and Calibrating Generative Color Fidelity

Zhengyao Fang, Zexi Jia, Yijia Zhong, Pengcheng Luo, Jinchao Zhang, Guangming Lu, Jun Yu, Wenjie Pei

Main category: cs.CV

TL;DR: A framework for improving color fidelity in realistic-style text-to-image generation through a dataset, metric, and refinement method.

Details

Motivation: Current text-to-image generation produces images that are too vivid and lack photographic realism due to biases in evaluation metrics that favor exaggerated saturation and contrast.

Method: Three components: 1) Color Fidelity Dataset (CFD) with 1.3M real/synthetic images, 2) Color Fidelity Metric (CFM) using multimodal encoder to learn perceptual color fidelity, 3) Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale during generation.

Result: The framework enables objective evaluation and improvement of color fidelity in realistic-style text-to-image generation, with dataset and code publicly available.

Conclusion: The proposed progressive framework addresses color fidelity issues in realistic T2I generation through integrated dataset, metric, and refinement components.

Abstract: Recent advances in text-to-image (T2I) generation have greatly improved visual quality, yet producing images that appear visually authentic to real-world photography remains challenging. This is partly due to biases in existing evaluation paradigms: human ratings and preference-trained metrics often favor visually vivid images with exaggerated saturation and contrast, which make generations often too vivid to be real even when prompted for realistic-style images. To address this issue, we present Color Fidelity Dataset (CFD) and Color Fidelity Metric (CFM) for objective evaluation of color fidelity in realistic-style generations. CFD contains over 1.3M real and synthetic images with ordered levels of color realism, while CFM employs a multimodal encoder to learn perceptual color fidelity. In addition, we propose a training-free Color Fidelity Refinement (CFR) that adaptively modulates spatial-temporal guidance scale in generation, thereby enhancing color authenticity. Together, CFD supports CFM for assessment, whose learned attention further guides CFR to refine T2I fidelity, forming a progressive framework for assessing and improving color fidelity in realistic-style T2I generation. The dataset and code are available at https://github.com/ZhengyaoFang/CFM.

[209] Does AI See like Art Historians? Interpreting How Vision Language Models Recognize Artistic Style

Marvin Limpijankit, Milad Alshomary, Yassin Oulad Daoud, Amith Ananthram, Tim Trombley, Elias Stengel-Eskin, Mohit Bansal, Noam M. Elcott, Kathleen McKeown

Main category: cs.CV

TL;DR: VLMs show strong art analysis capabilities; this interdisciplinary study examines how VLMs predict artistic style and compares their mechanisms to art historians’ reasoning through latent-space decomposition and expert evaluation.

Details

Motivation: To understand the mechanisms behind VLMs' ability to predict artistic style and assess whether these mechanisms align with the criteria used by art historians when reasoning about artistic style, bridging computer science and art history.

Method: Employed a latent-space decomposition approach to identify concepts driving art style prediction, followed by quantitative evaluations, causal analysis, and assessment by art historians to evaluate concept coherence and relevance.

Result: 73% of extracted concepts were judged by art historians to exhibit coherent and semantically meaningful visual features, and 90% of concepts used to predict style of a given artwork were judged relevant. When irrelevant concepts successfully predicted style, art historians identified possible formal interpretations (e.g., dark/light contrasts).

Conclusion: VLMs demonstrate meaningful understanding of artistic style, with most identified concepts aligning with art historical reasoning, though some differences in conceptual interpretation exist between computational models and human experts.

Abstract: VLMs have become increasingly proficient at a range of computer vision tasks, such as visual question answering and object detection. This includes increasingly strong capabilities in the domain of art, from analyzing artwork to generation of art. In an interdisciplinary collaboration between computer scientists and art historians, we characterize the mechanisms underlying VLMs’ ability to predict artistic style and assess the extent to which they align with the criteria art historians use to reason about artistic style. We employ a latent-space decomposition approach to identify concepts that drive art style prediction and conduct quantitative evaluations, causal analysis and assessment by art historians. Our findings indicate that 73% of the extracted concepts are judged by art historians to exhibit a coherent and semantically meaningful visual feature and 90% of concepts used to predict style of a given artwork were judged relevant. In cases where an irrelevant concept was used to successfully predict style, art historians identified possible reasons for its success; for example, the model might “understand” a concept in more formal terms, such as dark/light contrasts.

[210] DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

Main category: cs.CV

TL;DR: DynVLA is a driving vision-language-action model that introduces Dynamics Chain-of-Thought, forecasting compact world dynamics tokens before action generation for more physically grounded decision-making in autonomous driving scenarios.

Details

Motivation: Existing CoT paradigms for VLAs have limitations: Textual CoT lacks fine-grained spatiotemporal understanding, while Visual CoT introduces substantial redundancy through dense image prediction. There's a need for a more compact, interpretable, and efficient representation of world evolution for driving decision-making.

Method: Introduces Dynamics CoT paradigm with a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Decouples ego-centric and environment-centric dynamics for accurate modeling. Trains using supervised fine-tuning and reinforcement fine-tuning to generate dynamics tokens before actions while maintaining latency-efficient inference.

Result: Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

Conclusion: Dynamics CoT provides a compact, interpretable, and efficient representation of world evolution that enables more informed and physically grounded decision-making in driving scenarios, outperforming existing CoT approaches.

Abstract: We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

[211] Agentar-Fin-OCR

Siyi Qian, Xiongfei Bai, Bingtao Fu, Yichen Lu, Gaoyang Zhang, Xudong Yang, Peng Zhang

Main category: cs.CV

TL;DR: Agentar-Fin-OCR is a specialized document parsing system for financial PDFs with cross-page consolidation, hierarchical structure reconstruction, and advanced table parsing using curriculum learning and cell localization without external detectors.

Details

Motivation: Financial documents present unique challenges including complex layouts, cross-page structural discontinuities, and cell-level referencing needs that existing OCR/document parsing systems don't adequately address, requiring specialized solutions for reliable financial applications.

Method: Combines (1) Cross-page Contents Consolidation algorithm and Document-level Heading Hierarchy Reconstruction for structure-aware retrieval, and (2) difficulty-adaptive curriculum learning for table parsing with CellBBoxRegressor module using structural anchor tokens to localize table cells from decoder hidden states.

Result: Shows high performance on OmniDocBench table parsing metrics and introduces FinDocBench benchmark with six financial document categories, expert-verified annotations, and specialized evaluation metrics (TocEDS, cross-page TEDS, C-IoU).

Conclusion: Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications, addressing finance-specific challenges that existing models struggle with.

Abstract: In this paper, we propose Agentar-Fin-OCR, a document parsing system tailored to financial-domain documents, transforming ultra-long financial PDFs into semantically consistent, highly accurate, structured outputs with auditing-grade provenance. To address finance-specific challenges such as complex layouts, cross-page structural discontinuities, and cell-level referencing capability, Agentar-Fin-OCR combines (1) a Cross-page Contents Consolidation algorithm to restore continuity across pages and a Document-level Heading Hierarchy Reconstruction (DHR) module to build a globally consistent Table of Contents (TOC) tree for structure-aware retrieval, and (2) a difficulty-adaptive curriculum learning training strategy for table parsing, together with a CellBBoxRegressor module that uses structural anchor tokens to localize table cells from decoder hidden states without external detectors. Experiments demonstrate that our model shows high performance on the table parsing metrics of OmniDocBench. To enable realistic evaluation in the financial vertical, we further introduce FinDocBench, a benchmark that includes six financial document categories with expert-verified annotations and evaluation metrics including Table of Contents edit-distance-based similarity (TocEDS), cross-page concatenated TEDS, and Table Cell Intersection over Union (C-IoU). We evaluate a wide range of state-of-the-art models on FinDocBench to assess their capabilities and remaining limitations on financial documents. Overall, Agentar-Fin-OCR and FinDocBench provide a practical foundation for reliable downstream financial document applications.

[212] LiTo: Surface Light Field Tokenization

Jen-Hao Rick Chang, Xiaoming Zhao, Dorian Chan, Oncel Tuzel

Main category: cs.CV

TL;DR: A 3D latent representation that jointly models object geometry and view-dependent appearance by encoding RGB-depth images as surface light field samples, enabling realistic view-dependent effects and generation from single images.

Details

Motivation: Most prior works focus on either 3D geometry reconstruction or view-independent diffuse appearance, struggling to capture realistic view-dependent effects like specular highlights and Fresnel reflections under complex lighting.

Method: Encodes random subsamples of surface light fields from RGB-depth images into compact latent vectors, creating unified 3D latent space for geometry and appearance. Trains latent flow matching model on this representation conditioned on single input images.

Result: Achieves higher visual quality and better input fidelity than existing methods, reproducing realistic view-dependent effects including specular highlights and Fresnel reflections.

Conclusion: Proposed 3D latent representation successfully models both geometry and view-dependent appearance in unified space, enabling generation of 3D objects with appearance consistent with input lighting and materials.

Abstract: We propose a 3D latent representation that jointly models object geometry and view-dependent appearance. Most prior works focus on either reconstructing 3D geometry or predicting view-independent diffuse appearance, and thus struggle to capture realistic view-dependent effects. Our approach leverages that RGB-depth images provide samples of a surface light field. By encoding random subsamples of this surface light field into a compact set of latent vectors, our model learns to represent both geometry and appearance within a unified 3D latent space. This representation reproduces view-dependent effects such as specular highlights and Fresnel reflections under complex lighting. We further train a latent flow matching model on this representation to learn its distribution conditioned on a single input image, enabling the generation of 3D objects with appearances consistent with the lighting and materials in the input. Experiments show that our approach achieves higher visual quality and better input fidelity than existing methods.

[213] In Pursuit of Many: A Review of Modern Multiple Object Tracking Systems

Mk Bashar, Samia Islam, Kashifa Kawaakib Hussain, Md. Bakhtiar Hasan, A. B. M. Ashikur Rahman, Md. Hasanul Kabir

Main category: cs.CV

TL;DR: A comprehensive survey of Multiple Object Tracking (MOT) covering historical progression, architectural directions, benchmark trends, evaluation practices, and emerging directions including foundation-model integration and multimodal tracking.

Details

Motivation: MOT is essential for various applications but faces challenges like occlusion, appearance ambiguity, and identity switching. The paper aims to synthesize recent progress and organize methods to provide a comprehensive overview of the field.

Method: Survey methodology organizing MOT methods around problems they target and paradigms they adopt. Covers historical progression from tracking-by-detection to end-to-end designs, architectural directions (transformer-based, diffusion, state-space, Siamese, graph-based), benchmark trends, and evaluation practices.

Result: Comprehensive analysis of MOT field showing shift from saturated pedestrian benchmarks to challenge-driven datasets, emergence of new architectural paradigms, and evolution of evaluation metrics toward motion- and safety-centric approaches.

Conclusion: Identifies emerging directions including foundation-model integration, open-vocabulary and multimodal tracking, unified evaluation, and domain-adaptive methods that will shape future MOT research and real-world deployment.

Abstract: Multiple Object Tracking (MOT) is a core capability in modern computer vision, essential to autonomous driving, surveillance, sports analytics, robotics, and biomedical imaging. Persistent identity assignment across frames remains challenging in real scenes because of occlusion, dense crowds, appearance ambiguity, scale variation, camera motion, and identity switching. In this survey we synthesize recent progress by organizing methods around the problems they target and the paradigms they adopt. We cover the historical progression from tracking-by-detection to hybrid and end-to-end designs, and we summarize major architectural directions including transformer-based trackers, generative/diffusion formulations, state-space predictors, Siamese and graph-based models, and the growing impact of foundation models for detection and representation. We review benchmark trends that motivate method design, documenting the shift from saturated pedestrian benchmarks to challenge-driven and domain-specific datasets and we analyze evaluation practice by comparing classic and newer motion- and safety-centric metrics. Finally, we connect algorithmic trends to practical deployment constraints and outline emerging directions, foundation-model integration, open-vocabulary and multimodal tracking, unified evaluation, and domain-adaptive methods, that we believe will shape MOT research and real-world adoption.

[214] An Overview about Emerging Technologies of Autonomous Driving

Yu Huang, Yue Chen, Zijiang Yang

Main category: cs.CV

TL;DR: Survey paper on autonomous driving technologies covering perception, mapping, prediction, planning, control, simulation, V2X, and safety within a data closed-loop framework.

Details

Motivation: To provide a comprehensive overview of autonomous driving technologies and open problems, addressing the long-tail challenges in this active AI application field.

Method: Survey methodology analyzing major self-driving system components (perception, mapping, localization, prediction, planning, control, simulation, V2X, safety) within a data closed-loop framework.

Result: Comprehensive review of autonomous driving technologies, identifying key technical aspects, current approaches, and remaining challenges in the field.

Conclusion: Autonomous driving remains a complex AI application with many open problems; the data closed-loop framework is crucial for addressing long-tail challenges.

Abstract: Since DARPA started Grand Challenges in 2004 and Urban Challenges in 2007, autonomous driving has been the most active field of AI applications. This paper gives an overview about technical aspects of autonomous driving technologies and open problems. We investigate the major fields of self-driving systems, such as perception, mapping and localization, prediction, planning and control, simulation, V2X and safety etc. Especially we elaborate on all these issues in a framework of data closed loop, a popular platform to solve the long tailed autonomous driving problems.

[215] Sketch-Guided Stylized Landscape Cinemagraph Synthesis

Hao Jin, Hengyuan Chang, Xiaoxuan Xie, Zhengyang Wang, Xusheng Du, Shaojun Hu, Haoran Xie

Main category: cs.CV

TL;DR: Sketch2Cinemagraph: A sketch-guided framework for generating stylized cinemagraphs with spatial and motion control from freehand sketches using latent diffusion models.

Details

Motivation: Designing stylized cinemagraphs is challenging due to difficulty in customizing complex flow elements. Sketches provide intuitive control beyond text inputs for personalized design requirements.

Method: Uses latent diffusion model for initial landscape generation, object detection for flow region masks, latent motion diffusion model for motion field estimation controlled by sketches, and U-Net frame generator for pixel warping in fluid regions.

Result: Generates aesthetically appealing stylized cinemagraphs with continuous temporal flow from sketch inputs, verified through qualitative and quantitative comparisons against state-of-the-art approaches.

Conclusion: Sketch2Cinemagraph enables intuitive sketch-guided control for generating stylized cinemagraphs with both spatial and motion customization.

Abstract: Designing stylized cinemagraphs is challenging due to the difficulty in customizing complex and expressive flow elements. To achieve intuitive and detailed control of the generated cinemagraphs, sketches provide a feasible solution to convey personalized design requirements beyond text inputs. In this paper, we propose Sketch2Cinemagraph, a sketch-guided framework that enables the conditional generation of stylized cinemagraphs from freehand sketches. Sketch2Cinemagraph adopts text prompts for initial landscape generation and provides sketch controls for both spatial and motion cues. The latent diffusion model first generates target stylized landscape images along with realistic versions. Then, a pre-trained object detection model obtains masks for the flow regions. We propose a latent motion diffusion model to estimate motion field in fluid regions of the generated landscape images. The input motion sketches serve as the conditions to control the generated motion fields in the masked fluid regions with the prompt. To synthesize cinemagraph frames, the pixels within fluid regions are warped to target locations at each timestep using a U-Net based frame generator. The results verified that Sketch2Cinemagraph can generate aesthetically appealing stylized cinemagraphs with continuous temporal flow from sketch inputs. We showcase the advantages of Sketch2Cinemagraph through qualitative and quantitative comparisons against the state-of-the-art approaches.

[216] Leveraging Spatial Context for Positive Pair Sampling in Histopathology Image Representation Learning

Willmer Rafell Quinones Robles, Sakonporn Noree, Jongwoo Kim, Young Sin Ko, Bryan Wong, Mun Yong Yi

Main category: cs.CV

TL;DR: A spatial context-driven positive pair sampling strategy for self-supervised learning in computational pathology that leverages morphological coherence of adjacent patches in whole-slide images, improving classification performance by 5-10% over standard augmentation-based methods.

Details

Motivation: Deep learning for cancer classification from whole-slide images requires extensive expert annotations. While annotation-free approaches like multiple instance learning and self-supervised learning exist, conventional SSL methods rely on synthetic data augmentations that may fail to capture the critical spatial structure of histopathology images.

Method: Proposes a spatial context-driven positive pair sampling strategy that enhances SSL by leveraging morphological coherence of spatially adjacent patches within whole-slide images. The method is modular and compatible with established joint embedding SSL frameworks including Barlow Twins, BYOL, VICReg, and DINOv2.

Result: Experiments across four datasets show consistent performance improvements with accuracy gains of 5% to 10% compared to standard augmentation-based sampling, evaluated on both slide-level classification using MIL and patch-level linear probing.

Conclusion: The work highlights the value of spatial context in improving representation learning for computational pathology and provides a biologically meaningful enhancement for pretraining models in annotation-limited settings.

Abstract: Deep learning has shown strong potential in cancer classification from whole-slide images (WSIs), but the need for extensive expert annotations often limits its success. Annotation-free approaches, such as multiple instance learning (MIL) and self-supervised learning (SSL), have emerged as promising alternatives to traditional annotation-based methods. However, conventional SSL methods typically rely on synthetic data augmentations, which may fail to capture the spatial structure critical to histopathology. In this work, we propose a spatial context-driven positive pair sampling strategy that enhances SSL by leveraging the morphological coherence of spatially adjacent patches within WSIs. Our method is modular and compatible with established joint embedding SSL frameworks, including Barlow Twins, BYOL, VICReg, and DINOv2. We evaluate its effectiveness on both slide-level classification using MIL and patch-level linear probing. Experiments across four datasets demonstrate consistent performance improvements, with accuracy gains of 5% to 10% compared to standard augmentation-based sampling. These findings highlight the value of spatial context in improving representation learning for computational pathology and provide a biologically meaningful enhancement for pretraining models in annotation-limited settings. The code is available at https://anonymous.4open.science/r/contextual-pairs-E72F/.

[217] Rethinking Two-Stage Referring-by-Tracking in Referring Multi-Object Tracking: Make it Strong Again

Weize Li, Yunhao Du, Qixiang Yin, Zhicheng Zhao, Fei Su

Main category: cs.CV

TL;DR: FlexHook is a novel two-stage Referring-by-Tracking framework that addresses limitations in existing approaches through a Conditioning Hook for better feature construction and a Pairwise Correspondence Decoder for robust correspondence modeling.

Details

Motivation: The motivation is to revive the two-stage Referring-by-Tracking paradigm, which has lost popularity to one-stage methods despite its advantages in lower training cost and flexible incremental deployment. The authors identify two fundamental limitations in existing two-stage frameworks: overly heuristic feature construction and fragile correspondence modeling.

Method: FlexHook introduces two key components: 1) Conditioning Hook (C-Hook) that redefines feature construction using a sampling-based strategy and language-conditioned cue injection, and 2) Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling for more flexible and robust tracking.

Result: Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods.

Conclusion: FlexHook successfully addresses the limitations of existing two-stage RBT frameworks and demonstrates superior performance across multiple benchmarks, making it a competitive alternative to one-stage methods while maintaining the advantages of the two-stage paradigm.

Abstract: Referring Multi-Object Tracking (RMOT) aims to track multiple objects specified by natural language expressions in videos. With the recent significant progress of one-stage methods, the two-stage Referring-by-Tracking (RBT) paradigm has gradually lost its popularity. However, its lower training cost and flexible incremental deployment remain irreplaceable. Rethinking existing two-stage RBT frameworks, we identify two fundamental limitations: the overly heuristic feature construction and fragile correspondence modeling. To address these issues, we propose FlexHook, a novel two-stage RBT framework. In FlexHook, the proposed Conditioning Hook (C-Hook) redefines the feature construction by a sampling-based strategy and language-conditioned cue injection. Then, we introduce a Pairwise Correspondence Decoder (PCD) that replaces CLIP-based similarity matching with active correspondence modeling, yielding a more flexible and robust strategy. Extensive experiments on multiple benchmarks (Refer-KITTI/v2, Refer-Dance, and LaMOT) demonstrate that FlexHook becomes the first two-stage RBT approach to comprehensively outperform current state-of-the-art methods. Code can be found in the https://github.com/buptLwz/FlexHook.

[218] Enhanced Continual Learning of Vision-Language Models with Model Fusion

Haoyuan Gao, Zicong Zhang, Yuqi Wei, Linglan Zhao, Guilin Li, Yexin Li, Bo Wang, Linghe Kong, Weiran Huang

Main category: cs.CV

TL;DR: ConDU: A continual learning approach for Vision-Language Models that uses model fusion to prevent catastrophic forgetting while maintaining zero-shot capabilities.

Details

Motivation: VLMs suffer from catastrophic forgetting when fine-tuned sequentially on multiple tasks. Existing continual learning methods have limitations like requiring extra datasets, compromising zero-shot performance, or being restricted to parameter-efficient tuning.

Method: Proposes Continual Decoupling-Unifying (ConDU) approach that maintains a unified model with task triggers and prototype sets. Uses iterative process of decoupling task experts for previous tasks and unifying them with new task expert. Also introduces inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task experts.

Result: Extensive experiments on MTIL benchmark show ConDU achieves up to 2% improvement in average performance across all seen tasks compared to SOTA baselines, while enhancing zero-shot capabilities relative to original VLM.

Conclusion: ConDU effectively addresses catastrophic forgetting in VLMs through model fusion approach, improving both task-specific performance and zero-shot capabilities without requiring additional reference datasets.

Abstract: Vision-Language Models (VLMs) represent a significant breakthrough in artificial intelligence by integrating visual and textual modalities to achieve impressive zero-shot capabilities. However, VLMs are susceptible to catastrophic forgetting when sequentially fine-tuned on multiple downstream tasks. Existing continual learning methods for VLMs face various limitations, often relying on additional reference datasets, compromising zero-shot performance, or being restricted to parameter-efficient fine-tuning scenarios. In this paper, we propose a novel Continual Decoupling-Unifying (ConDU) approach that pioneers the use of model fusion for continual learning in VLMs. Specifically, ConDU maintains a unified model along with task triggers and prototype sets, employing an iterative process of decoupling task experts for previous tasks and unifying them with the task expert for the newly learned task. Additionally, we introduce an inference strategy for zero-shot scenarios by aggregating predictions from multiple decoupled task experts. Extensive experiments on the MTIL benchmark show that ConDU achieves up to a 2% improvement in average performance across all seen tasks compared to state-of-the-art baselines, while also enhancing zero-shot capabilities relative to the original VLM. Our code is available at https://github.com/zhangzicong518/ConDU.

[219] Token-Level Constraint Boundary Search for Jailbreaking Text-to-Image Models

Jiangtao Liu, Zhaoxin Wang, Handing Wang, Cong Tian, Yaochu Jin

Main category: cs.CV

TL;DR: TCBS-Attack is a black-box jailbreak method that uses evolutionary search near decision boundaries to bypass safety defenses in text-to-image models.

Details

Motivation: Text-to-image models have safety concerns about generating harmful content, and current defenses combine prompt checkers, secure training, and image checkers. Jailbreaking these full-chain systems is challenging in black-box settings due to discrete token spaces, multiple constraints, sparse feedback, and limited queries.

Method: Token-level Constraint Boundary Search (TCBS)-Attack uses evolutionary search to find tokens near decision boundaries defined by text and image checkers. It incorporates boundaries as constraints to guide token population evolution, reducing search space while preserving semantic coherence.

Result: TCBS-Attack outperforms state-of-the-art jailbreak attacks across various T2I models, achieving ASR-4 of 52.5% and ASR-1 of 22.0% on full-chain T2I models, significantly surpassing baseline methods.

Conclusion: The proposed evolutionary search approach near decision boundaries effectively addresses the challenges of black-box jailbreaking in full-chain T2I systems, demonstrating superior performance over existing methods.

Abstract: Text-to-Image (T2I) generation has advanced rapidly in recent years, but they also raise safety concerns due to the potential production of harmful content. In the practical deployments, T2I services typically adopt full-chain defenses that combine a prompt checker, a securely trained generator, and a post-hoc image checker. Jailbreaking such full-chain systems is challenging in the black-box settings because prompt tokens form a discrete combinatorial space and the attack must satisfy multiple coupled constraints under sparse feedback and limited queries. To address these challenges, we propose Token-level Constraint Boundary Search (TCBS)-Attack, a novel query-based black-box jailbreak attack that searches for tokens located near the decision boundaries defined by text and image checkers. TCBS-Attack incorporates decision boundaries as constraint conditions to guide the evolutionary search of token populations, iteratively optimize tokens near these boundaries. Such evolutionary search process reduces the effective search space and improves query efficiency while preserving semantic coherence. Extensive experiments demonstrate that TCBS-Attack consistently outperforms state-of-the-art jailbreak attacks across various T2I models, including securely trained open-source models and commercial online services like DALL-E 3. TCBS-Attack achieves an ASR-4 of 52.5% and an ASR-1 of 22.0% on jailbreaking full-chain T2I models, significantly surpassing baseline methods.

[220] Unsupervised training of keypoint-agnostic descriptors for flexible retinal image registration

David Rivas-Villar, Álvaro S. Hervella, José Rouco, Jorge Novo

Main category: cs.CV

TL;DR: Unsupervised descriptor learning for medical image registration without keypoint detection dependency, validated on retinal fundus images with multiple detectors.

Details

Motivation: Medical image registration suffers from limited labeled data, especially in medical domains. Current approaches are constrained by labeled data requirements and keypoint detection dependencies, motivating unsupervised learning methods that are detector-agnostic.

Method: Developed a novel unsupervised descriptor learning method that doesn’t rely on keypoint detection, making the descriptor network agnostic to the keypoint detector used during registration inference. Tested with multiple keypoint detectors of varied nature, including some novel ones.

Result: The approach offers accurate registration without performance loss versus supervised methods. Demonstrates accurate performance regardless of the keypoint detector used, validated on the reference public retinal image registration dataset.

Conclusion: This work represents a notable step towards leveraging unsupervised learning in the medical domain for image registration, overcoming labeled data limitations and keypoint detector dependencies.

Abstract: Current color fundus image registration approaches are limited, among other things, by the lack of labeled data, which is even more significant in the medical domain, motivating the use of unsupervised learning. Therefore, in this work, we develop a novel unsupervised descriptor learning method that does not rely on keypoint detection. This enables the resulting descriptor network to be agnostic to the keypoint detector used during the registration inference. To validate this approach, we perform an extensive and comprehensive comparison on the reference public retinal image registration dataset. Additionally, we test our method with multiple keypoint detectors of varied nature, even proposing some novel ones. Our results demonstrate that the proposed approach offers accurate registration, not incurring in any performance loss versus supervised methods. Additionally, it demonstrates accurate performance regardless of the keypoint detector used. Thus, this work represents a notable step towards leveraging unsupervised learning in the medical domain.

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Samwoo Seong, Youngjae Yu, Yunsung Lee

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: No method information available due to API rate limiting error

Result: No results available - paper content inaccessible

Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2511.20216: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20216&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[222] Average Calibration Losses for Reliable Uncertainty in Medical Image Segmentation

Theodore Barfoot, Luis C. Garcia-Peraza-Herrera, Samet Akcay, Ben Glocker, Tom Vercauteren

Main category: cs.CV

TL;DR: Proposes differentiable marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss for medical image segmentation to improve calibration while maintaining segmentation performance.

Details

Motivation: Deep neural networks for medical image segmentation are often overconfident, compromising reliability and clinical utility. There's a need to improve calibration (alignment between predicted confidences and true accuracies) without sacrificing segmentation performance.

Method: Introduces differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed per-image. Compares hard- and soft-binning approaches to directly improve pixel-wise calibration. Also introduces dataset reliability histograms as an aggregation of per-image reliability diagrams for better analysis.

Result: Experiments on four medical imaging datasets (ACDC, AMOS, KiTS, BraTS) show that incorporating mL1-ACE significantly reduces calibration errors (ACE and MCE) while largely maintaining high Dice Similarity Coefficients. Soft-binned variant yields greatest calibration improvements but often compromises segmentation performance, while hard-binned mL1-ACE maintains segmentation performance with weaker calibration improvement.

Conclusion: The approach provides practitioners with explicit control over the calibration-accuracy trade-off, enabling more reliable integration of deep learning methods into clinical workflows by improving confidence calibration in medical image segmentation.

Abstract: Deep neural networks for medical image segmentation are often overconfident, compromising both reliability and clinical utility. In this work, we propose differentiable formulations of marginal L1 Average Calibration Error (mL1-ACE) as an auxiliary loss that can be computed on a per-image basis. We compare both hard- and soft-binning approaches to directly improve pixel-wise calibration. Our experiments on four datasets (ACDC, AMOS, KiTS, BraTS) demonstrate that incorporating mL1-ACE significantly reduces calibration errors, particularly Average Calibration Error (ACE) and Maximum Calibration Error (MCE), while largely maintaining high Dice Similarity Coefficients (DSCs). We find that the soft-binned variant yields the greatest improvements in calibration over the DSC plus cross-entropy loss baseline but often compromises segmentation performance, with hard-binned mL1-ACE maintaining segmentation performance, albeit with weaker calibration improvement. To gain further insight into calibration performance and its variability across an imaging dataset, we introduce dataset reliability histograms, an aggregation of per-image reliability diagrams. The resulting analysis highlights improved alignment between predicted confidences and true accuracies. Overall, our approach provides practitioners with explicit control over the calibration-accuracy trade-off, enabling more reliable integration of deep learning methods into clinical workflows. We share our code here: https://github.com/cai4cai/Average-Calibration-Losses

[223] SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li

Main category: cs.CV

TL;DR: SOTA is a training-free ensemble framework that integrates multiple foundation models (vision-language or vision-only) using self-adaptive optimal transport to balance their complementary strengths without requiring prior knowledge.

Details

Motivation: Two key observations motivate this work: (1) Vision-Language Models (VLMs) like CLIP often over-rely on textual priors and miss fine-grained visual cues, while Vision-only Foundation Models (VFMs) like DINO provide rich visual features but lack semantic alignment; (2) Different VLMs perform variably across datasets due to pre-training differences.

Method: SOTA uses self-adaptive optimal transport to learn transport plans that integrate outputs from multiple foundation models (VFMs or VLMs). It’s training-free, prior-free, and automatically balances model contributions without requiring additional training data or parameters.

Result: Extensive experiments across natural images, medical pathology, and remote sensing domains show SOTA consistently outperforms individual models by effectively leveraging complementary strengths of different foundation models.

Conclusion: SOTA provides a generalizable, training-free ensemble framework that addresses limitations of individual foundation models by adaptively combining their strengths through optimal transport, achieving substantial performance improvements across diverse domains.

Abstract: Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

[224] Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Zhuoyang Zhang, Luke J. Huang, Chengyue Wu, Shang Yang, Kelly Peng, Yao Lu, Song Han

Main category: cs.CV

TL;DR: LPD accelerates autoregressive image generation by enabling parallel patch prediction through flexible modeling and locality-aware ordering, reducing steps from 256 to 20 for 256x256 images while maintaining quality.

Details

Motivation: Traditional autoregressive image generation suffers from high latency due to memory-bound next-patch prediction. Existing parallelization attempts achieve limited speedup while maintaining quality.

Method: Two key techniques: 1) Flexible Parallelized Autoregressive Modeling using learnable position query tokens for arbitrary generation ordering and parallelization with mutual visibility among concurrent tokens. 2) Locality-aware Generation Ordering that groups patches to minimize intra-group dependencies and maximize contextual support.

Result: Reduces generation steps from 256 to 20 (256x256) and 1024 to 48 (512x512) on ImageNet class-conditional generation without quality compromise, achieving at least 3.4× lower latency than previous parallelized autoregressive models.

Conclusion: LPD enables high parallelization in autoregressive image generation while maintaining quality, significantly reducing latency through flexible modeling and intelligent generation ordering.

Abstract: We present Locality-aware Parallel Decoding (LPD) to accelerate autoregressive image generation. Traditional autoregressive image generation relies on next-patch prediction, a memory-bound process that leads to high latency. Existing works have tried to parallelize next-patch prediction by shifting to multi-patch prediction to accelerate the process, but only achieved limited parallelization. To achieve high parallelization while maintaining generation quality, we introduce two key techniques: (1) Flexible Parallelized Autoregressive Modeling, a novel architecture that enables arbitrary generation ordering and degrees of parallelization. It uses learnable position query tokens to guide generation at target positions while ensuring mutual visibility among concurrently generated tokens for consistent parallel decoding. (2) Locality-aware Generation Ordering, a novel schedule that forms groups to minimize intra-group dependencies and maximize contextual support, enhancing generation quality. With these designs, we reduce the generation steps from 256 to 20 (256$\times$256 res.) and 1024 to 48 (512$\times$512 res.) without compromising quality on the ImageNet class-conditional generation, and achieving at least 3.4$\times$ lower latency than previous parallelized autoregressive models.

[225] A Survey on Interpretability in Visual Recognition

Qiyang Wan, Chengzhi Gao, Ruiping Wang, Xilin Chen

Main category: cs.CV

TL;DR: A systematic survey of XAI in visual recognition, establishing a human-centered taxonomy and exploring evaluation metrics, multimodal LLM interpretability, and practical applications.

Details

Motivation: The growing deployment of visual recognition models in safety-critical areas like autonomous driving and medical diagnostics has accelerated the need for eXplainable AI (XAI), particularly at the intersection of vision and language which form the cornerstones of multimodal intelligence.

Method: Establishes a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology. Conducts extensive qualitative assessment across categories and quantitative benchmarks within specific dimensions. Explores interpretability of Multimodal Large Language Models and practical applications.

Result: Provides a comprehensive survey that synthesizes diverse perspectives on XAI in visual recognition, including critical evaluation desiderata and metrics, with both qualitative assessments and quantitative benchmarks.

Conclusion: The survey offers an insightful roadmap to inspire future research on the interpretability of visual recognition models by systematically categorizing approaches, evaluating methods, and identifying emerging trends and opportunities in the field.

Abstract: Visual recognition models have achieved unprecedented success in various tasks. While researchers aim to understand the underlying mechanisms of these models, the growing demand for deployment in safety-critical areas like autonomous driving and medical diagnostics has accelerated the development of eXplainable AI (XAI). Distinct from generic XAI, visual recognition XAI is positioned at the intersection of vision and language, which represent the two most fundamental human modalities and form the cornerstones of multimodal intelligence. This paper provides a systematic survey of XAI in visual recognition by establishing a multi-dimensional taxonomy from a human-centered perspective based on intent, object, presentation, and methodology. Beyond categorization, we summarize critical evaluation desiderata and metrics, conducting an extensive qualitative assessment across different categories and demonstrating quantitative benchmarks within specific dimensions. Furthermore, we explore the interpretability of Multimodal Large Language Models and practical applications, identifying emerging trends and opportunities. By synthesizing these diverse perspectives, this survey provides an insightful roadmap to inspire future research on the interpretability of visual recognition models.

[226] Content-Aware Mamba for Learned Image Compression

Yunuo Chen, Zezheng Lyu, Bing He, Hongwei Hu, Qi Wang, Yuan Tian, Li Song, Wenjun Zhang, Guo Lu

Main category: cs.CV

TL;DR: CAM introduces content-aware Mamba for image compression with adaptive token permutation and global priors to better capture global redundancy while maintaining linear complexity.

Details

Motivation: Standard Mamba's rigid, content-agnostic raster scans and strict causality limit its ability to effectively eliminate redundancy between content-correlated but spatially distant tokens in image compression.

Method: Two novel mechanisms: 1) Content-adaptive token permutation strategy replaces rigid scans to prioritize interactions between content-similar tokens regardless of location; 2) Injection of sample-specific global priors into state-space model to mitigate strict causality without multi-directional scans.

Result: CMIC achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by 15.91%, 21.34%, and 17.58% in BD-rate on Kodak, Tecnick, and CLIC datasets respectively.

Conclusion: Content-Aware Mamba enables better global redundancy capture while preserving computational efficiency, advancing learned image compression with state-space models.

Abstract: Recent learned image compression (LIC) leverages Mamba-style state-space models (SSMs) for global receptive fields with linear complexity. However, the standard Mamba adopts content-agnostic, predefined raster (or multi-directional) scans under strict causality. This rigidity hinders its ability to effectively eliminate redundancy between tokens that are content-correlated but spatially distant. We introduce Content-Aware Mamba (CAM), an SSM that dynamically adapts its processing to the image content. Specifically, CAM overcomes prior limitations with two novel mechanisms. First, it replaces the rigid scan with a content-adaptive token permutation strategy to prioritize interactions between content-similar tokens regardless of their location. Second, it overcomes the sequential dependency by injecting sample-specific global priors into the state-space model, which effectively mitigates the strict causality without multi-directional scans. These innovations enable CAM to better capture global redundancy while preserving computational efficiency. Our Content-Aware Mamba-based LIC model (CMIC) achieves state-of-the-art rate-distortion performance, surpassing VTM-21.0 by 15.91%, 21.34%, and 17.58% in BD-rate on the Kodak, Tecnick, and CLIC datasets, respectively. Code will be released at https://github.com/UnoC-727/CMIC.

[227] Speech-to-LaTeX: New Models and Datasets for Converting Spoken Equations and Sentences

Dmitrii Korzh, Dmitrii Tarasov, Artyom Iudin, Elvir Karimov, Matvey Skripkin, Nikita Kuzmin, Andrey Kuznetsov, Oleg Y. Rogov, Ivan Oseledets

Main category: cs.CV

TL;DR: First open-source large-scale dataset for converting spoken mathematics to LaTeX with 66k+ audio samples in English/Russian, introducing audio language models that outperform prior work on new benchmarks.

Details

Motivation: Spoken mathematical expression conversion is challenging due to structured symbolic representation requirements and pronunciation ambiguity. While ASR and LMs have advanced, converting spoken math to LaTeX remains underexplored despite applications in education/research. Prior work has limitations: requires 2 transcriptions, focuses only on isolated equations, limited test sets, no training data, and lacks multilingual coverage.

Method: Created first fully open-source large-scale dataset with 66,000+ human-annotated audio samples of mathematical equations and sentences in English and Russian from diverse scientific domains. Applied ASR post-correction models, few-shot prompting, and audio language models for conversion.

Result: Audio language models achieved comparable CER to MathSpeech benchmark (28% vs 30%) for equations. On new S2L-equations benchmark, models outperformed MathSpeech by >36 percentage points (27% vs 64% CER). Established first benchmark for mathematical sentence recognition (S2L-sentences) with 40% equation CER.

Conclusion: This work provides foundational resources for spoken mathematics conversion and lays groundwork for future multimodal AI advances, particularly in mathematical content recognition.

Abstract: Conversion of spoken mathematical expressions is a challenging task that involves transcribing speech into a strictly structured symbolic representation while addressing the ambiguity inherent in the pronunciation of equations. Although significant progress has been achieved in automatic speech recognition (ASR) and language models (LM), the problem of converting spoken mathematics into LaTeX remains underexplored. This task directly applies to educational and research domains, such as lecture transcription or note creation. Based on ASR post-correction, prior work requires 2 transcriptions, focuses only on isolated equations, has a limited test set, and provides neither training data nor multilingual coverage. To address these issues, we present the first fully open-source large-scale dataset, comprising over 66,000 human-annotated audio samples of mathematical equations and sentences in English and Russian, drawn from diverse scientific domains. In addition to the ASR post-correction models and few-shot prompting, we apply audio language models, demonstrating comparable character error rate (CER) results on the MathSpeech benchmark (28% vs. 30%) for the equations conversion. In contrast, on the proposed S2L-equations benchmark, our models outperform the MathSpeech model by a substantial margin of more than 36 percentage points, even after accounting for LaTeX formatting artifacts (27% vs. 64%). We establish the first benchmark for mathematical sentence recognition (S2L-sentences) and achieve an equation CER of 40%. This work lays the groundwork for future advances in multimodal AI, with a particular focus on mathematical content recognition.

[228] DSER: Spectral Epipolar Representation for Efficient Light Field Depth Estimation

Noor Islam S. Mohammad, Md Muntaqim Meherab

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2508.08900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[229] PD-Diag-Net: Clinical-Priors guided Network on Brain MRI for Auxiliary Diagnosis of Parkinson’s Disease

Shuai Shao, Yan Wang, Shu Jiang, Shiyuan Zhao, Di Yang, Jiangtao Wang, Yutong Bai, Jianguo Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval error

Method: Unable to determine method due to retrieval error

Result: Unable to determine results due to retrieval error

Conclusion: Unable to determine conclusion due to retrieval error

Abstract: Failed to fetch summary for 2509.23719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[230] Seeing Space and Motion: Enhancing Latent Actions with Geometric and Dynamic Awareness for Vision-Language-Action Models

Zhejia Cai, Yandan Yang, Xinyuan Chang, Shiyi Liang, Ronghan Chen, Feng Xiong, Mu Xu, Ruqi Huang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.26251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[231] Adaptive Event Stream Slicing for Open-Vocabulary Event-Based Object Detection via Vision-Language Knowledge Distillation

Jinchang Zhang, Zijun Li, Jiakai Lin, Guoyu Lu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical limitations in accessing paper information

Conclusion: Cannot draw conclusions about paper relevance without access to the actual content

Abstract: Failed to fetch summary for 2510.00681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[232] Equivariant Splitting: Self-supervised learning from incomplete data

Victor Sechaud, Jérémy Scanvic, Quentin Barthélemy, Patrice Abry, Julián Tachella

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2510.00929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[233] MonitorVLM:A Vision Language Framework for Safety Violation Detection in Mining Operations

Jiang Wu, Sichao Wu, Yinsong Ma, Guangyuan Yu, Haoyuan Xu, Lifang Zheng, Jingliang Duan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with arXiv ID 2510.03666 could not be retrieved for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to rate limiting from arXiv API.

Method: Cannot determine method as paper content is unavailable due to rate limiting from arXiv API.

Result: Cannot determine results as paper content is unavailable due to rate limiting from arXiv API.

Conclusion: Cannot draw conclusions about the paper as content retrieval failed due to HTTP 429 error (too many requests).

Abstract: Failed to fetch summary for 2510.03666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[234] InstantSfM: Towards GPU-Native SfM for the Deep Learning Era

Jiankun Zhong, Zitong Zhan, Quankai Gao, Ziyu Chen, Haozhe Lou, Jiageng Mao, Ulrich Neumann, Chen Wang, Yue Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.13310: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13310&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[235] MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion

Minjung Shin, Hyunin Cho, Sooyeon Go, Jin-Hwa Kim, Youngjung Uh

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2510.13702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[236] UltraGen: Efficient Ultra-High-Resolution Image Generation with Hierarchical Local Attention

Yuyao Zhang, Yu-Wing Tai

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.16325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[237] REALM: An MLLM-Agent Framework for Open World 3D Reasoning Segmentation and Editing on Gaussian Splatting

Changyue Shi, Minghao Chen, Yiping Mao, Chuxiao Yang, Xinyuan Hu, Jiajun Ding, Zhou Yu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.16410: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16410&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

Main category: cs.CV

TL;DR: Paper 2511.05271: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2511.05271: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05271&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] D-GAP: Improving Out-of-Domain Robustness via Dataset-Agnostic and Gradient-Guided Augmentation in Frequency and Pixel Spaces

Ruoqi Wang, Haitao Wang, Shaojie Guo, Qiong Luo

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.11286: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11286&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu, Yueming Jin

Main category: cs.CV

TL;DR: Unable to analyze paper 2511.12110 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.12110: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12110&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] X-WIN: Building Chest Radiograph World Model via Predictive Sensing

Zefan Yang, Ge Wang, James Hendler, Mannudeep K. Kalra, Pingkun Yan

Main category: cs.CV

TL;DR: X-WIN: A CXR world model that distills 3D CT knowledge to predict 2D projections, improving medical imaging representation learning and disease diagnosis.

Details

Motivation: Chest X-rays (CXRs) are limited by being 2D projections that lose 3D anatomical information due to structural superposition, making representation learning and disease diagnosis challenging. The paper aims to address this by incorporating 3D volumetric knowledge from CT scans.

Method: Proposes X-WIN, a CXR world model that learns to predict 2D CT projections in latent space. Uses affinity-guided contrastive alignment loss to capture correlated information across projections from the same volume. Incorporates real CXRs through masked image modeling and employs a domain classifier to align representations between real and simulated CXRs.

Result: X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. Demonstrates ability to render 2D projections for reconstructing 3D CT volumes.

Conclusion: X-WIN successfully incorporates 3D anatomical knowledge into 2D CXR analysis, improving representation learning and diagnostic capabilities while enabling 3D reconstruction from 2D projections.

Abstract: Chest X-ray radiography (CXR) is an essential medical imaging technique for disease diagnosis. However, as 2D projectional images, CXRs are limited by structural superposition and hence fail to capture 3D anatomies. This limitation makes representation learning and disease diagnosis challenging. To address this challenge, we propose a novel CXR world model named X-WIN, which distills volumetric knowledge from chest computed tomography (CT) by learning to predict its 2D projections in latent space. The core idea is that a world model with internalized knowledge of 3D anatomical structure can predict CXRs under various transformations in 3D space. During projection prediction, we introduce an affinity-guided contrastive alignment loss that leverages mutual similarities to capture rich, correlated information across projections from the same volume. To improve model adaptability, we incorporate real CXRs into training through masked image modeling and employ a domain classifier to encourage statistically similar representations for real and simulated CXRs. Comprehensive experiments show that X-WIN outperforms existing foundation models on diverse downstream tasks using linear probing and few-shot fine-tuning. X-WIN also demonstrates the ability to render 2D projections for reconstructing a 3D CT volume.

[242] REMSA: Foundation Model Selection for Remote Sensing via a Constraint-Aware Agent

Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.17442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization

Xingyue Lin, Shuai Peng, Xiangyu Xie, Jianhua Zhu, Yuxuan Zhou, Liangcai Gao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.20034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

Tianyi Yan, Tao Tang, Xingtai Gui, Yongkang Li, Jiasen Zhesng, Weiyao Huang, Lingdong Kong, Wencheng Han, Xia Zhou, Xueyang Zhang, Yifei Zhan, Kun Zhan, Cheng-zhong Xu, Jianbing Shen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.20325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He, Guanyu Hou, Hongwei Li, Zhicong Huang, Kangjie Chen, Yi Yu, Wenbo Jiang, Guowen Xu, Tianwei Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.21145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] World Models That Know When They Don’t Know - Controllable Video Generation with Calibrated Uncertainty

Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2512.05927: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05927&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] Ultra-Low Bitrate Perceptual Image Compression with Shallow Encoder

Tianyu Zhang, Dong Liu, Chang Wen Chen

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.12229 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about the paper due to inability to access the abstract

Abstract: Failed to fetch summary for 2512.12229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] GTR-Turbo: Merged Checkpoint is Secretly a Free Teacher for Agentic VLM Training

Tong Wei, Yijun Yang, Changhao Zhang, Junliang Xing, Yuanchun Shi, Zongqing Lu, Deheng Ye

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2512.13043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] Enhancing Tree Species Classification: Insights from YOLOv8 and Explainable AI Applied to TLS Point Cloud Projections

Adrian Straker, Paul Magdon, Marco Zullich, Maximilian Freudenberg, Christoph Kleinn, Johannes Breidenbach, Stefano Puliti, Nils Noelke

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2512.16950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Wenshuo Peng, Gongxuan Wang, Tianmeng Yang, Chuanhao Li, Xiaojie Xu, Hui He, Kaipeng Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.21507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Don’t Mind the Gaps: Implicit Neural Representations for Resolution-Agnostic Retinal OCT Analysis

Bennet Kahrs, Julia Andresen, Fenja Falta, Monty Santarossa, Heinz Handels, Timo Kepp

Main category: cs.CV

TL;DR: The paper “2601.02447” could not be analyzed due to HTTP 429 error when fetching the abstract from arXiv API.

Details

Motivation: Unable to determine motivation due to abstract fetch failure.

Method: Unable to determine method due to abstract fetch failure.

Result: Unable to determine results due to abstract fetch failure.

Conclusion: Unable to draw conclusions due to abstract fetch failure.

Abstract: Failed to fetch summary for 2601.02447: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02447&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] PLANING: A Loosely Coupled Triangle-Gaussian Framework for Streaming 3D Reconstruction

Changjian Jiang, Kerui Ren, Xudong Li, Kaiwen Song, Guanghao Li, Linning Xu, Tao Lu, Junting Dong, Yu Zhang, Bo Dai, Mulin Yu

Main category: cs.CV

TL;DR: Paper 2601.22046: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2601.22046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] Generating a Paracosm for Training-Free Zero-Shot Composed Image Retrieval

Tong Wang, Yunhan Zhao, Shu Kong

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.00813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Siyu Jiang, Feiyang Chen, Xiaojin Zhang, Kun He

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.04268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] UniWeTok: An Unified Binary Tokenizer with Codebook Size $\mathit{2^{128}}$ for Unified Multimodal Large Language Model

Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - arXiv API returned rate limiting error (HTTP 429)

Conclusion: Paper analysis impossible due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2602.14178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] TikArt: Stabilizing Aperture-Guided Fine-Grained Visual Reasoning with Reinforcement Learning

Hao Ding, Zhichuan Yang, Weijie Ge, Ziqin Gao, Chaoyi Lu, Lei Zhao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.14482: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14482&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] Class Incremental Learning with Task-Specific Batch Normalization and Out-of-Distribution Detection

Zhiping Zhou, Xuchen Xie, Yiqiao Qiu, Run Lin, Weishi Zheng, Ruixuan Wang

Main category: cs.CV

TL;DR: Paper 2411.00430: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract/summary is unavailable due to HTTP 429 error from arXiv API

Method: Cannot determine method as abstract/summary is unavailable due to HTTP 429 error from arXiv API

Result: Cannot determine results as abstract/summary is unavailable due to HTTP 429 error from arXiv API

Conclusion: Cannot determine conclusion as abstract/summary is unavailable due to HTTP 429 error from arXiv API

Abstract: Failed to fetch summary for 2411.00430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.00430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance

Zhaotong Yang, Yong Du, Shengfeng He, Yuhui Li, Xinzhe Li, Yangyang Xu, Junyu Dong, Jian Yang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2602.14552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] Is CLIP ideal? No. Can we fix it? Yes!

Raphi Kang, Yue Song, Georgia Gkioxari, Pietro Perona

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2503.08723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.08723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] Similarity-as-Evidence: Calibrating Overconfident VLMs for Interpretable and Label-Efficient Medical Active Learning

Zhuofan Xie, Zishan Lin, Jinliang Lin, Jie Qi, Shaohua Hong, Shuo Li

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.18867: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18867&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] No Need For Real Anomaly: MLLM Empowered Zero-Shot Video Anomaly Detection

Zunkai Dai, Ke Li, Jiajia Liu, Jie Yang, Yuanyuan Qiao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.19248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2602.19442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] SIMSPINE: A Biomechanics-Aware Simulation Framework for 3D Spine Motion Annotation and Benchmarking

Muhammad Saif Ullah Khan, Didier Stricker

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2602.20792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] PatchDenoiser: Parameter-efficient multi-scale patch learning and fusion denoiser for Low-dose CT imaging

Jitindra Fartiyal, Pedro Freire, Sergei K. Turitsyn, Sergei G. Solovski

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.21987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] WebAccessVL: Violation-Aware VLM for Web Accessibility

Amber Yijia Zheng, Jae Joong Lee, Bedrich Benes, Raymond A. Yeh

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.03850: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03850&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] AMLRIS: Alignment-aware Masked Learning for Referring Image Segmentation

Tongfei Chen, Shuo Yang, Yuguang Yang, Linlin Yang, Runtang Guo, Changbai Li, He Long, Chunyu Xie, Dawei Leng, Baochang Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper summary fetch failed due to rate limiting

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2602.22740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Mind the Way You Select Negative Texts: Pursuing the Distance Consistency in OOD Detection with VLMs

Zhikang Xu, Qianqian Xu, Zitai Wang, Cong Hua, Sicong Li, Zhiyong Yang, Qingming Huang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.02618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[268] BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation

Zihao Zhu, Ruotong Wang, Siwei Lyu, Min Zhang, Baoyuan Wu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to draw conclusions due to failed API request

Abstract: Failed to fetch summary for 2603.02816: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02816&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] CFG-Ctrl: Control-Based Classifier-Free Diffusion Guidance

Hanyang Wang, Yiyang Liu, Jiawei Chi, Fangfu Liu, Ran Xue, Yueqi Duan

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.03281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jiaxu Zhou, Shaobo Wang, Zhiyuan Yang, Zhenjun Yu, Tao Li

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.07181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] A Systematic Comparison of Training Objectives for Out-of-Distribution Detection in Image Classification

Furkan Genç, Onat Özdemir, Emre Akbaş

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.07571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] SGI: Structured 2D Gaussians for Efficient and Compact Large Image Representation

Zixuan Pan, Kaiyuan Tang, Jun Xia, Yifan Qin, Lin Gu, Chaoli Wang, Jianxu Chen, Yiyu Shi

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.07789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] SAVE: Speech-Aware Video Representation Learning for Video-Text Retrieval

Ruixiang Zhao, Zhihao Xu, Bangxiang Lan, Zijie Xin, Jingyu Liu, Xirong Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.08224: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08224&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] SPIRAL: A Closed-Loop Framework for Self-Improving Action World Models via Reflective Planning Agents

Yu Yang, Yue Liao, Jianbiao Mei, Baisen Wang, Xuemeng Yang, Licheng Wen, Jiangning Zhang, Xiangtai Li, Hanlin Chen, Botian Shi, Yong Liu, Shuicheng Yan, Gim Hee Lee

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.08403: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08403&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] VIVID-Med: LLM-Supervised Structured Pretraining for Deployable Medical ViTs

Xiyao Wang, Xiaoyu Tan, Yang Dai, Yuxuan Fu, Shuo Li, Xihe Qiu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze content

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.09109: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09109&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] Transformer-Based Multi-Region Segmentation and Radiomic Analysis of HR-pQCT Imaging for Osteoporosis Classification

Mohseu Rashid Subah, Mohammed Abdul Gani Zilani, Thomas L. Nickolas, Matthew R. Allen, Stuart J. Warden, Rachel K. Surowiec

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions about the paper due to technical retrieval issues

Abstract: Failed to fetch summary for 2603.09137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Agentic AI as a Network Control-Plane Intelligence Layer for Federated Learning over 6G

Loc X. Nguyen, Ji Su Yoon, Huy Q. Le, Yu Qiao, Avi Deb Raha, Eui-Nam Huh, Nguyen H. Tran, Zhu Han, Choong Seon Hong

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.09141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity

Zhengyao Fang, Pengyuan Lyu, Chengquan Zhang, Guangming Lu, Jun Yu, Wenjie Pei

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot draw conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.09480: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09480&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] Streaming Autoregressive Video Generation via Diagonal Distillation

Jinxiu Liu, Xuanming Liu, Kangfu Mei, Yandong Wen, Ming-Hsuan Yang, Weiyang Liu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.09488 suggests it’s from March 2024, but content cannot be retrieved.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2603.09488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Won Shik Jang, Ue-Hwan Kim

Main category: cs.CV

TL;DR: Context-Nav: A method for text-goal instance navigation that uses dense text-image alignments for global exploration guidance and viewpoint-aware 3D spatial reasoning for candidate verification, achieving state-of-the-art without task-specific training.

Details

Motivation: Text-goal instance navigation requires agents to navigate to specific object instances based on free-form descriptions, which is challenging due to same-category distractors in cluttered 3D environments. Current approaches often fail to properly leverage contextual information and lack robust spatial reasoning for verification.

Method: Two-stage approach: 1) Compute dense text-image alignments to create a value map that ranks frontiers, guiding exploration using the entire description rather than early detections. 2) Perform viewpoint-aware relation check when observing candidates - sample plausible observer poses, align local frames, and accept target only if spatial relations can be satisfied from at least one viewpoint.

Result: Achieves state-of-the-art performance on InstanceNav and CoIN-Bench benchmarks. Ablations show that encoding full captions into value maps avoids wasted motion, and explicit viewpoint-aware 3D verification prevents semantically plausible but incorrect stops.

Conclusion: Geometry-grounded spatial reasoning provides a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes, demonstrating the effectiveness of explicit 3D reasoning over learned policies.

Abstract: Text-goal instance navigation (TGIN) asks an agent to resolve a single, free-form description into actions that reach the correct object instance among same-category distractors. We present \textit{Context-Nav} that elevates long, contextual captions from a local matching cue to a global exploration prior and verifies candidates through 3D spatial reasoning. First, we compute dense text-image alignments for a value map that ranks frontiers – guiding exploration toward regions consistent with the entire description rather than early detections. Second, upon observing a candidate, we perform a viewpoint-aware relation check: the agent samples plausible observer poses, aligns local frames, and accepts a target only if the spatial relations can be satisfied from at least one viewpoint. The pipeline requires no task-specific training or fine-tuning; we attain state-of-the-art performance on InstanceNav and CoIN-Bench. Ablations show that (i) encoding full captions into the value map avoids wasted motion and (ii) explicit, viewpoint-aware 3D verification prevents semantically plausible but incorrect stops. This suggests that geometry-grounded spatial reasoning is a scalable alternative to heavy policy training or human-in-the-loop interaction for fine-grained instance disambiguation in cluttered 3D scenes.

[281] A Saccade-inspired Approach to Image Classification using Vision Transformer Attention Maps

Matthis Dallain, Laurent Rodriguez, Laurent Udo Perrinet, Benoît Miramond

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.09613 appears to be from March 2023.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2603.09613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] AutoViVQA: A Large-Scale Automatically Constructed Dataset for Vietnamese Visual Question Answering

Nguyen Anh Tuong, Phan Ba Duc, Nguyen Trung Quoc, Tran Dac Thinh, Dang Duy Lan, Nguyen Quoc Thinh, Tung Le

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2603.09689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] Ego: Embedding-Guided Personalization of Vision-Language Models

Soroush Seifi, Simon Gardier, Vaggelis Dorovatas, Daniel Olmeda Reino, Rahaf Aljundi

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.09771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] ENIGMA-360: An Ego-Exo Dataset for Human Behavior Understanding in Industrial Scenarios

Francesco Ragusa, Rosario Leonardi, Michele Mazzamuto, Daniele Di Mauro, Camillo Quattrocchi, Alessandro Passanisi, Irene D’Ambra, Antonino Furnari, Giovanni Maria Farinella

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2603.09741: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09741&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] MA-EgoQA: Question Answering over Egocentric Videos from Multiple Embodied Agents

Kangsan Kim, Yanlai Yang, Suji Kim, Woongyeong Yeo, Youngwan Lee, Mengye Ren, Sung Ju Hwang

Main category: cs.CV

TL;DR: A novel benchmark (MA-EgoQA) and baseline model (EgoMAS) for understanding multiple long-horizon egocentric videos from embodied AI agents, addressing challenges in multi-agent visual understanding and communication.

Details

Motivation: As humans collaborate with multiple embodied AI agents, there's a need for systems that can interpret parallel sensory inputs (video) from multiple agents, compress high-volume data, and aggregate egocentric videos to construct system-level memory for effective human-agent communication.

Method: Introduces MultiAgent-EgoQA (MA-EgoQA) benchmark with 1.7k questions across five categories (social interaction, task coordination, theory-of-mind, temporal reasoning, environmental interaction). Proposes EgoMAS baseline model that uses shared memory across agents and agent-wise dynamic retrieval.

Result: Current approaches struggle with multiple egocentric streams. The benchmark enables systematic evaluation, and EgoMAS provides a simple baseline approach, though significant improvements are needed for effective multi-agent video understanding.

Conclusion: The work establishes a foundation for multi-agent egocentric video understanding, highlighting current limitations and the need for future advances in system-level understanding across embodied agents.

Abstract: As embodied models become powerful, humans will collaborate with multiple embodied AI agents at their workplace or home in the future. To ensure better communication between human users and the multi-agent system, it is crucial to interpret incoming information from agents in parallel and refer to the appropriate context for each query. Existing challenges include effectively compressing and communicating high volumes of individual sensory inputs in the form of video and correctly aggregating multiple egocentric videos to construct system-level memory. In this work, we first formally define a novel problem of understanding multiple long-horizon egocentric videos simultaneously collected from embodied agents. To facilitate research in this direction, we introduce MultiAgent-EgoQA (MA-EgoQA), a benchmark designed to systemically evaluate existing models in our scenario. MA-EgoQA provides 1.7k questions unique to multiple egocentric streams, spanning five categories: social interaction, task coordination, theory-of-mind, temporal reasoning, and environmental interaction. We further propose a simple baseline model for MA-EgoQA named EgoMAS, which leverages shared memory across embodied agents and agent-wise dynamic retrieval. Through comprehensive evaluation across diverse baselines and EgoMAS on MA-EgoQA, we find that current approaches are unable to effectively handle multiple egocentric streams, highlighting the need for future advances in system-level understanding across the agents. The code and benchmark are available at https://ma-egoqa.github.io.

[286] SEGA: Drivable 3D Gaussian Head Avatar from a Single Image

Chen Guo, Zhuo Su, Liao Wang, Jian Wang, Shuang Li, Xu Chang, Zhaohu Li, Yang Zhao, Guidong Wang, Yebin Liu, Ruqi Huang

Main category: cs.CV

TL;DR: Unable to analyze paper 2504.14373 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2504.14373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Pre-training vision models for the classification of alerts from wide-field time-domain surveys

Nabeel Rehemtulla, Adam A. Miller, Mike Walmsley, Ved G. Shah, Theophile Jegou du Laz, Michael W. Coughlin, Argyro Sasli, Joshua Bloom, Christoffer Fremling, Matthew J. Graham, Steven L. Groom, David Hale, Ashish A. Mahabal, Daniel A. Perley, Josiah Purdum, Ben Rusholme, Jesper Sollerman, Mansi M. Kasliwal

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2512.11957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.11957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] Structured Bitmap-to-Mesh Triangulation for Geometry-Aware Discretization of Image-Derived Domains

Wei Feng, Haiyong Zheng

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.19474 appears to be from February 2024, but no abstract or content is available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content. The HTTP 429 error indicates rate limiting from arXiv API, preventing retrieval of the paper details.

Method: Cannot analyze method without access to the paper content. The error suggests technical limitations in accessing the arXiv database at this time.

Result: No results can be analyzed due to inability to access the paper content. The HTTP 429 status code indicates too many requests to the arXiv API.

Conclusion: Cannot provide any conclusions about the paper due to technical limitations in accessing the content. The arXiv API rate limiting prevents analysis.

Abstract: Failed to fetch summary for 2602.19474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[289] Agentic Control Center for Data Product Optimization

Priyadarshini Tamilselvan, Gregory Bramble, Sola Shirai, Ken C. L. Wong, Faisal Chowdhury, Horst Samulowitz

Main category: cs.AI

TL;DR: System automates data product improvement using AI agents in continuous optimization loop with human oversight

Details

Motivation: Creating useful data products requires domain experts to hand-craft supporting assets, which is challenging and time-consuming

Method: Specialized AI agents operate in continuous optimization loop, surfacing questions, monitoring quality metrics, and supporting human-in-the-loop controls

Result: Transforms data into observable and refinable assets that balance automation with trust and oversight

Conclusion: Proposed system automates data product improvement while maintaining human oversight and trust

Abstract: Data products enable end users to gain greater insights about their data by providing supporting assets, such as example question-SQL pairs which can be answered using the data or views over the database tables. However, producing useful data products is challenging, and typically requires domain experts to hand-craft supporting assets. We propose a system that automates data product improvement through specialized AI agents operating in a continuous optimization loop. By surfacing questions, monitoring multi-dimensional quality metrics, and supporting human-in-the-loop controls, it transforms data into observable and refinable assets that balance automation with trust and oversight.

[290] Hybrid Self-evolving Structured Memory for GUI Agents

Sibo Zhu, Wenyi Wu, Kun Zhou, Stephen Wang, Biwei Huang

Main category: cs.AI

TL;DR: HyMEM is a graph-based memory system for GUI agents that combines symbolic nodes with continuous embeddings to improve long-horizon computer task performance through structured memory organization and self-evolution.

Details

Motivation: Real-world computer-use tasks are challenging for GUI agents due to long-horizon workflows, diverse interfaces, and frequent errors. Current approaches with flat retrieval over discrete summaries or continuous embeddings lack the structured organization and self-evolving characteristics of human memory.

Method: Proposes Hybrid Self-evolving Structured Memory (HyMEM), a graph-based memory that couples discrete high-level symbolic nodes with continuous trajectory embeddings. It maintains a graph structure for multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference.

Result: HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models. It boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.

Conclusion: HyMEM’s brain-inspired hybrid memory architecture effectively addresses the limitations of flat retrieval approaches for GUI agents, demonstrating significant performance improvements on real-world computer-use tasks.

Abstract: The remarkable progress of vision-language models (VLMs) has enabled GUI agents to interact with computers in a human-like manner. Yet real-world computer-use tasks remain difficult due to long-horizon workflows, diverse interfaces, and frequent intermediate errors. Prior work equips agents with external memory built from large collections of trajectories, but relies on flat retrieval over discrete summaries or continuous embeddings, falling short of the structured organization and self-evolving characteristics of human memory. Inspired by the brain, we propose Hybrid Self-evolving Structured Memory (HyMEM), a graph-based memory that couples discrete high-level symbolic nodes with continuous trajectory embeddings. HyMEM maintains a graph structure to support multi-hop retrieval, self-evolution via node update operations, and on-the-fly working-memory refreshing during inference. Extensive experiments show that HyMEM consistently improves open-source GUI agents, enabling 7B/8B backbones to match or surpass strong closed-source models; notably, it boosts Qwen2.5-VL-7B by +22.5% and outperforms Gemini2.5-Pro-Vision and GPT-4o.

[291] HEAL: Hindsight Entropy-Assisted Learning for Reasoning Distillation

Wenjing Zhang, Jiangze Yan, Jieyun Huang, Yi Shen, Shuming Shi, Ping Chen, Ning Wang, Zhaoxiang Liu, Kai Wang, Shiguo Lian

Main category: cs.AI

TL;DR: HEAL is a framework for distilling reasoning capabilities from large to small models that overcomes the “Teacher Ceiling” limitation by actively intervening in broken reasoning trajectories and filtering genuine cognitive breakthroughs.

Details

Motivation: Standard distillation methods treat the teacher as a static filter, discarding complex problems where the teacher fails, creating an artificial "Teacher Ceiling" that limits student learning. The paper aims to bridge this reasoning gap.

Method: HEAL combines three modules: (1) GEAR detects reasoning breakpoints via entropy dynamics and injects targeted hindsight hints; (2) PURE filters genuine cognitive breakthroughs from spurious shortcuts; (3) PACE organizes training in three stages from foundational alignment to frontier breakthrough.

Result: Extensive experiments on multiple benchmarks show HEAL significantly outperforms traditional SFT distillation and other baselines.

Conclusion: HEAL effectively bridges the reasoning gap in distillation by actively repairing broken trajectories and filtering genuine learning, overcoming the Teacher Ceiling limitation.

Abstract: Distilling reasoning capabilities from Large Reasoning Models (LRMs) into smaller models is typically constrained by the limitation of rejection sampling. Standard methods treat the teacher as a static filter, discarding complex “corner-case” problems where the teacher fails to explore valid solutions independently, thereby creating an artificial “Teacher Ceiling” for the student. In this work, we propose Hindsight Entropy-Assisted Learning (HEAL), an RL-free framework designed to bridge this reasoning gap. Drawing on the educational theory of the Zone of Proximal Development(ZPD), HEAL synergizes three core modules: (1) Guided Entropy-Assisted Repair (GEAR), an active intervention mechanism that detects critical reasoning breakpoints via entropy dynamics and injects targeted hindsight hints to repair broken trajectories; (2) Perplexity-Uncertainty Ratio Estimator (PURE), a rigorous filtering protocol that decouples genuine cognitive breakthroughs from spurious shortcuts; and (3) Progressive Answer-guided Curriculum Evolution (PACE), a three-stage distillation strategy that organizes training from foundational alignment to frontier breakthrough. Extensive experiments on multiple benchmarks demonstrate that HEAL significantly outperforms traditional SFT distillation and other baselines.

[292] Mindstorms in Natural Language-Based Societies of Mind

Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R. Ashley, Róbert Csordás, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, Louis Kirsch, Bing Li, Guohao Li, Shuming Liu, Jinjie Mai, Piotr Piękos, Aditya Ramesh, Imanol Schlag, Weimin Shi, Aleksandar Stanić, Wenyi Wang, Yuhui Wang, Mengmeng Xu, Deng-Ping Fan, Bernard Ghanem, Jürgen Schmidhuber

Main category: cs.AI

TL;DR: A framework for societies of multimodal neural networks that communicate via natural language to solve complex AI tasks through collaborative “mindstorms.”

Details

Motivation: Inspired by Minsky's "society of mind" and Schmidhuber's "learning to think," the paper aims to overcome limitations of single large language models by creating collaborative societies of diverse neural networks that communicate through natural language interfaces.

Method: Proposes Natural Language-based Societies of Mind (NLSOMs) where various neural network agents (LLMs and other NN-based experts) communicate via natural language. These modular societies (up to 129 members) use “mindstorms” - collaborative interviews between agents - to solve multimodal tasks.

Result: Demonstrates NLSOMs successfully solving practical AI tasks including visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving.

Conclusion: NLSOMs represent a promising direction toward larger societies of heterogeneous minds (potentially billions of agents including humans), raising important research questions about social structures, governance models, and economic principles for maximizing collective intelligence.

Abstract: Both Minsky’s “society of mind” and Schmidhuber’s “learning to think” inspire diverse societies of large multimodal neural networks (NNs) that solve problems by interviewing each other in a “mindstorm.” Recent implementations of NN-based societies of minds consist of large language models (LLMs) and other NN-based experts communicating through a natural language interface. In doing so, they overcome the limitations of single LLMs, improving multimodal zero-shot reasoning. In these natural language-based societies of mind (NLSOMs), new agents – all communicating through the same universal symbolic language – are easily added in a modular fashion. To demonstrate the power of NLSOMs, we assemble and experiment with several of them (having up to 129 members), leveraging mindstorms in them to solve some practical AI tasks: visual question answering, image captioning, text-to-image synthesis, 3D generation, egocentric retrieval, embodied AI, and general language-based task solving. We view this as a starting point towards much larger NLSOMs with billions of agents-some of which may be humans. And with this emergence of great societies of heterogeneous minds, many new research questions have suddenly become paramount to the future of artificial intelligence. What should be the social structure of an NLSOM? What would be the (dis)advantages of having a monarchical rather than a democratic structure? How can principles of NN economies be used to maximize the total reward of a reinforcement learning NLSOM? In this work, we identify, discuss, and try to answer some of these questions.

[293] Beyond Scalars: Evaluating and Understanding LLM Reasoning via Geometric Progress and Stability

Xinyan Jiang, Ninghao Liu, Di Wang, Lijie Hu

Main category: cs.AI

TL;DR: TRACED framework evaluates LLM reasoning quality using geometric kinematics, analyzing reasoning traces through Progress (displacement) and Stability (curvature) metrics to distinguish correct reasoning from hallucinations.

Details

Motivation: Traditional scalar probability evaluations of LLM reliability fail to capture the structural dynamics of reasoning processes. There's a need for more nuanced assessment methods that can reveal the internal dynamics of machine thought and distinguish between correct reasoning patterns and hallucination patterns.

Method: TRACED decomposes reasoning traces into two geometric components: Progress (displacement) measuring forward movement in reasoning, and Stability (curvature) measuring consistency. The framework analyzes reasoning trajectories to identify topological patterns - correct reasoning shows high-progress, stable trajectories while hallucinations show low-progress, unstable patterns with stalled displacement and high curvature fluctuations.

Result: The framework achieves competitive performance and superior robustness across diverse benchmarks. It reveals distinct topological divergence between correct reasoning and hallucinations, and successfully maps geometric features to cognitive concepts like “Hesitation Loops” (high curvature) and “Certainty Accumulation” (displacement).

Conclusion: TRACED provides a novel geometric approach to evaluating LLM reasoning quality that bridges geometry and cognition, offering a physical lens to decode the internal dynamics of machine thought and providing more nuanced assessment than traditional scalar probability methods.

Abstract: Evaluating LLM reliability via scalar probabilities often fails to capture the structural dynamics of reasoning. We introduce TRACED, a framework that assesses reasoning quality through theoretically grounded geometric kinematics. By decomposing reasoning traces into Progress (displacement) and Stability (curvature), we reveal a distinct topological divergence: correct reasoning manifests as high-progress, stable trajectories, whereas hallucinations are characterized by low-progress, unstable patterns (stalled displacement with high curvature fluctuations). Leveraging these signatures, our probabilistic framework achieves competitive performance and superior robustness across diverse benchmarks. Crucially, TRACED bridges geometry and cognition by mapping high curvature to ‘‘Hesitation Loops’’ and displacement to ‘‘Certainty Accumulation’’, offering a physical lens to decode the internal dynamics of machine thought.

[294] Verbalizing LLM’s Higher-order Uncertainty via Imprecise Probabilities

Anita Yang, Krikamol Muandet, Michele Caprio, Siu Lun Chau, Masaki Adachi

Main category: cs.AI

TL;DR: Novel prompt-based uncertainty elicitation techniques for LLMs using imprecise probabilities framework to capture both first-order (uncertainty over responses) and second-order (uncertainty about uncertainty) uncertainty.

Details

Motivation: Existing uncertainty elicitation techniques for LLMs, developed under classical probabilistic frameworks, fail to adequately capture LLM behavior, leading to systematic failures in ambiguous QA, in-context learning, and self-reflection settings.

Method: Proposes prompt-based uncertainty elicitation techniques grounded in imprecise probabilities framework, with general-purpose prompting and post-processing procedures to directly elicit and quantify both first-order and second-order uncertainty.

Result: Demonstrates effectiveness across diverse settings, enabling more faithful uncertainty reporting from LLMs, improving credibility and supporting downstream decision-making.

Conclusion: Imprecise probabilities provide a principled framework for better uncertainty elicitation from LLMs, addressing systematic failure modes of classical approaches.

Abstract: Despite the growing demand for eliciting uncertainty from large language models (LLMs), empirical evidence suggests that LLM behavior is not always adequately captured by the elicitation techniques developed under the classical probabilistic uncertainty framework. This mismatch leads to systematic failure modes, particularly in settings that involve ambiguous question-answering, in-context learning, and self-reflection. To address this, we propose novel prompt-based uncertainty elicitation techniques grounded in \emph{imprecise probabilities}, a principled framework for repesenting and eliciting higher-order uncertainty. Here, first-order uncertainty captures uncertainty over possible responses to a prompt, while second-order uncertainty (uncertainty about uncertainty) quantifies indeterminacy in the underlying probability model itself. We introduce general-purpose prompting and post-processing procedures to directly elicit and quantify both orders of uncertainty, and demonstrate their effectiveness across diverse settings. Our approach enables more faithful uncertainty reporting from LLMs, improving credibility and supporting downstream decision-making.

[295] The Yokai Learning Environment: Tracking Beliefs Over Space and Time

Constantin Ruhdorfer, Matteo Bortoletto, Johannes Forkel, Jakob Foerster, Andreas Bulling

Main category: cs.AI

TL;DR: Yokai Learning Environment (YLE) is introduced as a new multi-agent RL benchmark for zero-shot coordination that requires building common ground through belief tracking, ambiguous hints, and strategic termination decisions - addressing limitations of the near-saturated Hanabi benchmark.

Details

Motivation: The Hanabi Learning Environment (HLE) has become saturated with near-perfect inter-seed cross-play performance, limiting its ability to track algorithmic progress in zero-shot coordination. There's a need for a more challenging benchmark that requires sophisticated belief tracking, ambiguous communication, and strategic decision-making about game termination.

Method: Introduces Yokai Learning Environment (YLE) - an open-source multi-agent RL benchmark where agents must: 1) track and update beliefs over moving cards, 2) reason under ambiguous hints, and 3) decide when to terminate based on inferred shared knowledge. Evaluates leading ZSC methods (High-Entropy IPPO, Other-Play, Off-Belief Learning) on YLE.

Result: Leading ZSC methods that achieve near-perfect performance in HLE show persistent SP-XP gaps, degraded early-ending calibration, and weaker belief representations in YLE cross-play. Methods performing best in HLE do not perform best in YLE, indicating benchmark-specific progress doesn’t generalize.

Conclusion: YLE establishes itself as a challenging new benchmark for zero-shot coordination research, revealing limitations of current methods in maintaining consistent internal models with unseen partners and highlighting the need for more robust coordination algorithms.

Abstract: The ability to cooperate with unknown partners is a central challenge in cooperative AI and widely studied in the form of zero-shot coordination (ZSC), which evaluates an algorithm by measuring the performance of independently trained agents when paired. The Hanabi Learning Environment (HLE) has become the dominant benchmark for ZSC, but recent work has achieved near-perfect inter-seed cross-play performance, limiting its ability to track algorithmic progress. We introduce the Yokai Learning Environment (YLE) - an open-source multi-agent RL benchmark in which effective collaboration requires building common ground by tracking and updating beliefs over moving cards, reasoning under ambiguous hints, and deciding when to terminate the game based on inferred shared knowledge - features absent in the HLE, where beliefs are tied to hand slots and hints are truthful by rule. We evaluate the leading ZSC methods, including High-Entropy IPPO, Other-Play, and Off-Belief Learning, which achieve near-perfect inter-seed cross-play in the HLE, and show that in the YLE they exhibit persistent SP-XP gaps, degraded early-ending calibration, and weaker belief representations in cross-play, indicating failure to maintain consistent internal models with unseen partners. Methods that perform best in the HLE do not perform best in the YLE, indicating that progress measured on a single benchmark may not generalise. Together, these results establish YLE as a challenging new ZSC benchmark.

[296] Resource-constrained Amazons chess decision framework integrating large language models and graph attention

Tianhao Qian, Zhuoxuan Li, Jinde Cao, Xinli Shi, Hanjie Liu, Leszek Rutkowski

Main category: cs.AI

TL;DR: A lightweight hybrid framework for Game of the Amazons that combines graph-based learning with LLMs for weak-to-strong generalization under computational constraints.

Details

Motivation: Resource-constrained environments challenge conventional deep learning methods that require extensive datasets and computational resources. The paper aims to develop efficient game AI that can evolve from general-purpose foundation models under stringent computational constraints.

Method: Proposes a hybrid framework integrating Graph Attention Autoencoder with multi-step Monte Carlo Tree Search, Stochastic Graph Genetic Algorithm for optimization, and GPT-4o-mini for synthetic data generation. Uses graph attention as structural filter to denoise LLM outputs.

Result: Achieves 15-56% improvement in decision accuracy over baselines, outperforms teacher model (GPT-4o-mini) with 45.0% win rate at N=30 nodes and 66.5% at N=50 nodes on 10×10 Amazons board.

Conclusion: Demonstrates feasibility of evolving specialized, high-performance game AI from general-purpose foundation models under computational constraints through weak-to-strong generalization paradigm.

Abstract: Artificial intelligence has advanced significantly through the development of intelligent game-playing systems, providing rigorous testbeds for decision-making, strategic planning, and adaptive learning. However, resource-constrained environments pose critical challenges, as conventional deep learning methods heavily rely on extensive datasets and computational resources. In this paper, we propose a lightweight hybrid framework for the Game of the Amazons, which explores the paradigm of weak-to-strong generalization by integrating the structural reasoning of graph-based learning with the generative capabilities of large language models. Specifically, we leverage a Graph Attention Autoencoder to inform a multi-step Monte Carlo Tree Search, utilize a Stochastic Graph Genetic Algorithm to optimize evaluation signals, and harness GPT-4o-mini to generate synthetic training data. Unlike traditional approaches that rely on expert demonstrations, our framework learns from noisy and imperfect supervision. We demonstrate that the Graph Attention mechanism effectively functions as a structural filter, denoising the LLM’s outputs. Experiments on a 10$\times$10 Amazons board show that our hybrid approach not only achieves a 15%–56% improvement in decision accuracy over baselines but also significantly outperforms its teacher model (GPT-4o-mini), achieving a competitive win rate of 45.0% at N=30 nodes and a decisive 66.5% at only N=50 nodes. These results verify the feasibility of evolving specialized, high-performance game AI from general-purpose foundation models under stringent computational constraints.

[297] IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Chuan Guo, Juan Felipe Ceron Uribe, Sicheng Zhu, Christopher A. Choquette-Choo, Steph Lin, Nikhil Kandpal, Milad Nasr, Rai, Sam Toyer, Miles Wang, Yaodong Yu, Alex Beutel, Kai Xiao

Main category: cs.AI

TL;DR: Training dataset and method to improve LLM instruction hierarchy robustness against jailbreaks and prompt injections

Details

Motivation: Instruction hierarchy (IH) is crucial for LLM safety but difficult to train robustly due to confounded failures, nuanced conflicts, and model shortcuts like overrefusing

Method: Introduces IH-Challenge dataset and uses reinforcement learning with online adversarial example generation to fine-tune models (GPT-5-Mini) for improved IH robustness

Result: +10.0% average improvement across 16 benchmarks, reduces unsafe behavior from 6.6% to 0.7%, saturates internal prompt injection evaluation with minimal capability regression

Conclusion: IH-Challenge dataset enables training more robust instruction hierarchy in LLMs, improving safety against jailbreaks and prompt injections while maintaining helpfulness

Abstract: Instruction hierarchy (IH) defines how LLMs prioritize system, developer, user, and tool instructions under conflict, providing a concrete, trust-ordered policy for resolving instruction conflicts. IH is key to defending against jailbreaks, system prompt extractions, and agentic prompt injections. However, robust IH behavior is difficult to train: IH failures can be confounded with instruction-following failures, conflicts can be nuanced, and models can learn shortcuts such as overrefusing. We introduce IH-Challenge, a reinforcement learning training dataset, to address these difficulties. Fine-tuning GPT-5-Mini on IH-Challenge with online adversarial example generation improves IH robustness by +10.0% on average across 16 in-distribution, out-of-distribution, and human red-teaming benchmarks (84.1% to 94.1%), reduces unsafe behavior from 6.6% to 0.7% while improving helpfulness on general safety evaluations, and saturates an internal static agentic prompt injection evaluation, with minimal capability regression. We release the IH-Challenge dataset (https://huggingface.co/datasets/openai/ih-challenge) to support future research on robust instruction hierarchy.

[298] Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

Yuanhao Li, Haozhe Wang, Geyong Min, Nektarios Georgalas, Wang Miao

Main category: cs.AI

TL;DR: A self-finetuning framework for generative AI agents to achieve robust continuous control by internalizing experience through autonomous linguistic feedback and preference-based fine-tuning, demonstrated on dynamic RAN slicing tasks.

Details

Motivation: Current generative AI models face limitations in continuous control tasks due to finite context windows, lack of explicit rewards, and long-context degradation. The paper aims to enable agents to internalize experience through parameter distillation rather than prompt-based memory for autonomous network control.

Method: Proposes a self-finetuning framework with bi-perspective reflection mechanism that generates autonomous linguistic feedback from interaction history to construct preference datasets, followed by preference-based fine-tuning to distill long-horizon experiences into model parameters.

Result: Outperforms standard RL baselines and existing LLM-based agents in sample efficiency, stability, and multi-metric optimization on dynamic RAN slicing tasks, demonstrating superior continuous control capabilities.

Conclusion: The framework shows potential for self-improving generative agents in continuous control tasks, paving the way for AI-native network infrastructure by enabling experience internalization through autonomous learning.

Abstract: The integration of Generative AI models into AI-native network systems offers a transformative path toward achieving autonomous and adaptive control. However, the application of such models to continuous control tasks is impeded by intrinsic architectural limitations, including finite context windows, the lack of explicit reward signals, and the degradation of the long context. This paper posits that the key to unlocking robust continuous control is enabling agents to internalize experience by distilling it into their parameters, rather than relying on prompt-based memory. To this end, we propose a novel self-finetuning framework that enables agentic systems to learn continuously through direct interaction with the environment, bypassing the need for handcrafted rewards. Our framework implements a bi-perspective reflection mechanism that generates autonomous linguistic feedback to construct preference datasets from interaction history. A subsequent preference-based fine-tuning process distills long-horizon experiences into the model’s parameters. We evaluate our approach on a dynamic Radio Access Network (RAN) slicing task, a challenging multi-objective control problem that requires the resolution of acute trade-offs between spectrum efficiency, service quality, and reconfiguration stability under volatile network conditions. Experimental results show that our framework outperforms standard Reinforcement Learning (RL) baselines and existing Large Language Model (LLM)-based agents in sample efficiency, stability, and multi-metric optimization. These findings demonstrate the potential of self-improving generative agents for continuous control tasks, paving the way for future AI-native network infrastructure.

[299] CUAAudit: Meta-Evaluation of Vision-Language Models as Auditors of Autonomous Computer-Use Agents

Marta Sumyk, Oleksandr Kosovan

Main category: cs.AI

TL;DR: VLMs as autonomous auditors for evaluating Computer-Use Agents’ task completion across desktop environments, with meta-evaluation of five VLMs on three CUA benchmarks.

Details

Motivation: Existing evaluation methods for Computer-Use Agents (CUAs) are brittle, costly, and poorly aligned with real-world usage, creating a need for scalable, reliable assessment methods.

Method: Large-scale meta-evaluation of five Vision-Language Models as autonomous auditors that judge task success given natural-language instructions and final environment states across three CUA benchmarks on macOS, Windows, and Linux.

Result: State-of-the-art VLMs achieve strong accuracy and calibration, but all exhibit performance degradation in complex/heterogeneous environments, with significant disagreement even among high-performing models.

Conclusion: Current model-based auditing approaches have fundamental limitations, highlighting the need to account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.

Abstract: Computer-Use Agents (CUAs) are emerging as a new paradigm in human-computer interaction, enabling autonomous execution of tasks in desktop environment by perceiving high-level natural-language instructions. As such agents become increasingly capable and are deployed across diverse desktop environments, evaluating their behavior in a scalable and reliable manner becomes a critical challenge. Existing evaluation pipelines rely on static benchmarks, rule-based success checks, or manual inspection, which are brittle, costly, and poorly aligned with real-world usage. In this work, we study Vision-Language Models (VLMs) as autonomous auditors for assessing CUA task completion directly from observable interactions and conduct a large-scale meta-evaluation of five VLMs that judge task success given a natural-language instruction and the final environment state. Our evaluation spans three widely used CUA benchmarks across macOS, Windows, and Linux environments and analyzes auditor behavior along three complementary dimensions: accuracy, calibration of confidence estimates, and inter-model agreement. We find that while state-of-the-art VLMs achieve strong accuracy and calibration, all auditors exhibit notable performance degradation in more complex or heterogeneous environments, and even high-performing models show significant disagreement in their judgments. These results expose fundamental limitations of current model-based auditing approaches and highlight the need to explicitly account for evaluator reliability, uncertainty, and variance when deploying autonomous CUAs in real-world settings.

[300] Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

Zhaowei Zhang, Xiaohan Liu, Xuekai Zhu, Junchao Huang, Ceyao Zhang, Zhiyuan Feng, Yaodong Yang, Xiaoyuan Yi, Xing Xie

Main category: cs.AI

TL;DR: RLVR works well for logical reasoning, but it’s unclear if LLM alignment needs different approaches. The study compares reward-maximizing vs distribution-matching methods on moral reasoning tasks, finding that reward-maximizing methods work just as well or better, contrary to expectations.

Details

Motivation: The paper investigates whether LLM alignment tasks require fundamentally different approaches from standard reinforcement learning with verifiable rewards (RLVR). Given that moral reasoning often tolerates multiple valid responses, there's a hypothesis that alignment tasks might need diversity-seeking distribution-matching algorithms rather than reward-maximizing methods.

Method: The researchers conducted an empirical study comparing both paradigms on MoReBench. They built a rubric-grounded reward pipeline using a Qwen3-1.7B judge model to enable stable RLVR training. They used semantic visualization to map high-reward responses to semantic space and analyze response distributions.

Result: Contrary to expectations, distribution-matching approaches did not show significant advantages over reward-maximizing methods on alignment tasks. Moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. Mode-seeking optimization proved equally or more effective for alignment tasks.

Conclusion: Alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has achieved remarkable success in logical reasoning tasks, yet whether large language model (LLM) alignment requires fundamentally different approaches remains unclear. Given the apparent tolerance for multiple valid responses in moral reasoning, a natural hypothesis is that alignment tasks inherently require diversity-seeking distribution-matching algorithms rather than reward-maximizing policy-based methods. We conduct the first comprehensive empirical study comparing both paradigms on MoReBench. To enable stable RLVR training, we build a rubric-grounded reward pipeline by training a Qwen3-1.7B judge model. Contrary to our hypothesis, we find that distribution-matching approaches do not demonstrate significant advantages over reward-maximizing methods as expected on alignment tasks. Through semantic visualization mapping high-reward responses to semantic space, we demonstrate that moral reasoning exhibits more concentrated high-reward distributions than mathematical reasoning, where diverse solution strategies yield similarly high rewards. This counter-intuitive finding explains why mode-seeking optimization proves equally or more effective for alignment tasks. Our results suggest that alignment tasks do not inherently require diversity-preserving algorithms, and standard reward-maximizing RLVR methods can effectively transfer to moral reasoning without explicit diversity mechanisms.

[301] Trajectory-Informed Memory Generation for Self-Improving Agent Systems

Gaodan Fang, Vatche Isahagian, K. R. Jayaram, Ritesh Kumar, Vinod Muthusamy, Punleuk Oum, Gegi Thomas

Main category: cs.AI

TL;DR: A framework for extracting actionable learnings from LLM agent execution trajectories to improve future performance through contextual memory retrieval.

Details

Motivation: LLM-powered agents often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions, creating a need for learning from execution experiences.

Method: Four-component framework: (1) Trajectory Intelligence Extractor for semantic analysis of reasoning patterns, (2) Decision Attribution Analyzer to identify decision impacts, (3) Contextual Learning Generator producing strategy, recovery, and optimization tips, (4) Adaptive Memory Retrieval System for multi-dimensional similarity-based guidance injection.

Result: Evaluation on AppWorld benchmark shows consistent improvements: up to 14.3 percentage point gains in scenario goal completion on held-out tasks, with particularly strong benefits on complex tasks (28.5pp scenario goal improvement, 149% relative increase).

Conclusion: The framework enables agents to learn from execution experiences by extracting structured learnings with provenance and retrieving contextually relevant guidance, outperforming generic memory systems.

Abstract: LLM-powered agents face a persistent challenge: learning from their execution experiences to improve future performance. While agents can successfully complete many tasks, they often repeat inefficient patterns, fail to recover from similar errors, and miss opportunities to apply successful strategies from past executions. We present a novel framework for automatically extracting actionable learnings from agent execution trajectories and utilizing them to improve future performance through contextual memory retrieval. Our approach comprises four components: (1) a Trajectory Intelligence Extractor that performs semantic analysis of agent reasoning patterns, (2) a Decision Attribution Analyzer that identifies which decisions and reasoning steps led to failures, recoveries, or inefficiencies, (3) a Contextual Learning Generator that produces three types of guidance – strategy tips from successful patterns, recovery tips from failure handling, and optimization tips from inefficient but successful executions, and (4) an Adaptive Memory Retrieval System that injects relevant learnings into agent prompts based on multi-dimensional similarity. Unlike existing memory systems that store generic conversational facts, our framework understands execution patterns, extracts structured learnings with provenance, and retrieves guidance tailored to specific task contexts. Evaluation on the AppWorld benchmark demonstrates consistent improvements, with up to 14.3 percentage point gains in scenario goal completion on held-out tasks and particularly strong benefits on complex tasks (28.5~pp scenario goal improvement, a 149% relative increase).

[302] FAME: Formal Abstract Minimal Explanation for Neural Networks

Ryma Boumazouza, Raya Elsaleh, Melanie Ducoffe, Shahaf Bassan, Guy Katz

Main category: cs.AI

TL;DR: FAME introduces formal abstract minimal explanations using abstract interpretation to generate compact, scalable explanations for large neural networks without requiring traversal order.

Details

Motivation: Existing explanation methods for neural networks struggle with scalability to large models and often produce explanations that are too large or computationally expensive. There's a need for formal, minimal explanations that can scale efficiently while maintaining theoretical guarantees.

Method: FAME uses dedicated perturbation domains in abstract interpretation to eliminate traversal order requirements. It progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, converging to formal abstract minimal explanations. Quality assessment combines adversarial attacks with optional VERIX+ refinement.

Result: FAME consistently outperforms VERIX+ in both explanation size and runtime on medium- to large-scale neural networks, demonstrating better scalability and efficiency.

Conclusion: FAME provides a scalable approach to generating formal minimal explanations for large neural networks, addressing key limitations of existing methods through abstract interpretation and novel perturbation domains.

Abstract: We propose FAME (Formal Abstract Minimal Explanations), a new class of abductive explanations grounded in abstract interpretation. FAME is the first method to scale to large neural networks while reducing explanation size. Our main contribution is the design of dedicated perturbation domains that eliminate the need for traversal order. FAME progressively shrinks these domains and leverages LiRPA-based bounds to discard irrelevant features, ultimately converging to a formal abstract minimal explanation. To assess explanation quality, we introduce a procedure that measures the worst-case distance between an abstract minimal explanation and a true minimal explanation. This procedure combines adversarial attacks with an optional VERIX+ refinement step. We benchmark FAME against VERIX+ and demonstrate consistent gains in both explanation size and runtime on medium- to large-scale neural networks.

[303] Emulating Clinician Cognition via Self-Evolving Deep Clinical Research

Ruiyang Ren, Yuhao Wang, Yunsen Liang, Lan Luo, Jing Liu, Haifeng Wang, Cong Feng, Yinan Zhang, Chunyan Miao, Ji-Rong Wen, Wayne Xin Zhao

Main category: cs.AI

TL;DR: DxEvolve is a self-evolving diagnostic AI agent that improves clinical diagnosis through interactive examination requisition and continuous experience externalization, achieving significant accuracy improvements over baseline models.

Details

Motivation: Current AI diagnostic systems are misaligned with real clinical practice - they treat diagnosis as single-pass retrospective prediction and lack mechanisms for continuous, auditable improvement. There's a need for AI that mimics the dynamic, expertise-accumulating nature of clinical diagnosis.

Method: DxEvolve uses an interactive deep clinical research workflow that autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. It transforms experience into a governable learning asset.

Result: On MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% over backbone models, reaching 90.4% on a reader-study subset (comparable to clinician reference of 88.8%). On an independent external cohort, it improved accuracy by 10.2% for covered categories and 17.1% for uncovered categories compared to competitive methods.

Conclusion: DxEvolve bridges the gap between current AI systems and real clinical diagnosis by supporting an accountable pathway for continual evolution of clinical AI through experience transformation into governable learning assets.

Abstract: Clinical diagnosis is a complex cognitive process, grounded in dynamic cue acquisition and continuous expertise accumulation. Yet most current artificial intelligence (AI) systems are misaligned with this reality, treating diagnosis as single-pass retrospective prediction while lacking auditable mechanisms for governed improvement. We developed DxEvolve, a self-evolving diagnostic agent that bridges these gaps through an interactive deep clinical research workflow. The framework autonomously requisitions examinations and continually externalizes clinical experience from increasing encounter exposure as diagnostic cognition primitives. On the MIMIC-CDM benchmark, DxEvolve improved diagnostic accuracy by 11.2% on average over backbone models and reached 90.4% on a reader-study subset, comparable to the clinician reference (88.8%). DxEvolve improved accuracy on an independent external cohort by 10.2% (categories covered by the source cohort) and 17.1% (uncovered categories) compared to the competitive method. By transforming experience into a governable learning asset, DxEvolve supports an accountable pathway for the continual evolution of clinical AI.

[304] Nurture-First Agent Development: Building Domain-Expert AI Agents Through Conversational Knowledge Crystallization

Linghao Zhang

Main category: cs.AI

TL;DR: NFD is a new paradigm for building domain-expert AI agents through conversational nurturing rather than pre-engineering expertise, using knowledge crystallization cycles to grow agents incrementally with practitioners.

Details

Motivation: Current LLM-based agent frameworks treat agent construction as a discrete engineering phase (code-first or prompt-first), which mismatches the nature of domain expertise that is tacit, personal, and evolving. Sequential development fails to capture continuous knowledge evolution.

Method: Proposes Nurture-First Development (NFD) where agents start with minimal scaffolding and grow through structured conversational interaction. Uses Knowledge Crystallization Cycle to consolidate fragmented dialogue knowledge into structured assets. Formalizes with Three-Layer Cognitive Architecture (organizing knowledge by volatility/personalization), crystallization operations/efficiency metrics, Dual-Workspace Pattern, and Spiral Development Model.

Result: Illustrated through detailed case study building a financial research agent for U.S. equity analysis. Demonstrates how NFD enables human-agent co-evolution and continuous knowledge growth rather than static expertise encoding.

Conclusion: NFD addresses fundamental mismatch between traditional agent development and evolving domain expertise. Enables progressive agent growth through practitioner interaction, with broader implications for human-agent co-evolution beyond financial domain.

Abstract: The emergence of large language model (LLM)-based agent frameworks has shifted the primary challenge in building domain-expert AI agents from raw capability to effective encoding of domain expertise. Two dominant paradigms – code-first development, which embeds expertise in deterministic pipelines, and prompt-first development, which captures expertise in static system prompts – both treat agent construction as a discrete engineering phase preceding deployment. We argue that this sequential assumption creates a fundamental mismatch with the nature of domain expertise, which is substantially tacit, deeply personal, and continuously evolving. We propose Nurture-First Development (NFD), a paradigm in which agents are initialized with minimal scaffolding and progressively grown through structured conversational interaction with domain practitioners. The central mechanism is the Knowledge Crystallization Cycle, whereby fragmented knowledge embedded in operational dialogue is periodically consolidated into structured, reusable knowledge assets. We formalize NFD through: (1) a Three-Layer Cognitive Architecture organizing agent knowledge by volatility and personalization degree; (2) the Knowledge Crystallization Cycle with formal definitions of crystallization operations and efficiency metrics; and (3) an operational framework comprising a Dual-Workspace Pattern and Spiral Development Model. We illustrate the paradigm through a detailed case study on building a financial research agent for U.S. equity analysis and discuss the conditions, limitations, and broader implications of NFD for human-agent co-evolution.

[305] A Hybrid Knowledge-Grounded Framework for Safety and Traceability in Prescription Verification

Yichi Zhu, Kan Ling, Xu Liu, Hengrun Zhang, Huiqun Yu, Guisheng Fan

Main category: cs.AI

TL;DR: PharmGraph-Auditor: A hybrid knowledge graph system for safe prescription auditing using LLMs with verification chains against a pharmaceutical knowledge base

Details

Motivation: Medication errors threaten patient safety, but direct LLM application is unreliable for pharmacist verification due to factual inaccuracies, lack of traceability, and weak complex reasoning. Need for evidence-grounded, trustworthy systems.

Method: Introduces PharmGraph-Auditor with Hybrid Pharmaceutical Knowledge Base (HPKB) combining relational (set constraints) and graph (topological reasoning) components via Virtual Knowledge Graph paradigm. Uses Iterative Schema Refinement algorithm to build HPKB from medical texts. Employs KB-grounded Chain of Verification (CoV) to transform LLMs into transparent reasoning engines with verifiable queries.

Result: Demonstrates robust knowledge extraction capabilities and shows promise for enabling safer, faster prescription verification by pharmacists.

Conclusion: PharmGraph-Auditor addresses LLM limitations for high-stakes medical applications through hybrid knowledge base architecture and verification chains, enabling trustworthy prescription auditing.

Abstract: Medication errors pose a significant threat to patient safety, making pharmacist verification (PV) a critical, yet heavily burdened, final safeguard. The direct application of Large Language Models (LLMs) to this zero-tolerance domain is untenable due to their inherent factual unreliability, lack of traceability, and weakness in complex reasoning. To address these challenges, we introduce PharmGraph-Auditor, a novel system designed for safe and evidence-grounded prescription auditing. The core of our system is a trustworthy Hybrid Pharmaceutical Knowledge Base (HPKB), implemented under the Virtual Knowledge Graph (VKG) paradigm. This architecture strategically unifies a relational component for set constraint satisfaction and a graph component for topological reasoning via a rigorous mapping layer. To construct this HPKB, we propose the Iterative Schema Refinement (ISR) algorithm, a framework that enables the co-evolution of both graph and relational schemas from medical texts. For auditing, we introduce the KB-grounded Chain of Verification (CoV), a new reasoning paradigm that transforms the LLM from an unreliable generator into a transparent reasoning engine. CoV decomposes the audit task into a sequence of verifiable queries against the HPKB, generating hybrid query plans to retrieve evidence from the most appropriate data store. Experimental results demonstrate robust knowledge extraction capabilities and show promises of using PharmGraph-Auditor to enable pharmacists to achieve safer and faster prescription verification.

[306] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities

Ziwei Zhou, Rui Wang, Zuxuan Wu, Yu-Gang Jiang

Main category: cs.AI

TL;DR: Daily-Omni: A benchmark for evaluating cross-modal temporal reasoning in MLLMs using audio-visual QA with real-world videos, showing current models struggle with alignment-critical tasks.

Details

Motivation: While MLLMs perform well on individual visual and audio benchmarks, their ability to process cross-modal information synchronously (particularly temporal alignment between audio and visual streams) remains unexplored. The authors aim to address this gap by creating a benchmark that explicitly requires cross-modal temporal reasoning.

Method: 1) Created Daily-Omni benchmark with 684 real-world videos and 1,197 multiple-choice questions spanning 6 task families requiring cross-modal temporal reasoning. 2) Developed semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering. 3) Evaluated 24 foundation models under 37 modality settings (Audio+Video/Audio-only/Video-only/Text-only). 4) Provided training-free modular diagnostic baseline using off-the-shelf unimodal models.

Result: Results show that many end-to-end MLLMs struggle on alignment-critical questions. The benchmark reveals that robust cross-modal temporal alignment remains a significant challenge for current models, even though they perform well on unimodal tasks.

Conclusion: Cross-modal temporal reasoning is an important open challenge for MLLMs. The Daily-Omni benchmark provides a valuable tool for evaluating and advancing models’ ability to synchronously process audio and visual information, with implications for real-world multimodal applications.

Abstract: Recent Multimodal Large Language Models (MLLMs) achieve promising performance on visual and audio benchmarks independently. However, the ability of these models to process cross-modal information synchronously remains largely unexplored. We introduce Daily-Omni, a multiple-choice Audio-Visual QA benchmark featuring 684 real-world videos and 1,197 questions spanning 6 task families that explicitly require cross-modal temporal reasoning. To support scalable benchmark construction, we develop a semi-automatic pipeline for annotation, cross-modal consistency refinement, temporal alignment elicitation, and text-only leakage filtering, followed by human verification. We further provide a diagnostic evaluation suite and extensively evaluate 24 foundation models under 37 model–modality settings (Audio+Video / Audio-only / Video-only / Text-only). Finally, we include a training-free modular diagnostic baseline that composes off-the-shelf unimodal models to serve as a diagnostic baseline and to illustrate how explicit temporal alignment signals affect performance. Results indicate that many end-to-end MLLMs still struggle on alignment-critical questions, suggesting that robust cross-modal temporal alignment remains an important open challenge.

[307] Personalizing explanations of AI-driven hints to users’ characteristics: an empirical evaluation

Vedant Bahel, Harshinee Sriram, Cristina Conati

Main category: cs.AI

TL;DR: Personalized hint explanations in an Intelligent Tutoring System improve engagement and learning for students with low Need for Cognition and Conscientiousness traits.

Details

Motivation: Students with low Need for Cognition and Conscientiousness often don't request explanations in ITS even though they would benefit from them, so personalizing explanations could enhance their engagement and learning.

Method: Extended an existing ITS with personalized hint explanations tailored to students with low levels of two specific traits, then conducted a formal user study to evaluate effectiveness.

Result: Personalization increased target users’ interaction with hint explanations, improved their understanding of hints, and enhanced their learning outcomes.

Conclusion: Personalized Explainable AI (PXAI) in education shows value for engaging students who typically don’t seek explanations, contributing to evidence for adaptive educational systems.

Abstract: The paper extends an existing Intelligent Tutoring System (ITS) that supports students’ learning via AI-driven personalized hints and can generate explanations to justify why/how the hints were generated. In this work, we investigate personalizing these hint explanations to students with low levels of two traits, Need for Cognition and Conscientiousness in order to enhance their engagement with the explanations, based on prior findings that these students generally do not ask for the explanations although they would benefit from them. We evaluate the effectiveness of the personalized hint explanations with a formal user study. Our results show that the personalization increases our target users’ interaction with the hint explanations, their understanding of the hints, and their learning. Hence, this work contributes to exiting initial evidence on the value of Personalized Explainable AI (PXAI) in education.

[308] Synthesizing Interpretable Control Policies through Large Language Model Guided Search

Carlo Bosio, Mark W. Mueller

Main category: cs.AI

TL;DR: LLM-based evolutionary algorithm generates interpretable control policies as Python programs for dynamical systems, enhancing transparency over black-box neural network approaches.

Details

Motivation: To create interpretable control policies for dynamical systems by combining LLMs with evolutionary algorithms, addressing the black-box nature of conventional learning-based control techniques that use neural networks.

Method: Represent control policies as Python programs, evaluate candidates in simulation, and evolve them using a pre-trained LLM as an evolutionary algorithm, maintaining interpretability throughout.

Result: Successfully synthesized interpretable control policies for pendulum swing-up and ball in cup tasks, with code made publicly available.

Conclusion: LLM-based evolutionary algorithms can generate transparent, interpretable control policies for dynamical systems while maintaining human-understandable code and verifiability at runtime.

Abstract: The combination of Large Language Models (LLMs), systematic evaluation, and evolutionary algorithms has enabled breakthroughs in combinatorial optimization and scientific discovery. We propose to extend this powerful combination to the control of dynamical systems, generating interpretable control policies capable of complex behaviors. With our novel method, we represent control policies as programs in standard languages like Python. We evaluate candidate controllers in simulation and evolve them using a pre-trained LLM. Unlike conventional learning-based control techniques, which rely on black-box neural networks to encode control policies, our approach enhances transparency and interpretability. We still take advantage of the power of large AI models, but only at the policy design phase, ensuring that all system components remain interpretable and easily verifiable at runtime. Additionally, the use of standard programming languages makes it straightforward for humans to finetune or adapt the controllers based on their expertise and intuition. We illustrate our method through its application to the synthesis of an interpretable control policy for the \textit{pendulum swing-up} and the \textit{ball in cup} tasks. We make the code available at https://github.com/muellerlab/synthesizing_interpretable_control_policies.git.

[309] Consistency-based Abductive Reasoning over Perceptual Errors of Multiple Pre-trained Models in Novel Environments

Mario Leiva, Noel Ngu, Joshua Shay Kricheli, Aditya Taparia, Ransalu Senanayake, Paulo Shakarian, Nathaniel Bastian, John Corcoran, Gerardo Simari

Main category: cs.AI

TL;DR: Consistency-based abduction framework integrates multiple pre-trained models at test-time to handle distribution shifts, using logic programming to resolve conflicts and improve robustness.

Details

Motivation: Pre-trained perception models degrade in novel environments due to distribution shifts. Existing metacognition approaches using logical rules improve precision but reduce recall. The paper hypothesizes that leveraging multiple models can mitigate recall reduction.

Method: Formulates conflicting predictions from multiple models as a consistency-based abduction problem. Encodes predictions and learned error detection rules in logic programs, then seeks abductive explanations (subsets of predictions) that maximize coverage while keeping inconsistency rate below threshold. Proposes two algorithms: exact Integer Programming and efficient Heuristic Search.

Result: Outperforms individual models and standard ensemble baselines on simulated aerial imagery with controlled distribution shifts. Achieves average relative improvements of ~13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets compared to best individual model.

Conclusion: Consistency-based abduction effectively integrates knowledge from multiple imperfect models in challenging novel scenarios, validating the approach for robust model integration.

Abstract: The deployment of pre-trained perception models in novel environments often leads to performance degradation due to distributional shifts. Although recent artificial intelligence approaches for metacognition use logical rules to characterize and filter model errors, improving precision often comes at the cost of reduced recall. This paper addresses the hypothesis that leveraging multiple pre-trained models can mitigate this recall reduction. We formulate the challenge of identifying and managing conflicting predictions from various models as a consistency-based abduction problem, building on the idea of abductive learning (ABL) but applying it to test-time instead of training. The input predictions and the learned error detection rules derived from each model are encoded in a logic program. We then seek an abductive explanation–a subset of model predictions–that maximizes prediction coverage while ensuring the rate of logical inconsistencies (derived from domain constraints) remains below a specified threshold. We propose two algorithms for this knowledge representation task: an exact method based on Integer Programming (IP) and an efficient Heuristic Search (HS). Through extensive experiments on a simulated aerial imagery dataset featuring controlled, complex distributional shifts, we demonstrate that our abduction-based framework outperforms individual models and standard ensemble baselines, achieving, for instance, average relative improvements of approximately 13.6% in F1-score and 16.6% in accuracy across 15 diverse test datasets when compared to the best individual model. Our results validate the use of consistency-based abduction as an effective mechanism to robustly integrate knowledge from multiple imperfect models in challenging, novel scenarios.

[310] Learning What Reinforcement Learning Can’t: Interleaved Online Fine-Tuning for Hardest Questions

Lu Ma, Hao Liang, Meiyi Qiang, Lexiang Tang, Xiaochen Ma, Zhen Hao Wong, Junbo Niu, Chengyu Shen, Runming He, Yanhao Li, Bin Cui, Wentao Zhang

Main category: cs.AI

TL;DR: ReLIFT: A novel training approach that interleaves reinforcement learning with online fine-tuning to overcome RL’s limitations in acquiring new knowledge for LLM reasoning.

Details

Motivation: Current RL approaches for LLM reasoning are insufficient for acquiring new knowledge beyond the base model's capabilities, as RL primarily optimizes based on existing knowledge rather than facilitating new information acquisition.

Method: ReLIFT (Reinforcement Learning Interleaved with Online Fine-Tuning) alternates between RL training and supervised fine-tuning using high-quality demonstration data collected when the model encounters challenging questions.

Result: ReLIFT achieves +5.2 point average improvement across five competition-level benchmarks and one out-of-distribution benchmark compared to zero-RL models, outperforming both RL and SFT while using only 13% of demonstration data.

Conclusion: ReLIFT overcomes fundamental limitations of RL for LLM reasoning by combining complementary strengths of RL and SFT, demonstrating significant potential for scalable reasoning enhancement.

Abstract: Recent advances in large language model (LLM) reasoning have shown that sophisticated behaviors such as planning and self-reflection can emerge through reinforcement learning (RL). However, despite these successes, RL in its current form remains insufficient to induce capabilities that exceed the limitations of the base model, as it is primarily optimized based on existing knowledge of the model rather than facilitating the acquisition of new information. To address this limitation, we employ supervised fine-tuning (SFT) to learn what RL cannot, which enables the incorporation of new knowledge and reasoning patterns by leveraging high-quality demonstration data. We analyze the training dynamics of RL and SFT for LLM reasoning and find that RL excels at maintaining and improving performance on questions within the model’s original capabilities, while SFT is more effective at enabling progress on questions beyond the current scope of the model. Motivated by the complementary strengths of RL and SFT, we introduce a novel training approach, \textbf{ReLIFT} (\textbf{Re}inforcement \textbf{L}earning \textbf{I}nterleaved with Online \textbf{F}ine-\textbf{T}uning). In ReLIFT, the model is primarily trained using RL, but when it encounters challenging questions, high-quality solutions are collected for fine-tuning, and the training process alternates between RL and fine-tuning to enhance the model’s reasoning abilities. ReLIFT achieves an average improvement of over +5.2 points across five competition-level benchmarks and one out-of-distribution benchmark compared to other zero-RL models. Furthermore, we demonstrate that ReLIFT outperforms both RL and SFT while using only 13% of the detailed demonstration data, highlighting its scalability. These results provide compelling evidence that ReLIFT overcomes the fundamental limitations of RL and underscores the significant potential.

[311] From Next Token Prediction to (STRIPS) World Models

Carlos Núñez-Molina, Vicenç Gómez, Hector Geffner

Main category: cs.AI

TL;DR: Transformers trained on next-token prediction can learn STRIPS action models from action traces, enabling planning with off-the-shelf planners across unseen states and goals, with standard transformers performing better than symbolically-aligned architectures.

Details

Motivation: To investigate whether next-token prediction in transformers can learn world models that support planning, specifically in controlled symbolic settings where propositional STRIPS action models can be learned from action traces alone and correctness can be evaluated exactly.

Method: Two architectures: 1) STRIPS Transformer with symbolic inductive bias grounded in theoretical links between transformers and STRIPS formal language structure; 2) Standard transformer with different positional encoding schemes and attention aggregation mechanisms. Evaluated on five classical planning domains measuring training accuracy, generalization, and planning performance.

Result: Both approaches can produce models supporting planning with off-the-shelf STRIPS planners over exponentially many unseen initial states and goals. Standard transformers with stick-breaking attention achieved near-perfect training accuracy and strong generalization, while STRIPS Transformer was harder to optimize and required larger datasets. Standard transformers without stick-breaking attention failed to generalize to long traces, but symbolic STRIPS models extracted from transformers trained on shorter traces succeeded.

Conclusion: Next-token prediction can yield world models that support planning, with standard transformers outperforming symbolically-aligned architectures in this controlled symbolic setting, though symbolic extraction from transformers trained on shorter traces enables generalization to longer sequences.

Abstract: We study whether next-token prediction can yield world models that truly support planning, in a controlled symbolic setting where propositional STRIPS action models are learned from action traces alone and correctness can be evaluated exactly. We introduce two architectures. The first is the STRIPS Transformer, a symbolically aligned model grounded in theoretical results linking transformers and the formal language structure of STRIPS domains. The second is a standard transformer architecture without explicit symbolic structure built in, for which we study different positional encoding schemes and attention aggregation mechanisms. We evaluate both architectures on five classical planning domains, measuring training accuracy, generalization, and planning performance across domains and problem sizes. Interestingly, both approaches can be used to produce models that support planning with off-the-shelf STRIPS planners over exponentially many unseen initial states and goals. Although the STRIPS Transformer incorporates a strong symbolic inductive bias, it is harder to optimize and requires larger datasets to generalize reliably. In contrast, a standard transformer with stick-breaking attention achieves near-perfect training accuracy and strong generalization. Finally, standard transformers without stick-breaking attention do not generalize to long traces, whereas a symbolic STRIPS model extracted from a transformer trained on shorter traces does.

[312] RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Nigel Fernandez, Branislav Kveton, Ryan A. Rossi, Andrew S. Lan, Zichao Wang

Main category: cs.AI

TL;DR: RADAR is a routing framework that intelligently allocates queries to different model-budget configurations based on query difficulty and model ability, optimizing performance-cost tradeoffs for reasoning tasks.

Details

Motivation: There's a tradeoff between performance and cost when deploying reasoning language models - larger models and higher reasoning budgets (more compute) improve performance but increase costs and latency. Current approaches don't intelligently route different queries to appropriate model configurations.

Method: RADAR learns an item response model from model responses with different budgets to queries, obtaining interpretable parameters (query difficulties and model-budget abilities). It then routes harder queries to higher-ability model-budget pairs and easier queries to lower-ability ones. The framework is lightweight, scalable, and can integrate new models efficiently.

Result: Extensive experiments on 8 challenging reasoning benchmarks show RADAR outperforms state-of-the-art model routing methods. It also demonstrates strong generalization to out-of-distribution queries and scalability for integrating additional models.

Conclusion: RADAR provides an effective, interpretable, and scalable solution for optimizing the performance-cost tradeoff in reasoning model deployment through intelligent query routing based on difficulty and ability.

Abstract: Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance and cost tradeoff at two key levels: model size and reasoning budget, where larger models and higher reasoning budget lead to better performance but with increased cost and latency. In this work, we tackle this tradeoff from the angle of model configuration routing for different queries, and present RADAR (Reasoning-Ability and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework. Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries, with interpretable parameters including query difficulties and model-budget abilities. RADAR then routes queries with higher difficulty to model-budget pairs with higher ability, and vice versa. We conduct extensive experiments on 8 widely used challenging reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art model routing methods. RADAR also exhibits query generalization capabilities, showing strong performance on out-of-distribution queries in all benchmarks. RADAR is also scalable and can efficiently integrate additional models by dynamically selecting a small set of evaluation queries to estimate their abilities.

[313] BiasBusters: Uncovering and Mitigating Tool Selection Bias in Large Language Models

Thierry Blankenstein, Jialin Yu, Zixuan Li, Vassilis Plachouras, Sunando Sengupta, Philip Torr, Yarin Gal, Alasdair Paren, Adel Bibi

Main category: cs.AI

TL;DR: LLM agents show systematic bias in selecting functionally equivalent tools from marketplaces, favoring certain providers or earlier context positions, with semantic alignment between queries and tool metadata being the strongest driver.

Details

Motivation: As LLM agents increasingly use external tools from marketplaces with multiple functionally equivalent options, systematic bias in tool selection can degrade user experience and distort competition by privileging certain providers over others.

Method: Created a benchmark of diverse tool categories with functionally equivalent tools, evaluated seven LLMs, conducted controlled experiments isolating effects of tool features, metadata, and pre-training exposure, and proposed a lightweight mitigation strategy.

Result: Substantial bias persists in LLM tool selection, with models fixating on single providers or favoring earlier context tools; semantic alignment between queries and tool metadata is the strongest driver; small description perturbations significantly shift choices; repeated pre-training exposure amplifies provider-level bias.

Conclusion: Tool-selection bias is a key obstacle to fair deployment of tool-augmented LLM agents; proposed mitigation strategy of filtering to relevant subset then uniform sampling substantially reduces bias while maintaining task coverage.

Abstract: Agents backed by large language models (LLMs) increasingly rely on external tools drawn from marketplaces where multiple providers offer functionally equivalent options. This raises a critical fairness concern: systematic bias in tool selection can degrade user experience and distort competition by privileging certain providers over others. We introduce a benchmark of diverse tool categories, each containing multiple functionally equivalent tools, to systematically evaluate tool-selection bias. Using this benchmark, we evaluate seven LLMs and show that substantial bias persists, with models either fixating on a single provider or disproportionately favoring tools that appear earlier in the context. To uncover the sources of this behavior, we conduct controlled experiments that isolate the effects of tool features, exposed metadata (name, description, and parameters), and pre-training exposure. We find that (1) semantic alignment between user queries and tool metadata is the strongest driver of selection; (2) small perturbations to tool descriptions can significantly shift choices; and (3) repeated pre-training exposure to a single endpoint amplifies provider-level bias. Finally, we propose a lightweight mitigation strategy that first filters tools to a relevant subset and then samples uniformly, substantially reducing selection bias while maintaining strong task coverage. Our results highlight tool-selection bias as a key obstacle to the fair deployment of tool-augmented LLM agents. Our code and benchmark are publicly available at https://github.com/thierry123454/tool-selection-bias.

[314] What We Don’t C: Manifold Disentanglement for Structured Discovery

Brian Rogers, Micah Bowles, Chris J. Lintott, Steve Croft, Oliver N. F. King, James Kostas Ray

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.09433: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09433&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.00997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] Toward Closed-loop Molecular Discovery via Language Model, Property Alignment and Strategic Search

Junkai Ji, Zhangfan Yang, Dong Xu, Ruibin Bai, Jianqiang Li, Tingjun Hou, Zexuan Zhu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.09566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] Learning Transferable Skills in Action RPGs via Directed Skill Graphs and Selective Adaptation

Ali Najar

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.17923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2601.21468: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21468&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang, Xiang Long, Ziheng Li, Yilong Xu, Tingguang Li, Yehui Tang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2602.12566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] Many AI Analysts, One Dataset: Navigating the Agentic Data Science Multiverse

Martin Bertran, Riccardo Fogliato, Zhiwei Steven Wu

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.18710: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18710&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[321] A Minimal Agent for Automated Theorem Proving

Borja Requena, Austin Letson, Krystian Nowakowski, Izan Beltran Ferreiro, Leopoldo Sarra

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot determine conclusion as paper content is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2602.24273: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24273&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Yuexi Du, Jinglu Wang, Shujie Liu, Nicha C. Dvornek, Yan Lu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.01607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents

Pengbo Liu

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.01620 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation without access to paper abstract

Method: Cannot determine method without access to paper abstract

Result: Cannot determine results without access to paper abstract

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2603.01620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] SEED-SET: Scalable Evolving Experimental Design for System-level Ethical Testing

Anjali Parashar, Yingke Li, Eric Yang Yu, Fei Chen, James Neidhoefer, Devesh Upadhyay, Chuchu Fan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.01630

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to technical limitations

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2603.01630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] UIS-Digger: Towards Comprehensive Research Agent Systems for Real-world Unindexed Information Seeking

Chang Liu, Chuqiao Kuang, Tianyi Zhuang, Yuxin Cheng, Huichi Zhou, Xiaoguang Li, Lifeng Shang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.08117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback

Xiaoying Zhang, Zichen Liu, Yipeng Zhang, Xia Hu, Wenqi Shao

Main category: cs.AI

TL;DR: RetroAgent: An online RL framework with hindsight self-reflection that provides dual intrinsic feedback (numerical and language-based) to help LLM-based agents continuously adapt and evolve in complex interactive environments.

Details

Motivation: Standard RL paradigms for LLM-based agents favor static problem-solving over continuous adaptation, leading to suboptimal strategies due to insufficient exploration and implicit knowledge representation that limits experiential learning.

Method: RetroAgent features a hindsight self-reflection mechanism producing dual intrinsic feedback: (1) intrinsic numerical feedback tracking incremental subtask completion, and (2) intrinsic language feedback distilling reusable lessons into a memory buffer, retrieved via Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy.

Result: Significantly outperforms existing methods across four challenging agentic tasks: +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper compared to GRPO-trained agents, with strong test-time adaptation and generalization to out-of-distribution scenarios.

Conclusion: RetroAgent enables LLM-based agents to master complex interactive environments through continuous evolution rather than just static problem-solving, demonstrating superior performance and adaptability through its dual intrinsic feedback and memory retrieval mechanisms.

Abstract: Large language model (LLM)-based agents trained with reinforcement learning (RL) have shown strong potential on complex interactive tasks. However, standard RL paradigms favor static problem-solving over continuous adaptation: agents often converge to suboptimal strategies due to insufficient exploration, while learned knowledge remains implicit within parameters rather than explicitly retrievable, limiting effective experiential learning. To address these limitations, we introduce RetroAgent, an online RL framework that empowers agents to master complex interactive environments not just by solving, but by evolving. Concretely, RetroAgent features a hindsight self-reflection mechanism that produces dual intrinsic feedback: (1) intrinsic numerical feedback that that tracks incremental subtask completion relative to prior attempts, rewarding promising explorations, and (2) intrinsic language feedback that distills reusable lessons into a memory buffer, retrieved via our proposed Similarity & Utility-Aware Upper Confidence Bound (SimUtil-UCB) strategy balancing relevance, utility, and exploration to effectively leverage past experiences. Extensive experiments on two model families across four challenging agentic tasks demonstrate that RetroAgent significantly outperforms existing methods, achieving state-of-the-art results – e.g., surpassing Group Relative Policy Optimization (GRPO)-trained agents by +18.3% on ALFWorld, +15.4% on WebShop, +27.1% on Sokoban, and +8.9% on MineSweeper – while exhibiting strong test-time adaptation and generalization to out-of-distribution scenarios.

[327] Curveball Steering: The Right Direction To Steer Isn’t Always Linear

Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Amirali Abdullah

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.09313 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract could not be retrieved

Method: Cannot determine method as abstract could not be retrieved

Result: Cannot determine results as abstract could not be retrieved

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.09313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] LCA: Local Classifier Alignment for Continual Learning

Tung Tran, Danilo Vasconcellos Vargas, Khoat Than

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.09888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] Improving Fairness with Ensemble Combination: Margin-Dependent Bounds

Yijun Bian

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2301.10813: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2301.10813&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] An Updated Assessment of Reinforcement Learning for Macro Placement

Chung-Kuan Cheng, Andrew B. Kahng, Sayak Kundu, Yucheng Wang, Zhiang Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2302.11014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2302.11014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] Optimal Transport Aggregation for Distributed Mixture-of-Experts

Faïcel Chamroukhi, Nhat Thien Pham

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2312.09877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.09877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Boosting Cross-problem Generalization in Diffusion-Based Neural Combinatorial Solver via Inference Time Adaptation

Haoyu Lei, Kaiwen Zhou, Yinchuan Li, Zhitang Chen, Farzan Farnia

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2502.12188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.12188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] Offline Dynamic Inventory and Pricing Strategy: Addressing Censored and Dependent Demand

Korel Gundem, Zhengling Qi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2504.09831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] Scalable Multi-Task Learning through Spiking Neural Networks with Adaptive Task-Switching Policy for Intelligent Autonomous Agents

Rachmad Vidya Wicaksana Putra, Avaneesh Devkota, Muhammad Shafique

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze paper content

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2504.13541: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13541&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] Comparative Analysis of Modern Machine Learning Models for Retail Sales Forecasting

Luka Hobor, Mario Brcic, Lidija Polutnik, Ante Kapetanovic

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2506.05941: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05941&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] Self-Improving Loops for Visual Robotic Planning

Calvin Luo, Zilai Zeng, Mingxi Jia, Yilun Du, Chen Sun

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract available for analysis.

Details

Motivation: Unable to determine motivation due to missing paper content.

Method: Unable to determine method due to missing paper content.

Result: Unable to determine results due to missing paper content.

Conclusion: Unable to determine conclusion due to missing paper content.

Abstract: Failed to fetch summary for 2506.06658: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06658&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] Differential Privacy in Machine Learning: A Survey from Symbolic AI to LLMs

Francisco Aguilera-Martínez, Fernando Berzal

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.11687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] What Makes Code Generation Ethically Sourced?

Zhuolin Xu, Chenglin Li, Qiushi Li, Shin Hwei Tan

Main category: cs.AI

TL;DR: Paper 2507.19743: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing abstract

Method: Cannot determine method due to missing abstract

Result: Cannot determine results due to missing abstract

Conclusion: Cannot determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2507.19743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] Global Minimizers of Sigmoid Contrastive Loss

Kiril Bangachev, Guy Bresler, Iliyas Noman, Yury Polyanskiy

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.18552: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18552&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] A Systematic Evaluation of Self-Supervised Learning for Label-Efficient Sleep Staging with Wearable EEG

Emilio Estevan, María Sierra-Torralba, Eduardo López-Larraz, Luis Montesano

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2510.07960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Reveal-to-Revise: Explainable Bias-Aware Generative Modeling with Multimodal Attention

Noor Islam S. Mohammad, Md Muntaqim Meherab

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2510.12957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] Predicting kernel regression learning curves from only raw data statistics

Dhruva Karkada, Joseph Turnbull, Yuxi Liu, James B. Simon

Main category: cs.AI

TL;DR: Unable to analyze paper 2510.14878 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2510.14878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] STREAM-VAE: Dual-Path Routing for Slow and Fast Dynamics in Vehicle Telemetry Anomaly Detection

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2511.15339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.15339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] Hierarchical Dual-Strategy Unlearning for Biomedical and Healthcare Intelligence Using Imperfect and Privacy-Sensitive Medical Data

Yi Zhang, Chao Zhang, Zijian Li, Tianxiang Xu, Kunyu Zhang, Zhan Gao, Meinuo Li, Xiaohan Zhang, Qichao Qi, Bing Chen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2511.19498: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19498&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[345] Maximum Risk Minimization with Random Forests

Francesco Freni, Anya Fries, Linus Kühne, Markus Reichstein, Jonas Peters

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.10445 suggests it’s from December 2024, but content is unavailable.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2512.10445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[346] Pretrained battery transformer (PBT): A foundation model for universal battery life prediction

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about the paper due to technical issues accessing the content

Abstract: Failed to fetch summary for 2512.16334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] The Bayesian Geometry of Transformer Attention

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.22471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.22473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] Geometric Scaling of Bayesian Inference in LLMs

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.23752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] Over-Searching in Search-Augmented Large Language Models

Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to server rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2601.05503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] Burn-After-Use for Preventing Data Leakage through a Secure Multi-Tenant Architecture in Enterprise LLM

Qiang Zhang, Elena Emma Wang, Jiaming Li, Xichun Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations preventing paper retrieval

Method: No method information available due to API rate limiting error

Result: No results available - paper content could not be retrieved

Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2601.06627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He, Meng Xue, Xueluan Gong, Yuji Wang, Xuanye Zhang, Kwok-Yan Lam

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.10955: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10955&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] Moving On, Even When You’re Broken: Fail-Active Trajectory Generation via Diffusion Policies Conditioned on Embodiment and Task

Gilberto G. Briscoe-Martinez, Yaashia Gautam, Rahul Shetty, Anuj Pasricha, Marco M. Nicotra, Alessandro Roncone

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2602.02895: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02895&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[354] Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.10048 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.10048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] Conformal Tradeoffs: Operational Profiles Beyond Coverage

Petrus H. Zwart

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.18045: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18045&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[356] Adversarial Hubness Detector: Detecting Hubness Poisoning in Retrieval-Augmented Generation Systems

Idan Habler, Vineeth Sai Narajala, Stav Koren, Amy Chang, Tiffany Saade

Main category: cs.AI

TL;DR: Paper 2602.22427: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2602.22427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders

David Campbell, Neil Kale, Udari Madhushani Sehwag, Bert Herring, Nick Price, Dan Borges, Alex Levinson, Christina Q Knight

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.01246: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01246&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] BD-Merging: Bias-Aware Dynamic Model Merging with Evidence-Guided Contrastive Learning

Yuhan Xie, Chen Lyu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2603.03920: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03920&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[359] ResearchEnvBench: Benchmarking Agents on Environment Synthesis for Research Code Execution

Yubang Wang, Chenxi Zhang, Bowen Chen, Zezheng Huai, Zihao Dai, Xinchi Chen, Yuxin Wang, Yining Zheng, Jingjing Gong, Xipeng Qiu

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.06739: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06739&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[360] SeDa: A Unified System for Dataset Discovery and Multi-Entity Augmented Semantic Exploration

Kan Ling, Zhen Qin, Yichi Zhu, Hengrun Zhang, Huiqun Yu, Guisheng Fan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.07502: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07502&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[361] SiliconMind-V1: Multi-Agent Distillation and Debug-Reasoning Workflows for Verilog Code Generation

Mu-Chi Chen, Yu-Hung Kao, Po-Hsuan Huang, Shao-Chun Ho, Hsiang-Yu Tsou, I-Ting Wu, En-Ming Huang, Yu-Kai Hung, Wei-Po Hsin, Cheng Liang, Chia-Heng Tu, Shih-Hao Hung, H. T. Kung

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.08719: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08719&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[362] Alignment as Iatrogenesis: Pastoral Power, Collective Pathology, and the Structural Limits of Monolingual Safety Evaluation

Hiroki Fukui

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.08723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage

Saron Samuel, Alexander Martin, Eugene Yang, Andrew Yates, Dawn Lawrie, Ian Soboroff, Laura Dietz, Benjamin Van Durme

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.08819: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08819&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] A New Modeling to Feature Selection Based on the Fuzzy Rough Set Theory in Normal and Optimistic States on Hybrid Information Systems

Mohammad Hossein Safarpour, Seyed Majid Alavi, Mohammad Izadikhah, Hossein Dibachi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.08900: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08900&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] Reinforced Generation of Combinatorial Structures: Ramsey Numbers

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.09172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Anupam Purwar, Aditya Choudhary

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.09643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

Rodrigo Diaz, Rodrigo Constanzo, Mark Sandler

Main category: cs.SD

TL;DR: A set of Max externals for real-time non-linear modal synthesis of strings, membranes, and plates with interactive physical parameter control.

Details

Motivation: To lower the barrier for composers, performers, and sound designers to explore non-linear modal synthesis by providing efficient real-time tools in a familiar environment (Max).

Method: Developed C++ Max externals that implement non-linear modal synthesis algorithms, offering interactive control of physical parameters, custom modal data loading, and multichannel output.

Result: Open-source software (nlm) that enables real-time non-linear modal synthesis for various physical models, making advanced synthesis techniques more accessible.

Conclusion: The nlm externals successfully integrate interactive physical-modelling capabilities into Max, democratizing access to expressive non-linear modal synthesis for creative applications.

Abstract: We present \texttt{nlm}, a set of Max externals that enable efficient real-time non-linear modal synthesis for strings, membranes, and plates. The externals, implemented in C++, offer interactive control of physical parameters, allow the loading of custom modal data, and provide multichannel output. By integrating interactive physical-modelling capabilities into a familiar environment, \texttt{nlm} lowers the barrier for composers, performers, and sound designers to explore the expressive potential of non-linear modal synthesis. The externals are available as open-source software at https://github.com/rodrigodzf/nlm.

[368] ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Aviad Dahan, Moran Yanuka, Noa Kraicer, Lior Wolf, Raja Giryes

Main category: cs.SD

Details

Conclusion: ID-LoRA enables joint audio-video personalization in a single generative pass, achieving strong results with minimal training data while providing physically grounded sound synthesis.

[369] MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle

Main category: cs.SD

TL;DR: MoXaRt is a real-time XR system that uses audio-visual cues to separate entangled sound sources, enabling fine-grained sound interaction in complex acoustic environments.

Details

Motivation: In XR environments, complex acoustic scenes with multiple sound sources overwhelm users, compromising scene awareness and social engagement due to entangled audio sources.

Method: Cascaded architecture performing coarse audio-only separation in parallel with visual detection of sources (faces, instruments), then using visual anchors to guide refinement networks for isolating individual sources.

Result: Separates up to 5 concurrent sources with ~2 second latency, enhances speech intelligibility by 36.2% (p<0.01), and significantly reduces cognitive load (p<0.001) in user studies.

Conclusion: MoXaRt enables more perceptive and socially adept XR experiences by effectively separating entangled sound sources using audio-visual multimodal approaches.

Abstract: In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt’s core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly enhances speech intelligibility, yielding a 36.2% (p < 0.01) increase in listening comprehension within adversarial acoustic environments while substantially reducing cognitive load (p < 0.001), thereby paving the way for more perceptive and socially adept XR experiences.

[370] Distilling LLM Semantic Priors into Encoder-Only Multi-Talker ASR with Talker-Count Routing

Hao Shi, Yusuke Fujita, Roman Koshkin, Mengjie Zhao, Yuan Gao, Lianbo Liu, Yui Sudo

Main category: cs.SD

TL;DR: Encoder-only multi-talker ASR framework that adapts LLMs for semantic guidance during training but uses fast CTC decoding at inference, with talker-count prediction for variable numbers of speakers.

Details

Motivation: LLMs provide strong semantic priors for multi-talker ASR but are computationally expensive as autoregressive decoders and fragile under heavy overlap. Need efficient approach that retains LLM benefits while enabling fast inference.

Method: Proposes encoder-only MT-ASR framework that: 1) Adapts LLM to multi-talker conditioning, 2) Distills LLM’s semantic guidance into encoder during training, 3) Uses post-encoder separator with serialized CTC for talker-ordered transcripts, 4) Employs adapted LLM-based SOT objective as multi-talker-aware teacher signal, 5) Introduces Talker-Count Head to predict talker count and dynamically select decoding branch.

Result: Experiments on LibriMix show comparable performance to LLM-based systems in two-talker condition, significant improvements in three-talker condition, with significantly smaller real-time factor (RTF).

Conclusion: Encoder-only framework successfully distills LLM semantic guidance into encoder while maintaining fast CTC-style decoding, achieving strong performance especially in challenging multi-talker scenarios with computational efficiency.

Abstract: Large language models (LLMs) provide strong semantic priors that can improve multi-talker automatic speech recognition (MT-ASR), but using an LLM as an autoregressive decoder is computationally expensive and remains fragile under heavy overlap. In this paper, we propose an encoder-only MT-ASR framework that adapts an LLM to multi-talker conditioning and distills its semantic guidance into the encoder during training, while retaining fast CTC-style decoding at inference. Our model employs a post-encoder separator with serialized CTC to produce talker-ordered transcripts, and leverages an adapted LLM-based SOT objective as a multi-talker-aware teacher signal to explicitly regularize mixed-speech representations. To further support variable numbers of talkers, we introduce a Talker-Count Head that predicts the talker count and dynamically selects the appropriate decoding branch. Experiments on LibriMix show that the proposed encoder-only model achieves comparable performance to LLM-based systems in the two-talker condition, while delivering significant improvements in the three-talker condition with significant small RTF.

[371] AlphaFlowTSE: One-Step Generative Target Speaker Extraction via Conditional AlphaFlow

Duojia Li, Shuhan Zhang, Zihan Qian, Wenxuan Wu, Shuai Wang, Qingyang Hong, Lin Li, Haizhou Li

Main category: cs.SD

TL;DR: AlphaFlowTSE: One-step target speaker extraction using flow matching with JVP-free objective, eliminating mixture-dependent time coordinates and improving real-world generalization.

Details

Motivation: Current TSE methods using diffusion/flow-matching have multi-step latency issues, and one-step solutions rely on unreliable mixture-dependent time coordinates for real conversations.

Method: One-step conditional generative model trained with JVP-free AlphaFlow objective, learning mean-velocity transport along mixture-to-target trajectory, eliminating mixing-ratio prediction, and using interval-consistency teacher-student stabilization.

Result: Improves target-speaker similarity and real-mixture generalization for downstream ASR on Libri2Mix and REAL-T datasets.

Conclusion: AlphaFlowTSE provides efficient one-step TSE with better real-world performance by addressing limitations of existing flow-based methods.

Abstract: In target speaker extraction (TSE), we aim to recover target speech from a multi-talker mixture using a short enrollment utterance as reference. Recent studies on diffusion and flow-matching generators have improved target-speech fidelity. However, multi-step sampling increases latency, and one-step solutions often rely on a mixture-dependent time coordinate that can be unreliable for real-world conversations. We present AlphaFlowTSE, a one-step conditional generative model trained with a Jacobian-vector product (JVP)-free AlphaFlow objective. AlphaFlowTSE learns mean-velocity transport along a mixture-to-target trajectory starting from the observed mixture, eliminating auxiliary mixing-ratio prediction, and stabilizes training by combining flow matching with an interval-consistency teacher-student target. Experiments on Libri2Mix and REAL-T confirm that AlphaFlowTSE improves target-speaker similarity and real-mixture generalization for downstream automatic speech recognition (ASR).

[372] Probabilistic Verification of Voice Anti-Spoofing Models

Evgeny Kushnir, Alexandr Kozodaev, Dmitrii Korzh, Mikhail Pautov, Oleg Kiriukhin, Oleg Y. Rogov

Main category: cs.SD

TL;DR: PV-VASM is a probabilistic framework for verifying robustness of voice anti-spoofing models against speech synthesis attacks, providing formal guarantees and generalization to unseen techniques.

Details

Motivation: The paper addresses the security risks of speech synthesis technologies being misused for impersonation attacks, noting that existing voice anti-spoofing detection methods lack formal robustness guarantees and fail to generalize to unseen generation techniques.

Method: Proposes PV-VASM, a probabilistic framework that estimates misclassification probability under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations, with derived theoretical upper bounds on error probability.

Result: The method is validated across diverse experimental settings, demonstrating effectiveness as a practical robustness verification tool for voice anti-spoofing models.

Conclusion: PV-VASM provides a formal framework for assessing the robustness of voice anti-spoofing systems against evolving speech synthesis threats, addressing critical security gaps in current detection methods.

Abstract: Recent advances in generative models have amplified the risk of malicious misuse of speech synthesis technologies, enabling adversaries to impersonate target speakers and access sensitive resources. Although speech deepfake detection has progressed rapidly, most existing countermeasures lack formal robustness guarantees or fail to generalize to unseen generation techniques. We propose PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models (VASMs). PV-VASM estimates the probability of misclassification under text-to-speech (TTS), voice cloning (VC), and parametric signal transformations. The approach is model-agnostic and enables robustness verification against unseen speech synthesis techniques and input perturbations. We derive a theoretical upper bound on the error probability and validate the method across diverse experimental settings, demonstrating its effectiveness as a practical robustness verification tool.

[373] Towards Robust Speech Deepfake Detection via Human-Inspired Reasoning

Artem Dvirniak, Evgeny Kushnir, Dmitrii Tarasov, Artem Iudin, Oleg Kiriukhin, Mikhail Pautov, Dmitrii Korzh, Oleg Y. Rogov

Main category: cs.SD

TL;DR: HIR-SDD: A novel speech deepfake detection framework that combines Large Audio Language Models with chain-of-thought reasoning using human-annotated data to improve generalization and provide interpretable justifications.

Details

Motivation: Address the limitations of current speech deepfake detection methods which lack generalization to new audio domains/generators and lack interpretability/human-like reasoning for predictions.

Method: Proposes HIR-SDD framework combining Large Audio Language Models (LALMs) with chain-of-thought reasoning derived from novel human-annotated dataset for interpretable deepfake detection.

Result: Experimental evaluation demonstrates both effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.

Conclusion: HIR-SDD offers a promising approach to speech deepfake detection that addresses generalization and interpretability issues through LALMs and human-like reasoning.

Abstract: The modern generative audio models can be used by an adversary in an unlawful manner, specifically, to impersonate other people to gain access to private information. To mitigate this issue, speech deepfake detection (SDD) methods started to evolve. Unfortunately, current SDD methods generally suffer from the lack of generalization to new audio domains and generators. More than that, they lack interpretability, especially human-like reasoning that would naturally explain the attribution of a given audio to the bona fide or spoof class and provide human-perceptible cues. In this paper, we propose HIR-SDD, a novel SDD framework that combines the strengths of Large Audio Language Models (LALMs) with the chain-of-thought reasoning derived from the novel proposed human-annotated dataset. Experimental evaluation demonstrates both the effectiveness of the proposed method and its ability to provide reasonable justifications for predictions.

[374] Speaker Verification with Speech-Aware LLMs: Evaluation and Augmentation

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, Najim Dehak

Main category: cs.SD

TL;DR: The paper introduces a protocol to evaluate speaker discrimination in speech-aware LLMs and proposes ECAPA-LLM, a lightweight augmentation that adds speaker verification capability to LLMs while preserving natural language interfaces.

Details

Motivation: Current speech-aware LLMs focus on linguistic content or specific fields like emotions, but it's unclear whether they encode speaker identity information. The authors want to evaluate and enhance speaker discrimination capabilities in these models.

Method: 1) Proposed a model-agnostic scoring protocol using confidence scores or log-likelihood ratios from Yes/No token probabilities to evaluate speaker discrimination. 2) Introduced ECAPA-LLM - a lightweight augmentation that injects frozen ECAPA-TDNN speaker embeddings through learned projection and trains only LoRA adapters on TinyLLaMA-1.1B.

Result: Benchmarking showed weak speaker discrimination in recent speech-aware LLMs (EERs above 20% on VoxCeleb1). ECAPA-LLM achieved 1.03% EER on VoxCeleb1-E, approaching dedicated speaker verification system performance while maintaining natural-language interface.

Conclusion: Speech-aware LLMs have weak inherent speaker discrimination, but lightweight augmentation with speaker embeddings can equip them with strong speaker verification capabilities while preserving their core functionality.

Abstract: Speech-aware large language models (LLMs) can accept speech inputs, yet their training objectives largely emphasize linguistic content or specific fields such as emotions or the speaker’s gender, leaving it unclear whether they encode speaker identity. First, we propose a model-agnostic scoring protocol that produces continuous verification scores for both API-only and open-weight models, using confidence scores or log-likelihood ratios from the Yes/No token probabilities. Using this protocol, we benchmark recent speech-aware LLMs and observe weak speaker discrimination (EERs above 20% on VoxCeleb1). Second, we introduce a lightweight augmentation that equips an LLM with ASV capability by injecting frozen ECAPA-TDNN speaker embeddings through a learned projection and training only LoRA adapters. On TinyLLaMA-1.1B, the resulting ECAPA-LLM achieves 1.03% EER on VoxCeleb1-E, approaching a dedicated speaker verification system while preserving a natural-language interface.

[375] Are Deep Speech Denoising Models Robust to Adversarial Noise?

Will Schwarzer, Neel Chaudhari, Philip S. Thomas, Andrea Fanelli, Xiaoyu Liu

Main category: cs.SD

TL;DR: Recent deep noise suppression models are vulnerable to psychoacoustically hidden adversarial noise that causes them to output unintelligible gibberish, even in low-noise environments.

Details

Motivation: To investigate the security vulnerabilities of deep noise suppression (DNS) models used in high-stakes speech applications, particularly their susceptibility to adversarial attacks that could compromise speech intelligibility.

Method: Tested four recent DNS models with psychoacoustically hidden adversarial noise in low-background-noise and simulated over-the-air settings. Conducted transcription studies with audio/multimedia experts to confirm unintelligibility and ABX studies to measure adversarial noise perceptibility.

Result: All four DNS models could be reduced to outputting unintelligible gibberish through adversarial noise. Transcription studies confirmed unintelligibility, while ABX studies showed the adversarial noise was generally imperceptible (with some variance). Negative results were found for targeted attacks and model transfer.

Conclusion: Deep noise suppression models have significant security vulnerabilities to adversarial attacks, highlighting the need for practical countermeasures before deployment in safety-critical applications.

Abstract: Deep noise suppression (DNS) models enjoy widespread use throughout a variety of high-stakes speech applications. However, we show that four recent DNS models can each be reduced to outputting unintelligible gibberish through the addition of psychoacoustically hidden adversarial noise, even in low-background-noise and simulated over-the-air settings. For three of the models, a small transcription study with audio and multimedia experts confirms unintelligibility of the attacked audio; simultaneously, an ABX study shows that the adversarial noise is generally imperceptible, with some variance between participants and samples. While we also establish several negative results around targeted attacks and model transfer, our results nevertheless highlight the need for practical countermeasures before open-source DNS systems can be used in safety-critical applications.

[376] OSUM-Pangu: An Open-Source Multidimension Speech Understanding Foundation Model Built upon OpenPangu on Ascend NPUs

Yujie Liao, Xuelong Geng, Hongfei Xue, Shuiyuan Wang, Lei Xie

Main category: cs.SD

TL;DR: OSUM-Pangu: A fully open-source speech understanding foundation model built on non-CUDA hardware (Ascend NPU) using openPangu-7B LLM backbone, achieving GPU-comparable accuracy while enabling deployment on alternative computing infrastructures.

Details

Motivation: Current high-performance speech LLMs are optimized for GPU ecosystems and proprietary backbones, creating deployment barriers for non-CUDA computing infrastructures. There's a need for open-source alternatives that work on diverse hardware platforms.

Method: Integrates audio encoder with openPangu-7B LLM backbone, implements entire training/inference pipeline on Ascend NPU platform, uses sequential training process bridging speech perception and user intent recognition under non-CUDA constraints.

Result: Achieves task accuracy comparable to mainstream GPU-based models while maintaining robust natural language interaction capabilities, providing reproducible non-CUDA baseline for open-source speech community.

Conclusion: OSUM-Pangu enables independent evolution of multimodal intelligence by providing open-source speech understanding foundation model that works on non-CUDA hardware, promoting accessibility and deployment flexibility.

Abstract: Recent advancements in Speech Large Language Models have significantly enhanced multi-dimensional speech understanding. However, the majority of high-performance frameworks are predominantly optimized for GPU centric ecosystems and proprietary backbones, creating a significant gap for deployment on non-CUDA computing infrastructures. In this paper, we present OSUM-Pangu, a fully open-source speech understanding foundation model developed on a completely non-CUDA software and hardware stack. By integrating an audio encoder with the openPangu-7B LLM backbone, we successfully implement the entire training and inference pipeline on the Ascend NPU platform. To facilitate efficient task alignment under non-CUDA resource constraints, we adopt a practical training process that sequentially bridges speech perception and user intent recognition. Experimental results demonstrate that OSUM-Pangu achieves task accuracy comparable to mainstream GPU-based models while maintaining robust natural language interaction capabilities. Our work provides a reproducible, non-CUDA baseline for the open-source speech community, promoting the independent evolution of multimodal intelligence.

[377] VoxCare: Studying Natural Communication Behaviors of Hospital Caregivers through Wearable Sensing of Egocentric Audio

Tiantian Feng, Kleanthis Avramidis, Anfeng Xu, Deqi Wang, Brandon M Booth, Shrikanth Narayanan

Main category: cs.SD

TL;DR: VoxCare: A wearable audio sensing system for healthcare professionals that captures communication behaviors in clinical settings using real-time acoustic feature extraction and speech foundation models, enabling analysis of communication patterns related to workload and stress.

Details

Motivation: Healthcare communication is critical but challenging to measure in real-world clinical settings. There's a need for scalable systems to capture natural communication behaviors of hospital professionals without privacy concerns from storing raw audio.

Method: Developed VoxCare, an egocentric wearable audio sensing system that performs real-time, on-device acoustic feature extraction. Uses a speech foundation model-guided teacher-student framework to identify foreground speech activity and derives interpretable behavioral measures of communication frequency, duration, and vocal arousal.

Result: The system successfully captured communication patterns across different shifts and working units, revealing how, when, and how often clinicians communicate. Analysis suggests communication activity reflects underlying workload and stress levels.

Conclusion: VoxCare enables continuous assessment of communication patterns in healthcare settings, providing data-driven approaches to understand healthcare provider behaviors and potentially improve healthcare delivery through better communication analysis.

Abstract: Healthcare professionals work in complex, high-stakes environments where effective communication is critical for care delivery, team coordination, and individual well-being. However, communication activity in everyday clinical settings remains challenging to measure and largely unexplored in human behavioral research. We present VoxCare, a scalable egocentric wearable audio sensing and computing system that captures natural communication behaviors of hospital professionals in real-world settings without storing raw audio. VoxCare performs real-time, on-device acoustic feature extraction and applies a speech foundation model-guided teacher-student framework to identify foreground speech activity. From these features, VoxCare derives interpretable behavioral measures of communication frequency, duration, and vocal arousal. Our analyses reveal how, when, and how often clinicians communicate across different shifts and working units, and suggest that communication activity reflects underlying workload and stress. By enabling continuous assessment of communication patterns in everyday contexts, this study provides data-driven approaches to understand the behaviors of healthcare providers and ultimately improve healthcare delivery.

[378] Modeling strategies for speech enhancement in the latent space of a neural audio codec

Sofiene Kammoun, Xavier Alameda-Pineda, Simon Leglaive

Main category: cs.SD

TL;DR: Comparing continuous vs discrete neural audio codec representations for speech enhancement, finding continuous predictions work better, non-autoregressive models are more practical, and encoder fine-tuning improves enhancement but hurts codec reconstruction.

Details

Motivation: Neural audio codecs provide compact speech representations (continuous vectors or discrete tokens), but it's unclear which representation type works better as training targets for supervised speech enhancement tasks.

Method: Evaluated both autoregressive and non-autoregressive Conformer-based speech enhancement models, plus a baseline of fine-tuning the NAC encoder. Compared continuous latent representation prediction vs discrete token prediction across these architectures.

Result: Three key findings: 1) Predicting continuous latent representations consistently outperforms discrete token prediction; 2) Autoregressive models achieve higher quality but sacrifice intelligibility and efficiency, making non-autoregressive models more practical; 3) Adding encoder fine-tuning yields strongest enhancement metrics overall but degrades codec reconstruction quality.

Conclusion: Continuous neural audio codec representations are superior training targets for speech enhancement, with non-autoregressive models offering the best practical trade-off, and encoder fine-tuning provides enhancement benefits despite compromising codec reconstruction.

Abstract: Neural audio codecs (NACs) provide compact latent speech representations in the form of sequences of continuous vectors or discrete tokens. In this work, we investigate how these two types of speech representations compare when used as training targets for supervised speech enhancement. We consider both autoregressive and non-autoregressive speech enhancement models based on the Conformer architecture, as well as a simple baseline where the NAC encoder is simply fine-tuned for speech enhancement. Our experiments reveal three key findings: predicting continuous latent representations consistently outperforms discrete token prediction; autoregressive models achieve higher quality but at the expense of intelligibility and efficiency, making non-autoregressive models more attractive in practice; and adding encoder fine-tuning yields the strongest enhancement metrics overall, though at the cost of degraded codec reconstruction. The code and audio samples are available online.

[379] When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Anupam Purwar, Aditya Choudhary

Main category: cs.SD

TL;DR: LoRA fine-tuning of Qwen-0.5B LLM backbone improves voice cloning quality in TTS systems across perceptual quality, speaker fidelity, and signal-level metrics, with gains dependent on training data diversity.

Details

Motivation: Frozen LLM representations are insufficient for modeling speaker-specific acoustic and perceptual characteristics in text-to-speech systems, necessitating better adaptation methods for voice cloning tasks.

Method: Used LoRA (Low-Rank Adaptation) fine-tuning of the Qwen-0.5B language model backbone for TTS, evaluating improvements across multiple speakers compared to non-fine-tuned base model.

Result: LoRA fine-tuning consistently outperformed base model across three dimensions: perceptual quality (DNS-MOS gains up to 0.42), speaker fidelity (improved voice similarity), and signal quality (SNR increases up to 34%). Gains were strongly dependent on training data characteristics.

Conclusion: LoRA fine-tuning is an effective mechanism for speaker-level adaptation in compact LLM-based TTS systems, surpassing frozen base models in perceptual quality and speaker similarity when supported by diverse training data.

Abstract: Large language models are increasingly adopted as semantic backbones for neural text-to-speech systems. However, frozen LLM representations are insufficient for modeling speaker specific acoustic and perceptual characteristics. Our experiments involving fine tuning of the Language Model backbone of TTS show promise in improving the voice consistency and Signal to Noise ratio SNR in voice cloning task. Across multiple speakers LoRA finetuning consistently outperforms the non-finetuned base Qwen-0.5B model across three complementary dimensions of speech quality. First, perceptual quality improves significantly with DNS-MOS gains of up to 0.42 points for speakers whose training data exhibits sufficient acoustic variability. Second, speaker fidelity improves for all evaluated speakers with consistent increases in voice similarity indicating that LoRA effectively adapts speaker identity representations without degrading linguistic modeling. Third, signal level quality improves in most cases with signal to noise ratio increasing by as much as 34 percent. Crucially these improvements are strongly governed by the characteristics of the training data. Speakers with high variability in acoustic energy and perceptual quality achieve simultaneous gains in DNS-MOS voice similarity and SNR. Overall this work establishes that LoRA finetuning is not merely a parameter efficient optimization technique but an effective mechanism for better speaker level adaptation in compact LLM-based TTS systems. When supported by sufficiently diverse training data LoRA adapted Qwen-0.5B consistently surpasses its frozen base model in perceptual quality speaker similarity with low latency using GGUF model hosted in quantized form.

[380] Training-Free Multi-Step Inference for Target Speaker Extraction

Zhenghai You, Ying Shi, Lantian Li, Dong Wang

Main category: cs.SD

TL;DR: A training-free multi-step inference method for target speaker extraction that enables iterative refinement using test-time scaling, with joint metric optimization for controllable extraction preferences.

Details

Motivation: Most target speaker extraction systems use one-step inference with conditional auto-encoders, but test-time scaling suggests iterative refinement could improve performance without retraining.

Method: Propose multi-step inference with frozen pretrained model: at each step, generate new candidates by interpolating original mixture and previous estimate, select best candidate for further refinement until convergence. Use joint metric optimization to balance intrusive (SI-SDRi) and non-intrusive (UTMOS, SpkSim) metrics.

Result: With ground-truth target speech, optimizing SI-SDRi yields consistent gains across multiple metrics. Without ground truth, optimizing individual non-intrusive metrics improves corresponding metric but may hurt others. Joint metric optimization enables controllable extraction preferences.

Conclusion: Training-free multi-step inference enables iterative refinement for target speaker extraction, with joint metric optimization providing practical deployment flexibility for balancing different quality objectives.

Abstract: Target speaker extraction (TSE) aims to recover a target speaker’s speech from a mixture using a reference utterance as a cue. Most TSE systems adopt conditional auto-encoder architectures with one-step inference. Inspired by test-time scaling, we propose a training-free multi-step inference method that enables iterative refinement with a frozen pretrained model. At each step, new candidates are generated by interpolating the original mixture and the previous estimate, and the best candidate is selected for further refinement until convergence. Experiments show that, when ground-truth target speech is available, optimizing an intrusive metric (SI-SDRi) yields consistent gains across multiple evaluation metrics. Without ground truth, optimizing non-intrusive metrics (UTMOS or SpkSim) improves the corresponding metric but may hurt others. We therefore introduce joint metric optimization to balance these objectives, enabling controllable extraction preferences for practical deployment.

[381] Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and Multi-Scale Global-Local Attention

Kai Li, Kejun Gao, Xiaolin Hu

Main category: cs.SD

TL;DR: Dolphin: An efficient audio-visual speech separation method using lightweight dual-path video encoder for lip-motion tokens and encoder-decoder separator with global-local attention blocks.

Details

Motivation: Current AVSS methods have large parameter counts and high computational costs, making them impractical for applications where speech separation is just a preprocessing step for further speech processing.

Method: 1) DP-LipCoder: dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens; 2) Lightweight encoder-decoder separator with global-local attention blocks to capture multi-scale dependencies efficiently.

Result: Outperforms SOTA in separation quality while achieving >50% fewer parameters, >2.4x reduction in MACs, and >6x faster GPU inference speed on three benchmark datasets.

Conclusion: Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios with significantly improved efficiency.

Abstract: Audio-visual speech separation (AVSS) methods leverage visual cues to extract target speech and have demonstrated strong separation quality in noisy acoustic environments. However, these methods usually involve a large number of parameters and require high computational cost, which is unacceptable in many applications where speech separation serves as only a preprocessing step for further speech processing. To address this issue, we propose an efficient AVSS method, named Dolphin. For visual feature extraction, we develop DP-LipCoder, a dual-path lightweight video encoder that transforms lip-motion into discrete audio-aligned semantic tokens. For audio separation, we construct a lightweight encoder-decoder separator, in which each layer incorporates a global-local attention (GLA) block to efficiently capture multi-scale dependencies. Experiments on three benchmark datasets showed that Dolphin not only surpassed the current state-of-the-art (SOTA) model in separation quality but also achieved remarkable improvements in efficiency: over 50% fewer parameters, more than 2.4x reduction in MACs, and over 6x faster GPU inference speed. These results indicate that Dolphin offers a practical and deployable solution for high-performance AVSS in real-world scenarios. Our code and demo page are publicly available at http://cslikai.cn/Dolphin/.

[382] Evaluation of Audio Compression Codecs

Thien T. Duong, Jan P. Springer

Main category: cs.SD

TL;DR: Evaluation of audio compression codecs focusing on both compression efficiency and perceptual quality metrics including PEAQ scores, visualizations, and performance measurements.

Details

Motivation: Users often focus only on compression efficiency when choosing audio codecs, but perceptual quality (accuracy, intelligibility, fidelity) is equally important for human listening experience, especially given the widespread use of audio compression in digital media.

Method: Evaluates multiple common audio compression codecs using codec performance measurements, visualizations, and PEAQ (Perceptual Evaluation of Audio Quality) scores to assess both compression performance and sonic perceptual quality.

Result: Demonstrates how different digital audio compression techniques affect perceptual quality, providing insights into the trade-offs between compression efficiency and audio fidelity.

Conclusion: Users should consider both compression efficiency and perceptual quality when selecting audio compression codecs, as compression techniques significantly impact the sonic experience and human perception of audio fidelity.

Abstract: Perceptual quality of audio is the combination of aural accuracy and listener-perceived sound fidelity. It is how humans respond to the accuracy, intelligibility, and fidelity of aural media. Today this fidelity is also heavily influenced by the use of audio compression codecs for storing aural media in digital form. We argue that, when choosing an audio compression codec, users should not only look at compression efficiency but also consider the sonic perceptual quality properties of available audio compression codecs. We evaluate several commonly used audio compression codecs in terms of compression performance as well as their sonic perceptual quality via codec performance measurements, visualizations, and PEAQ scores. We demonstrate how perceptual quality is affected by digital audio compression techniques, providing insights for users in the process of choosing a digital audio compression scheme.

[383] Fish Audio S2 Technical Report

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han

Main category: cs.SD

TL;DR: Fish Audio S2 is an open-source text-to-speech system with multi-speaker, multi-turn generation and natural language instruction-following control, featuring production-ready streaming inference.

Details

Motivation: To create an advanced open-source TTS system that goes beyond basic speech synthesis by enabling multi-speaker, multi-turn conversations and natural language control over voice characteristics, making high-quality speech generation more accessible and controllable.

Method: Developed a multi-stage training recipe with staged data pipeline covering video captioning, speech captioning, voice-quality assessment, and reward modeling. Released model weights, fine-tuning code, and SGLang-based inference engine optimized for streaming.

Result: Achieved production-ready streaming with RTF of 0.195 and time-to-first-audio below 100 ms. Released complete open-source system including weights, fine-tuning code, and inference engine.

Conclusion: Fish Audio S2 pushes the frontier of open-source TTS by providing a powerful, controllable, and production-ready speech synthesis system with natural language instruction following capabilities.

Abstract: We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

cs.LG

[384] KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Qitong Sun, Jun Han, Tianlin Li, Zhe Tang, Sheng Chen, Fei Yang, Aishan Liu, Xianglong Liu, Yang Liu

Main category: cs.LG

TL;DR: KernelSkill is a multi-agent framework that replaces implicit LLM heuristics with knowledge-driven expert optimization skills for GPU kernel generation, achieving significant speedups on benchmark tasks.

Details

Motivation: Existing LLM-based GPU kernel optimization pipelines rely on opaque, implicitly learned heuristics, leading to inefficient trial-and-error and weakly interpretable optimizations. The authors aim to replace these implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories.

Method: KernelSkill is a multi-agent framework with a dual-level memory architecture. It coordinates agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. This approach replaces implicit heuristics with explicit, knowledge-driven optimization strategies.

Result: On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3 respectively, outperforming prior baselines.

Conclusion: The KernelSkill framework demonstrates that replacing implicit LLM heuristics with knowledge-driven expert optimization skills can significantly improve GPU kernel efficiency and optimization success rates.

Abstract: Improving GPU kernel efficiency is crucial for advancing AI systems. Recent work has explored leveraging large language models (LLMs) for GPU kernel generation and optimization. However, existing LLM-based kernel optimization pipelines typically rely on opaque, implicitly learned heuristics within the LLMs to determine optimization strategies. This leads to inefficient trial-and-error and weakly interpretable optimizations. Our key insight is to replace implicit heuristics with expert optimization skills that are knowledge-driven and aware of task trajectories. Specifically, we present KernelSkill, a multi-agent framework with a dual-level memory architecture. KernelSkill operates by coordinating agents with long-term memory of reusable expert skills and short-term memory to prevent repetitive backtracking. On KernelBench Levels 1-3, KernelSkill achieves a 100% success rate and average speedups of 5.44x, 2.82x, and 1.92x over Torch Eager on Levels 1, 2, and 3, respectively, outperforming prior baselines. Code is available at https://github.com/0satan0/KernelMem/.

[385] Explainable LLM Unlearning Through Reasoning

Junfeng Liao, Qizhou Wang, Shanshan Ye, Xin Yu, Ling Chen, Zhen Fang

Main category: cs.LG

TL;DR: TRU introduces reasoning-based unlearning targets to guide LLM unlearning, addressing issues in gradient ascent methods by providing explicit guidance on what and how to unlearn.

Details

Motivation: Current LLM unlearning methods like gradient ascent suffer from unintended degradation of general capabilities, incomplete knowledge removal, and incoherent responses due to lack of explicit guidance on what/how to unlearn.

Method: Proposes Targeted Reasoning Unlearning (TRU) using reasoning-based unlearning targets as guidance, combining cross-entropy supervised loss with GA-based loss to learn reasoning ability for precise knowledge removal while preserving unrelated abilities.

Result: TRU achieves more reliable unlearning while preserving general capabilities across multiple benchmarks and LLM backbones, and exhibits superior robustness under diverse attack scenarios due to learned reasoning ability.

Conclusion: Reasoning-augmented unlearning establishes a practical paradigm for reliable and explainable LLM unlearning, addressing safety, copyright, and privacy concerns.

Abstract: LLM unlearning is essential for mitigating safety, copyright, and privacy concerns in pre-trained large language models (LLMs). Compared to preference alignment, it offers a more explicit way by removing undesirable knowledge characterized by specific unlearning datasets. In previous works, gradient ascent (GA) and its variants have shown promise for implementing unlearning, yet their untargeted nature results in unintended degradation of general capabilities, incomplete removal of knowledge, and the generation of incoherent responses, among many others. We argue that these issues stem from the absence of explicit guidance on what and how models should unlearn. To fill this gap, we introduce a novel unlearning target, reasoning-based unlearning target, which satisfies both the specified unlearning scope and the specified post-unlearning response. Building on this, we propose targeted reasoning unlearning (TRU), which leverages reasoning-based unlearning target as guidance. We employ the target using a cross-entropy supervised loss combined with a GA-based loss, enabling the model to learn reasoning ability for precise knowledge removal while preserving unrelated abilities. We evaluate TRU against strong baselines across multiple benchmarks and LLM backbones, and find that it achieves more reliable unlearning while preserving general capabilities. Moreover, TRU exhibits superior robustness under diverse attack scenarios, stemming from the reasoning ability learned through reasoning-based targets. Overall, our study establishes reasoning-augmented unlearning as a practical paradigm for reliable and explainable LLM unlearning.

[386] MoE-SpAc: Efficient MoE Inference Based on Speculative Activation Utility in Heterogeneous Edge Scenarios

Shuhuai Li, Jianghao Lin, Dongdong Ge, Yinyu Ye

Main category: cs.LG

TL;DR: MoE-SpAc: A novel MoE inference framework that uses speculative decoding as a lookahead sensor for memory management, achieving significant speedups over existing methods.

Details

Motivation: Mixture-of-Experts (MoE) models face severe memory constraints on edge devices, and existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation.

Method: Repurposes Speculative Decoding (SD) as an informative lookahead sensor for memory management. Introduces three components: Speculative Utility Estimator to track expert demand, Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and Asynchronous Execution Engine to unify prefetching and eviction.

Result: Achieves 42% improvement in TPS over state-of-the-art SD-based baseline, and average 4.04x speedup over all standard baselines across seven benchmarks.

Conclusion: MoE-SpAc effectively addresses memory constraints in MoE inference on edge devices by using speculative decoding for intelligent memory management, demonstrating significant performance improvements.

Abstract: Mixture-of-Experts (MoE) models enable scalable performance but face severe memory constraints on edge devices. Existing offloading strategies struggle with I/O bottlenecks due to the dynamic, low-information nature of autoregressive expert activation. In this paper, we propose to repurpose Speculative Decoding (SD) not merely as a compute accelerator, but as an informative lookahead sensor for memory management, supported by our theoretical and empirical analyses. Hence, we introduce MoE-SpAc, an MoE inference framework that integrates a Speculative Utility Estimator to track expert demand, a Heterogeneous Workload Balancer to dynamically partition computation via online integer optimization, and an Asynchronous Execution Engine to unify the prefetching and eviction in the same utility space. Extensive experiments on seven benchmarks demonstrate that MoE-SpAc achieves a 42% improvement in TPS over the SOTA SD-based baseline, and an average 4.04x speedup over all standard baselines. Code is available at https://github.com/lshAlgorithm/MoE-SpAc .

[387] Personalized Group Relative Policy Optimization for Heterogenous Preference Alignment

Jialu Wang, Heinrich Peters, Asad A. Butt, Navid Hashemi, Alireza Hashemi, Pouya M. Ghari, Joseph Hoover, James Rae, Morteza Dehghani

Main category: cs.LG

TL;DR: P-GRPO: A personalized alignment framework that improves on GRPO by normalizing advantages against preference-group-specific reward histories instead of batch statistics, enabling better learning of diverse individual preferences.

Details

Motivation: Standard LLM alignment methods like RLHF optimize for a single global objective, failing to align with diverse individual preferences. GRPO's group-based normalization assumes all samples are exchangeable, conflating distinct user reward distributions and biasing learning toward dominant preferences while suppressing minority signals.

Method: Introduces Personalized GRPO (P-GRPO) that decouples advantage estimation from immediate batch statistics. Instead of normalizing against the concurrent generation group, it normalizes advantages against preference-group-specific reward histories, preserving contrastive signals needed for learning distinct preferences.

Result: P-GRPO consistently achieves faster convergence and higher rewards than standard GRPO across diverse tasks, enhancing ability to recover and align with heterogeneous preference signals.

Conclusion: Accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

Abstract: Despite their sophisticated general-purpose capabilities, Large Language Models (LLMs) often fail to align with diverse individual preferences because standard post-training methods, like Reinforcement Learning with Human Feedback (RLHF), optimize for a single, global objective. While Group Relative Policy Optimization (GRPO) is a widely adopted on-policy reinforcement learning framework, its group-based normalization implicitly assumes that all samples are exchangeable, inheriting this limitation in personalized settings. This assumption conflates distinct user reward distributions and systematically biases learning toward dominant preferences while suppressing minority signals. To address this, we introduce Personalized GRPO (P-GRPO), a novel alignment framework that decouples advantage estimation from immediate batch statistics. By normalizing advantages against preference-group-specific reward histories rather than the concurrent generation group, P-GRPO preserves the contrastive signal necessary for learning distinct preferences. We evaluate P-GRPO across diverse tasks and find that it consistently achieves faster convergence and higher rewards than standard GRPO, thereby enhancing its ability to recover and align with heterogeneous preference signals. Our results demonstrate that accounting for reward heterogeneity at the optimization level is essential for building models that faithfully align with diverse human preferences without sacrificing general capabilities.

[388] LWM-Temporal: Sparse Spatio-Temporal Attention for Wireless Channel Representation Learning

Sadjad Alikhani, Akshay Malhotra, Shahab Hamidi-Rad, Ahmed Alkhateeb

Main category: cs.LG

TL;DR: LWM-Temporal is a foundation model for wireless channel prediction that learns universal spatiotemporal embeddings using geometry-aware sparse attention and physics-informed pretraining.

Details

Motivation: Wireless channels have complex spatiotemporal dynamics due to mobility, requiring models that can capture these evolution patterns and transfer across tasks with limited data.

Method: Uses Sparse Spatio-Temporal Attention (SSTA) that restricts attention to physically plausible neighborhoods, operates in angle-delay-time domain, and employs physics-informed masking curriculum for self-supervised pretraining.

Result: Shows consistent improvements over baselines in channel prediction across mobility regimes, especially for long horizons and limited fine-tuning data.

Conclusion: Geometry-aware architectures and geometry-consistent pretraining are crucial for learning transferable spatiotemporal representations in wireless systems.

Abstract: LWM-Temporal is a new member of the Large Wireless Models (LWM) family that targets the spatiotemporal nature of wireless channels. Designed as a task-agnostic foundation model, LWM-Temporal learns universal channel embeddings that capture mobility-induced evolution and are reusable across various downstream tasks. To achieve this objective, LWM-Temporal operates in the angle-delay-time domain and introduces Sparse Spatio-Temporal Attention (SSTA), a propagation-aligned attention mechanism that restricts interactions to physically plausible neighborhoods, reducing attention complexity by an order of magnitude while preserving geometry-consistent dependencies. LWM-Temporal is pretrained in a self-supervised manner using a physics-informed masking curriculum that emulates realistic occlusions, pilot sparsity, and measurement impairments. Experimental results on channel prediction across multiple mobility regimes show consistent improvements over strong baselines, particularly under long horizons and limited fine-tuning data, highlighting the importance of geometry-aware architectures and geometry-consistent pretraining for learning transferable spatiotemporal wireless representations.

[389] Gated Adaptation for Continual Learning in Human Activity Recognition

Reza Rahimi Azghan, Gautham Krishna Gudur, Mohit Malu, Edison Thomaz, Giulia Pedrielli, Pavan Turaga, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: Parameter-efficient continual learning framework using channel-wise gated modulation of frozen pretrained representations for human activity recognition, achieving better stability-plasticity balance with minimal parameter updates.

Details

Motivation: Address catastrophic forgetting in continual learning for wearable IoT applications, particularly domain-incremental human activity recognition where models must adapt to new subjects while maintaining accuracy on previous ones without transmitting sensitive data to the cloud.

Method: Proposes channel-wise gated modulation of frozen pretrained representations, using feature selection rather than feature generation. Learned transformations are restricted to diagonal scaling of existing features to preserve geometry while enabling subject-specific modulation. Theoretical analysis shows gating implements bounded diagonal operators limiting representational drift.

Result: On PAMAP2 with 8 sequential subjects, reduces forgetting from 39.7% to 16.2% and improves final accuracy from 56.7% to 77.7%, while training less than 2% of parameters. Matches or exceeds standard continual learning baselines without replay buffers or task-specific regularization.

Conclusion: Structured diagonal operators are effective and efficient for continual learning under distribution shift, enabling parameter-efficient adaptation while maintaining stability of pretrained representations.

Abstract: Wearable sensors in Internet of Things (IoT) ecosystems increasingly support applications such as remote health monitoring, elderly care, and smart home automation, all of which rely on robust human activity recognition (HAR). Continual learning systems must balance plasticity (learning new tasks) with stability (retaining prior knowledge), yet AI models often exhibit catastrophic forgetting, where learning new tasks degrades performance on earlier ones. This challenge is especially acute in domain-incremental HAR, where on-device models must adapt to new subjects with distinct movement patterns while maintaining accuracy on prior subjects without transmitting sensitive data to the cloud. We propose a parameter-efficient continual learning framework based on channel-wise gated modulation of frozen pretrained representations. Our key insight is that adaptation should operate through feature selection rather than feature generation: by restricting learned transformations to diagonal scaling of existing features, we preserve the geometry of pretrained representations while enabling subject-specific modulation. We provide a theoretical analysis showing that gating implements a bounded diagonal operator that limits representational drift compared to unconstrained linear transformations. Empirically, freezing the backbone substantially reduces forgetting, and lightweight gates restore lost adaptation capacity, achieving stability and plasticity simultaneously. On PAMAP2 with 8 sequential subjects, our approach reduces forgetting from 39.7% to 16.2% and improves final accuracy from 56.7% to 77.7%, while training less than 2% of parameters. Our method matches or exceeds standard continual learning baselines without replay buffers or task-specific regularization, confirming that structured diagonal operators are effective and efficient under distribution shift.

[390] Revisiting Sharpness-Aware Minimization: A More Faithful and Effective Implementation

Jianlong Chen, Zhiming Zhou

Main category: cs.LG

TL;DR: XSAM improves SAM by explicitly estimating the direction toward maximum loss in parameter neighborhood and better leveraging multi-step gradient information

Details

Motivation: SAM's practical implementation uses gradient at ascent point to update current parameters, but lacks intuitive understanding of why this works well. The paper aims to provide better interpretation and address limitations of SAM's approximation quality.

Method: Proposes eXplicit Sharpness-Aware Minimization (XSAM) that explicitly estimates direction toward maximum loss in local neighborhood and crafts search space to better leverage gradient information at multi-step ascent points.

Result: XSAM demonstrates consistent superiority over existing SAM variants with negligible computational overhead.

Conclusion: XSAM provides better theoretical understanding and practical improvement over SAM by addressing approximation inaccuracies and degradation issues in multi-step settings.

Abstract: Sharpness-Aware Minimization (SAM) enhances generalization by minimizing the maximum training loss within a predefined neighborhood around the parameters. However, its practical implementation approximates this as gradient ascent(s) followed by applying the gradient at the ascent point to update the current parameters. This practice can be justified as approximately optimizing the objective by neglecting the (full) derivative of the ascent point with respect to the current parameters. Nevertheless, a direct and intuitive understanding of why using the gradient at the ascent point to update the current parameters works superiorly is still lacking. Our work bridges this gap by proposing a novel and intuitive interpretation. We show that the gradient at the single-step ascent point, \uline{when applied to the current parameters}, provides a better approximation of the direction from the current parameters toward the maximum within the local neighborhood than the local gradient. This improved approximation thereby enables a more direct escape from the maximum within the local neighborhood. Nevertheless, our analysis further reveals two issues. First, the approximation by the gradient at the single-step ascent point is often inaccurate. Second, the approximation quality may degrade as the number of ascent steps increases. To address these limitations, we propose in this paper eXplicit Sharpness-Aware Minimization (XSAM). It tackles the first by explicitly estimating the direction of the maximum during training, while addressing the second by crafting a search space that effectively leverages the gradient information at the multi-step ascent point. XSAM features a unified formulation that applies to both single-step and multi-step settings and only incurs negligible computational overhead. Extensive experiments demonstrate the consistent superiority of XSAM against existing counterparts.

[391] InFusionLayer: a CFA-based ensemble tool to generate new classifiers for learning and modeling

Eric Roginek, Jingyan Xu, D. Frank. Hsu

Main category: cs.LG

TL;DR: InFusionLayer is a Python tool implementing Combinatorial Fusion Analysis (CFA) for ensemble learning, using rank-score characteristic functions and cognitive diversity to combine multiple models for improved multiclass classification performance.

Details

Motivation: There's a lack of general-purpose Python tools that incorporate Combinatorial Fusion Analysis (CFA) techniques for ensemble learning, despite CFA's proven effectiveness in combining multiple scoring systems using rank-score characteristic functions and cognitive diversity.

Method: InFusionLayer is a machine learning architecture inspired by CFA at the system fusion level. It uses a moderate set of base models and incorporates distinctive features of rank-score characteristic (RSC) functions and cognitive diversity (CD) to optimize unsupervised and supervised learning multiclassification problems.

Result: The tool demonstrates ease of use for PyTorch, TensorFlow, and Scikit-learn workflows and shows practical advantages when validated on various computer vision datasets. The code is open-sourced to encourage community development.

Conclusion: InFusionLayer provides a practical implementation of CFA techniques for ensemble learning, paving the way for more sophisticated ensemble applications in machine learning, with demonstrated effectiveness on computer vision tasks.

Abstract: Ensemble learning is a well established body of methods for machine learning to enhance predictive performance by combining multiple algorithms/models. Combinatorial Fusion Analysis (CFA) has provided method and practice for combining multiple scoring systems, using rank-score characteristic (RSC) function and cognitive diversity (CD), including ensemble method and model fusion. However, there is no general-purpose Python tool available that incorporate these techniques. In this paper we introduce \texttt{InFusionLayer}, a machine learning architecture inspired by CFA at the system fusion level that uses a moderate set of base models to optimize unsupervised and supervised learning multiclassification problems. We demonstrate \texttt{InFusionLayer}’s ease of use for PyTorch, TensorFlow, and Scikit-learn workflows by validating its performance on various computer vision datasets. Our results highlight the practical advantages of incorporating distinctive features of RSC function and CD, paving the way for more sophisticated ensemble learning applications in machine learning. We open-sourced our code to encourage continuing development and community accessibility to leverage CFA on github: https://github.com/ewroginek/Infusion

[392] Cluster-Aware Attention-Based Deep Reinforcement Learning for Pickup and Delivery Problems

Wentao Wang, Lifeng Han, Guangyu Zou

Main category: cs.LG

TL;DR: CAADRL is a cluster-aware attention-based deep reinforcement learning framework for solving Pickup and Delivery Problems that explicitly models multi-scale structure through cluster-aware encoding and hierarchical decoding.

Details

Motivation: Existing DRL approaches for PDP either model all nodes on a flat graph (relying on implicit constraint learning) or use inference-time collaborative search (which has high latency). The paper aims to develop a method that explicitly exploits the multi-scale structure of PDP instances for better performance and efficiency.

Method: Proposes CAADRL with cluster-aware encoding using Transformer with global self-attention and intra-cluster attention over depot, pickup, and delivery nodes. Uses Dynamic Dual-Decoder with learnable gate to balance intra-cluster routing and inter-cluster transitions. Trained end-to-end with POMO-style policy gradient using multiple symmetric rollouts.

Result: CAADRL matches or improves upon state-of-the-art baselines on clustered PDP instances and remains highly competitive on uniform instances, especially as problem size increases. Achieves these results with substantially lower inference time than neural collaborative-search baselines.

Conclusion: Explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers, offering strong performance with reduced inference latency compared to collaborative-search approaches.

Abstract: The Pickup and Delivery Problem (PDP) is a fundamental and challenging variant of the Vehicle Routing Problem, characterized by tightly coupled pickup–delivery pairs, precedence constraints, and spatial layouts that often exhibit clustering. Existing deep reinforcement learning (DRL) approaches either model all nodes on a flat graph, relying on implicit learning to enforce constraints, or achieve strong performance through inference-time collaborative search at the cost of substantial latency. In this paper, we propose \emph{CAADRL} (Cluster-Aware Attention-based Deep Reinforcement Learning), a DRL framework that explicitly exploits the multi-scale structure of PDP instances via cluster-aware encoding and hierarchical decoding. The encoder builds on a Transformer and combines global self-attention with intra-cluster attention over depot, pickup, and delivery nodes, producing embeddings that are both globally informative and locally role-aware. Based on these embeddings, we introduce a Dynamic Dual-Decoder with a learnable gate that balances intra-cluster routing and inter-cluster transitions at each step. The policy is trained end-to-end with a POMO-style policy gradient scheme using multiple symmetric rollouts per instance. Experiments on synthetic clustered and uniform PDP benchmarks show that CAADRL matches or improves upon strong state-of-the-art baselines on clustered instances and remains highly competitive on uniform instances, particularly as problem size increases. Crucially, our method achieves these results with substantially lower inference time than neural collaborative-search baselines, suggesting that explicitly modeling cluster structure provides an effective and efficient inductive bias for neural PDP solvers.

[393] Inferring Clinically Relevant Molecular Subtypes of Pancreatic Cancer from Routine Histopathology Using Deep Learning

Abdul Rehman Akbar, Alejandro Levya, Ashwini Esnakula, Elshad Hasanov, Anne Noonan, Lingbin Meng, Susan Tsai, Vaibhav Sahai, Midhun Malla, Sarbajit Mukherjee, Upender Manne, Anil Parwani, Wei Chen, Ashish Manne, Muhammad Khalid Khan Niazi

Main category: cs.LG

TL;DR: PanSubNet: An interpretable deep learning framework that predicts pancreatic cancer molecular subtypes (basal-like vs classical) directly from H&E-stained whole slide images, enabling rapid, cost-effective clinical deployment without requiring RNA-seq.

Details

Motivation: Current molecular subtyping of pancreatic ductal adenocarcinoma (PDAC) using RNA-seq is limited by cost, turnaround time, and tissue requirements, restricting clinical application. There's a need for accessible tools that can predict therapy-relevant molecular subtypes from routine histopathology slides.

Method: Developed PanSubNet using 1,055 patients across two cohorts with paired histology and RNA-seq data. Uses dual-scale architecture fusing cellular-level morphology with tissue-level architecture, employing attention mechanisms for multi-scale representation learning and transparent feature attribution. Ground-truth labels derived from validated Moffitt 50-gene signature refined by GATA6 expression.

Result: Achieved mean AUC of 88.5% on internal validation (PANCAN cohort) and 84.0% on external validation (TCGA cohort) without fine-tuning. Model preserved and strengthened prognostic stratification compared to RNA-seq labels, especially in metastatic disease. Predictions aligned with established transcriptomic programs and biological signatures.

Conclusion: PanSubNet offers a clinically deployable, interpretable tool for genetic subtyping from routine H&E slides, enabling rapid, cost-effective molecular stratification for precision oncology in PDAC. Currently being validated in real-world settings for integration into digital pathology workflows.

Abstract: Molecular subtyping of PDAC into basal-like and classical has established prognostic and predictive value. However, its use in clinical practice is limited by cost, turnaround time, and tissue requirements, thereby restricting its application in the management of PDAC. We introduce PanSubNet, an interpretable deep learning framework that predicts therapy-relevant molecular subtypes directly from standard H&E-stained WSIs. PanSubNet was developed using data from 1,055 patients across two multi-institutional cohorts (PANCAN, n=846; TCGA, n=209) with paired histology and RNA-seq data. Ground-truth labels were derived using the validated Moffitt 50-gene signature refined by GATA6 expression. The model employs dual-scale architecture that fuses cellular-level morphology with tissue-level architecture, leveraging attention mechanisms for multi-scale representation learning and transparent feature attribution. On internal validation within PANCAN using five-fold cross-validation, PanSubNet achieved mean AUC of 88.5% with balanced sensitivity and specificity. External validation on the independent TCGA cohort without fine-tuning demonstrated robust generalizability (AUC 84.0%). PanSubNet preserved and, in metastatic disease, strengthened prognostic stratification compared to RNA-seq based labels. Prediction uncertainty linked to intermediate transcriptional states, not classification noise. Model predictions are aligned with established transcriptomic programs, differentiation markers, and DNA damage repair signatures. By enabling rapid, cost-effective molecular stratification from routine H&E-stained slides, PanSubNet offers a clinically deployable and interpretable tool for genetic subtyping. We are gathering data from two institutions to validate and assess real-world performance, supporting integration into digital pathology workflows and advancing precision oncology for PDAC.

[394] Training Language Models via Neural Cellular Automata

Dan Lee, Seungwook Han, Akarsh Kumar, Pulkit Agrawal

Main category: cs.LG

TL;DR: Using neural cellular automata (NCA) to generate synthetic non-linguistic data for pre-pre-training LLMs improves language modeling and reasoning performance compared to natural language pre-training.

Details

Motivation: Natural language pre-training has limitations: finite high-quality text, human biases, and entangled knowledge with reasoning. The paper questions whether natural language is the only path to intelligence and explores synthetic data alternatives.

Method: Proposes using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs. NCA data exhibits rich spatiotemporal structure resembling natural language but is controllable and cheap to generate at scale. The approach involves training on synthetic-then-natural language.

Result: Pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl. Gains transfer to reasoning benchmarks (GSM8K, HumanEval, BigBench-Lite). Attention layers are most transferable, and optimal NCA complexity varies by domain.

Conclusion: Synthetic pre-training with NCA data is effective for LLMs, enabling systematic tuning of synthetic distributions to target domains. Opens path toward more efficient models with fully synthetic pre-training.

Abstract: Pre-training is crucial for large language models (LLMs), as it is when most representations and capabilities are acquired. However, natural language pre-training has problems: high-quality text is finite, it contains human biases, and it entangles knowledge with reasoning. This raises a fundamental question: is natural language the only path to intelligence? We propose using neural cellular automata (NCA) to generate synthetic, non-linguistic data for pre-pre-training LLMs–training on synthetic-then-natural language. NCA data exhibits rich spatiotemporal structure and statistics resembling natural language while being controllable and cheap to generate at scale. We find that pre-pre-training on only 164M NCA tokens improves downstream language modeling by up to 6% and accelerates convergence by up to 1.6x. Surprisingly, this even outperforms pre-pre-training on 1.6B tokens of natural language from Common Crawl with more compute. These gains also transfer to reasoning benchmarks, including GSM8K, HumanEval, and BigBench-Lite. Investigating what drives transfer, we find that attention layers are the most transferable, and that optimal NCA complexity varies by domain: code benefits from simpler dynamics, while math and web text favor more complex ones. These results enable systematic tuning of the synthetic distribution to target domains. More broadly, our work opens a path toward more efficient models with fully synthetic pre-training.

[395] HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

Main category: cs.LG

TL;DR: HTMuon improves Muon optimization by preserving parameter interdependencies while producing heavier-tailed updates and weight spectra, enhancing LLM training performance.

Details

Motivation: Muon shows promise in LLM training but its orthogonalized update rule suppresses heavy-tailed weight spectra and over-emphasizes noise-dominated directions, limiting performance.

Method: Proposes HTMuon based on Heavy-Tailed Self-Regularization theory, which preserves Muon’s ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra.

Result: HTMuon consistently improves performance over state-of-the-art baselines in LLM pretraining and image classification, reducing perplexity by up to 0.98 on LLaMA pretraining compared to Muon.

Conclusion: HTMuon is an effective improvement over Muon that can serve as a plug-in on existing Muon variants, with theoretical grounding in steepest descent under Schatten-q norm constraints.

Abstract: Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon’s orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon’s ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

[396] Improving Search Agent with One Line of Code

Jian Li, Dongsheng Chen, Zhenhua Xu, Yizhang Jin, Jiafu Wu, Chengjie Wang, Xiaotong Yuan, Yabiao Wang

Main category: cs.LG

TL;DR: SAPO stabilizes tool-based agentic RL training by applying conditional token-level KL constraints to prevent importance sampling distribution drift that causes catastrophic model collapse.

Details

Motivation: Tool-based Agentic Reinforcement Learning (TARL) suffers from training instability due to Importance Sampling Distribution Drift (ISDD), which causes catastrophic model collapse in widely adopted algorithms like GRPO by nullifying gradient updates.

Method: Proposes Search Agent Policy Optimization (SAPO) with conditional token-level KL constraints that selectively penalize KL divergence between current and old policies only for positive tokens with low probabilities where policy has shifted excessively, preventing distribution drift while preserving gradient flow.

Result: Achieves +10.6% absolute improvement (+31.5% relative) over Search-R1 across seven QA benchmarks, with consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

Conclusion: SAPO effectively stabilizes TARL training with minimal implementation overhead (one-line code modification to standard GRPO), addressing the critical ISDD problem while maintaining performance improvements.

Abstract: Tool-based Agentic Reinforcement Learning (TARL) has emerged as a promising paradigm for training search agents to interact with external tools for a multi-turn information-seeking process autonomously. However, we identify a critical training instability that leads to catastrophic model collapse: Importance Sampling Distribution Drift(ISDD). In Group Relative Policy Optimization(GRPO), a widely adopted TARL algorithm, ISDD manifests as a precipitous decline in the importance sampling ratios, which nullifies gradient updates and triggers irreversible training failure. To address this, we propose \textbf{S}earch \textbf{A}gent \textbf{P}olicy \textbf{O}ptimization (\textbf{SAPO}), which stabilizes training via a conditional token-level KL constraint. Unlike hard clipping, which ignores distributional divergence, SAPO selectively penalizes the KL divergence between the current and old policies. Crucially, this penalty is applied only to positive tokens with low probabilities where the policy has shifted excessively, thereby preventing distribution drift while preserving gradient flow. Remarkably, SAPO requires only one-line code modification to standard GRPO, ensuring immediate deployability. Extensive experiments across seven QA benchmarks demonstrate that SAPO achieves \textbf{+10.6% absolute improvement} (+31.5% relative) over Search-R1, yielding consistent gains across varying model scales (1.5B, 14B) and families (Qwen, LLaMA).

[397] Dissecting Chronos: Sparse Autoencoders Reveal Causal Feature Hierarchies in Time Series Foundation Models

Anurag Mishra

Main category: cs.LG

TL;DR: First application of sparse autoencoders to time series foundation models reveals Chronos-T5-Large uses abrupt-dynamics detection rather than periodic patterns, with most critical features in mid-encoder layers.

Details

Motivation: Time series foundation models are deployed in high-stakes domains but remain opaque; need to understand their internal representations and decision-making processes for trust and reliability.

Method: Applied sparse autoencoders (SAEs) to Chronos-T5-Large (710M parameters) across six layers, conducted 392 single-feature ablation experiments to establish causal relevance of features.

Result: Every ablated feature produced positive CRPS degradation, confirming causal relevance. Found depth-dependent hierarchy: early layers encode low-level frequency features, mid-encoder has critical change-detection features, final encoder has rich but less causal temporal concepts. Most critical features in mid-encoder (max single-feature Delta CRPS = 38.61).

Conclusion: Mechanistic interpretability transfers effectively to TSFMs; Chronos-T5 relies on abrupt-dynamics detection rather than periodic pattern recognition; most critical features are in mid-encoder layers, not final semantically rich layers.

Abstract: Time series foundation models (TSFMs) are increasingly deployed in high-stakes domains, yet their internal representations remain opaque. We present the first application of sparse autoencoders (SAEs) to a TSFM, training TopK SAEs on activations of Chronos-T5-Large (710M parameters) across six layers. Through 392 single-feature ablation experiments, we establish that every ablated feature produces a positive CRPS degradation, confirming causal relevance. Our analysis reveals a depth-dependent hierarchy: early encoder layers encode low-level frequency features, the mid-encoder concentrates causally critical change-detection features, and the final encoder compresses a rich but less causally important taxonomy of temporal concepts. The most critical features reside in the mid-encoder (max single-feature Delta CRPS = 38.61), not in the semantically richest final encoder layer, where progressive ablation paradoxically improves forecast quality. These findings demonstrate that mechanistic interpretability transfers effectively to TSFMs and that Chronos-T5 relies on abrupt-dynamics detection rather than periodic pattern recognition.

[398] Marginals Before Conditionals

Mihir Sahasrabudhe

Main category: cs.LG

TL;DR: The paper studies conditional learning in neural networks using a minimal task with K-fold ambiguity resolved by a selector token, revealing a plateau at log K before sharp transition to full conditional learning.

Details

Motivation: To understand how neural networks learn conditional relationships, specifically isolating conditional learning dynamics through a controlled task with quantifiable ambiguity.

Method: Construct a minimal surjective map task with K-fold ambiguity resolved by selector token z, analyze learning dynamics including plateau behavior, gradient noise effects, and internal selector-routing head development.

Result: Models learn marginal P(A|B) first (plateau at log K), then transition sharply to full conditional; plateau duration depends on dataset size, gradient noise stabilizes marginal solution, selector-routing head develops during plateau.

Conclusion: Conditional learning exhibits distinct phases: marginal learning plateau followed by sharp transition; gradient noise and dataset size control transition dynamics; internal selector mechanisms develop during plateau.

Abstract: We construct a minimal task that isolates conditional learning in neural networks: a surjective map with K-fold ambiguity, resolved by a selector token z, so H(A | B) = log K while H(A | B, z) = 0. The model learns the marginal P(A | B) first, producing a plateau at exactly log K, before acquiring the full conditional in a sharp, collective transition. The plateau has a clean decomposition: height = log K (set by ambiguity), duration = f(D) (set by dataset size D, not K). Gradient noise stabilizes the marginal solution: higher learning rates monotonically slow the transition (3.6* across a 7* η range at fixed throughput), and batch-size reduction delays escape, consistent with an entropic force opposing departure from the low-gradient marginal. Internally, a selector-routing head assembles during the plateau, leading the loss transition by ~50% of the waiting time. This is the Type 2 directional asymmetry of Papadopoulos et al. [2024], measured dynamically: we track the excess risk from log K to zero and characterize what stabilizes it, what triggers its collapse, and how long it takes.

[399] Stochastic Port-Hamiltonian Neural Networks: Universal Approximation with Passivity Guarantees

Luca Di Persio, Matthias Ehrhardt, Youness Outaleb

Main category: cs.LG

TL;DR: SPH-NNs are neural networks that parameterize stochastic port-Hamiltonian systems, enforcing physical constraints for improved long-term stability in modeling noisy dynamical systems.

Details

Motivation: To develop neural networks that can model stochastic dynamical systems while preserving the energy-based structure and stability properties of port-Hamiltonian systems, which is important for long-term rollout accuracy in noisy environments.

Method: Parameterize the Hamiltonian with a feedforward network while enforcing skew symmetry of the interconnection matrix and positive semidefiniteness of the dissipation matrix. Prove weak passivity inequality and universal approximation results for Itô dynamics.

Result: Experiments on noisy mass spring, Duffing, and Van der Pol oscillators show improved long horizon rollouts and reduced energy error compared to standard multilayer perceptron baselines.

Conclusion: SPH-NNs provide a principled approach to learning stochastic dynamical systems that preserves physical structure, leading to better long-term stability and energy conservation in noisy environments.

Abstract: Stochastic port-Hamiltonian systems represent open dynamical systems with dissipation, inputs, and stochastic forcing in an energy based form. We introduce stochastic port-Hamiltonian neural networks, SPH-NNs, which parameterize the Hamiltonian with a feedforward network and enforce skew symmetry of the interconnection matrix and positive semidefiniteness of the dissipation matrix. For Itô dynamics we establish a weak passivity inequality in expectation under an explicit generator condition, stated for a stopped process on a compact set. We also prove a universal approximation result showing that, on any compact set and finite horizon, SPH-NNs approximate the coefficients of a target stochastic port-Hamiltonian system with $C^2$ accuracy of the Hamiltonian and yield coupled solutions that remain close in mean square up to the exit time. Experiments on noisy mass spring, Duffing, and Van der Pol oscillators show improved long horizon rollouts and reduced energy error relative to a multilayer perceptron baseline.

[400] Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

Benjamin Gess, Daniel Heydecker

Main category: cs.LG

TL;DR: Theoretical analysis of SGD training dynamics in shallow neural networks, focusing on the catapult phase and NTK-flattening spikes with explicit probability predictions.

Details

Motivation: To provide a quantitative theory of the catapult phase in SGD training of shallow networks, explaining when and why NTK-flattening spikes occur during training, which has implications for understanding training dynamics and generalization.

Method: Theoretical analysis of SGD training in the NTK (Neural Tangent Kernel) scaling regime for shallow fully connected networks. Derives explicit mathematical criteria based on a function G that depends on kernel, learning rate, and data to predict spike behavior.

Result: Identifies explicit criterion separating two behaviors: when G > 0, SGD produces large NTK-flattening spikes with high probability; when G < 0, spike probability decays like (n/η)^{-ϑ/2} for explicitly characterized ϑ ∈ (0,∞). Provides parameter-dependent explanation for spike observation at practical widths.

Conclusion: The analysis yields a concrete theoretical framework for understanding the catapult phase in SGD training, with explicit predictions about when NTK-flattening spikes occur, bridging theory and practical observations in neural network training dynamics.

Abstract: We analyse SGD training of a shallow, fully connected network in the NTK scaling and provide a quantitative theory of the catapult phase. We identify an explicit criterion separating two behaviours: When an explicit function $G$, depending only on the kernel, learning rate $η$ and data, is positive, SGD produces large NTK-flattening spikes with high probability; when $G<0$, their probability decays like $(n/η)^{-\vartheta/2}$, for an explicitly characterised $\vartheta\in (0,\infty)$. This yields a concrete parameter-dependent explanation for why such spikes may still be observed at practical widths.

[401] Digging Deeper: Learning Multi-Level Concept Hierarchies

Oscar Hill, Mateo Espinosa Zarlenga, Mateja Jamnik

Main category: cs.LG

TL;DR: MLCS discovers multi-level concept hierarchies from top-level supervision only, and Deep-HiCEMs represent these hierarchies enabling interventions at multiple abstraction levels, improving interpretability and task performance.

Details

Motivation: Current concept-based models for interpretability rely on exhaustive annotations and treat concepts as flat and independent, while existing hierarchical approaches (HiCEMs and Concept Splitting) are limited to shallow hierarchies.

Method: Proposes Multi-Level Concept Splitting (MLCS) to discover multi-level concept hierarchies using only top-level supervision, and Deep-HiCEMs architecture to represent these discovered hierarchies and enable interventions at multiple abstraction levels.

Result: MLCS discovers human-interpretable concepts absent during training, and Deep-HiCEMs maintain high accuracy while supporting test-time concept interventions that can improve task performance across multiple datasets.

Conclusion: The approach enables discovery of multi-level concept hierarchies from minimal supervision and provides interpretable models that support interventions at different abstraction levels, advancing concept-based interpretability.

Abstract: Although concept-based models promise interpretability by explaining predictions with human-understandable concepts, they typically rely on exhaustive annotations and treat concepts as flat and independent. To circumvent this, recent work has introduced Hierarchical Concept Embedding Models (HiCEMs) to explicitly model concept relationships, and Concept Splitting to discover sub-concepts using only coarse annotations. However, both HiCEMs and Concept Splitting are restricted to shallow hierarchies. We overcome this limitation with Multi-Level Concept Splitting (MLCS), which discovers multi-level concept hierarchies from only top-level supervision, and Deep-HiCEMs, an architecture that represents these discovered hierarchies and enables interventions at multiple levels of abstraction. Experiments across multiple datasets show that MLCS discovers human-interpretable concepts absent during training and that Deep-HiCEMs maintain high accuracy while supporting test-time concept interventions that can improve task performance.

[402] Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

Zhengzhao Ma, Xueru Wen, Boxi Cao, Yaojie Lu, Hongyu Lin, Jinglin Yang, Min He, Xianpei Han, Le Sun

Main category: cs.LG

TL;DR: DCPO framework decouples reasoning and calibration objectives in RLVR to address over-confidence issues in LLMs while maintaining accuracy.

Details

Motivation: RLVR enhances LLM reasoning but causes calibration degeneration where models become over-confident in incorrect answers. Existing approaches directly incorporate calibration objectives but suffer from gradient conflicts between accuracy and calibration optimization.

Method: Proposes DCPO (Decoupled Calibration Policy Optimization), a framework that systematically decouples reasoning and calibration objectives based on theoretical analysis showing fundamental gradient conflicts between maximizing policy accuracy and minimizing calibration error.

Result: Extensive experiments show DCPO preserves accuracy on par with GRPO while achieving best calibration performance and substantially mitigating over-confidence issues in LLMs.

Conclusion: DCPO provides valuable insights and practical solution for more reliable LLM deployment by addressing calibration degeneration in RLVR-enhanced models.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) significantly enhances large language models (LLMs) reasoning but severely suffers from calibration degeneration, where models become excessively over-confident in incorrect answers. Previous studies devote to directly incorporating calibration objective into existing optimization target. However, our theoretical analysis demonstrates that there exists a fundamental gradient conflict between the optimization for maximizing policy accuracy and minimizing calibration error. Building on this insight, we propose DCPO, a simple yet effective framework that systematically decouples reasoning and calibration objectives. Extensive experiments demonstrate that our DCPO not only preserves accuracy on par with GRPO but also achieves the best calibration performance and substantially mitigates the over-confidence issue. Our study provides valuable insights and practical solution for more reliable LLM deployment.

[403] ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Zijian Zhu, Fei Ren, Zhanhong Tan, Kaisheng Ma

Main category: cs.LG

TL;DR: ES-dLLM: A training-free inference acceleration framework for diffusion LLMs that skips tokens in early layers based on importance scores computed from intermediate tensor variation and confidence scores.

Details

Motivation: Diffusion LLMs offer advantages over autoregressive models but suffer from expensive inference since full input context is processed at every iteration. The authors aim to accelerate dLLM inference while preserving quality.

Method: Analyze dLLM generation dynamics, find intermediate representations change subtly across iterations. Propose ES-dLLM that skips tokens in early layers based on importance scores computed from intermediate tensor variation and previous iteration confidence scores.

Result: Achieves up to 226.57 TPS on LLaDA-8B and 308.51 TPS on Dream-7B on H200 GPU, delivering 5.6-16.8× speedup over vanilla implementation and up to 1.85× over state-of-the-art caching method while preserving quality.

Conclusion: ES-dLLM effectively accelerates dLLM inference by exploiting the observation that intermediate representations change subtly across iterations, enabling token skipping in early layers without quality degradation.

Abstract: Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM inference remains computationally expensive as the full input context is processed at every iteration. In this work, we analyze the generation dynamics of dLLMs and find that intermediate representations, including key, value, and hidden states, change only subtly across successive iterations. Leveraging this insight, we propose \textbf{ES-dLLM}, a training-free inference acceleration framework for dLLM that reduces computation by skipping tokens in early layers based on the estimated importance. Token importance is computed with intermediate tensor variation and confidence scores of previous iterations. Experiments on LLaDA-8B and Dream-7B demonstrate that ES-dLLM achieves throughput of up to 226.57 and 308.51 tokens per second (TPS), respectively, on an NVIDIA H200 GPU, delivering 5.6$\times$ to 16.8$\times$ speedup over the vanilla implementation and up to 1.85$\times$ over the state-of-the-art caching method, while preserving generation quality.

[404] A Survey of Weight Space Learning: Understanding, Representation, and Generation

Xiaolong Han, Zehong Wang, Bo Zhao, Binchi Zhang, Jundong Li, Damian Borth, Rose Yu, Haggai Maron, Yanfang Ye, Lu Yin, Ferrante Neri

Main category: cs.LG

TL;DR: Survey paper on Weight Space Learning (WSL) - treating neural network weights as a structured domain for analysis, representation, and generation rather than just training outputs.

Details

Motivation: Traditional deep learning focuses on data, features, and architectures, treating weights as final training outputs. However, recent research reveals that weight space contains rich structure - pretrained models form organized distributions, exhibit symmetries, and can be embedded or generated. Understanding weight space structure enables better model analysis, comparison, and knowledge transfer.

Method: Proposes unified taxonomy of Weight Space Learning with three core dimensions: 1) Weight Space Understanding (WSU) - studies geometry and symmetries of weights, 2) Weight Space Representation (WSR) - learns embeddings over model weights, 3) Weight Space Generation (WSG) - synthesizes new weights via hypernetworks or generative models.

Result: Consolidates fragmented research under coherent framework, showing how WSL enables practical applications including model retrieval, continual/federated learning, neural architecture search, and data-free reconstruction. Provides accompanying resource repository.

Conclusion: Weight space is a learnable, structured domain with growing impact across model analysis, transferring, and weight generation. This survey establishes foundation for emerging WSL research direction.

Abstract: Neural network weights are typically viewed as the end product of training, while most deep learning research focuses on data, features, and architectures. However, recent advances show that the set of all possible weight values (weight space) itself contains rich structure: pretrained models form organized distributions, exhibit symmetries, and can be embedded, compared, or even generated. Understanding such structures has tremendous impact on how neural networks are analyzed and compared, and on how knowledge is transferred across models, beyond individual training instances. This emerging research direction, which we refer to as Weight Space Learning (WSL), treats neural weights as a meaningful domain for analysis and modeling. This survey provides the first unified taxonomy of WSL. We categorize existing methods into three core dimensions: Weight Space Understanding (WSU), which studies the geometry and symmetries of weights; Weight Space Representation (WSR), which learns embeddings over model weights; and Weight Space Generation (WSG), which synthesizes new weights through hypernetworks or generative models. We further show how these developments enable practical applications, including model retrieval, continual and federated learning, neural architecture search, and data-free reconstruction. By consolidating fragmented progress under a coherent framework, this survey highlights weight space as a learnable, structured domain with growing impact across model analysis, transferring, and weight generation. We release an accompanying resource at https://github.com/Zehong-Wang/Awesome-Weight-Space-Learning.

[405] Equivariant Asynchronous Diffusion: An Adaptive Denoising Schedule for Accelerated Molecular Conformation Generation

Junyi An, Chao Qu, Yun-Fei Shi, Zhijian Zhou, Fenglei Cao, Yuan Qi

Main category: cs.LG

TL;DR: EAD is a novel equivariant asynchronous diffusion model for 3D molecular generation that combines molecule-level horizon with hierarchical structure modeling through adaptive asynchronous denoising.

Details

Motivation: Existing 3D molecular generation methods have limitations: auto-regressive models suffer from short horizon and train-inference discrepancy, while synchronous diffusion models fail to capture hierarchical molecular structure relationships.

Method: Proposes Equivariant Asynchronous Diffusion (EAD) with asynchronous denoising schedule to capture molecular hierarchy while maintaining molecule-level horizon, plus dynamic scheduling to adaptively determine denoising timesteps.

Result: EAD achieves state-of-the-art performance in 3D molecular generation, outperforming existing methods.

Conclusion: EAD successfully combines strengths of auto-regressive and diffusion approaches for improved 3D molecular generation by better modeling hierarchical structures.

Abstract: Recent 3D molecular generation methods primarily use asynchronous auto-regressive or synchronous diffusion models. While auto-regressive models build molecules sequentially, they’re limited by a short horizon and a discrepancy between training and inference. Conversely, synchronous diffusion models denoise all atoms at once, offering a molecule-level horizon but failing to capture the causal relationships inherent in hierarchical molecular structures. We introduce Equivariant Asynchronous Diffusion (EAD) to overcome these limitations. EAD is a novel diffusion model that combines the strengths of both approaches: it uses an asynchronous denoising schedule to better capture molecular hierarchy while maintaining a molecule-level horizon. Since these relationships are often complex, we propose a dynamic scheduling mechanism to adaptively determine the denoising timestep. Experimental results show that EAD achieves state-of-the-art performance in 3D molecular generation.

[406] Rethinking Adam for Time Series Forecasting: A Simple Heuristic to Improve Optimization under Distribution Shifts

Yuze Dong, Jinsong Wu

Main category: cs.LG

TL;DR: TS_Adam: A lightweight Adam variant for time-series forecasting that removes second-order bias correction to better handle non-stationary data with distributional drift.

Details

Motivation: Standard adaptive optimizers like Adam are designed for stationary objectives but struggle with non-stationary time-series data where distributional drift occurs. The second-order bias correction in Adam limits responsiveness to shifting loss landscapes in forecasting tasks.

Method: Proposes TS_Adam, a simple modification of Adam that removes the second-order correction from the learning rate computation while preserving the optimizer’s core structure. This lightweight variant requires no additional hyperparameters and can be easily integrated into existing models.

Result: TS_Adam consistently improves performance across long- and short-term forecasting tasks. On ETT datasets with MICN model, it achieves average reductions of 12.8% in MSE and 5.7% in MAE compared to standard Adam.

Conclusion: TS_Adam is a practical and versatile optimization strategy for real-world forecasting scenarios involving non-stationary data, offering improved adaptability to distributional drift through a simple yet effective modification of the Adam optimizer.

Abstract: Time-series forecasting often faces challenges from non-stationarity, particularly distributional drift, where the data distribution evolves over time. This dynamic behavior can undermine the effectiveness of adaptive optimizers, such as Adam, which are typically designed for stationary objectives. In this paper, we revisit Adam in the context of non-stationary forecasting and identify that its second-order bias correction limits responsiveness to shifting loss landscapes. To address this, we propose TS_Adam, a lightweight variant that removes the second-order correction from the learning rate computation. This simple modification improves adaptability to distributional drift while preserving the optimizer core structure and requiring no additional hyperparameters. TS_Adam integrates easily into existing models and consistently improves performance across long- and short-term forecasting tasks. On the ETT datasets with the MICN model, it achieves an average reduction of 12.8% in MSE and 5.7% in MAE compared to Adam. These results underscore the practicality and versatility of TS_Adam as an effective optimization strategy for real-world forecasting scenarios involving non-stationary data. Code is available at: https://github.com/DD-459-1/TS_Adam.

[407] Denoising the US Census: Succinct Block Hierarchical Regression

Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Adam Sealfon

Main category: cs.LG

TL;DR: BlueDown: A new post-processing method for census data that improves accuracy while maintaining privacy guarantees and structural constraints, using hierarchical regression and optimization techniques.

Details

Motivation: The US Census Bureau's Disclosure Avoidance System (DAS) needs to balance confidentiality and utility for census data used in critical applications like legislative apportionment and funding allocation. The existing TopDown method has limitations in accuracy, especially at county and tract levels.

Method: Developed BlueDown with hierarchical generalized least-squares regression that leverages measurement structure, reducing computation from matrix multiplication to linear time. Combined with optimization routine extending TDA to support correlated measurements, using succinct linear-algebraic operations exploiting symmetries.

Result: BlueDown produces more accurate and consistent estimates than TopDown, with especially large accuracy improvements for aggregates at county and tract levels on Census Bureau evaluation metrics.

Conclusion: BlueDown represents a significant improvement over existing census data processing methods, offering better accuracy while maintaining privacy and structural constraints, with hierarchical regression and succinct operations being independently valuable techniques.

Abstract: The US Census Bureau Disclosure Avoidance System (DAS) balances confidentiality and utility requirements for the decennial US Census (Abowd et al., 2022). The DAS was used in the 2020 Census to produce demographic datasets critically used for legislative apportionment and redistricting, federal and state funding allocation, municipal and infrastructure planning, and scientific research. At the heart of DAS is TopDown, a heuristic post-processing method that combines billions of private noisy measurements across six geographic levels in order to produce new estimates that are consistent, more accurate, and satisfy certain structural constraints on the data. In this work, we introduce BlueDown, a new post-processing method that produces more accurate, consistent estimates while satisfying the same privacy guarantees and structural constraints. We obtain especially large accuracy improvements for aggregates at the county and tract levels on evaluation metrics proposed by the US Census Bureau. From a technical perspective, we develop a new algorithm for generalized least-squares regression that leverages the hierarchical structure of the measurements and that is statistically optimal among linear unbiased estimators. This reduces the computational dependence on the number of geographic regions measured from matrix multiplication time, which would be infeasible for census-scale data, to linear time. We incorporate the additional structural constraints by combining this regression algorithm with an optimization routine that extends TDA to support correlated measurements. We further improve the efficiency of our algorithm using succinct linear-algebraic operations that exploit symmetries in the structure of the measurements and constraints. We believe our hierarchical regression and succinct operations to be of independent interest.

[408] Hardware Efficient Approximate Convolution with Tunable Error Tolerance for CNNs

Vishal Shashidhar, Anupam Kumari, Roy P Paily

Main category: cs.LG

TL;DR: Hardware-efficient soft sparsity using MSB proxy to skip negligible non-zero multiplications in CNNs, reducing MAC operations by 74-88% with zero accuracy loss.

Details

Motivation: Traditional hard sparsity (skipping mathematical zeros) loses effectiveness in deep CNN layers or with smooth activations like Tanh, limiting edge deployment due to high computational demands.

Method: Proposes “soft sparsity” paradigm using Most Significant Bit (MSB) proxy to identify and skip negligible non-zero multiplications, implemented as custom RISC-V instruction for hardware efficiency.

Result: Reduces ReLU MACs by 88.42% and Tanh MACs by 74.87% with zero accuracy loss on LeNet-5 (MNIST), outperforming zero-skipping by 5x. Estimated power savings of 35.2% for ReLU and 29.96% for Tanh.

Conclusion: Soft sparsity with MSB proxy significantly optimizes CNN inference for resource-constrained edge devices, though memory access makes power reduction sub-linear to operation savings.

Abstract: Modern CNNs’ high computational demands hinder edge deployment, as traditional hard'' sparsity (skipping mathematical zeros) loses effectiveness in deep layers or with smooth activations like Tanh. We propose a soft sparsity’’ paradigm using a hardware efficient Most Significant Bit (MSB) proxy to skip negligible non-zero multiplications. Integrated as a custom RISC-V instruction and evaluated on LeNet-5 (MNIST), this method reduces ReLU MACs by 88.42% and Tanh MACs by 74.87% with zero accuracy loss–outperforming zero-skipping by 5x. By clock-gating inactive multipliers, we estimate power savings of 35.2% for ReLU and 29.96% for Tanh. While memory access makes power reduction sub-linear to operation savings, this approach significantly optimizes resource-constrained inference.

[409] CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Sijia Cui, Pengyu Cheng, Jiajun Song, Yongbo Gai, Guojun Zhang, Zhechao Yu, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang

Main category: cs.LG

TL;DR: CLIPO enhances RLVR by adding contrastive learning to address process-wrong but outcome-correct reasoning issues, improving generalization and reducing hallucinations in LLM reasoning.

Details

Motivation: RLVR improves LLM reasoning but only uses final answers as rewards, ignoring intermediate step correctness. This leads to training on process-wrong but outcome-correct rollouts, causing hallucination and answer-copying problems that undermine model generalization and robustness.

Method: Incorporates contrastive learning into policy optimization (CLIPO) by optimizing a contrastive loss over successful rollouts. This helps the LLM capture invariant structures shared across correct reasoning paths, providing robust cross-trajectory regularization instead of single-path supervision.

Result: CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs.

Conclusion: CLIPO effectively mitigates step-level reasoning inconsistencies and suppresses hallucinatory artifacts in LLM reasoning through contrastive learning regularization, advancing RLVR methods.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capacity of Large Language Models (LLMs). However, RLVR solely relies on final answers as outcome rewards, neglecting the correctness of intermediate reasoning steps. Training on these process-wrong but outcome-correct rollouts can lead to hallucination and answer-copying, severely undermining the model’s generalization and robustness. To address this, we incorporate a Contrastive Learning mechanism into the Policy Optimization (CLIPO) to generalize the RLVR process. By optimizing a contrastive loss over successful rollouts, CLIPO steers the LLM to capture the invariant structure shared across correct reasoning paths. This provides a more robust cross-trajectory regularization than the original single-path supervision in RLVR, effectively mitigating step-level reasoning inconsistencies and suppressing hallucinatory artifacts. In experiments, CLIPO consistently improves multiple RLVR baselines across diverse reasoning benchmarks, demonstrating uniform improvements in generalization and robustness for policy optimization of LLMs. Our code and training recipes are available at https://github.com/Qwen-Applications/CLIPO.

[410] Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias

Borun D Chowdhury

Main category: cs.LG

TL;DR: The U-shaped “Lost in the Middle” phenomenon in LLMs is an inherent geometric property of causal decoder architectures with residual connections, present even at initialization before any training or positional encoding.

Details

Motivation: To understand the fundamental cause of the "Lost in the Middle" phenomenon where LLMs retrieve well from the beginning and end of context but fail in the middle, which has been widely attributed to learned Softmax artifacts or positional encoding properties like RoPE.

Method: Model multi-layer causal attention as iterated powers of the Cesàro matrix, derive exact closed-form influence density in the continuous limit, and empirically validate on untrained Qwen2 and GPT-2 architectures at initialization (Step 0).

Result: Causal masking creates logarithmic divergence at prompt start (Primacy Tail), residual connections create isolated anchor at final token (Recency Delta), and middle context has factorial dead zone of order O(1/(H-1)!). The U-shape is identical with or without RoPE and persists through standard pretraining.

Conclusion: The U-shaped performance curve is an inherent architectural property of causal decoders with residual connections, not a learned artifact. This establishes a baseline for future interventions to overcome the middle-context retrieval problem.

Abstract: The ``Lost in the Middle’’ phenomenon – a U-shaped performance curve where LLMs retrieve well from the beginning and end of a context but fail in the middle – is widely attributed to learned Softmax artifacts or the distance-decay of positional encodings like RoPE. This paper makes a single, precise claim: \emph{the U-shape is already present at initialization, before any training or positional encoding takes effect.} It is an inherent geometric property of the causal decoder with residual connections. We model multi-layer causal attention as iterated powers of the Cesàro matrix and derive the exact closed-form influence density in the continuous limit. Causal masking forces a logarithmic divergence of gradient influence at the start of the prompt (the Primacy Tail), while residual connections create an isolated $\mathcal{O}(1)$ anchor at the final token (the Recency Delta). Between these extremes lies a factorial dead zone of order $\mathcal{O}(1/(H{-}1)!)$, where $H$ is the network depth, making middle-context retrieval and training structurally hostile. We validate empirically that untrained Qwen2 and GPT-2 architectures exhibit this U-shape at Step~0, and that it is identical with or without RoPE. Comparing initialized and pretrained networks, we show that standard training does not overcome the topological valley, confirming that the U-shape persists as an architectural baseline under standard pretraining objectives. We do not claim that this bias is insurmountable, nor that interventions such as RoPE modifications are useless. We establish what the baseline is and where it comes from, so that future efforts to overcome it can be precisely targeted.

[411] A neural operator for predicting vibration frequency response curves from limited data

D. Bluedorn, A. Badawy, B. E. Saunders, D. Roettgen, A. Abdelkefi

Main category: cs.LG

TL;DR: Neural operator integrated with implicit numerical scheme learns state-space dynamics from limited data to predict frequency response curves for vibration testing, achieving 99.87% accuracy on linear SDOF system.

Details

Motivation: Accelerate design iteration and make vibration testing workflows more efficient by using machine learning for numerical vibration analysis, overcoming challenges of conventional ML methods that require physics-based regularizing loss functions.

Method: Neural operator integrated with implicit numerical scheme that learns underlying state-space dynamics from limited data, enabling generalization to untested driving frequencies and initial conditions without physics-based regularizing terms.

Result: 99.87% accuracy in predicting Frequency Response Curve (FRC), forecasting frequency and amplitude of linear resonance while training on only 7% of the bandwidth of the solution for a linear single-degree-of-freedom system.

Conclusion: Training ML models to internalize physics information rather than trajectory enables better generalization accuracy and vastly improves timeframe for vibration studies on engineered components.

Abstract: In the design of engineered components, rigorous vibration testing is essential for performance validation and identification of resonant frequencies and amplitudes encountered during operation. Performing this evaluation numerically via machine learning has great potential to accelerate design iteration and make testing workflows more efficient. However, dynamical systems are conventionally difficult to solve via machine learning methods without using physics-based regularizing loss functions. To properly perform this forecasting task, a structure that has an inspectable physical obedience can be devised without the use of regularizing terms from first principles. The method employed in this work is a neural operator integrated with an implicit numerical scheme. This architecture enables operators to learn of the underlying state-space dynamics from limited data, allowing generalization to untested driving frequencies and initial conditions. This network can infer the system’s global frequency response by training on a small set of input conditions. As a foundational proof of concept, this investigation verifies the machine learning algorithm with a linear, single-degree-of-freedom system, demonstrating implicit obedience of dynamics. This approach demonstrates 99.87% accuracy in predicting the Frequency Response Curve (FRC), forecasting the frequency and amplitude of linear resonance training on 7% of the bandwidth of the solution. By training machine learning models to internalize physics information rather than trajectory, better generalization accuracy can be realized, vastly improving the timeframe for vibration studies on engineered components.

[412] Mashup Learning: Faster Finetuning by Remixing Past Checkpoints

Sofia Maria Lo Cicero Vaina, Artem Chumachenko, Max Ryabinin

Main category: cs.LG

TL;DR: Mashup Learning improves LLM adaptation by aggregating relevant historical checkpoints as initialization for new tasks, boosting accuracy and accelerating convergence.

Details

Motivation: Existing LLM fine-tuning produces many specialized checkpoints that are rarely reused despite containing valuable learned abilities for potentially similar tasks, representing wasted computational resources and knowledge.

Method: Proposes Mashup Learning: 1) identifies most relevant historical checkpoints for target dataset, 2) aggregates them via model merging, 3) uses merged model as improved initialization for training on new task.

Result: Across 8 LLM benchmarks, 4 models, and 2 checkpoint collections: improves average downstream accuracy by 0.5-5 percentage points over training from scratch; accelerates convergence by 41-46% fewer training steps; reduces wall-clock time by up to 37% including selection/merging overhead.

Conclusion: Mashup Learning effectively leverages existing training artifacts to enhance model adaptation, demonstrating that historical checkpoints contain valuable transferable knowledge that can be systematically reused for improved efficiency and performance.

Abstract: Finetuning on domain-specific data is a well-established method for enhancing LLM performance on downstream tasks. Training on each dataset produces a new set of model weights, resulting in a multitude of checkpoints saved in-house or on open-source platforms. However, these training artifacts are rarely reused for subsequent experiments despite containing improved model abilities for potentially similar tasks. In this paper, we propose Mashup Learning, a simple method to leverage the outputs of prior training runs to enhance model adaptation to new tasks. Our procedure identifies the most relevant historical checkpoints for a target dataset, aggregates them with model merging, and uses the result as an improved initialization for training. Across 8 standard LLM benchmarks, four models, and two collections of source checkpoints, Mashup Learning consistently improves average downstream accuracy by 0.5-5 percentage points over training from scratch. It also accelerates convergence, requiring 41-46% fewer training steps and up to 37% less total wall-clock time to match from-scratch accuracy, including all selection and merging overhead.

[413] ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Ruizhong Qiu, Hanqing Zeng, Yinglong Xia, Yiwen Meng, Ren Chen, Jiarui Feng, Dongqi Fu, Qifan Wang, Jiayi Liu, Jun Xiao, Xiangjun Fan, Benyu Zhang, Hong Li, Zhining Liu, Hyunsik Yoo, Zhichen Zeng, Tianxin Wei, Hanghang Tong

Main category: cs.LG

TL;DR: ReMix introduces a reinforcement learning-based router for Mixture-of-LoRAs that uses non-learnable routing weights to prevent LoRA dominance, enabling more effective use of multiple specialized adapters through unbiased gradient estimation.

Details

Motivation: Existing Mixture-of-LoRAs routers suffer from imbalanced routing weights where only one or two LoRAs dominate, limiting the expressive power of the model despite having multiple specialized adapters.

Method: Proposes Reinforcement Routing for Mixture-of-LoRAs (ReMix) with non-learnable routing weights to ensure all active LoRAs are equally effective. Uses reinforce leave-one-out (RLOO) technique as an unbiased gradient estimator, treating supervision loss as reward and router as policy in reinforcement learning.

Result: Extensive experiments show ReMix significantly outperforms state-of-the-art parameter-efficient finetuning methods under comparable numbers of activated parameters.

Conclusion: ReMix addresses the critical limitation of imbalanced routing in Mixture-of-LoRAs models through reinforcement learning-based routing, enabling more effective utilization of multiple specialized adapters for improved performance.

Abstract: Low-rank adapters (LoRAs) are a parameter-efficient finetuning technique that injects trainable low-rank matrices into pretrained models to adapt them to new tasks. Mixture-of-LoRAs models expand neural networks efficiently by routing each layer input to a small subset of specialized LoRAs of the layer. Existing Mixture-of-LoRAs routers assign a learned routing weight to each LoRA to enable end-to-end training of the router. Despite their empirical promise, we observe that the routing weights are typically extremely imbalanced across LoRAs in practice, where only one or two LoRAs often dominate the routing weights. This essentially limits the number of effective LoRAs and thus severely hinders the expressive power of existing Mixture-of-LoRAs models. In this work, we attribute this weakness to the nature of learnable routing weights and rethink the fundamental design of the router. To address this critical issue, we propose a new router designed that we call Reinforcement Routing for Mixture-of-LoRAs (ReMix). Our key idea is using non-learnable routing weights to ensure all active LoRAs to be equally effective, with no LoRA dominating the routing weights. However, our routers cannot be trained directly via gradient descent due to our non-learnable routing weights. Hence, we further propose an unbiased gradient estimator for the router by employing the reinforce leave-one-out (RLOO) technique, where we regard the supervision loss as the reward and the router as the policy in reinforcement learning. Our gradient estimator also enables to scale up training compute to boost the predictive performance of our ReMix. Extensive experiments demonstrate that our proposed ReMix significantly outperform state-of-the-art parameter-efficient finetuning methods under a comparable number of activated parameters.

[414] DT-BEHRT: Disease Trajectory-aware Transformer for Interpretable Patient Representation Learning

Deyi Li, Zijun Yao, Qi Xu, Muxuan Liang, Lingyao Li, Zijian Xu, Mei Liu

Main category: cs.LG

TL;DR: DT-BEHRT is a graph-enhanced transformer model for EHR data that disentangles disease trajectories by modeling diagnosis interactions within organ systems and capturing asynchronous progression patterns, with tailored pre-training using trajectory-level masking and ontology-informed prediction.

Details

Motivation: Existing EHR predictive models overlook the heterogeneous roles of medical codes arising from distinct clinical characteristics and contexts. Current approaches fail to properly model disease trajectories and their asynchronous progression patterns within organ systems.

Method: Proposes DT-BEHRT, a graph-enhanced sequential architecture that disentangles disease trajectories by explicitly modeling diagnosis-centric interactions within organ systems and capturing asynchronous progression patterns. Includes tailored pre-training with trajectory-level code masking and ontology-informed ancestor prediction for semantic alignment.

Result: Extensive experiments on multiple benchmark datasets demonstrate strong predictive performance and interpretable patient representations that align with clinicians’ disease-centered reasoning.

Conclusion: DT-BEHRT effectively captures complex disease trajectories in EHR data through its graph-enhanced transformer architecture and specialized pre-training, providing both strong predictive performance and clinically interpretable representations.

Abstract: The growing adoption of electronic health record (EHR) systems has provided unprecedented opportunities for predictive modeling to guide clinical decision making. Structured EHRs contain longitudinal observations of patients across hospital visits, where each visit is represented by a set of medical codes. While sequence-based, graph-based, and graph-enhanced sequence approaches have been developed to capture rich code interactions over time or within the same visits, they often overlook the inherent heterogeneous roles of medical codes arising from distinct clinical characteristics and contexts. To this end, in this study we propose the Disease Trajectory-aware Transformer for EHR (DT-BEHRT), a graph-enhanced sequential architecture that disentangles disease trajectories by explicitly modeling diagnosis-centric interactions within organ systems and capturing asynchronous progression patterns. To further enhance the representation robustness, we design a tailored pre-training methodology that combines trajectory-level code masking with ontology-informed ancestor prediction, promoting semantic alignment across multiple modeling modules. Extensive experiments on multiple benchmark datasets demonstrate that DT-BEHRT achieves strong predictive performance and provides interpretable patient representations that align with clinicians’ disease-centered reasoning. The source code is publicly accessible at https://github.com/GatorAIM/DT-BEHRT.git.

[415] Actor-Accelerated Policy Dual Averaging for Reinforcement Learning in Continuous Action Spaces

Ji Gao, Caleb Ju, Guanghui Lan, Zhaohui Tong

Main category: cs.LG

TL;DR: Actor-accelerated Policy Dual Averaging improves computational efficiency for continuous action spaces by using a learned policy network to approximate optimization sub-problems while maintaining convergence guarantees.

Details

Motivation: Policy Dual Averaging (PDA) provides strong theoretical convergence guarantees for reinforcement learning but becomes computationally expensive in continuous state-action spaces due to optimization sub-problems at each decision step, limiting practical deployment.

Method: Proposes actor-accelerated PDA which uses a learned policy network to approximate solutions to the optimization sub-problems, reducing runtime while preserving convergence properties. Includes theoretical analysis of how actor approximation error affects convergence.

Result: Achieves superior performance compared to popular on-policy baselines like Proximal Policy Optimization (PPO) on robotics, control, and operations research benchmarks. Bridges gap between theoretical advantages of PDA and practical deployment.

Conclusion: Actor-accelerated PDA enables efficient application of Policy Dual Averaging in continuous-action problems with function approximation, making theoretically sound methods practically deployable while maintaining performance advantages over existing approaches.

Abstract: Policy Dual Averaging (PDA) offers a principled Policy Mirror Descent (PMD) framework that more naturally admits value function approximation than standard PMD, enabling the use of approximate advantage (or Q-) functions while retaining strong convergence guarantees. However, applying PDA in continuous state and action spaces remains computationally challenging, since action selection involves solving an optimization sub-problem at each decision step. In this paper, we propose \textit{actor-accelerated PDA}, which uses a learned policy network to approximate the solution of the optimization sub-problems, yielding faster runtimes while maintaining convergence guarantees. We provide a theoretical analysis that quantifies how actor approximation error impacts the convergence of PDA under suitable assumptions. We then evaluate its performance on several benchmarks in robotics, control, and operations research problems. Actor-accelerated PDA achieves superior performance compared to popular on-policy baselines such as Proximal Policy Optimization (PPO). Overall, our results bridge the gap between the theoretical advantages of PDA and its practical deployment in continuous-action problems with function approximation.

[416] Rethinking the Harmonic Loss via Non-Euclidean Distance Layers

Maxwell Miller-Golub, Kamil Faber, Marcin Pietron, Panpan Zheng, Pasquale Minervini, Roberto Corizzo

Main category: cs.LG

TL;DR: The paper extends harmonic loss by exploring various distance metrics beyond Euclidean distance, evaluating them on vision and language models for performance, interpretability, and sustainability trade-offs.

Details

Motivation: Cross-entropy loss has limitations in interpretability, unbounded weight growth, and training inefficiencies. While harmonic loss addresses some issues, previous work only explored Euclidean distance, leaving other distance metrics unexamined and lacking systematic evaluation of computational efficiency and sustainability.

Method: Systematically investigate a broad spectrum of distance metrics as replacements for Euclidean distance in harmonic loss. Evaluate distance-tailored harmonic losses on vision backbones and large language models using a three-way evaluation framework: model performance, interpretability, and sustainability.

Result: On vision tasks, cosine distances provide the best trade-off, improving accuracy while lowering carbon emissions. Bray-Curtis and Mahalanobis distances further enhance interpretability at varying efficiency costs. On language models, cosine-based harmonic losses improve gradient/learning stability, strengthen representation structure, and reduce emissions compared to cross-entropy and Euclidean heads.

Conclusion: Different distance metrics in harmonic loss offer distinct trade-offs across performance, interpretability, and sustainability dimensions. Cosine distance emerges as particularly effective for both vision and language tasks, providing balanced improvements across all evaluation criteria.

Abstract: Cross-entropy loss has long been the standard choice for training deep neural networks, yet it suffers from interpretability limitations, unbounded weight growth, and inefficiencies that can contribute to costly training dynamics. The harmonic loss is a distance-based alternative grounded in Euclidean geometry that improves interpretability and mitigates phenomena such as grokking, or delayed generalization on the test set. However, the study of harmonic loss remains narrow: only Euclidean distance is explored, and no systematic evaluation of computational efficiency or sustainability was conducted. We extend harmonic loss by systematically investigating a broad spectrum of distance metrics as replacements for the Euclidean distance. We comprehensively evaluate distance-tailored harmonic losses on both vision backbones and large language models. Our analysis is framed around a three-way evaluation of model performance, interpretability, and sustainability. On vision tasks, cosine distances provide the most favorable trade-off, consistently improving accuracy while lowering carbon emissions, whereas Bray-Curtis and Mahalanobis further enhance interpretability at varying efficiency costs. On language models, cosine-based harmonic losses improve gradient and learning stability, strengthen representation structure, and reduce emissions relative to cross-entropy and Euclidean heads. Our code is available at: https://anonymous.4open.science/r/rethinking-harmonic-loss-5BAB/.

[417] SiMPO: Measure Matching for Online Diffusion Reinforcement Learning

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

Main category: cs.LG

TL;DR: SiMPO is a diffusion RL framework that generalizes reweighting schemes using signed measures and monotonic functions, enabling better use of negative samples and avoiding over-greedy policies.

Details

Motivation: Current diffusion RL algorithms using softmax reweighting over behavior policies tend to be over-greedy and fail to effectively leverage feedback from negative samples, limiting their performance.

Method: Two-stage measure matching: 1) Construct virtual target policy via f-divergence regularized optimization allowing signed target measures, 2) Use signed measure to guide diffusion/flow models through reweighted matching with arbitrary monotonic weighting functions.

Result: SiMPO achieves superior performance by leveraging flexible weighting schemes, provides geometric interpretations showing how negative reweighting repels policies from suboptimal actions, and offers practical guidelines for selecting reweighting methods based on reward landscapes.

Conclusion: SiMPO provides a unified framework for diffusion RL that generalizes reweighting schemes, enables principled negative reweighting, and offers both theoretical justification and practical guidance for improved policy optimization.

Abstract: A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over the behavior policy, which usually induces an over-greedy policy and fails to leverage feedback from negative samples. In this work, we introduce Signed Measure Policy Optimization (SiMPO), a simple and unified framework that generalizes reweighting scheme in diffusion RL with general monotonic functions. SiMPO revisits diffusion RL via a two-stage measure matching lens. First, we construct a virtual target policy by $f$-divergence regularized policy optimization, where we can relax the non-negativity constraint to allow for a signed target measure. Second, we use this signed measure to guide diffusion or flow models through reweighted matching. This formulation offers two key advantages: a) it generalizes to arbitrary monotonically increasing weighting functions; and b) it provides a principled justification and practical guidance for negative reweighting. Furthermore, we provide geometric interpretations to illustrate how negative reweighting actively repels the policy from suboptimal actions. Extensive empirical evaluations demonstrate that SiMPO achieves superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods tailored to the reward landscape.

[418] Improving TabPFN’s Synthetic Data Generation by Integrating Causal Structure

Davide Tugnoli, Andrea De Lorenzo, Marco Virgolin, Giovanni Cinà

Main category: cs.LG

TL;DR: Causal-aware TabPFN improves synthetic tabular data generation by incorporating causal structure into autoregressive generation to reduce spurious correlations.

Details

Motivation: TabPFN generates synthetic tabular data autoregressively, but when feature order conflicts with causal structure, it produces spurious correlations that impair data quality and causal effect preservation.

Method: Two approaches: 1) DAG-aware conditioning samples each variable given its causal parents, and 2) CPDAG-based strategy for scenarios with partial causal knowledge.

Result: DAG-aware conditioning improves quality and stability of synthetic data relative to vanilla TabPFN across most settings. CPDAG-based strategy shows moderate improvements depending on number of oriented edges.

Conclusion: Injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data by reducing spurious correlations and preserving causal effects.

Abstract: Synthetic tabular data generation addresses data scarcity and privacy constraints in a variety of domains. Tabular Prior-Data Fitted Network (TabPFN), a recent foundation model for tabular data, has been shown capable of generating high-quality synthetic tabular data. However, TabPFN is autoregressive: features are generated sequentially by conditioning on the previous ones, depending on the order in which they appear in the input data. We demonstrate that when the feature order conflicts with causal structure, the model produces spurious correlations that impair its ability to generate synthetic data and preserve causal effects. We address this limitation by integrating causal structure into TabPFN’s generation process through two complementary approaches: Directed Acyclic Graph (DAG)-aware conditioning, which samples each variable given its causal parents, and a Completed Partially Directed Acyclic Graph (CPDAG)-based strategy for scenarios with partial causal knowledge. We evaluate these approaches on controlled benchmarks and six CSuite datasets, assessing structural fidelity, distributional alignment, privacy preservation, and Average Treatment Effect (ATE) preservation. Across most settings, DAG-aware conditioning improves the quality and stability of synthetic data relative to vanilla TabPFN. The CPDAG-based strategy shows moderate improvements, with effectiveness depending on the number of oriented edges. These results indicate that injecting causal structure into autoregressive generation enhances the reliability of synthetic tabular data.

[419] Discovery of a Hematopoietic Manifold in scGPT Yields a Method for Extracting Performant Algorithms from Biological Foundation Model Internals

Ihor Kendiukhov

Main category: cs.LG

TL;DR: Researchers extracted a compact hematopoietic algorithm from scGPT foundation model using mechanistic interpretability, achieving state-of-the-art performance on cell type classification and pseudotime ordering tasks.

Details

Motivation: To demonstrate that foundation models like scGPT encode biologically meaningful algorithms that can be extracted via mechanistic interpretability, providing standalone, efficient algorithms without retraining.

Method: Three-stage extraction: 1) direct operator export from frozen attention weights, 2) lightweight learned adaptor, 3) task-specific readout. Validated on external datasets with donor-holdout benchmarks against multiple baselines.

Result: Extracted algorithm outperformed scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines on pseudotime-depth ordering and subtype classification (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). 34.5x faster with 1000x fewer parameters than standard probing.

Conclusion: Foundation models encode compact, biologically meaningful algorithms that can be extracted via mechanistic interpretability, yielding efficient standalone algorithms with strong performance on biological tasks.

Abstract: We report the discovery and extraction of a compact hematopoietic algorithm from the single-cell foundation model scGPT, to our knowledge the first biologically useful, competitive algorithm extracted from a foundation model via mechanistic interpretability. We show that scGPT internally encodes a compact hematopoietic manifold with significant developmental branch structure, validated on a strict non-overlap Tabula Sapiens external panel and confirmed via frozen-head zero-shot transfer to an independent multi-donor immune panel. To isolate this geometry, we introduce a general three-stage extraction method consisting of direct operator export from frozen attention weights, a lightweight learned adaptor, and a task-specific readout, producing a standalone algorithm without target-dataset retraining. In 88-split donor-holdout benchmarks against scVI, Palantir, DPT, CellTypist, PCA, and raw-expression baselines, the extracted algorithm achieves the strongest pseudotime-depth ordering and leads on key subtype endpoints (CD4/CD8 AUROC 0.867, mono/macro AUROC 0.951). Compared to standard probing of frozen scGPT embeddings with a 3-layer MLP, the extracted head is BH-significantly better on 6/8 classification endpoints while completing a full 12-split evaluation campaign 34.5x faster with approximately 1000x fewer trainable parameters. The exported operator compresses from three pooled attention heads to a single head without statistically significant loss, and further to a rank-64 surrogate. Mechanistic interpretability of the compact operator reveals a concentrated four-factor core explaining 66.2% of ablation impact, with factors resolving into explicit T/lymphoid, B/plasma, granulocytic, and monocyte/macrophage gene programs. A supplementary second-manifold validation (intercellular communication geometry) confirms that the extraction method generalizes beyond hematopoiesis.

[420] Estimating condition number with Graph Neural Networks

Erin Carson, Xinye Chen

Main category: cs.LG

TL;DR: A fast method for estimating sparse matrix condition numbers using graph neural networks with O(nnz + n) complexity, achieving significant speedup over traditional methods.

Details

Motivation: Traditional methods for estimating matrix condition numbers (like Hager-Higham and Lanczos) are computationally expensive, especially for large sparse matrices. There's a need for faster alternatives that can efficiently handle sparse matrix structures.

Method: Proposes using graph neural networks (GNNs) to estimate condition numbers of sparse matrices. Develops efficient feature engineering for GNNs with O(nnz + n) complexity, where nnz is number of non-zero elements and n is matrix dimension. Introduces two prediction schemes for condition number estimation.

Result: Extensive experiments show the method achieves significant speedup over Hager-Higham and Lanczos methods for both 1-norm and 2-norm condition number estimation.

Conclusion: GNN-based approach provides an efficient alternative to traditional condition number estimation methods for sparse matrices, with substantial computational advantages.

Abstract: In this paper, we propose a fast method for estimating the condition number of sparse matrices using graph neural networks (GNNs). To enable efficient training and inference of GNNs, our proposed feature engineering for GNNs achieves $\mathrm{O}(\mathrm{nnz} + n)$, where $\mathrm{nnz}$ is the number of non-zero elements in the matrix and $n$ denotes the matrix dimension. We propose two prediction schemes for estimating the matrix condition number using GNNs. The extensive experiments for the two schemes are conducted for 1-norm and 2-norm condition number estimation, which show that our method achieves a significant speedup over the Hager-Higham and Lanczos methods.

[421] Robust Post-Training for Generative Recommenders: Why Exponential Reward-Weighted SFT Outperforms RLHF

Keertana Chidambaram, Sanath Kumar Krishnamurthy, Qiuling Xu, Ko-Jen Hsiao, Moumita Bhattacharya

Main category: cs.LG

TL;DR: Exponential reward-weighted supervised fine-tuning (SFT) for aligning generative recommender systems to user preferences, offering a robust offline alternative to RLHF that avoids reward hacking and doesn’t require propensity scores.

Details

Motivation: Existing post-training methods for aligning generative recommender systems (RLHF, offline RL) are ill-suited for production-scale systems due to reward hacking from noisy user feedback, unreliable reward models, need for propensity scores, or infeasibility of online interaction.

Method: Proposes exponential reward-weighted SFT with weights w = exp(r/λ), which directly optimizes on observed rewards without querying a learned reward model. The temperature parameter λ explicitly controls the robustness-improvement tradeoff.

Result: Theoretical guarantees show policy improvement under noisy rewards with gap scaling only logarithmically with catalog size. Experiments on three open-source and one proprietary dataset against four baselines confirm the method consistently outperforms RLHF-based alternatives.

Conclusion: Exponential reward weighting is simple, scalable, and provides a theoretically grounded, interpretable regularization hyperparameter that makes it uniquely suited for production-scale generative recommender systems.

Abstract: Aligning generative recommender systems to user preferences via post-training is critical for closing the gap between next-item prediction and actual recommendation quality. Existing post-training methods are ill-suited for production-scale systems: RLHF methods reward hack due to noisy user feedback and unreliable reward models, offline RL alternatives require propensity scores that are unavailable, and online interaction is infeasible. We identify exponential reward-weighted SFT with weights $w = \exp(r/λ)$ as uniquely suited to this setting, and provide the theoretical and empirical foundations that explain why. By optimizing directly on observed rewards without querying a learned reward model, the method is immune to reward hacking, requires no propensity scores, and is fully offline. We prove the first policy improvement guarantees for this setting under noisy rewards, showing that the gap scales only logarithmically with catalog size and remains informative even for large item catalogs. Crucially, we show that temperature $λ$ explicitly and quantifiably controls the robustness-improvement tradeoff, providing practitioners with a single interpretable regularization hyperparameter with theoretical grounding. Experiments on three open-source and one proprietary dataset against four baselines confirm that exponential reward weighting is simple, scalable, and consistently outperforms RLHF-based alternatives.

[422] Taming Score-Based Denoisers in ADMM: A Convergent Plug-and-Play Framework

Rajesh Shrestha, Xiao Fu

Main category: cs.LG

TL;DR: ADMM-PnP with AC-DC denoiser: A plug-and-play ADMM framework that integrates score-based generative models for inverse problems via a three-stage denoiser with convergence guarantees.

Details

Motivation: Score-based generative models are powerful priors for inverse problems, but integrating them into optimization algorithms like ADMM is challenging due to: 1) mismatch between noisy data manifolds used for training and ADMM iterate geometry, and 2) lack of convergence understanding when ADMM uses score-based denoisers.

Method: Proposes ADMM plug-and-play (ADMM-PnP) with AC-DC denoiser: a three-stage denoiser embedded into ADMM: (1) auto-correction via additive Gaussian noise, (2) directional correction using conditional Langevin dynamics, and (3) score-based denoising. Provides convergence analysis showing weak nonexpansiveness and bounded denoiser properties.

Result: Experiments on various inverse problems demonstrate consistent improvement in solution quality over multiple baselines. Convergence guarantees established under proper parameter settings.

Conclusion: The AC-DC denoiser framework successfully integrates score-based priors into ADMM optimization with theoretical convergence guarantees and practical performance improvements for inverse problems.

Abstract: While score-based generative models have emerged as powerful priors for solving inverse problems, directly integrating them into optimization algorithms such as ADMM remains nontrivial. Two central challenges arise: i) the mismatch between the noisy data manifolds used to train the score functions and the geometry of ADMM iterates, especially due to the influence of dual variables, and ii) the lack of convergence understanding when ADMM is equipped with score-based denoisers. To address the manifold mismatch issue, we propose ADMM plug-and-play (ADMM-PnP) with the AC-DC denoiser, a new framework that embeds a three-stage denoiser into ADMM: (1) auto-correction (AC) via additive Gaussian noise, (2) directional correction (DC) using conditional Langevin dynamics, and (3) score-based denoising. In terms of convergence, we establish two results: first, under proper denoiser parameters, each ADMM iteration is a weakly nonexpansive operator, ensuring high-probability fixed-point $\textit{ball convergence}$ using a constant step size; second, under more relaxed conditions, the AC-DC denoiser is a bounded denoiser, which leads to convergence under an adaptive step size schedule. Experiments on a range of inverse problems demonstrate that our method consistently improves solution quality over a variety of baselines.

[423] GSVD for Geometry-Grounded Dataset Comparison: An Alignment Angle Is All You Need

Eduarda de Souza Marques, Arthur Sobrinho Ferreira da Rocha, Joao Paixao, Heudson Mirandola, Daniel Sadoc Menasche

Main category: cs.LG

TL;DR: A geometric approach using Generalized Singular Value Decomposition (GSVD) to compare datasets via linear relations, deriving interpretable angle scores for per-sample analysis.

Details

Motivation: To develop geometry-grounded learning that respects problem domain structure by comparing datasets through linear relations rather than treating observations as arbitrary vectors.

Method: Uses Generalized Singular Value Decomposition (GSVD) as a joint coordinate system for two subspaces, deriving an interpretable angle score θ(z) ∈ [0, π/2] that quantifies whether a sample z is explained more by dataset A, B, or comparably by both.

Result: Demonstrates the angle score’s behavior on MNIST through angle distributions and representative GSVD directions, and presents a binary classifier derived from θ(z) as an illustrative application.

Conclusion: The GSVD-based angle score serves as an interpretable per-sample geometric diagnostic tool for comparing datasets and understanding their structural relationships.

Abstract: Geometry-grounded learning asks models to respect structure in the problem domain rather than treating observations as arbitrary vectors. Motivated by this view, we revisit a classical but underused primitive for comparing datasets: linear relations between two data matrices, expressed via the co-span constraint $Ax = By = z$ in a shared ambient space. To operationalize this comparison, we use the generalized singular value decomposition (GSVD) as a joint coordinate system for two subspaces. In particular, we exploit the GSVD form $A = HCU$, $B = HSV$ with $C^{\top}C + S^{\top}S = I$, which separates shared versus dataset-specific directions through the diagonal structure of $(C, S)$. From these factors we derive an interpretable angle score $θ(z) \in [0, π/2]$ for a sample $z$, quantifying whether z is explained relatively more by $A$, more by $B$, or comparably by both. The primary role of $θ(z)$ is as a per-sample geometric diagnostic. We illustrate the behavior of the score on MNIST through angle distributions and representative GSVD directions. A binary classifier derived from $θ(z)$ is presented as an illustrative application of the score as an interpretable diagnostic tool.

[424] Copula-ResLogit: A Deep-Copula Framework for Unobserved Confounding Effects

Kimia Kamal, Bilal Farooq

Main category: cs.LG

TL;DR: A novel deep learning framework called Copula-ResLogit combines copula models with ResNet architectures to detect and mitigate unobserved confounding in travel demand analysis, applied to pedestrian stress-wait time relationships and travel mode-distance dependencies.

Details

Motivation: Addressing the challenge of unobserved factors in travel demand analysis that create non-causal dependencies and obscure true causal effects, requiring methods to detect and mitigate hidden confounding.

Method: Develops Copula-ResLogit, a fully interpretable joint modeling framework integrating copula models (for dependence capturing) with Residual Neural Network (ResNet) architectures. Uses copula functions to detect unobserved confounding, then deep learning components to mitigate hidden associations.

Result: Applied to two case studies: (1) stress levels vs. wait time of pedestrians crossing mid-block in VR, and (2) travel mode choice vs. travel distance in London travel behavior data. The framework substantially reduces or eliminates dependencies, demonstrating residual layers’ ability to account for hidden confounding effects.

Conclusion: Copula-ResLogit effectively addresses unobserved confounding in travel demand analysis by combining copula models with deep learning, providing a powerful tool for causal inference in transportation research.

Abstract: A key challenge in travel demand analysis is the presence of unobserved factors that may generate non-causal dependencies, obscuring the true causal effects. To address the issue, the study introduces a novel deep learning based fully interpretable joint modelling framework, Copula-ResLogit, which integrates the flexibility of Residual Neural Network (ResNet) architectures with the dependence capturing capabilities of copula models. This hybrid structure enables us to first detect unobserved confounding through traditional copula function based joint modelling and then mitigate these hidden associations by incorporating deep learning components. The study applies this framework to two case studies, including the relationship between stress levels and wait time of pedestrians when crossing mid block in VR and the dependencies between travel mode choice and travel distance in London travel behaviour data. Results show that Copula-ResLogit substantially reduces or eliminates the dependencies, demonstrating the ability of residual layers to account for hidden confounding effects.

[425] GaLoRA: Parameter-Efficient Graph-Aware LLMs for Node Classification

Mayur Choudhary, Saptarshi Sengupta, Katerina Potika

Main category: cs.LG

TL;DR: GaLoRA is a parameter-efficient framework that integrates graph structural information into LLMs for text-attributed graphs, achieving competitive node classification performance with only 0.24% of full LLM fine-tuning parameters.

Details

Motivation: Text-attributed graphs (TAGs) combine structural graph information with textual node content, appearing in domains like social networks and citation graphs. Current approaches need to effectively learn both structural and textual representations to improve decision-making in these domains.

Method: GaLoRA integrates structural information into LLMs using a parameter-efficient framework. It combines graph neural networks with LLMs while requiring minimal additional parameters compared to full fine-tuning.

Result: GaLoRA demonstrates competitive performance on node classification tasks with TAGs, performing on par with state-of-the-art models while using only 0.24% of the parameter count required by full LLM fine-tuning. Experiments on three real-world datasets showcase its effectiveness.

Conclusion: GaLoRA provides an efficient framework for combining structural and semantic information in text-attributed graphs, enabling effective node classification with minimal parameter overhead compared to full LLM fine-tuning.

Abstract: The rapid rise of large language models (LLMs) and their ability to capture semantic relationships has led to their adoption in a wide range of applications. Text-attributed graphs (TAGs) are a notable example where LLMs can be combined with Graph Neural Networks to improve the performance of node classification. In TAGs, each node is associated with textual content and such graphs are commonly seen in various domains such as social networks, citation graphs, recommendation systems, etc. Effectively learning from TAGs would enable better representations of both structural and textual representations of the graph and improve decision-making in relevant domains. We present GaLoRA, a parameter-efficient framework that integrates structural information into LLMs. GaLoRA demonstrates competitive performance on node classification tasks with TAGs, performing on par with state-of-the-art models with just 0.24% of the parameter count required by full LLM fine-tuning. We experiment with three real-world datasets to showcase GaLoRA’s effectiveness in combining structural and semantical information on TAGs.

[426] Regime-aware financial volatility forecasting via in-context learning

Saba Asaad, Shayan Mohajer Hamidi, Ali Bereyhi

Main category: cs.LG

TL;DR: LLM-based regime-aware in-context learning framework for financial volatility forecasting that adapts to nonstationary market conditions without parameter fine-tuning.

Details

Motivation: Financial markets exhibit nonstationary behavior with changing volatility regimes, making accurate forecasting challenging. Traditional methods struggle to adapt to regime shifts, while LLMs offer potential for contextual reasoning over historical patterns.

Method: Uses pretrained LLMs with in-context learning, oracle-guided refinement to construct regime-aware demonstrations, and conditional sampling based on estimated market labels to adapt predictions to different volatility regimes without fine-tuning.

Result: Outperforms classical volatility forecasting approaches and direct one-shot learning across multiple financial datasets, particularly during high-volatility periods.

Conclusion: Regime-aware in-context learning with LLMs provides effective adaptation to changing market conditions for volatility forecasting through contextual reasoning alone.

Abstract: This work introduces a regime-aware in-context learning framework that leverages large language models (LLMs) for financial volatility forecasting under nonstationary market conditions. The proposed approach deploys pretrained LLMs to reason over historical volatility patterns and adjust their predictions without parameter fine-tuning. We develop an oracle-guided refinement procedure that constructs regime-aware demonstrations from training data. An LLM is then deployed as an in-context learner that predicts the next-step volatility from the input sequence using demonstrations sampled conditional to the estimated market label. This conditional sampling strategy enables the LLM to adapt its predictions to regime-dependent volatility dynamics through contextual reasoning alone. Experiments with multiple financial datasets show that the proposed regime-aware in-context learning framework outperforms both classical volatility forecasting approaches and direct one-shot learning, especially during high-volatility periods.

[427] What do near-optimal learning rate schedules look like?

Hiroki Naganuma, Atish Agarwala, Priya Kasimbeg, George E. Dahl

Main category: cs.LG

TL;DR: The paper presents a systematic search procedure to find optimal learning rate schedule shapes for neural network training, showing that commonly used schedules are suboptimal and that weight decay significantly affects optimal shape selection.

Details

Motivation: There's no consensus on what constitutes the best learning rate schedule shape for neural network training beyond basic warmup and decay patterns, despite schedule choice being critical for training success.

Method: Developed a search procedure to find optimal schedule shapes within parameterized families, factoring out schedule shape from base learning rate. Applied to various schedule families on linear regression, CIFAR-10 image classification, and Wikitext103 language modeling tasks.

Result: The search procedure generally found near-optimal schedules, confirming warmup and decay as robust features. Found commonly used schedule families are suboptimal, and discovered weight decay strongly influences optimal schedule shape.

Conclusion: Provides comprehensive results on near-optimal schedule shapes, demonstrating systematic search can improve training efficiency and revealing important interactions between schedule shape and other hyperparameters like weight decay.

Abstract: A basic unanswered question in neural network training is: what is the best learning rate schedule shape for a given workload? The choice of learning rate schedule is a key factor in the success or failure of the training process, but beyond having some kind of warmup and decay, there is no consensus on what makes a good schedule shape. To answer this question, we designed a search procedure to find the best shapes within a parameterized schedule family. Our approach factors out the schedule shape from the base learning rate, which otherwise would dominate cross-schedule comparisons. We applied our search procedure to a variety of schedule families on three workloads: linear regression, image classification on CIFAR-10, and small-scale language modeling on Wikitext103. We showed that our search procedure indeed generally found near-optimal schedules. We found that warmup and decay are robust features of good schedules, and that commonly used schedule families are not optimal on these workloads. Finally, we explored how the outputs of our shape search depend on other optimization hyperparameters, and found that weight decay can have a strong effect on the optimal schedule shape. To the best of our knowledge, our results represent the most comprehensive results on near-optimal schedule shapes for deep neural network training, to date.

[428] How to make the most of your masked language model for protein engineering

Calvin McCarter, Nick Bhattacharya, Sebastian W. Ober, Hunter Elliott

Main category: cs.LG

TL;DR: Proposes stochastic beam search sampling for protein language models to optimize biological properties, with extensive evaluation showing sampling method choice is as impactful as model choice.

Details

Motivation: While many protein language models exist, there's little work on optimal sampling methods to optimize desired biological properties. The paper aims to address this gap for antibody engineering applications.

Method: Proposes stochastic beam search sampling that exploits masked language models’ efficiency at evaluating pseudo-perplexity of entire 1-edit neighborhoods. Reframes generation as whole-sequence evaluation for flexible multi-objective optimization.

Result: Extensive in vitro evaluation for antibody engineering shows that choice of sampling method is at least as impactful as the model used, highlighting the importance of sampling methodology.

Conclusion: Sampling methodology is crucial for protein language models and deserves more research attention. The proposed stochastic beam search enables effective optimization of biological properties.

Abstract: A plethora of protein language models have been released in recent years. Yet comparatively little work has addressed how to best sample from them to optimize desired biological properties. We fill this gap by proposing a flexible, effective sampling method for masked language models (MLMs), and by systematically evaluating models and methods both in silico and in vitro on actual antibody therapeutics campaigns. Firstly, we propose sampling with stochastic beam search, exploiting the fact that MLMs are remarkably efficient at evaluating the pseudo-perplexity of the entire 1-edit neighborhood of a sequence. Reframing generation in terms of entire-sequence evaluation enables flexible guidance with multiple optimization objectives. Secondly, we report results from our extensive in vitro head-to-head evaluation for the antibody engineering setting. This reveals that choice of sampling method is at least as impactful as the model used, motivating future research into this under-explored area.

[429] Data-Driven Integration Kernels for Interpretable Nonlocal Operator Learning

Savannah L. Ferretti, Jerry Lin, Sara Shamekh, Jane W. Baldwin, Michael S. Pritchard, Tom Beucler

Main category: cs.LG

TL;DR: A framework for interpretable nonlocal operator learning in climate modeling using data-driven integration kernels that separate nonlocal information aggregation from local nonlinear prediction.

Details

Motivation: Machine learning models for climate processes often combine nonlocal information across space and time in nonlinear ways, which improves prediction but makes relationships difficult to interpret and prone to overfitting as nonlocal information grows.

Method: Introduces data-driven integration kernels that first integrate spatiotemporal predictor fields using learnable kernels (continuous weighting functions over horizontal space, height, and/or time), then apply local nonlinear mapping only to the resulting kernel-integrated features and optional local inputs.

Result: Kernel-based models achieve near-baseline performance with far fewer trainable parameters, demonstrating that relevant nonlocal information can be captured through a small set of interpretable integrations when appropriate structural constraints are imposed.

Conclusion: The framework enables interpretable nonlocal operator learning by separating information aggregation from nonlinear prediction, making kernels directly interpretable as weighting patterns that reveal which spatial locations and past timesteps contribute most to predictions.

Abstract: Machine learning models can represent climate processes that are nonlocal in horizontal space, height, and time, often by combining information across these dimensions in highly nonlinear ways. While this can improve predictive skill, it makes learned relationships difficult to interpret and prone to overfitting as the extent of nonlocal information grows. We address this challenge by introducing data-driven integration kernels, a framework that adds structure to nonlocal operator learning by explicitly separating nonlocal information aggregation from local nonlinear prediction. Each spatiotemporal predictor field is first integrated using learnable kernels (defined as continuous weighting functions over horizontal space, height, and/or time), after which a local nonlinear mapping is applied only to the resulting kernel-integrated features and any optional local inputs. This design confines nonlinear interactions to a small set of integrated features and makes each kernel directly interpretable as a weighting pattern that reveals which horizontal locations, vertical levels, and past timesteps contribute most to the prediction. We demonstrate the framework for South Asian monsoon precipitation using a hierarchy of neural network models with increasing structure, including baseline, nonparametric kernel, and parametric kernel models. Across this hierarchy, kernel-based models achieve near-baseline performance with far fewer trainable parameters, showing that much of the relevant nonlocal information can be captured through a small set of interpretable integrations when appropriate structural constraints are imposed.

[430] Federated Active Learning Under Extreme Non-IID and Global Class Imbalance

Chen-Chen Zong, Sheng-Jun Huang

Main category: cs.LG

TL;DR: FairFAL is an adaptive class-fair federated active learning framework that addresses annotation cost reduction under privacy constraints, particularly effective in challenging long-tailed and non-IID settings.

Details

Motivation: Federated active learning (FAL) aims to reduce annotation costs while preserving privacy, but existing methods degrade in realistic settings with severe global class imbalance and highly heterogeneous clients. The authors identify that current approaches lack effective query-model selection strategies for these challenging conditions.

Method: FairFAL uses three key components: (1) lightweight prediction discrepancy to infer global imbalance and local-global divergence for adaptive selection between global and local query models; (2) prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) two-stage uncertainty-diversity balanced sampling with k-center refinement.

Result: Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings, demonstrating superior performance in realistic federated learning scenarios.

Conclusion: The paper presents FairFAL as an effective solution for federated active learning in realistic settings with class imbalance and client heterogeneity, showing that class-balanced sampling and adaptive query-model selection are crucial for performance.

Abstract: Federated active learning (FAL) seeks to reduce annotation cost under privacy constraints, yet its effectiveness degrades in realistic settings with severe global class imbalance and highly heterogeneous clients. We conduct a systematic study of query-model selection in FAL and uncover a central insight: the model that achieves more class-balanced sampling, especially for minority classes, consistently leads to better final performance. Moreover, global-model querying is beneficial only when the global distribution is highly imbalanced and client data are relatively homogeneous; otherwise, the local model is preferable. Based on these findings, we propose FairFAL, an adaptive class-fair FAL framework. FairFAL (1) infers global imbalance and local-global divergence via lightweight prediction discrepancy, enabling adaptive selection between global and local query models; (2) performs prototype-guided pseudo-labeling using global features to promote class-aware querying; and (3) applies a two-stage uncertainty-diversity balanced sampling strategy with k-center refinement. Experiments on five benchmarks show that FairFAL consistently outperforms state-of-the-art approaches under challenging long-tailed and non-IID settings. The code is available at https://github.com/chenchenzong/FairFAL.

[431] Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning

Md Muntaqim Meherab, Noor Islam S. Mohammad, Faiza Feroz

Main category: cs.LG

TL;DR: Causal Concept Graphs (CCG) combine sparse autoencoders with causal structure learning to model concept interactions in language models, outperforming baselines on reasoning tasks.

Details

Motivation: While sparse autoencoders can identify where concepts live in language models, they don't capture how concepts interact during multi-step reasoning. The authors aim to model these causal dependencies between interpretable concepts.

Method: Propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features. Combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery. Introduce Causal Fidelity Score (CFS) to evaluate graph-guided interventions.

Result: On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, CCG achieves CFS=5.654±0.625, outperforming ROME-style tracing (3.382±0.233), SAE-only ranking (2.479±0.196), and random baseline (1.032±0.034). Learned graphs are sparse (5-6% edge density), domain-specific, and stable across seeds.

Conclusion: CCG successfully models causal dependencies between interpretable concepts in language models, providing a framework for understanding multi-step reasoning beyond simple concept localization.

Abstract: Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ($n{=}15$ paired runs), CCG achieves $\CFS=5.654\pm0.625$, outperforming ROME-style tracing ($3.382\pm0.233$), SAE-only ranking ($2.479\pm0.196$), and a random baseline ($1.032\pm0.034$), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6% edge density), domain-specific, and stable across seeds.

[432] Optimal Expert-Attention Allocation in Mixture-of-Experts: A Scalable Law for Dynamic Model Design

Junzhuo Li, Peijie Jiang, Changxin Tian, Jia Liu, Zhiqiang Zhang, Xuming Hu

Main category: cs.LG

TL;DR: Extends neural scaling laws to Mixture-of-Experts models, deriving optimal compute allocation between expert and attention layers via power-law relationships.

Details

Motivation: MoE models scale capacity efficiently but lack guidelines for optimal compute allocation between expert and attention sub-layers, which is critical for performance under fixed compute budgets.

Method: Define ratio r as FLOPs fraction for expert vs attention layers, conduct extensive experiments with GPT-style MoE Transformers to empirically find optimal ratio r* as function of compute budget and sparsity.

Result: Optimal ratio r* follows power-law relationship with total compute and varies with sparsity; derive explicit formula for r* and generalize Chinchilla scaling law to incorporate architectural parameter.

Conclusion: Provides practical framework for designing efficient MoE models with precise control over expert-attention compute allocation, optimizing performance under fixed compute constraints.

Abstract: This paper presents a novel extension of neural scaling laws to Mixture-of-Experts (MoE) models, focusing on the optimal allocation of compute between expert and attention sub-layers. As MoE architectures have emerged as an efficient method for scaling model capacity without proportionally increasing computation, determining the optimal expert-attention compute ratio becomes critical. We define the ratio $r$ as the fraction of total FLOPs per token dedicated to the expert layers versus the attention layers, and explore how this ratio interacts with the overall compute budget and model sparsity. Through extensive experiments with GPT-style MoE Transformers, we empirically find that the optimal ratio $r^$ follows a power-law relationship with total compute and varies with sparsity. Our analysis leads to an explicit formula for $r^$, enabling precise control over the expert-attention compute allocation. We generalize the Chinchilla scaling law by incorporating this architectural parameter, providing a new framework for tuning MoE models beyond size and data. Our findings offer practical guidelines for designing efficient MoE models, optimizing performance while respecting fixed compute budgets.

[433] Variance-Aware Adaptive Weighting for Diffusion Model Training

Nanlong Sun, Lei Shi

Main category: cs.LG

TL;DR: A variance-aware adaptive weighting strategy for diffusion models that addresses training imbalance across noise levels by dynamically adjusting weights based on loss variance distribution.

Details

Motivation: Diffusion models suffer from highly imbalanced training dynamics across different noise levels, leading to inefficient optimization and unstable learning behavior. The authors investigate this imbalance from the perspective of loss variance across log-SNR levels.

Method: Proposes a variance-aware adaptive weighting strategy that dynamically adjusts training weights based on the observed variance distribution across log-SNR levels, encouraging more balanced optimization across noise levels.

Result: Extensive experiments on CIFAR-10 and CIFAR-100 show consistent improvement in generative performance over standard training schemes, achieving lower Fréchet Inception Distance (FID) while reducing performance variance across random seeds.

Conclusion: The adaptive weighting effectively stabilizes training dynamics and highlights the potential of variance-aware training strategies for improving diffusion model optimization.

Abstract: Diffusion models have recently achieved remarkable success in generative modeling, yet their training dynamics across different noise levels remain highly imbalanced, which can lead to inefficient optimization and unstable learning behavior. In this work, we investigate this imbalance from the perspective of loss variance across log-SNR levels and propose a variance-aware adaptive weighting strategy to address it. The proposed approach dynamically adjusts training weights based on the observed variance distribution, encouraging a more balanced optimization process across noise levels. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate that the proposed method consistently improves generative performance over standard training schemes, achieving lower Fréchet Inception Distance (FID) while also reducing performance variance across random seeds. Additional analysis, including loss-log-SNR visualization, variance heatmaps, and ablation studies, further reveal that the adaptive weighting effectively stabilizes training dynamics. These results highlight the potential of variance-aware training strategies for improving diffusion model optimization.

[434] Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang

Main category: cs.LG

TL;DR: Graph-GRPO: Online RL framework for training graph flow models with verifiable rewards using differentiable rollouts and localized refinement

Details

Motivation: Graph generation has broad applications like drug discovery, but aligning graph flow models with complex human preferences or task-specific objectives remains challenging

Method: 1) Derive analytical expression for transition probability of GFMs to enable fully differentiable rollouts for RL training; 2) Propose refinement strategy that randomly perturbs specific nodes/edges and regenerates them for localized exploration

Result: Achieves 95.0% and 97.5% Valid-Unique-Novelty scores on planar and tree datasets with only 50 denoising steps; State-of-the-art performance on molecular optimization tasks, outperforming graph-based/fragment-based RL methods and genetic algorithms

Conclusion: Graph-GRPO effectively trains graph flow models under verifiable rewards through differentiable rollouts and localized refinement, demonstrating strong performance on graph generation tasks

Abstract: Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0% and 97.5% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

[435] On the Learning Dynamics of Two-layer Linear Networks with Label Noise SGD

Tongcheng Zhang, Zhanpeng Zhou, Mingze Wang, Andi Han, Wei Huang, Taiji Suzuki, Junchi Yan

Main category: cs.LG

TL;DR: Analysis of how label noise in SGD drives transition from lazy to rich learning regimes in over-parameterized networks, with theoretical insights extended to SAM optimization.

Details

Motivation: Empirical observations show training with noisy labels improves model generalization, but the underlying mechanisms of SGD with label noise are not well understood. The paper aims to analyze how label noise affects learning dynamics and drives regime transitions.

Method: Theoretical analysis of two-layer over-parameterized linear networks under SGD with label noise, identifying two-phase learning behavior. Extends insights to Sharpness-Aware Minimization (SAM) and validates with experiments on synthetic and real-world setups.

Result: Reveals that label noise SGD exhibits two phases: Phase I where model weights diminish and transition from lazy to rich regime occurs, and Phase II where weight alignment with ground-truth interpolator increases. Shows similar principles apply to SAM optimization.

Conclusion: Label noise plays critical role in driving transition from lazy to rich learning regimes, explaining empirical success of noisy label training. Insights extend to broader optimization algorithms like SAM, providing theoretical foundation for understanding implicit bias in deep learning.

Abstract: One crucial factor behind the success of deep learning lies in the implicit bias induced by noise inherent in gradient-based training algorithms. Motivated by empirical observations that training with noisy labels improves model generalization, we delve into the underlying mechanisms behind stochastic gradient descent (SGD) with label noise. Focusing on a two-layer over-parameterized linear network, we analyze the learning dynamics of label noise SGD, unveiling a two-phase learning behavior. In \emph{Phase I}, the magnitudes of model weights progressively diminish, and the model escapes the lazy regime; enters the rich regime. In \emph{Phase II}, the alignment between model weights and the ground-truth interpolator increases, and the model eventually converges. Our analysis highlights the critical role of label noise in driving the transition from the lazy to the rich regime and minimally explains its empirical success. Furthermore, we extend these insights to Sharpness-Aware Minimization (SAM), showing that the principles governing label noise SGD also apply to broader optimization algorithms. Extensive experiments, conducted under both synthetic and real-world setups, strongly support our theory. Our code is released at https://github.com/a-usually/Label-Noise-SGD.

[436] Designing Service Systems from Textual Evidence

Ruicheng Ao, Hongyu Chen, Siyang Gao, Hanwei Li, David Simchi-Levi

Main category: cs.LG

TL;DR: A sequential decision framework for selecting optimal service configurations using biased LLM evaluations and selective human audits to minimize costs while maintaining confidence.

Details

Motivation: Service systems often rely on textual evidence (chat transcripts, complaints) rather than scalar metrics for performance evaluation. LLMs can score this text but have systematic biases, while human review is accurate but expensive. Need to identify best configurations with high confidence while minimizing costly human audits.

Method: Formalizes as sequential decision problem with biased proxy scores (LLM evaluations) and selective verified outcomes (human audits). Develops PP-LUCB algorithm that combines proxy scores with inverse-propensity-weighted residuals, constructs anytime-valid confidence sequences, and jointly decides which alternatives to evaluate and when to request human audits.

Result: Proves LLM-only selection fails under arm-dependent bias, naive selective-audit estimators are asymptotically biased. On customer support ticket classification task, algorithm correctly identifies best model in 40/40 trials while achieving 90% audit cost reduction.

Conclusion: Provides principled framework for leveraging biased LLM evaluations with selective human verification to optimize service configurations efficiently. Demonstrates significant cost savings while maintaining selection accuracy.

Abstract: Designing service systems requires selecting among alternative configurations – choosing the best chatbot variant, the optimal routing policy, or the most effective quality control procedure. In many service systems, the primary evidence of performance quality is textual – customer support transcripts, complaint narratives, compliance review reports – rather than the scalar measurements assumed by classical optimization methods. Large language models (LLMs) can read such textual evidence and produce standardized quality scores, but these automated judges exhibit systematic biases that vary across alternatives and evaluation instances. Human expert review remains accurate but costly. We study how to identify the best service configuration with high confidence while minimizing expensive human audits, given that automated evaluation is cheap but biased. We formalize this as a sequential decision problem where a biased proxy score is observed for every evaluation, and a verified outcome can be acquired selectively at additional cost. We prove that LLM-only selection fails under arm-dependent bias, and that naive selective-audit estimators can be asymptotically biased. We develop an estimator combining proxy scores with inverse-propensity-weighted residuals and construct anytime-valid confidence sequences. Our algorithm, PP-LUCB, jointly decides which alternatives to evaluate and whether to request human audits, concentrating reviews where the LLM judge is least reliable. We prove correctness and establish instance-dependent cost bounds showing near-optimal efficiency. On a customer support ticket classification task, our algorithm correctly identifies the best model in 40/40 trials while achieving 90% audit cost reduction.

[437] Effective Dataset Distillation for Spatio-Temporal Forecasting with Bi-dimensional Compression

Taehyung Kwon, Yeonje Choi, Yeongho Kim, Kijung Shin

Main category: cs.LG

TL;DR: STemDist: First dataset distillation method for spatio-temporal time series forecasting that compresses both temporal and spatial dimensions, enabling faster, more memory-efficient, and more effective model training.

Details

Motivation: Training deep learning models for spatio-temporal forecasting is time- and resource-intensive due to large dataset sizes and model complexity. Existing dataset distillation methods only compress one dimension, making them unsuitable for spatio-temporal data where both spatial and temporal dimensions contribute to data volume.

Method: Proposes STemDist that compresses both temporal and spatial dimensions in a balanced manner. Uses cluster-level distillation to reduce cost, complemented by subset-based granular distillation to enhance forecasting performance.

Result: On five real-world datasets, STemDist enables model training: 1) up to 6X faster, 2) up to 8X more memory-efficient, and 3) up to 12% lower prediction error compared to general and time-series dataset distillation methods.

Conclusion: STemDist is the first specialized dataset distillation method for spatio-temporal time series forecasting that effectively addresses the dual-dimensional compression challenge, offering significant improvements in training efficiency and prediction accuracy.

Abstract: Spatio-temporal time series are widely used in real-world applications, including traffic prediction and weather forecasting. They are sequences of observations over extensive periods and multiple locations, naturally represented as multidimensional data. Forecasting is a central task in spatio-temporal analysis, and numerous deep learning methods have been developed to address it. However, as dataset sizes and model complexities continue to grow in practice, training deep learning models has become increasingly time- and resource-intensive. A promising solution to this challenge is dataset distillation, which synthesizes compact datasets that can effectively replace the original data for model training. Although successful in various domains, including time series analysis, existing dataset distillation methods compress only one dimension, making them less suitable for spatio-temporal datasets, where both spatial and temporal dimensions jointly contribute to the large data volume. To address this limitation, we propose STemDist, the first dataset distillation method specialized for spatio-temporal time series forecasting. A key idea of our solution is to compress both temporal and spatial dimensions in a balanced manner, reducing training time and memory. We further reduce the distillation cost by performing distillation at the cluster level rather than the individual location level, and we complement this coarse-grained approach with a subset-based granular distillation technique that enhances forecasting performance. On five real-world datasets, we show empirically that, compared to both general and time-series dataset distillation methods, datasets distilled by our STemDist method enable model training (1) faster (up to 6X) (2) more memory-efficient (up to 8X), and (3) more effective (with up to 12% lower prediction error).

[438] Domain-Adaptive Health Indicator Learning with Degradation-Stage Synchronized Sampling and Cross-Domain Autoencoder

Jungho Choo, Hanbyeol Park, Gawon Lee, Yunkyung Park, Hyerim Bae

Main category: cs.LG

TL;DR: A domain-adaptive framework for health indicator construction using degradation stage synchronized batch sampling and cross-domain aligned fusion large autoencoder to address distribution mismatches in industrial condition monitoring.

Details

Motivation: Existing deep learning approaches for health indicator modeling struggle with distribution mismatches from varying operating conditions, and domain adaptation methods face challenges with degradation stage misalignment during batch sampling and limited temporal dependency capture in small-kernel 1D-CNNs.

Method: Proposes DSSBS (degradation stage synchronized batch sampling) using kernel change-point detection to segment degradation stages and synchronize source/target mini-batches by failure phases, combined with CAFLAE (cross-domain aligned fusion large autoencoder) integrating large-kernel temporal feature extraction with cross-attention mechanisms.

Result: Achieved 24.1% average performance improvement over state-of-the-art methods on Korean defense system dataset and XJTU-SY bearing dataset, demonstrating improved cross-domain alignment through stage-consistent sampling and superior domain-invariant representations.

Conclusion: The proposed framework effectively addresses distribution mismatches in health indicator construction through synchronized degradation stage sampling and enhanced temporal feature extraction, offering a robust solution for industrial condition monitoring under varying operating conditions.

Abstract: The construction of high quality health indicators (HIs) is crucial for effective prognostics and health management. Although deep learning has significantly advanced HI modeling, existing approaches often struggle with distribution mismatches resulting from varying operating conditions. Although domain adaptation is typically employed to mitigate these shifts, two critical challenges remain: (1) the misalignment of degradation stages during random mini-batch sampling, resulting in misleading discrepancy losses, and (2) the structural limitations of small-kernel 1D-CNNs in capturing long-range temporal dependencies within complex vibration signals. To address these issues, we propose a domain-adaptive framework comprising degradation stage synchronized batch sampling (DSSBS) and the cross-domain aligned fusion large autoencoder (CAFLAE). DSSBS utilizes kernel change-point detection to segment degradation stages, ensuring that source and target mini-batches are synchronized by their failure phases during alignment. Complementing this, CAFLAE integrates large-kernel temporal feature extraction with cross-attention mechanisms to learn superior domain-invariant representations. The proposed framework was rigorously validated on a Korean defense system dataset and the XJTU-SY bearing dataset, achieving an average performance enhancement of 24.1% over state-of-the-art methods. These results demonstrate that DSSBS improves cross-domain alignment through stage-consistent sampling, whereas CAFLAE offers a high-performance backbone for long-term industrial condition monitoring.

[439] GGMPs: Generalized Gaussian Mixture Processes

Vardaan Tekriwal, Mark D. Risser, Hengrui Luo, Marcus M. Noack

Main category: cs.LG

TL;DR: GGMP is a Gaussian Process-based method for multimodal conditional density estimation that produces Gaussian mixture predictive densities to handle complex, non-Gaussian output distributions.

Details

Motivation: Standard Gaussian Process regression is limited to unimodal Gaussian predictions, but real-world data often exhibits multimodality, heteroscedasticity, and strong non-Gaussianity, requiring more flexible conditional density estimation methods.

Method: GGMP combines local Gaussian mixture fitting, cross-input component alignment, and per-component heteroscedastic GP training to produce closed-form Gaussian mixture predictive densities, avoiding the exponential complexity of naive multimodal GP formulations.

Result: GGMPs demonstrate improved distributional approximation on both synthetic and real-world datasets with pronounced non-Gaussianity and multimodality compared to standard approaches.

Conclusion: GGMP provides a tractable, scalable GP-based framework for multimodal conditional density estimation that maintains compatibility with standard GP solvers while handling complex output distributions.

Abstract: Conditional density estimation is complicated by multimodality, heteroscedasticity, and strong non-Gaussianity. Gaussian processes (GPs) provide a principled nonparametric framework with calibrated uncertainty, but standard GP regression is limited by its unimodal Gaussian predictive form. We introduce the Generalized Gaussian Mixture Process (GGMP), a GP-based method for multimodal conditional density estimation in settings where each input may be associated with a complex output distribution rather than a single scalar response. GGMP combines local Gaussian mixture fitting, cross-input component alignment and per-component heteroscedastic GP training to produce a closed-form Gaussian mixture predictive density. The method is tractable, compatible with standard GP solvers and scalable methods, and avoids the exponentially large latent-assignment structure of naive multimodal GP formulations. Empirically, GGMPs improve distributional approximation on synthetic and real-world datasets with pronounced non-Gaussianity and multimodality.

[440] The Curse and Blessing of Mean Bias in FP4-Quantized LLM Training

Hengjie Cao, Zhendong Huang, Mengyi Chen, Yifeng Yang, Fanqi Yu, Ruijun Huang, Fang Dong, Xin Zhang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Qin Lv, Yuan Cheng, Tun Lu, Fan Yang, Li Shang

Main category: cs.LG

TL;DR: Mean removal technique stabilizes low-bit LLM training by addressing rank-one mean bias that causes dynamic range inflation in quantized models.

Details

Motivation: LLMs exhibit anisotropy where a few directions concentrate most energy, causing instability in low-bit training regimes due to extreme activation magnitudes stretching dynamic range and compressing semantic variation.

Method: Identify that the primary instability driver is a coherent rank-one mean bias, then apply simple source-level mean-subtraction operation to eliminate this bias, requiring only reduction operations and standard quantization kernels.

Result: Mean removal substantially narrows the loss gap to BF16 training and restores downstream performance in FP4 (W4A4G4) training, providing hardware-efficient stability for low-bit LLM training.

Conclusion: A simple mean-subtraction operation effectively addresses the dominant instability in low-bit LLM training, offering a practical, hardware-efficient alternative to more complex spectral methods while recovering most stability benefits.

Abstract: Large language models trained on natural language exhibit pronounced anisotropy: a small number of directions concentrate disproportionate energy, while the remaining dimensions form a broad semantic tail. In low-bit training regimes, this geometry becomes numerically unstable. Because blockwise quantization scales are determined by extreme elementwise magnitudes, dominant directions stretch the dynamic range, compressing long-tail semantic variation into narrow numerical bins. We show that this instability is primarily driven by a coherent rank-one mean bias, which constitutes the dominant component of spectral anisotropy in LLM representations. This mean component emerges systematically across layers and training stages and accounts for the majority of extreme activation magnitudes, making it the principal driver of dynamic-range inflation under low precision. Crucially, because the dominant instability is rank-one, it can be eliminated through a simple source-level mean-subtraction operation. This bias-centric conditioning recovers most of the stability benefits of SVD-based spectral methods while requiring only reduction operations and standard quantization kernels. Empirical results on FP4 (W4A4G4) training show that mean removal substantially narrows the loss gap to BF16 and restores downstream performance, providing a hardware-efficient path to stable low-bit LLM training.

[441] Unlearning the Unpromptable: Prompt-free Instance Unlearning in Diffusion Models

Kyungryeol Lee, Kyeonghyun Lee, Seongmin Hong, Byung Hyun Lee, Se Young Chun

Main category: cs.LG

TL;DR: A method for instance unlearning in diffusion models that removes specific unpromptable outputs (like faces or culturally inaccurate depictions) without using text prompts, using image editing, timestep-aware weighting, and gradient surgery.

Details

Motivation: Current machine unlearning methods focus on concept-level forgetting using text prompts, but many undesired outputs (individual faces, culturally inaccurate depictions) cannot be specified by text prompts. There's a need for prompt-free instance unlearning to selectively forget specific outputs while preserving model integrity.

Method: Proposes a surrogate-based unlearning method with three key components: 1) Image editing to create surrogate images for unlearning targets, 2) Timestep-aware weighting to focus on relevant denoising steps, and 3) Gradient surgery to guide the model away from undesired outputs while preserving other capabilities.

Result: Experiments on Stable Diffusion 3 (conditional) and DDPM-CelebA (unconditional) show the method successfully unlearns unpromptable outputs like faces and culturally inaccurate depictions while maintaining model integrity, outperforming both prompt-based and prompt-free baselines.

Conclusion: The method provides a practical hotfix for diffusion model providers to address privacy protection and ethical compliance by enabling selective forgetting of specific unpromptable outputs without compromising overall model performance.

Abstract: Machine unlearning aims to remove specific outputs from trained models, often at the concept level, such as forgetting all occurrences of a particular celebrity or filtering content via text prompts. However, many undesired outputs, such as an individual’s face or generations culturally or factually misinterpreted, cannot often be specified by text prompts. We address this underexplored setting of instance unlearning for outputs that are undesired but unpromptable, where the goal is to forget target outputs selectively while preserving the rest. To this end, we introduce an effective surrogate-based unlearning method that leverages image editing, timestep-aware weighting, and gradient surgery to guide trained diffusion models toward forgetting specific outputs. Experiments on conditional (Stable Diffusion 3) and unconditional (DDPM-CelebA) diffusion models demonstrate that our prompt-free method uniquely unlearns unpromptable outputs, such as faces and culturally inaccurate depictions, with preserved integrity, unlike prompt-based and prompt-free baselines. Our proposed method would serve as a practical hotfix for diffusion model providers to ensure privacy protection and ethical compliance.

[442] Spatio-Temporal Forecasting of Retaining Wall Deformation: Mitigating Error Accumulation via Multi-Resolution ConvLSTM Stacking Ensemble

Jihoon Kim, Heejung Youn

Main category: cs.LG

TL;DR: Multi-resolution ConvLSTM ensemble framework improves long-term forecasting of retaining structure behavior during excavation by combining models trained at different temporal resolutions to reduce error accumulation.

Details

Motivation: To address error accumulation and improve long-horizon forecasting of retaining-structure behavior during staged excavation, particularly for geotechnical applications where accurate predictions are crucial for safety and design.

Method: Developed a multi-resolution ConvLSTM ensemble framework using three ConvLSTM models trained at different input temporal resolutions, integrated via a fully connected neural network meta-learner. Used extensive PLAXIS2D simulations to generate 2,000 time-series deflection profiles with varied geotechnical parameters.

Result: The ensemble approach consistently outperformed standalone ConvLSTM models, especially in long-term multi-step prediction, showing reduced error propagation and improved generalization. Validated with both numerical results and field measurements.

Conclusion: Multi-resolution ensemble strategies that exploit diverse temporal input scales can enhance predictive stability and accuracy in AI-driven geotechnical forecasting, demonstrating the value of combining models at different resolutions.

Abstract: This study proposes a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) ensemble framework that leverages diverse temporal input resolutions to mitigate error accumulation and improve long-horizon forecasting of retaining-structure behavior during staged excavation. An extensive database of lateral wall displacement responses was generated through PLAXIS2D simulations incorporating five-layered soil stratigraphy, two excavation depths (14 and 20 m), and stochastically varied geotechnical and structural parameters, yielding 2,000 time-series deflection profiles. Three ConvLSTM models trained at different input resolutions were integrated using a fully connected neural network meta-learner to construct the ensemble model. Validation using both numerical results and field measurements demonstrated that the ensemble approach consistently outperformed the standalone ConvLSTM models, particularly in long-term multi-step prediction, exhibiting reduced error propagation and improved generalization. These findings underscore the potential of multi-resolution ensemble strategies that jointly exploit diverse temporal input scales to enhance predictive stability and accuracy in AI-driven geotechnical forecasting.

[443] Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

Ilseung Park, Eunsik Choi, Jangwhan Ahn, Jooeun Ahn

Main category: cs.LG

TL;DR: A reinforcement learning framework that uses muscle synergies extracted from human walking data to constrain control of a musculoskeletal model, improving biomechanical fidelity and generalization across various walking conditions.

Details

Motivation: Human locomotion involves complex neuromuscular control that makes predictive musculoskeletal simulation challenging. The authors aim to improve biomechanical fidelity and generalization in human locomotion simulation by incorporating neurophysiological structure into reinforcement learning.

Method: Extracted low-dimensional muscle synergies from inverse musculoskeletal analyses of overground walking trials, then used this synergy basis as the action space for a muscle-driven 3D model trained with reinforcement learning across variable speeds, slopes, and uneven terrain.

Result: The synergy-constrained controller generated stable gait from 0.7-1.8 m/s and on ±6° grades, reproduced condition-dependent modulation of joint angles/moments/ground reaction forces, reduced non-physiological knee kinematics, kept knee moments within experimental envelopes, and produced muscle-activation timing within inter-subject variability.

Conclusion: Embedding neurophysiological structure (muscle synergies) into reinforcement learning improves biomechanical fidelity and generalization in predictive human locomotion simulation, even with limited experimental data.

Abstract: Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.

[444] A Universal Nearest-Neighbor Estimator for Intrinsic Dimensionality

Eng-Jon Ong, Omer Bobrowski, Gesine Reinert, Primoz Skraba

Main category: cs.LG

TL;DR: A novel intrinsic dimensionality estimator based on nearest-neighbor distance ratios that achieves state-of-the-art performance and provides theoretical guarantees of universality across data distributions.

Details

Motivation: Existing intrinsic dimensionality (ID) estimation methods often fail when their geometric or distributional assumptions are violated, highlighting the need for a more robust and universal estimator that works across diverse data distributions.

Method: The method introduces a novel ID estimator based on nearest-neighbor distance ratios that involves simple calculations. The key innovation is the theoretical proof that this estimator is universal - it converges to the true ID independently of the distribution generating the data.

Result: The estimator achieves state-of-the-art results on benchmark manifolds and real-world datasets, demonstrating superior performance compared to existing methods while maintaining computational simplicity.

Conclusion: The proposed nearest-neighbor distance ratio estimator provides a robust, theoretically-grounded solution for intrinsic dimensionality estimation that works universally across different data distributions, addressing limitations of previous approaches.

Abstract: Estimating the intrinsic dimensionality (ID) of data is a fundamental problem in machine learning and computer vision, providing insight into the true degrees of freedom underlying high-dimensional observations. Existing methods often rely on geometric or distributional assumptions and can significantly fail when these assumptions are violated. In this paper, we introduce a novel ID estimator based on nearest-neighbor distance ratios that involves simple calculations and achieves state-of-the-art results. Most importantly, we provide a theoretical analysis proving that our estimator is \emph{universal}, namely, it converges to the true ID independently of the distribution generating the data. We present experimental results on benchmark manifolds and real-world datasets to demonstrate the performance of our estimator.

[445] World Model for Battery Degradation Prediction Under Non-Stationary Aging

Kai Chin Lim, Khay Wai See

Main category: cs.LG

TL;DR: World model approach for battery degradation prognosis using learned dynamics from voltage/current/temperature time-series, with optional electrochemical constraints from Single Particle Model.

Details

Motivation: Existing data-driven battery SOH trajectory forecasting lacks mechanisms to propagate degradation dynamics forward in time. Need to improve forecasting by encoding cycle-level data and learning degradation dynamics.

Method: Formulates battery degradation as world model problem: encodes raw voltage/current/temperature time-series into latent state, propagates via learned dynamics transition to forecast 80 cycles. Incorporates Single Particle Model constraint in training loss to investigate electrochemical knowledge benefits.

Result: Iterative rollout halves trajectory forecast error compared to direct regression from same encoder. SPM constraint improves prediction at degradation knee where resistance-SOH relationship is most applicable, without changing aggregate accuracy.

Conclusion: World model approach effectively captures battery degradation dynamics, with SPM constraints providing targeted improvements at critical degradation points.

Abstract: Degradation prognosis for lithium-ion cells requires forecasting the state-of-health (SOH) trajectory over future cycles. Existing data-driven approaches can produce trajectory outputs through direct regression, but lack a mechanism to propagate degradation dynamics forward in time. This paper formulates battery degradation prognosis as a world model problem, encoding raw voltage, current, and temperature time-series from each cycle into a latent state and propagating it forward via a learned dynamics transition to produce a future trajectory spanning 80 cycles. To investigate whether electrochemical knowledge improves the learned dynamics, a Single Particle Model (SPM) constraint is incorporated into the training loss. Three configurations are evaluated on the Severson LiFePO4 (LFP) dataset of 138 cells. Iterative rollout halves the trajectory forecast error compared to direct regression from the same encoder. The SPM constraint improves prediction at the degradation knee where the resistance to SOH relationship is most applicable, without changing aggregate accuracy.

[446] UAV-MARL: Multi-Agent Reinforcement Learning for Time-Critical and Dynamic Medical Supply Delivery

Islam Guven, Mehmet Parlak

Main category: cs.LG

TL;DR: MARL framework using PPO for coordinating UAV fleets in medical delivery logistics, prioritizing urgent requests and adapting to stochastic conditions.

Details

Motivation: UAVs offer rapid medical supply delivery but require intelligent coordination to prioritize urgent requests, allocate limited resources, and adapt to uncertain operational conditions in healthcare logistics.

Method: Formulates problem as POMDP with limited agent visibility, uses Proximal Policy Optimization (PPO) as primary learning algorithm with variants including asynchronous extensions and actor-critic methods, evaluated on real-world geographic data from OpenStreetMap.

Result: Classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, demonstrating effective prioritization of medical tasks and real-time resource reallocation.

Conclusion: Reinforcement learning, particularly PPO-based MARL, shows strong potential for adaptive and scalable UAV-assisted healthcare logistics coordination under stochastic conditions.

Abstract: Unmanned aerial vehicles (UAVs) are increasingly used to support time-critical medical supply delivery, providing rapid and flexible logistics during emergencies and resource shortages. However, effective deployment of UAV fleets requires coordination mechanisms capable of prioritizing medical requests, allocating limited aerial resources, and adapting delivery schedules under uncertain operational conditions. This paper presents a multi-agent reinforcement learning (MARL) framework for coordinating UAV fleets in stochastic medical delivery scenarios where requests vary in urgency, location, and delivery deadlines. The problem is formulated as a partially observable Markov decision process (POMDP) in which UAV agents maintain awareness of medical delivery demands while having limited visibility of other agents due to communication and localization constraints. The proposed framework employs Proximal Policy Optimization (PPO) as the primary learning algorithm and evaluates several variants, including asynchronous extensions, classical actor–critic methods, and architectural modifications to analyze scalability and performance trade-offs. The model is evaluated using real-world geographic data from selected clinics and hospitals extracted from the OpenStreetMap dataset. The framework provides a decision-support layer that prioritizes medical tasks, reallocates UAV resources in real time, and assists healthcare personnel in managing urgent logistics. Experimental results show that classical PPO achieves superior coordination performance compared to asynchronous and sequential learning strategies, highlighting the potential of reinforcement learning for adaptive and scalable UAV-assisted healthcare logistics.

[447] Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, Xing Yu

Main category: cs.LG

TL;DR: GR³ (Group Relative Reward Rescaling) addresses length inflation in RL for LLMs through multiplicative reward rescaling with group-relative regularization and advantage-aware calibration.

Details

Motivation: Reinforcement learning enhances LLMs but suffers from length inflation where models become verbose to maximize rewards. Prior approaches using additive penalties create optimization shortcuts, while heuristic gating lacks generality beyond binary feedback.

Method: GR³ reframes length control as multiplicative rescaling paradigm with group-relative regularization and advantage-aware calibration. It establishes continuous, reward-dependent gating that dynamically adapts length budgets to instance difficulty while preserving advantage signals of high-quality trajectories.

Result: Empirically, across RLHF and RLVR settings, GR³ maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

Conclusion: GR³ provides a general, lossless solution to length inflation in RL for LLMs through multiplicative reward rescaling with adaptive length control mechanisms.

Abstract: Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

[448] SCORE: Replacing Layer Stacking with Contractive Recurrent Depth

Guillaume Godin

Main category: cs.LG

TL;DR: SCORE proposes a discrete recurrent alternative to layer stacking using ODE-inspired contractive updates with shared neural blocks, improving convergence speed and reducing parameters across various architectures.

Details

Motivation: Residual connections are fundamental to deep neural networks but traditional layer stacking requires many independent parameters. The authors aim to create a more efficient alternative using recurrent depth refinement with shared weights.

Method: SCORE uses a single shared neural block applied iteratively with ODE-inspired contractive updates: ht+1 = (1 - dt) * ht + dt * F(ht). This discrete recurrent approach uses fixed iterations and standard backpropagation without ODE solvers, with step size dt controlling stability.

Result: SCORE improves convergence speed and accelerates training across graph neural networks (ESOL), MLPs, and Transformers (nanoGPT). It reduces parameter count through weight sharing, with simple Euler integration providing the best cost-performance trade-off.

Conclusion: Controlled recurrent depth with contractive residual updates offers a lightweight, effective alternative to classical layer stacking, enabling faster training with fewer parameters across diverse neural architectures.

Abstract: Residual connections are central to modern deep neural networks, enabling stable optimization and efficient information flow across depth. In this work, we propose SCORE (Skip-Connection ODE Recurrent Embedding), a discrete recurrent alternative to classical layer stacking. Instead of composing multiple independent layers, SCORE iteratively applies a single shared neural block using an ODE (Ordinary Differential Equation)-inspired contractive update: ht+1 = (1 - dt) * ht + dt * F(ht) This formulation can be interpreted as a depth-by-iteration refinement process, where the step size dt explicitly controls stability and update magnitude. Unlike continuous Neural ODE approaches, SCORE uses a fixed number of discrete iterations and standard backpropagation without requiring ODE solvers or adjoint methods. We evaluate SCORE across graph neural networks (ESOL molecular solubility), multilayer perceptrons, and Transformer-based language models (nanoGPT). Across architectures, SCORE generally improves convergence speed and often accelerates training. SCORE is reducing parameter count through shared weights. In practice, simple Euler integration provides the best trade-off between computational cost and performance, while higher-order integrators yield marginal gains at increased compute. These results suggest that controlled recurrent depth with contractive residual updates offers a lightweight and effective alternative to classical stacking in deep neural networks.

[449] Learning to Score: Tuning Cluster Schedulers through Reinforcement Learning

Martin Asenov, Qiwen Deng, Gingfung Yeung, Adam Barker

Main category: cs.LG

TL;DR: RL-based scheduler learns optimal scoring function weights for job allocation in clusters, improving job performance by 33% over fixed weights.

Details

Motivation: Current cluster schedulers use equal weighting for scoring functions, leading to sub-optimal job allocation. Manual weight tuning requires expert knowledge and is computationally expensive.

Method: Reinforcement learning approach with percentage improvement reward, frame-stacking for information retention across optimization experiments, and limited domain information to prevent overfitting.

Result: Improves performance by 33% compared to fixed weights and 12% compared to best-performing baseline in lab-based serverless scenario.

Conclusion: RL-based weight tuning for scheduler scoring functions significantly improves job performance and cluster utilization without requiring expert manual tuning.

Abstract: Efficiently allocating incoming jobs to nodes in large-scale clusters can lead to substantial improvements in both cluster utilization and job performance. In order to allocate incoming jobs, cluster schedulers usually rely on a set of scoring functions to rank feasible nodes. Results from individual scoring functions are usually weighted equally, which could lead to sub-optimal deployments as the one-size-fits-all solution does not take into account the characteristics of each workload. Tuning the weights of scoring functions, however, requires expert knowledge and is computationally expensive. This paper proposes a reinforcement learning approach for learning the weights in scheduler scoring algorithms with the overall objective of improving the end-to-end performance of jobs for a given cluster. Our approach is based on percentage improvement reward, frame-stacking, and limiting domain information. We propose a percentage improvement reward to address the objective of multi-step parameter tuning. The inclusion of frame-stacking allows for carrying information across an optimization experiment. Limiting domain information prevents overfitting and improves performance in unseen clusters and workloads. The policy is trained on different combinations of workloads and cluster setups. We demonstrate the proposed approach improves performance on average by 33% compared to fixed weights and 12% compared to the best-performing baseline in a lab-based serverless scenario.

[450] A Bipartite Graph Approach to U.S.-China Cross-Market Return Forecasting

Jing Liu, Maria Grith, Xiaowen Dong, Mihai Cucuringu

Main category: cs.LG

TL;DR: A machine learning framework with economic structure preservation for cross-market return prediction between US and Chinese equity markets using non-overlapping trading hours and directed bipartite graphs.

Details

Motivation: To study cross-market return predictability while preserving economic interpretability, leveraging the time-zone difference between US and Chinese markets to create structured predictive models.

Method: Construct directed bipartite graph capturing time-ordered predictive linkages between stocks across markets using non-overlapping trading hours. Use rolling-window hypothesis testing for edge selection, creating sparse feature-selection layer for downstream ML models including regularized and ensemble methods.

Result: Reveals pronounced directional asymmetry: US previous-close-to-close returns strongly predict Chinese intraday returns, but reverse effect is limited. This informational asymmetry leads to economically meaningful performance differences.

Conclusion: Structured machine learning frameworks can effectively uncover cross-market dependencies while maintaining interpretability, with US market information being more predictive for Chinese returns than vice versa.

Abstract: This paper studies cross-market return predictability through a machine learning framework that preserves economic structure. Exploiting the non-overlapping trading hours of the U.S. and Chinese equity markets, we construct a directed bipartite graph that captures time-ordered predictive linkages between stocks across markets. Edges are selected via rolling-window hypothesis testing, and the resulting graph serves as a sparse, economically interpretable feature-selection layer for downstream machine learning models. We apply a range of regularized and ensemble methods to forecast open-to-close returns using lagged foreign-market information. Our results reveal a pronounced directional asymmetry: U.S. previous-close-to-close returns contain substantial predictive information for Chinese intraday returns, whereas the reverse effect is limited. This informational asymmetry translates into economically meaningful performance differences and highlights how structured machine learning frameworks can uncover cross-market dependencies while maintaining interpretability.

[451] Riemannian Geometry-Preserving Variational Autoencoder for MI-BCI Data Augmentation

Viktorija Poļaka, Ivo Pascal de Jong, Andreea Ioana Sburlea

Main category: cs.LG

TL;DR: Riemannian geometry-preserving VAE for generating synthetic EEG covariance matrices for motor imagery brain-computer interfaces, preserving symmetric positive-definite structure.

Details

Motivation: Address the challenge of generating synthetic EEG covariance matrices for MI-BCI applications while preserving their symmetric positive-definite nature, enabling data augmentation, privacy protection, and scalability.

Method: Proposed RGP-VAE (Riemannian geometry-preserving variational autoencoder) that integrates geometric mappings with a composite loss function combining Riemannian distance, tangent space reconstruction accuracy, and generative diversity.

Result: The model successfully generates valid, representative EEG covariance matrices while learning a subject-invariant latent space. Synthetic data proves practically useful for MI-BCI, with effectiveness depending on the paired classifier.

Conclusion: The RGP-VAE is validated as a geometry-preserving generative model for EEG covariance matrices, highlighting its potential for signal privacy, scalability, and data augmentation in BCI applications.

Abstract: This paper addresses the challenge of generating synthetic electroencephalogram (EEG) covariance matrices for motor imagery brain-computer interface (MI-BCI) applications. Objective: We aim to develop a generative model capable of producing high-fidelity synthetic covariance matrices while preserving their symmetric positive-definite nature. Approach: We propose a Riemannian geometry-preserving variational autoencoder (RGP-VAE) integrating geometric mappings with a composite loss function combining Riemannian distance, tangent space reconstruction accuracy and generative diversity. Results: The model generates valid, representative EEG covariance matrices, while learning a subject-invariant latent space. Synthetic data proves practically useful for MI-BCI, with its impact depending on the paired classifier. Contribution: This work introduces and validates the RGP-VAE as a geometry-preserving generative model for EEG covariance matrices, highlighting its potential for signal privacy, scalability and data augmentation.

[452] Implicit Statistical Inference in Transformers: Approximating Likelihood-Ratio Tests In-Context

Faris Chaudhry, Siddhant Gadkari

Main category: cs.LG

TL;DR: Transformers in in-context learning approximate Bayes-optimal statistics rather than simple similarity matching, adapting decision boundaries based on task geometry.

Details

Motivation: To understand the underlying algorithms of in-context learning (ICL) in Transformers, which adapt to novel tasks without weight updates but whose mechanisms remain poorly understood.

Method: Adopted statistical decision-theoretic perspective using binary hypothesis testing with known optimal policy (likelihood-ratio test). Trained Transformers on tasks with distinct geometries (linear shifted means vs. nonlinear variance estimation) and analyzed mechanisms via logit lens and circuit alignment.

Result: Transformers approximate Bayes-optimal sufficient statistics from context up to monotonic transformation, matching ideal oracle estimator performance in nonlinear regimes. Models adapt decision boundaries rather than using fixed kernel smoothing, showing voting-style ensembles for linear tasks and deeper sequential computation for nonlinear tasks.

Conclusion: ICL emerges from construction of task-adaptive statistical estimators rather than simple similarity matching, with Transformers learning to approximate optimal statistical decision rules based on task geometry.

Abstract: In-context learning (ICL) allows Transformers to adapt to novel tasks without weight updates, yet the underlying algorithms remain poorly understood. We adopt a statistical decision-theoretic perspective by investigating simple binary hypothesis testing, where the optimal policy is determined by the likelihood-ratio test. Notably, this setup provides a mathematically rigorous setting for mechanistic interpretability where the target algorithmic ground truth is known. By training Transformers on tasks requiring distinct geometries (linear shifted means vs. nonlinear variance estimation), we demonstrate that the models approximate the Bayes-optimal sufficient statistics from context up to some monotonic transformation, matching the performance of an ideal oracle estimator in nonlinear regimes. Leveraging this analytical ground truth, mechanistic analysis via logit lens and circuit alignment suggests that the model does not rely on a fixed kernel smoothing heuristic. Instead, it appears to adapt the point at which decisions become linearly decodable: exhibiting patterns consistent with a voting-style ensemble for linear tasks while utilizing a deeper sequential computation for nonlinear tasks. These findings suggest that ICL emerges from the construction of task-adaptive statistical estimators rather than simple similarity matching.

[453] HAPEns: Hardware-Aware Post-Hoc Ensembling for Tabular Data

Jannis Maier, Lennart Purucker

Main category: cs.LG

TL;DR: HAPEns is a post-hoc ensembling method for tabular data that optimizes both predictive performance and hardware efficiency using multi-objective optimization to find Pareto-optimal ensembles.

Details

Motivation: Traditional ensembling methods for tabular data improve predictive performance but increase hardware demands significantly. There's a need for methods that explicitly balance accuracy against hardware efficiency, especially for deployment scenarios where resource constraints matter.

Method: HAPEns uses multi-objective and quality diversity optimization to construct diverse ensembles along the Pareto front of predictive performance vs. resource usage. It explores trade-offs between accuracy and hardware efficiency, with memory usage identified as a particularly effective objective metric.

Result: Experiments on 83 tabular classification datasets show HAPEns significantly outperforms baselines, finding superior trade-offs between ensemble performance and deployment cost. The method also improves greedy ensembling algorithms with static multi-objective weighting.

Conclusion: HAPEns provides an effective approach for hardware-aware ensembling on tabular data, demonstrating that explicit optimization of hardware efficiency alongside accuracy yields practical benefits for real-world deployment.

Abstract: Ensembling is commonly used in machine learning on tabular data to boost predictive performance and robustness, but larger ensembles often lead to increased hardware demand. We introduce HAPEns, a post-hoc ensembling method that explicitly balances accuracy against hardware efficiency. Inspired by multi-objective and quality diversity optimization, HAPEns constructs a diverse set of ensembles along the Pareto front of predictive performance and resource usage. Existing hardware-aware post-hoc ensembling baselines are not available, highlighting the novelty of our approach. Experiments on 83 tabular classification datasets show that HAPEns significantly outperforms baselines, finding superior trade-offs for ensemble performance and deployment cost. Ablation studies also reveal that memory usage is a particularly effective objective metric. Further, we show that even a greedy ensembling algorithm can be significantly improved in this task with a static multi-objective weighting scheme.

[454] Gradient Flow Drifting: Generative Modeling via Wasserstein Gradient Flows of KDE-Approximated Divergences

Jiarui Cao, Zixuan Wei, Yuxin Liu

Main category: cs.LG

TL;DR: Theoretical framework connecting drifting models to Wasserstein gradient flows under kernel density estimation, with extensions to MMD-based generators and mixed-divergence strategies for avoiding mode issues.

Details

Motivation: To establish a precise mathematical framework connecting drifting generative models to Wasserstein gradient flows, providing theoretical grounding for these models and addressing issues like mode collapse and mode blurring through mixed-divergence strategies.

Method: Develops a mathematical framework showing equivalence between drifting models and Wasserstein gradient flows of forward KL divergence under KDE approximation. Extends to MMD-based generators and proposes mixed-divergence strategy combining reverse KL and χ² divergence gradient flows, with Riemannian manifold extensions.

Result: Proves drifting field equals scaled difference of KDE log-density gradients, establishing equivalence to Wasserstein-2 gradient flow. Provides identifiability proof and demonstrates mixed-divergence approach can avoid mode collapse and blurring. Preliminary experiments on synthetic benchmarks validate framework.

Conclusion: Establishes theoretical foundation for drifting models as Wasserstein gradient flows under KDE approximation, provides mixed-divergence strategy for better generative modeling, and extends framework to Riemannian manifolds for semantic space applications.

Abstract: We reveal a precise mathematical framework about a new family of generative models which we call Gradient Flow Drifting. With this framework, we prove an equivalence between the recently proposed Drifting Model and the Wasserstein gradient flow of the forward KL divergence under kernel density estimation (KDE) approximation. Specifically, we prove that the drifting field of drifting model (arXiv:2602.04770) equals, up to a bandwidth-squared scaling factor, the difference of KDE log-density gradients $\nabla \log p_{\mathrm{kde}} - \nabla \log q_{\mathrm{kde}}$, which is exactly the particle velocity field of the Wasserstein-2 gradient flow of $KL(q|p)$ with KDE-approximated densities. Besides that, this broad family of generative models can also include MMD-based generators, which arises as special cases of Wasserstein gradient flows of different divergences under KDE approximation. We provide a concise identifiability proof, and a theoretically grounded mixed-divergence strategy. We combine reverse KL and $χ^2$ divergence gradient flows to simultaneously avoid mode collapse and mode blurring, and extend this method onto Riemannian manifold which loosens the constraints on the kernel function, and makes this method more suitable for the semantic space. Preliminary experiments on synthetic benchmarks validate the framework.

[455] Reinforcement Learning with Conditional Expectation Reward

Changyi Xiao, Caijun Xu, Yixin Cao

Main category: cs.LG

TL;DR: CER (Conditional Expectation Reward) uses LLMs as implicit verifiers for reinforcement learning, eliminating need for handcrafted verification rules and enabling soft reward signals for general reasoning tasks.

Details

Motivation: RLVR (Reinforcement Learning with Verifiable Rewards) works well for domains like mathematics with rule-based verifiers, but fails for general reasoning with free-form answers where valid answers vary significantly and handcrafted rules are incomplete/inaccurate.

Method: Proposes Conditional Expectation Reward (CER) that uses the LLM itself as an implicit verifier. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer, providing soft graded rewards instead of binary feedback.

Result: CER is effective across a wide range of reasoning tasks spanning both mathematical and general domains, demonstrating it serves as a flexible and general verification mechanism without external verifiers or auxiliary models.

Conclusion: CER addresses limitations of RLVR by using LLMs as implicit verifiers, enabling reinforcement learning for general reasoning domains with free-form answers through soft reward signals.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.

[456] Spatio-Temporal Attention Graph Neural Network: Explaining Causalities With Attention

Kosti Koistinen, Kirsi Hellsten, Joni Herttuainen, Kimmo K. Kaski

Main category: cs.LG

TL;DR: STA-GNN: A spatio-temporal attention graph neural network for explainable, unsupervised anomaly detection in industrial control systems that handles multiple data modalities and addresses baseline drift.

Details

Motivation: Industrial Control Systems face growing cyber-physical threats, but existing ML-based anomaly detection approaches suffer from poor explainability, high false-positive rates, and sensitivity to evolving system behavior (baseline drifting).

Method: Proposes a Spatio-Temporal Attention Graph Neural Network (STA-GNN) that models sensors, controllers, and network entities as nodes in a dynamically learned graph to capture temporal dynamics and relational structure. Uses attention mechanisms for explainability and incorporates conformal prediction to control false alarm rates and handle environmental drift.

Result: The approach enables unified cyber-physical analysis by supporting multiple data modalities (SCADA measurements, network flow features, payload features) and provides explainable anomaly detection with controlled false alarm rates.

Conclusion: Highlights the importance of explainable, drift-aware evaluation for reliable deployment of learning-based security monitoring systems in ICS, emphasizing both possibilities and limitations of model evaluation.

Abstract: Industrial Control Systems (ICS) underpin critical infrastructure and face growing cyber-physical threats due to the convergence of operational technology and networked environments. While machine learning-based anomaly detection approaches in ICS shows strong theoretical performance, deployment is often limited by poor explainability, high false-positive rates, and sensitivity to evolving system behavior, i.e., baseline drifting. We propose a Spatio-Temporal Attention Graph Neural Network (STA-GNN) for unsupervised and explainable anomaly detection in ICS that models both temporal dynamics and relational structure of the system. Sensors, controllers, and network entities are represented as nodes in a dynamically learned graph, enabling the model to capture inter-dependencies across physical processes and communication patterns. Attention mechanisms provide influential relationships, supporting inspection of correlations and potential causal pathways behind detected events. The approach supports multiple data modalities, including SCADA point measurements, network flow features, and payload features, and thus enables unified cyber-physical analysis. To address operational requirements, we incorporate a conformal prediction strategy to control false alarm rates and monitor performance degradation under drifting of the environment. Our findings highlight the possibilities and limitations of model evaluation and common pitfalls in anomaly detection in ICS. Our findings emphasise the importance of explainable, drift-aware evaluation for reliable deployment of learning-based security monitoring systems.

[457] Surrogate models for nuclear fusion with parametric Shallow Recurrent Decoder Networks: applications to magnetohydrodynamics

M. Lo Verso, C. Introini, E. Cervi, L. Savoldi, J. N. Kutz, A. Cammi

Main category: cs.LG

TL;DR: SHRED neural network combines SVD dimensionality reduction with sparse sensor data to reconstruct full MHD states in nuclear fusion systems, enabling efficient surrogate modeling for real-time monitoring.

Details

Motivation: MHD modeling for nuclear fusion systems is computationally expensive, especially for multi-query, parametric, or real-time applications. There's a need for efficient surrogate models that can reconstruct full spatio-temporal states from limited measurements.

Method: Combines Singular Value Decomposition (SVD) for dimensionality reduction with SHallow REcurrent Decoder (SHRED) neural network. Uses sparse temperature measurements from three sensors to reconstruct full velocity, pressure, and temperature fields. Tests robustness with 30 random sensor configurations.

Result: SHRED accurately reconstructs full MHD states even for magnetic field intensities not included in training. Demonstrates computational efficiency and robustness to sensor placement variations.

Conclusion: SHRED shows potential as an efficient surrogate modeling strategy for fusion-relevant multiphysics problems, enabling low-cost state estimation suitable for real-time monitoring and control applications.

Abstract: Magnetohydrodynamic (MHD) effects play a key role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts in reactor blankets) interact with magnetic fields of varying intensity and orientation, which affect the resulting flow. The numerical resolution of MHD models involves highly nonlinear multiphysics systems of equations and can become computationally expensive, particularly in multi-query, parametric, or real-time contexts. This work investigates a fully data-driven framework for MHD state reconstruction that combines dimensionality reduction via Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to recover the full spatio-temporal state from sparse time-series measurements of a limited number of observables. The methodology is applied to a parametric MHD test case involving compressible lead-lithium flow in a stepped channel subjected to thermal gradients and magnetic fields spanning a broad range of intensities. To improve efficiency, the full-order dataset is first compressed using SVD, yielding a reduced representation used as reference truth for training. Only temperature measurements from three sensors are provided as input, while the network reconstructs the full fields of velocity, pressure, and temperature. To assess robustness with respect to sensor placement, thirty randomly generated sensor configurations are tested in ensemble mode. Results show that SHRED accurately reconstructs the full MHD state even for magnetic field intensities not included in the training set. These findings demonstrate the potential of SHRED as a computationally efficient surrogate modeling strategy for fusion-relevant multiphysics problems, enabling low-cost state estimation with possible applications in real-time monitoring and control.

[458] Contract And Conquer: How to Provably Compute Adversarial Examples for a Black-Box Model?

Anna Chistyakova, Mikhail Pautov

Main category: cs.LG

TL;DR: CAC (Contract And Conquer) is a provable black-box adversarial attack method that uses knowledge distillation on an expanding dataset and precise search space contraction to guarantee adversarial example generation for neural networks.

Details

Motivation: Existing black-box adversarial attacks are empirically effective but lack guarantees of finding adversarial examples for specific models. There's a need for provable methods that can guarantee adversarial example generation in black-box settings.

Method: CAC uses knowledge distillation to create a surrogate model of the black-box target, iteratively expanding the distillation dataset while contracting the adversarial search space. The approach combines distillation with precise space contraction to guarantee transferability.

Result: CAC outperforms state-of-the-art black-box attack methods on ImageNet for various target models including vision transformers, with provable guarantees of finding adversarial examples within a fixed number of iterations.

Conclusion: CAC provides a provable approach to black-box adversarial attacks with transferability guarantees, offering both theoretical soundness and empirical effectiveness across different model architectures.

Abstract: Black-box adversarial attacks are widely used as tools to test the robustness of deep neural networks against malicious perturbations of input data aimed at a specific change in the output of the model. Such methods, although they remain empirically effective, usually do not guarantee that an adversarial example can be found for a particular model. In this paper, we propose Contract And Conquer (CAC), an approach to provably compute adversarial examples for neural networks in a black-box manner. The method is based on knowledge distillation of a black-box model on an expanding distillation dataset and precise contraction of the adversarial example search space. CAC is supported by the transferability guarantee: we prove that the method yields an adversarial example for the black-box model within a fixed number of algorithm iterations. Experimentally, we demonstrate that the proposed approach outperforms existing state-of-the-art black-box attack methods on ImageNet dataset for different target models, including vision transformers.

[459] Riemannian MeanFlow for One-Step Generation on Manifolds

Zichen Zhong, Haoliang Sun, Yukun Zhao, Yongshun Gong, Yilong Yin

Main category: cs.LG

TL;DR: Riemannian MeanFlow (RMF) extends flow matching to manifolds for efficient one-step generation without trajectory simulation, using parallel transport and tangent space representations.

Details

Motivation: Existing flow matching methods on Riemannian manifolds still require numerical integration for sampling, which is computationally expensive. The authors aim to develop a more efficient approach that enables one-step generation on manifolds without trajectory simulation.

Method: RMF extends MeanFlow to manifold-valued generation by defining an average-velocity field via parallel transport and deriving a Riemannian MeanFlow identity. They use a log-map tangent representation to avoid trajectory simulation and heavy geometric computations. For stable optimization, they decompose the objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. The method also supports conditional generation via classifier-free guidance.

Result: Experiments on spheres, tori, and SO(3) manifolds demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost compared to existing methods.

Conclusion: Riemannian MeanFlow provides an efficient framework for manifold-valued generation that enables one-step sampling without numerical integration, offering better computational efficiency while maintaining generation quality.

Abstract: Flow Matching enables simulation-free training of generative models on Riemannian manifolds, yet sampling typically still relies on numerically integrating a probability-flow ODE. We propose Riemannian MeanFlow (RMF), extending MeanFlow to manifold-valued generation where velocities lie in location-dependent tangent spaces. RMF defines an average-velocity field via parallel transport and derives a Riemannian MeanFlow identity that links average and instantaneous velocities for intrinsic supervision. We make this identity practical in a log-map tangent representation, avoiding trajectory simulation and heavy geometric computations. For stable optimization, we decompose the RMF objective into two terms and apply conflict-aware multi-task learning to mitigate gradient interference. RMF also supports conditional generation via classifier-free guidance. Experiments on spheres, tori, and SO(3) demonstrate competitive one-step sampling with improved quality-efficiency trade-offs and substantially reduced sampling cost.

[460] Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Yujie Zheng, Zhuo Li, Shengtao Zhang, Hanjing Wang, Junjie Sheng, Jiaqian Wang, Junchi Yan, Weinan Zhang, Ying Wen, Bo Tang, Muning Wen

Main category: cs.LG

TL;DR: EvoKernel: A self-evolving agentic framework that enables large language models to synthesize high-performance kernels for data-scarce programming domains like NPU programming through value-driven experience accumulation and iterative refinement.

Details

Motivation: Large language models struggle with kernel synthesis in data-scarce programming domains like emerging Domain-Specific Architectures (NPUs) due to limited training data, causing catastrophic performance drops compared to data-rich platforms like CUDA.

Method: EvoKernel formulates kernel synthesis as memory-based reinforcement learning with a novel value-driven retrieval mechanism that learns stage-specific Q-values to prioritize experiences for bootstrapping drafts or refining latency, plus cross-task memory sharing for generalization.

Result: On an NPU variant of KernelBench, EvoKernel improved frontier models’ correctness from 11.0% to 83.0% and achieved median speedup of 3.60x over initial drafts through iterative refinement.

Conclusion: Value-guided experience accumulation enables general-purpose models to master kernel synthesis in niche hardware ecosystems without expensive fine-tuning, overcoming the cold-start barrier in data-scarce domains.

Abstract: Deploying Large Language Models to data-scarce programming domains poses significant challenges, particularly for kernel synthesis on emerging Domain-Specific Architectures where a “Data Wall” limits available training data. While models excel on data-rich platforms like CUDA, they suffer catastrophic performance drops on data-scarce ecosystems such as NPU programming. To overcome this cold-start barrier without expensive fine-tuning, we introduce EvoKernel, a self-evolving agentic framework that automates the lifecycle of kernel synthesis from initial drafting to continual refining. EvoKernel addresses this by formulating the synthesis process as a memory-based reinforcement learning task. Through a novel value-driven retrieval mechanism, it learns stage-specific Q-values that prioritize experiences based on their contribution to the current objective, whether bootstrapping a feasible draft or iteratively refining latency. Furthermore, by enabling cross-task memory sharing, the agent generalizes insights from simple to complex operators. By building an NPU variant of KernelBench and evaluating on it, EvoKernel improves frontier models’ correctness from 11.0% to 83.0% and achieves a median speedup of 3.60x over initial drafts through iterative refinement. This demonstrates that value-guided experience accumulation allows general-purpose models to master the kernel synthesis task on niche hardware ecosystems. Our official page is available at https://evokernel.zhuo.li.

[461] Beyond Accuracy: Reliability and Uncertainty Estimation in Convolutional Neural Networks

Sanne Ruijs, Alina Kosiakova, Farrukh Javed

Main category: cs.LG

TL;DR: Comparison of Bayesian (Monte Carlo Dropout) vs Conformal Prediction for uncertainty quantification in CNNs on Fashion-MNIST, finding trade-offs between accuracy and calibration.

Details

Motivation: DNNs often have poor calibration despite high accuracy, assigning overly confident probabilities to wrong predictions. Need for reliable uncertainty estimation mechanisms in deep learning systems.

Method: Compare two uncertainty quantification approaches: Bayesian approximation via Monte Carlo Dropout and nonparametric Conformal Prediction framework. Use two CNN architectures (H-CNN VGG16 and GoogLeNet) trained on Fashion-MNIST dataset.

Result: H-CNN VGG16 achieves higher predictive accuracy but exhibits pronounced overconfidence. GoogLeNet yields better-calibrated uncertainty estimates. Conformal Prediction demonstrates consistent validity by producing statistically guaranteed prediction sets.

Conclusion: Importance of evaluating model performance beyond accuracy alone. Contributes to development of more reliable and trustworthy deep learning systems with better uncertainty quantification.

Abstract: Deep neural networks (DNNs) have become integral to a wide range of scientific and practical applications due to their flexibility and strong predictive performance. Despite their accuracy, however, DNNs frequently exhibit poor calibration, often assigning overly confident probabilities to incorrect predictions. This limitation underscores the growing need for integrated mechanisms that provide reliable uncertainty estimation. In this article, we compare two prominent approaches for uncertainty quantification: a Bayesian approximation via Monte Carlo Dropout and the nonparametric Conformal Prediction framework. Both methods are assessed using two convolutional neural network architectures; H-CNN VGG16 and GoogLeNet, trained on the Fashion-MNIST dataset. The empirical results show that although H-CNN VGG16 attains higher predictive accuracy, it tends to exhibit pronounced overconfidence, whereas GoogLeNet yields better-calibrated uncertainty estimates. Conformal Prediction additionally demonstrates consistent validity by producing statistically guaranteed prediction sets, highlighting its practical value in high-stakes decision-making contexts. Overall, the findings emphasize the importance of evaluating model performance beyond accuracy alone and contribute to the development of more reliable and trustworthy deep learning systems.

[462] $V_{0.5}$: Generalist Value Model as a Prior for Sparse RL Rollouts

Yi-Kai Zhang, Yueqing Sun, Hongyan Hao, Qi Gu, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye

Main category: cs.LG

TL;DR: V_{0.5} is a robust advantage baseline for RLVR that adaptively fuses pre-trained value model predictions with empirical rollout means, using statistical testing to balance bias and variance for stable policy gradients.

Details

Motivation: In Reinforcement Learning with Verifiable Rewards (RLVR), constructing robust advantage baselines is critical for policy gradients. Existing approaches either rely on synchronous value model updates (computationally expensive) or sparse rollouts (high variance). There's a need for a baseline that balances computational efficiency with low variance while handling systematic bias in value model predictions.

Method: Proposes V_{0.5} which adaptively fuses baseline predictions from pre-trained Generalist Value Models (like V_0) with empirical means from sparse rollouts. Introduces real-time statistical testing and dynamic budget allocation: constructs hypothesis tests to evaluate the value model prior’s reliability, then dynamically allocates additional rollout budget on demand to minimize Mean Squared Error of the baseline estimator.

Result: Extensive evaluations across six mathematical reasoning benchmarks show V_{0.5} significantly outperforms GRPO and DAPO, achieving faster convergence and over 10% performance improvement. The method maintains stable policy gradients even under extreme sparsity with group size of 4.

Conclusion: V_{0.5} provides an effective solution for robust advantage baselines in RLVR by adaptively balancing value model priors with empirical evidence through statistical testing and dynamic budget allocation, achieving superior performance and convergence compared to existing methods.

Abstract: In Reinforcement Learning with Verifiable Rewards (RLVR), constructing a robust advantage baseline is critical for policy gradients, effectively guiding the policy model to reinforce desired behaviors. Recent research has introduced Generalist Value Models (such as $V_0$), which achieve pre-trained value estimation by explicitly encoding model capabilities in-context, eliminating the need to synchronously update the value model alongside the policy model. In this paper, we propose $V_{0.5}$, which adaptively fuses the baseline predicted by such value model (acting as a prior) with the empirical mean derived from sparse rollouts. This constructs a robust baseline that balances computational efficiency with extremely low variance. Specifically, we introduce a real-time statistical testing and dynamic budget allocation. This balances the high variance caused by sparse sampling against the systematic bias (or hallucinations) inherent in the value model’s prior. By constructing a hypothesis test to evaluate the prior’s reliability in real-time, the system dynamically allocates additional rollout budget on demand. This mechanism minimizes the baseline estimator’s Mean Squared Error (MSE), guaranteeing stable policy gradients, even under extreme sparsity with a group size of 4. Extensive evaluations across six mathematical reasoning benchmarks demonstrate that $V_{0.5}$ significantly outperforms GRPO and DAPO, achieving faster convergence and over some 10% performance improvement.

[463] A Grammar of Machine Learning Workflows

Simon Roth

Main category: cs.LG

TL;DR: A grammar-based structural approach to prevent data leakage in supervised learning by decomposing the lifecycle into 7 kernel primitives with runtime constraints that reject the most damaging leakage classes.

Details

Motivation: Data leakage affects hundreds of published papers across scientific fields, and current documentation-based approaches (checklists, guides) fail to prevent these failures. There's a need for structural solutions that enforce prevention at runtime rather than relying on documentation.

Method: Proposes a grammar that decomposes supervised learning lifecycle into 7 kernel primitives connected by a typed directed acyclic graph (DAG). Includes four hard constraints that reject the two most damaging leakage classes at call time, with core innovation being the terminal assess constraint - a runtime-enforced evaluate/assess boundary that prevents repeated test-set assessment.

Result: Companion study across 2,047 experimental instances shows selection leakage inflates performance by d_z = 0.93 and memorization leakage by d_z = 0.53-1.11. Three separate implementations (Python, R, Julia) confirm the claims. The approach successfully prevents data leakage at runtime.

Conclusion: Structural grammar-based approaches with runtime enforcement are more effective than documentation for preventing data leakage in supervised learning. The proposed method provides a practical solution that can be implemented across different programming languages.

Abstract: Data leakage affected 294 published papers across 17 scientific fields (Kapoor & Narayanan, 2023). The dominant response has been documentation: checklists, linters, best-practice guides. Documentation does not prevent these failures. This paper proposes a structural remedy: a grammar that decomposes the supervised learning lifecycle into 7 kernel primitives connected by a typed directed acyclic graph (DAG), with four hard constraints that reject the two most damaging leakage classes at call time. The grammar’s core contribution is the terminal assess constraint: a runtime-enforced evaluate/assess boundary where repeated test-set assessment is rejected by a guard on a nominally distinct Evidence type. A companion study across 2,047 experimental instances quantifies why this matters: selection leakage inflates performance by d_z = 0.93 and memorization leakage by d_z = 0.53-1.11. Three separate implementations (Python, R, and Julia) confirm the claims. The appendix specification lets anyone build a conforming version.

[464] TOSSS: a CVE-based Software Security Benchmark for Large Language Models

Marc Damie, Murat Bilgehan Ertan, Domenico Essoussi, Angela Makhanu, Gaëtan Peter, Roos Wensveen

Main category: cs.LG

TL;DR: TOSSS benchmark evaluates LLMs’ ability to distinguish secure vs vulnerable code snippets using CVE database, revealing security gaps in current models.

Details

Motivation: As LLMs become integral to software development workflows, there's a critical need to assess whether they introduce security vulnerabilities or weaken existing security efforts, given the heavy investments in cybersecurity.

Method: Introduces TOSSS (Two-Option Secure Snippet Selection) benchmark that measures LLM security by presenting models with pairs of secure and vulnerable code snippets from CVE database, giving security scores from 0-1 based on selection accuracy.

Result: Evaluation of 14 open-source and closed-source models on C/C++ and Java code shows security scores ranging from 0.48 to 0.89, indicating significant variation in security awareness.

Conclusion: TOSSS provides an extensible security benchmark that could complement existing LLM evaluation reports, highlighting the need for improved security awareness in code-generation LLMs.

Abstract: With their increasing capabilities, Large Language Models (LLMs) are now used across many industries. They have become useful tools for software engineers and support a wide range of development tasks. As LLMs are increasingly used in software development workflows, a critical question arises: are LLMs good at software security? At the same time, organizations worldwide invest heavily in cybersecurity to reduce exposure to disruptive attacks. The integration of LLMs into software engineering workflows may introduce new vulnerabilities and weaken existing security efforts. We introduce TOSSS (Two-Option Secure Snippet Selection), a benchmark that measures the ability of LLMs to choose between secure and vulnerable code snippets. Existing security benchmarks for LLMs cover only a limited range of vulnerabilities. In contrast, TOSSS relies on the CVE database and provides an extensible framework that can integrate newly disclosed vulnerabilities over time. Our benchmark gives each model a security score between 0 and 1 based on its behavior; a score of 1 indicates that the model always selects the secure snippet, while a score of 0 indicates that it always selects the vulnerable one. We evaluate 14 widely used open-source and closed-source models on C/C++ and Java code and observe scores ranging from 0.48 to 0.89. LLM providers already publish many benchmark scores for their models, and TOSSS could become a complementary security-focused score to include in these reports.

[465] CUPID: A Plug-in Framework for Joint Aleatoric and Epistemic Uncertainty Estimation with a Single Model

Xinran Xu, Xiuyi Fan

Main category: cs.LG

TL;DR: CUPID is a plug-in module for comprehensive uncertainty estimation in deep learning models that jointly estimates aleatoric and epistemic uncertainty without retraining base models.

Details

Motivation: Accurate uncertainty estimation is critical for high-stakes applications like medical diagnosis and autonomous systems, but existing methods often address only single types of uncertainty or require model retraining, limiting practical adoption.

Method: CUPID is a general-purpose module that can be inserted into any layer of pretrained networks. It models aleatoric uncertainty through learned Bayesian identity mapping and captures epistemic uncertainty by analyzing model responses to structured perturbations.

Result: CUPID demonstrates competitive performance across classification, regression, and out-of-distribution detection tasks while providing layer-wise insights into uncertainty origins.

Conclusion: CUPID makes uncertainty estimation modular, interpretable, and model-agnostic, supporting more transparent and trustworthy AI systems.

Abstract: Accurate estimation of uncertainty in deep learning is critical for deploying models in high-stakes domains such as medical diagnosis and autonomous decision-making, where overconfident predictions can lead to harmful outcomes. In practice, understanding the reason behind a model’s uncertainty and the type of uncertainty it represents can support risk-aware decisions, enhance user trust, and guide additional data collection. However, many existing methods only address a single type of uncertainty or require modifications and retraining of the base model, making them difficult to adopt in real-world systems. We introduce CUPID (Comprehensive Uncertainty Plug-in estImation moDel), a general-purpose module that jointly estimates aleatoric and epistemic uncertainty without modifying or retraining the base model. CUPID can be flexibly inserted into any layer of a pretrained network. It models aleatoric uncertainty through a learned Bayesian identity mapping and captures epistemic uncertainty by analyzing the model’s internal responses to structured perturbations. We evaluate CUPID across a range of tasks, including classification, regression, and out-of-distribution detection. The results show that it consistently delivers competitive performance while offering layer-wise insights into the origins of uncertainty. By making uncertainty estimation modular, interpretable, and model-agnostic, CUPID supports more transparent and trustworthy AI. Related code and data are available at https://github.com/a-Fomalhaut-a/CUPID.

[466] Prioritizing Gradient Sign Over Modulus: An Importance-Aware Framework for Wireless Federated Learning

Yiyang Yue, Jiacheng Yao, Wei Xu, Zhaohui Yang, George K. Karagiannidis, Dusit Niyato

Main category: cs.LG

TL;DR: SP-FL improves wireless federated learning by prioritizing transmission of gradient signs over moduli using hierarchical resource allocation, achieving better accuracy in resource-constrained scenarios.

Details

Motivation: Wireless FL faces unreliable communication due to limited resources, requiring methods to prioritize important gradient information for more efficient model training at the edge.

Method: Proposes Sign-Prioritized FL with hierarchical resource allocation: transmits gradient signs in individual packets, allows sign reuse if modulus recovery fails, formulates optimization problem for bandwidth/power allocation across devices and sign/modulus packets.

Result: SP-FL achieves up to 9.96% higher testing accuracy on CIFAR-10 compared to existing methods, especially effective in resource-constrained wireless scenarios.

Conclusion: Prioritizing gradient sign transmission through intelligent resource allocation significantly improves wireless FL performance under communication constraints.

Abstract: Wireless federated learning (FL) facilitates collaborative training of artificial intelligence (AI) models to support ubiquitous intelligent applications at the wireless edge. However, the inherent constraints of limited wireless resources inevitably lead to unreliable communication, which poses a significant challenge to wireless FL. To overcome this challenge, we propose Sign-Prioritized FL (SP-FL), a novel framework that improves wireless FL by prioritizing the transmission of important gradient information through uneven resource allocation. Specifically, recognizing the importance of descent direction in model updating, we transmit gradient signs in individual packets and allow their reuse for gradient descent if the remaining gradient modulus cannot be correctly recovered. To further improve the reliability of transmission of important information, we formulate a hierarchical resource allocation problem based on the importance disparity at both the packet and device levels, optimizing bandwidth allocation across multiple devices and power allocation between sign and modulus packets. To make the problem tractable, the one-step convergence behavior of SP-FL, which characterizes data importance at both levels in an explicit form, is analyzed. We then propose an alternating optimization algorithm to solve this problem using the Newton-Raphson method and successive convex approximation (SCA). Simulation results confirm the superiority of SP-FL, especially in resource-constrained scenarios, demonstrating up to 9.96% higher testing accuracy on the CIFAR-10 dataset compared to existing methods.

[467] Dynamics-Informed Deep Learning for Predicting Extreme Events

Eirini Katsidoniotaki, Themistoklis P. Sapsis

Main category: cs.LG

TL;DR: A data-driven framework for predicting extreme events in chaotic systems using interpretable precursors based on transient instability mechanisms, applied to turbulent flow.

Details

Motivation: Predicting extreme events in high-dimensional chaotic systems is challenging due to their rarity, intermittency, and complex transient mechanisms. Current approaches often rely on statistical associations rather than encoding the actual physical mechanisms driving extremes.

Method: Proposes a fully data-driven framework that constructs mechanism-aware precursors by tracking transient instabilities before event onset. Uses reduced-order formulation to compute FTLE-like precursors from state snapshots without governing equations. Employs adaptively evolving low-dimensional subspace spanned by Optimal Time-Dependent modes for efficient computation. These precursors are then fed into a Transformer-based model for forecasting extreme event observables.

Result: Demonstrated on Kolmogorov flow (canonical model of intermittent turbulence). Results show that explicitly encoding transient instability mechanisms substantially extends practical prediction horizons compared to baseline observable-based approaches.

Conclusion: The framework successfully enables long-lead prediction of extreme events by constructing interpretable, mechanism-aware precursors that capture transient instabilities, offering improved forecasting capabilities for chaotic dynamical systems.

Abstract: Predicting extreme events in high-dimensional chaotic dynamical systems remains a fundamental challenge, as such events are rare, intermittent, and arise from transient dynamical mechanisms that are difficult to infer from limited observations. Accordingly, real-time forecasting calls for precursors that encode the mechanisms driving extremes, rather than relying solely on statistical associations. We propose a fully data-driven framework for long-lead prediction of extreme events that constructs interpretable, mechanism-aware precursors by explicitly tracking transient instabilities preceding event onset. The approach leverages a reduced-order formulation to compute finite-time Lyapunov exponent (FTLE)-like precursors directly from state snapshots, without requiring knowledge of the governing equations. To avoid the prohibitive computational cost of classical FTLE computation, instability growth is evaluated in an adaptively evolving low-dimensional subspace spanned by Optimal Time-Dependent (OTD) modes, enabling efficient identification of transiently amplifying directions. These precursors are then provided as input to a Transformer-based model, enabling forecast of extreme event observables. We demonstrate the framework on Kolmogorov flow, a canonical model of intermittent turbulence. The results show that explicitly encoding transient instability mechanisms substantially extends practical prediction horizons compared to baseline observable-based approaches.

[468] AI-Enhanced Spatial Cellular Traffic Demand Prediction with Contextual Clustering and Error Correction for 5G/6G Planning

Mohamad Alkadamani, Colin Brown, Halim Yanikomeroglu

Main category: cs.LG

TL;DR: AI framework for cellular traffic demand prediction using geospatial data with spatial autocorrelation handling to prevent neighborhood leakage and improve spatial generalization for 5G/6G network planning.

Details

Motivation: Accurate spatial prediction of cellular traffic demand is crucial for 5G capacity planning and 6G network design, but spatial autocorrelation causes neighborhood leakage in traditional train/test splits, inflating accuracy metrics and reducing planning reliability.

Method: Proposes an AI-driven framework with context-aware two-stage splitting strategy and residual spatial error correction to reduce leakage and improve spatial generalization, using heterogeneous geospatial and socio-economic data layers.

Result: Experiments across five major Canadian cities show consistent mean absolute error (MAE) reductions compared to location-only clustering approaches, supporting more reliable bandwidth provisioning and spectrum planning.

Conclusion: The framework effectively addresses spatial autocorrelation issues in cellular traffic prediction, providing more reliable demand maps for 5G/6G network planning and spectrum management.

Abstract: Accurate spatial prediction of cellular traffic demand is essential for 5G NR capacity planning, network densification, and data-driven 6G planning. Although machine learning can fuse heterogeneous geospatial and socio-economic layers to estimate fine-grained demand maps, spatial autocorrelation can cause neighborhood leakage under naive train/test splits, inflating accuracy and weakening planning reliability. This paper presents an AI-driven framework that reduces leakage and improves spatial generalization via a context-aware two-stage splitting strategy with residual spatial error correction. Experiments using crowdsourced usage indicators across five major Canadian cities show consistent mean absolute error (MAE) reductions relative to location-only clustering, supporting more reliable bandwidth provisioning and evidence-based spectrum planning and sharing assessments.

[469] Protein Counterfactuals via Diffusion-Guided Latent Optimization

Weronika Kłos, Sidney Bender, Lukas Kades

Main category: cs.LG

TL;DR: MCCOP is a framework that computes minimal, biologically plausible protein sequence edits to flip model predictions to desired target states using a continuous joint sequence-structure latent space with diffusion model priors.

Details

Motivation: Current deep learning models predict protein properties accurately but lack mechanistic insight or actionable guidance for engineering improved variants. When models flag issues like antibody instability, engineers need specific mutation suggestions to fix problems while preserving function.

Method: MCCOP operates in a continuous joint sequence-structure latent space using a pretrained diffusion model as a manifold prior. It balances three objectives: validity (achieving target property), proximity (minimizing mutations), and plausibility (producing foldable proteins).

Result: MCCOP generates sparser, more plausible counterfactuals than both discrete and continuous baselines on three protein engineering tasks: GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery. Recovered mutations align with known biophysical mechanisms.

Conclusion: MCCOP serves as a tool for both model interpretation and hypothesis-driven protein design by providing actionable mutation suggestions that are biologically plausible and aligned with known mechanisms.

Abstract: Deep learning models can predict protein properties with unprecedented accuracy but rarely offer mechanistic insight or actionable guidance for engineering improved variants. When a model flags an antibody as unstable, the protein engineer is left without recourse: which mutations would rescue stability while preserving function? We introduce Manifold-Constrained Counterfactual Optimization for Proteins (MCCOP), a framework that computes minimal, biologically plausible sequence edits that flip a model’s prediction to a desired target state. MCCOP operates in a continuous joint sequence-structure latent space and employs a pretrained diffusion model as a manifold prior, balancing three objectives: validity (achieving the target property), proximity (minimizing mutations), and plausibility (producing foldable proteins). We evaluate MCCOP on three protein engineering tasks - GFP fluorescence rescue, thermodynamic stability enhancement, and E3 ligase activity recovery - and show that it generates sparser, more plausible counterfactuals than both discrete and continuous baselines. The recovered mutations align with known biophysical mechanisms, including chromophore packing and hydrophobic core consolidation, establishing MCCOP as a tool for both model interpretation and hypothesis-driven protein design. Our code is publicly available at github.com/weroks/mccop.

[470] Evaluating randomized smoothing as a defense against adversarial attacks in trajectory prediction

Julian F. Schumann, Eduardo Figueiredo, Frederik Baymler Mathiesen, Luca Laurenti, Jens Kober, Arkady Zgonnikov

Main category: cs.LG

TL;DR: Randomized smoothing defense improves robustness of trajectory prediction models against adversarial attacks without compromising accuracy in normal conditions.

Details

Motivation: Trajectory prediction models for autonomous driving are highly vulnerable to adversarial attacks, but effective countermeasures remain limited. The paper aims to develop a defense mechanism to improve model robustness against such attacks.

Method: The authors develop a defense mechanism based on randomized smoothing, previously successful in other domains. They evaluate different strategies of randomized smoothing through experiments on multiple base trajectory prediction models across various datasets.

Result: The approach consistently improves prediction robustness of multiple base models across different datasets without compromising accuracy in non-adversarial settings. Randomized smoothing proves effective as a simple and computationally inexpensive defense technique.

Conclusion: Randomized smoothing offers a practical and effective defense mechanism for trajectory prediction models against adversarial attacks, enhancing robustness while maintaining performance in normal conditions.

Abstract: Accurate and robust trajectory prediction is essential for safe and efficient autonomous driving, yet recent work has shown that even state-of-the-art prediction models are highly vulnerable to inputs being mildly perturbed by adversarial attacks. Although model vulnerabilities to such attacks have been studied, work on effective countermeasures remains limited. In this work, we develop and evaluate a new defense mechanism for trajectory prediction models based on randomized smoothing – an approach previously applied successfully in other domains. We evaluate its ability to improve model robustness through a series of experiments that test different strategies of randomized smoothing. We show that our approach can consistently improve prediction robustness of multiple base trajectory prediction models in various datasets without compromising accuracy in non-adversarial settings. Our results demonstrate that randomized smoothing offers a simple and computationally inexpensive technique for mitigating adversarial attacks in trajectory prediction.

[471] 6ABOS: An Open-Source Atmospheric Correction Framework for the EnMAP Hyperspectral Mission Based on 6S

Gabriel Caballero Cañas, Bárbara Alvado Arranz, Xavier Sòria-Perpinyà, Antonio Ruiz-Verdú, Jesús Delegido, José Moreno

Main category: cs.LG

TL;DR: 6ABOS is an open-source Python framework for atmospheric correction of EnMAP hyperspectral imagery over water bodies using 6S radiative transfer model and Google Earth Engine.

Details

Motivation: Accurate retrieval of surface reflectance over water bodies is challenging because water-leaving signals are small and easily obscured by atmospheric effects; existing methods need improvement for optically complex environments.

Method: Developed 6ABOS framework using 6S radiative transfer model for physically-based inversion accounting for Rayleigh scattering, aerosol interactions, and gaseous absorption; integrated automated metadata parsing with dynamic atmospheric parameter retrieval via Google Earth Engine API.

Result: Validation on two Mediterranean reservoirs showed high spectral similarity between in situ measurements and EnMAP-derived water-leaving reflectances with low Spectral Angle Mapper values (SAM < 10°).

Conclusion: 6ABOS provides a scalable, transparent, and reproducible open-science tool for hyperspectral aquatic research in the cloud-computing era, distributed via conda-forge.

Abstract: The Environmental Mapping and Analysis Program (EnMAP) mission has opened new frontiers in the monitoring of optically complex environments. However, the accurate retrieval of surface reflectance over water bodies remains a significant challenge, as the water-leaving signal typically accounts for only a small fraction of the total radiance, being easily obscured by atmospheric scattering and surface reflection effects. This paper introduces 6ABOS (6S-based Atmospheric Background Offset Subtraction), a novel open-source Python framework designed to automate the atmospheric correction (AC) of EnMAP hyperspectral imagery. By leveraging the Second Simulation of the Satellite Signal in the Solar Spectrum (6S) radiative transfer model, 6ABOS implements a physically-based inversion scheme that accounts for Rayleigh scattering, aerosol interactions, and gaseous absorption. The framework integrates automated EnMAP metadata parsing with dynamic atmospheric parameter retrieval via the Google Earth Engine (GEE) Application Programming Interface (API). Validation was conducted over two Mediterranean inland water reservoirs with contrasting trophic states: the oligotrophic Benag{’e}ber and the hypertrophic Bell{‘u}s. Results demonstrate a high degree of spectral similarity between in situ measurements and EnMAP-derived water-leaving reflectances. The Spectral Angle Mapper (SAM) values remained consistently low (SAM $<$ 10$^\circ$) across both study sites. 6ABOS is distributed via conda-forge, providing the scientific community with a scalable, transparent, and reproducible open-science tool for advancing hyperspectral aquatic research in the cloud-computing era.

[472] SNPgen: Phenotype-Supervised Genotype Representation and Synthetic Data Generation via Latent Diffusion

Andrea Lampis, Michela Carlotta Massi, Nicola Pirastu, Francesca Ieva, Matteo Matteucci, Emanuele Di Angelantonio

Main category: cs.LG

TL;DR: SNPgen is a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes that preserves privacy while maintaining statistical fidelity and downstream task utility for genomic analyses.

Details

Motivation: Genomic analyses require large individual-level genotype datasets, but strict data access restrictions impede sharing. Existing synthetic genotype methods either operate unconditionally (without phenotype alignment) or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility.

Method: Two-stage conditional latent diffusion framework: 1) GWAS-guided variant selection (1,024-2,048 trait-associated SNPs), 2) Variational autoencoder for genotype compression, and 3) Latent diffusion model conditioned on binary disease labels via classifier-free guidance.

Result: Evaluated on 458,724 UK Biobank individuals across four complex diseases. Models trained on synthetic data matched real-data predictive performance in train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods using 2-6× more variants. Privacy analysis showed zero identical matches, near-random membership inference (AUC ≈ 0.50), preserved linkage disequilibrium structure, and high allele frequency correlation (r ≥ 0.95).

Conclusion: SNPgen provides a privacy-preserving synthetic genotype generation method that maintains both statistical fidelity and downstream task utility for genomic analyses, addressing the limitations of existing unconditional or unsupervised approaches.

Abstract: Polygenic risk scores and other genomic analyses require large individual-level genotype datasets, yet strict data access restrictions impede sharing. Synthetic genotype generation offers a privacy-preserving alternative, but most existing methods operate unconditionally, producing samples without phenotype alignment, or rely on unsupervised compression, creating a gap between statistical fidelity and downstream task utility. We present SNPgen, a two-stage conditional latent diffusion framework for generating phenotype-supervised synthetic genotypes. SNPgen combines GWAS-guided variant selection (1,024-2,048 trait-associated SNPs) with a variational autoencoder for genotype compression and a latent diffusion model conditioned on binary disease labels via classifier-free guidance. Evaluated on 458,724 UK Biobank individuals across four complex diseases (coronary artery disease, breast cancer, type 1 and type 2 diabetes), models trained on synthetic data matched real-data predictive performance in a train-on-synthetic, test-on-real protocol, approaching genome-wide PRS methods that use $2$-$6\times$ more variants. Privacy analysis confirmed zero identical matches, near-random membership inference (AUC $\approx 0.50$), preserved linkage disequilibrium structure, and high allele frequency correlation ($r \geq 0.95$) with source data. A controlled simulation with known causal effects verified faithful recovery of the imposed genetic association structure.

[473] LAtte: Hyperbolic Lorentz Attention for Cross-Subject EEG Classification

Johannes Burchert, Ahmad Bdeir, Tom Hanika, Lars Schmidt-Thieme, Niels Landwehr

Main category: cs.LG

TL;DR: LAtte: A novel EEG classification framework using Lorentz Attention and InceptionTime encoder for cross-subject generalization, with shared baseline signals and subject-specific Lorentz adapters.

Details

Motivation: EEG classification is challenging due to low signal-to-noise ratio and high inter-subject variability. Most prior work focuses on single-subject performance, but real-world applications require models that generalize across subjects.

Method: Proposes LAtte framework with Lorentz Attention Module and InceptionTime-based encoder. Uses pretraining to learn shared baseline signals across subjects, then employs novel Lorentz low-rank adapters to learn subject-specific embeddings for individual differences.

Result: Evaluated on three well-established EEG datasets, achieving substantial improvement in performance over current state-of-the-art methods for cross-subject EEG classification.

Conclusion: LAtte enables robust and generalizable EEG classification across subjects, with potential applications in medical diagnostics and brain-computer interfaces through its shared model approach with subject-specific adaptation.

Abstract: Electroencephalogram (EEG) classification is critical for applications ranging from medical diagnostics to brain-computer interfaces, yet it remains challenging due to the inherently low signal-to-noise ratio (SNR) and high inter-subject variability. To address these issues, we propose LAtte, a novel framework that integrates a Lorentz Attention Module with an InceptionTime-based encoder to enable robust and generalizable EEG classification. Unlike prior work, which evaluates primarily on single-subject performance, LAtte focuses on cross-subject training. First, we learn a shared baseline signal across all subjects using pretraining tasks to capture common underlying patterns. Then, we utilize novel Lorentz low-rank adapters to learn subject-specific embeddings that model individual differences. This allows us to learn a shared model that performs robustly across subjects, and can be subsequently finetuned for individual subjects or used to generalize to unseen subjects. We evaluate LAtte on three well-established EEG datasets, achieving a substantial improvement in performance over current state-of-the-art methods.

[474] Continuous Diffusion Transformers for Designing Synthetic Regulatory Elements

Jonathan Liu, Kia Ghods

Main category: cs.LG

TL;DR: Parameter-efficient Diffusion Transformer (DiT) for generating cell-type-specific regulatory DNA sequences with improved training efficiency, reduced memorization, and enhanced regulatory activity prediction.

Details

Motivation: To develop a more efficient and effective method for generating regulatory DNA sequences using diffusion models, addressing issues of training efficiency, memorization, and regulatory activity prediction in existing approaches.

Method: Replaced U-Net backbone in DNA-Diffusion with transformer denoiser using 2D CNN input encoder, applied DDPO finetuning with Enformer as reward model, and conducted cross-validation against DRAKES.

Result: Achieved 60× fewer epochs to match U-Net’s best validation loss, 39% lower convergence, reduced memorization from 5.3% to 1.7%, and 38× improvement in predicted regulatory activity through DDPO finetuning.

Conclusion: The parameter-efficient Diffusion Transformer with CNN encoder significantly improves DNA sequence generation efficiency and quality while reducing memorization, with DDPO finetuning dramatically enhancing regulatory activity prediction.

Abstract: We present a parameter-efficient Diffusion Transformer (DiT) for generating 200bp cell-type-specific regulatory DNA sequences. By replacing the U-Net backbone of DNA-Diffusion with a transformer denoiser equipped with a 2D CNN input encoder, our model matches the U-Net’s best validation loss in 13 epochs (60$\times$ fewer) and converges 39% lower, while reducing memorization from 5.3% to 1.7% of generated sequences aligning to training data via BLAT. Ablations show the CNN encoder is essential: without it, validation loss increases 70% regardless of positional embedding choice. We further apply DDPO finetuning using Enformer as a reward model, achieving a 38$\times$ improvement in predicted regulatory activity. Cross-validation against DRAKES on an independent prediction task confirms that improvements reflect genuine regulatory signal rather than reward model overfitting.

[475] Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

Main category: cs.LG

TL;DR: DPS is a novel online prompt selection method for RL finetuning of LLMs that predicts informative prompts by modeling solving progress as a dynamical system, reducing computational overhead from extensive rollouts.

Details

Motivation: Current online prompt selection methods for RL finetuning of LLMs require extensive and computationally expensive rollouts over large candidate batches to identify informative samples, which can outweigh the finetuning process itself.

Method: Proposes Dynamics-Predictive Sampling (DPS) which models each prompt’s solving progress during RL finetuning as a dynamical system using a hidden Markov model. Uses online Bayesian inference on historical rollout reward signals to estimate evolving state distributions, providing predictive priors for efficient prompt selection without rollout-intensive filtering.

Result: Empirical results across diverse reasoning tasks (mathematics, planning, and visual geometry) show DPS substantially reduces redundant rollouts, accelerates training, and achieves superior reasoning performance.

Conclusion: DPS offers an efficient alternative to rollout-intensive prompt selection methods for RL finetuning of LLMs, enabling faster training with better performance by predicting informative prompts through learning dynamics modeling.

Abstract: Reinforcement learning (RL) finetuning has become a key technique for enhancing the reasoning abilities of large language models (LLMs). However, its effectiveness critically depends on the selection of training data. Recent advances underscore the importance of online prompt selection methods, which typically concentrate training on partially solved or moderately challenging examples under the current policy, thereby yielding more effective model updates. While significantly accelerating RL finetuning in terms of training steps, they also incur substantial computational overhead by requiring extensive LLM rollouts over large candidate batches to identify informative samples, an expense that can outweigh the finetuning process itself. To address this challenge, this work proposes Dynamics-Predictive Sampling (DPS), which online predicts and selects informative prompts by inferring their learning dynamics prior to costly rollouts. Specifically, we introduce a new perspective by modeling each prompt’s solving progress during RL finetuning as a dynamical system, where the extent of solving is represented as the state and the transition is characterized by a hidden Markov model. Using historical rollout reward signals, we perform online Bayesian inference to estimate evolving state distributions, and the inference outcome provides a predictive prior for efficient prompt selection without rollout-intensive filtering. Empirical results across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that DPS substantially reduces redundant rollouts, accelerates the training process, and achieves superior reasoning performance.

[476] LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon

Main category: cs.LG

TL;DR: LookaheadKV: A lightweight KV cache eviction framework that predicts importance scores without expensive draft generation, reducing overhead while maintaining accuracy for long-context LLMs.

Details

Motivation: KV caching in LLMs causes memory bottlenecks for long-context tasks. Existing eviction methods use draft generation to estimate importance scores, but this introduces substantial computational overhead that limits practical deployment.

Method: Proposes LookaheadKV which augments transformer layers with parameter-efficient modules trained to predict true importance scores directly, eliminating the need for explicit draft generation while maintaining accuracy.

Result: Outperforms competitive baselines on long-context understanding benchmarks across various models, reduces eviction cost by up to 14.5x, and achieves significantly faster time-to-first-token with negligible runtime overhead.

Conclusion: LookaheadKV provides an efficient solution to KV cache bottlenecks in long-context LLMs by eliminating the computational overhead of draft generation while maintaining high accuracy in importance score prediction.

Abstract: Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by “glimpsing into the future”, in which a draft generator produces a surrogate future response approximating the target model’s true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their practicality in real-world deployment. To address this challenge, we propose LookaheadKV, a lightweight eviction framework that leverages the strength of surrogate future response without requiring explicit draft generation. LookaheadKV augments transformer layers with parameter-efficient modules trained to predict true importance scores with high accuracy. Our design ensures negligible runtime overhead comparable to existing inexpensive heuristics, while achieving accuracy superior to more costly approximation methods. Extensive experiments on long-context understanding benchmarks, across a wide range of models, demonstrate that our method not only outperforms recent competitive baselines in various long-context understanding tasks, but also reduces the eviction cost by up to 14.5x, leading to significantly faster time-to-first-token. Our code is available at https://github.com/SamsungLabs/LookaheadKV.

[477] Ergodicity in reinforcement learning

Dominik Baumann, Erfaun Noorani, Arsenii Mustafin, Xinyi Sheng, Bert Verbruggen, Arne Vanhoyweghen, Vincent Ginis, Thomas B. Schön

Main category: cs.LG

TL;DR: Paper discusses limitations of expected value optimization in RL when reward processes are non-ergodic, and presents solutions for optimizing individual trajectory performance.

Details

Motivation: Standard RL optimizes expected value of sum of rewards, but this is uninformative for individual agent performance when reward processes are non-ergodic. The paper aims to address this gap.

Method: Uses instructive examples to demonstrate the problem, relates ergodic reward processes to ergodic Markov chains, and presents existing solutions for optimizing long-term individual trajectory performance under non-ergodic dynamics.

Result: Shows that expected value optimization fails for individual agent performance in non-ergodic settings, and presents alternative approaches that can handle such scenarios.

Conclusion: When reward processes are non-ergodic, standard RL objectives are inadequate for optimizing individual agent performance, requiring specialized approaches that consider trajectory-specific optimization.

Abstract: In reinforcement learning, we typically aim to optimize the expected value of the sum of rewards an agent collects over a trajectory. However, if the process generating these rewards is non-ergodic, the expected value, i.e., the average over infinitely many trajectories with a given policy, is uninformative for the average over a single, but infinitely long trajectory. Thus, if we care about how the individual agent performs during deployment, the expected value is not a good optimization objective. In this paper, we discuss the impact of non-ergodic reward processes on reinforcement learning agents through an instructive example, relate the notion of ergodic reward processes to more widely used notions of ergodic Markov chains, and present existing solutions that optimize long-term performance of individual trajectories under non-ergodic reward dynamics.

[478] Historical Consensus: Preventing Posterior Collapse via Iterative Selection of Gaussian Mixture Priors

Zegu Zhang, Jian Zhang

Main category: cs.LG

TL;DR: Historical Consensus Training prevents VAE posterior collapse by using multiple GMM priors and iterative selection to create stable parameter barriers that exclude collapsed solutions.

Details

Motivation: VAEs often suffer from posterior collapse where latent variables become uninformative. Existing approaches use architectural constraints or hyperparameter tuning to avoid collapse, but this paper proposes eliminating collapse entirely through a different mechanism.

Method: Introduces Historical Consensus Training: an iterative selection procedure that refines candidate GMM priors through alternating optimization and selection. Models are trained to satisfy multiple distinct clustering constraints, creating a historical barrier in parameter space that remains stable even with single-objective training.

Result: The method achieves non-collapsed representations regardless of decoder variance or regularization strength, works with arbitrary neural architectures, and requires no explicit stability conditions. Validated on synthetic and real-world datasets.

Conclusion: Historical Consensus Training provides a fundamentally different approach to preventing VAE posterior collapse by leveraging multiplicity of GMM clusterings to create stable parameter barriers that exclude collapsed solutions.

Abstract: Variational autoencoders (VAEs) frequently suffer from posterior collapse, where latent variables become uninformative and the approximate posterior degenerates to the prior. Recent work has characterized this phenomenon as a phase transition governed by the spectral properties of the data covariance matrix. In this paper, we propose a fundamentally different approach: instead of avoiding collapse through architectural constraints or hyperparameter tuning, we eliminate the possibility of collapse altogether by leveraging the multiplicity of Gaussian mixture model (GMM) clusterings. We introduce Historical Consensus Training, an iterative selection procedure that progressively refines a set of candidate GMM priors through alternating optimization and selection. The key insight is that models trained to satisfy multiple distinct clustering constraints develop a historical barrier – a region in parameter space that remains stable even when subsequently trained with a single objective. We prove that this barrier excludes the collapsed solution, and demonstrate through extensive experiments on synthetic and real-world datasets that our method achieves non-collapsed representations regardless of decoder variance or regularization strength. Our approach requires no explicit stability conditions (e.g., $σ^{\prime 2} < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/historical-consensus-vae.

[479] NCAA Bracket Prediction Using Machine Learning and Combinatorial Fusion Analysis

Yuanhong Wu, Isaiah Smith, Tushar Marwah, Michael Schroeter, Mohamed Rahouti, D. Frank Hsu

Main category: cs.LG

TL;DR: Using Combinatorial Fusion Analysis to combine multiple ranking systems improves sports prediction accuracy over individual ranking methods

Details

Motivation: To enhance sports prediction accuracy by moving beyond traditional classification approaches and leveraging multiple ranking systems through combinatorial analysis

Method: Combinatorial Fusion Analysis (CFA) using rank-score characteristic functions and cognitive diversity to combine ten popular public ranking systems for team ranking

Result: Achieved 74.60% accuracy in team ranking prediction, outperforming the best individual ranking system (73.02%)

Conclusion: CFA provides an effective paradigm for improving sports prediction accuracy by combining multiple scoring systems through rank combination

Abstract: Machine learning models have demonstrated remarkable success in sports prediction in the past years, often treating sports prediction as a classification task within the field. This paper introduces new perspectives for analyzing sports data to predict outcomes more accurately. We leverage rankings to generate team rankings for the 2024 dataset using Combinatorial Fusion Analysis (CFA), a new paradigm for combining multiple scoring systems through the rank-score characteristic (RSC) function and cognitive diversity (CD). Our result based on rank combination with respect to team ranking has an accuracy rate of $74.60%$, which is higher than the best of the ten popular public ranking systems ($73.02%$). This exhibits the efficacy of CFA in enhancing the precision of sports prediction through different lens.

[480] Safe RLHF Beyond Expectation: Stochastic Dominance for Universal Spectral Risk Control

Yaswanth Chittepu, Ativ Joshi, Rajarshi Bhattacharjee, Scott Niekum

Main category: cs.LG

TL;DR: RAD introduces risk-sensitive RLHF using stochastic dominance constraints instead of expected costs, enabling better control over tail risks and rare catastrophic events through quantile-weighted FSD constraints that universally control spectral risk measures.

Details

Motivation: Standard RLHF uses expected cost constraints which only capture a single statistic of the cost distribution, failing to account for distributional uncertainty, heavy tails, and rare catastrophic events. This is problematic when robustness and risk sensitivity are critical for safety.

Method: Proposes Risk-sensitive Alignment via Dominance (RAD) replacing scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. Operationalizes this by comparing target policy’s cost distribution to reference policy using Optimal Transport framework with entropic regularization and Sinkhorn iterations for differentiable, computationally efficient optimization. Introduces quantile-weighted FSD constraints that universally control Spectral Risk Measures.

Result: RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.

Conclusion: RAD provides a principled framework for risk-sensitive alignment that better handles tail risks and distributional uncertainty compared to expectation-based constraints, with quantile weighting offering tunable risk profiles.

Abstract: Safe Reinforcement Learning from Human Feedback (RLHF) typically enforces safety through expected cost constraints, but the expectation captures only a single statistic of the cost distribution and fails to account for distributional uncertainty, particularly under heavy tails or rare catastrophic events. This limitation is problematic when robustness and risk sensitivity are critical. Stochastic dominance offers a principled alternative by comparing entire cost distributions rather than just their averages, enabling direct control over tail risks and potential out-of-distribution failures that expectation-based constraints may overlook. In this work, we propose Risk-sensitive Alignment via Dominance (RAD), a novel alignment framework that replaces scalar expected cost constraints with First-Order Stochastic Dominance (FSD) constraints. We operationalize this constraint by comparing the target policy’s cost distribution to that of a reference policy within an Optimal Transport (OT) framework, using entropic regularization and Sinkhorn iterations to obtain a differentiable and computationally efficient objective for stable end-to-end optimization. Furthermore, we introduce quantile-weighted FSD constraints and show that weighted FSD universally controls a broad class of Spectral Risk Measures (SRMs), so that improvements under weighted dominance imply guaranteed improvements in the corresponding spectral risk. This provides a principled mechanism for tuning a model’s risk profile via the quantile weighting function. Empirical results demonstrate that RAD improves harmlessness over baselines while remaining competitive in helpfulness, and exhibits greater robustness on out-of-distribution harmlessness evaluations.

[481] ECoLAD: Deployment-Oriented Evaluation for Automotive Time-Series Anomaly Detection

Kadir-Kaan Özer, René Ebeling, Markus Enzweiler

Main category: cs.LG

TL;DR: ECoLAD is a deployment-oriented evaluation protocol for time-series anomaly detection that assesses methods under constrained computational resources, focusing on predictable latency and stable behavior for in-vehicle monitoring applications.

Details

Motivation: Current anomaly detection evaluations focus on accuracy on workstation-class hardware, but in-vehicle monitoring requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards misrepresent which methods remain feasible under deployment-relevant constraints.

Method: ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps. It sweeps target scoring rates and reports coverage (fraction of entities meeting target) and best AUC-PR achievable among configurations satisfying the target.

Result: On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above random baseline across full throughput sweep. Several deep methods lose feasibility before they lose accuracy.

Conclusion: Deployment-oriented evaluation reveals important trade-offs between accuracy and computational feasibility that are missed by accuracy-only benchmarks, with lightweight classical methods showing better practical viability for constrained environments like in-vehicle monitoring.

Abstract: Time-series anomaly detectors are commonly compared on workstation-class hardware under unconstrained execution. In-vehicle monitoring, however, requires predictable latency and stable behavior under limited CPU parallelism. Accuracy-only leaderboards can therefore misrepresent which methods remain feasible under deployment-relevant constraints. We present ECoLAD (Efficiency Compute Ladder for Anomaly Detection), a deployment-oriented evaluation protocol instantiated as an empirical study on proprietary automotive telemetry (anomaly rate ${\approx}$0.022) and complementary public benchmarks. ECoLAD applies a monotone compute-reduction ladder across heterogeneous detector families using mechanically determined, integer-only scaling rules and explicit CPU thread caps, while logging every applied configuration change. Throughput-constrained behavior is characterized by sweeping target scoring rates and reporting (i) coverage (the fraction of entities meeting the target) and (ii) the best AUC-PR achievable among measured ladder configurations satisfying the target. On constrained automotive telemetry, lightweight classical detectors sustain both coverage and detection lift above the random baseline across the full throughput sweep. Several deep methods lose feasibility before they lose accuracy.

[482] Quantifying Membership Disclosure Risk for Tabular Synthetic Data Using Kernel Density Estimators

Rajdeep Pathak, Sayantee Jana

Main category: cs.LG

TL;DR: KDE-based method for quantifying membership inference attack risk in synthetic tabular data, outperforming baseline approaches without requiring shadow models.

Details

Motivation: Synthetic data is used for privacy preservation but remains vulnerable to membership inference attacks (MIAs) where adversaries can determine if specific individuals were in the training data. Current methods lack practical, effective ways to quantify this risk.

Method: Proposes kernel density estimators (KDEs) to model nearest-neighbor distances between synthetic data and training records, enabling probabilistic membership inference. Introduces two attack models: True Distribution Attack (with training data access) and Realistic Attack (using auxiliary data without true labels).

Result: Method consistently achieves higher F1 scores and sharper risk characterization than baseline across four real-world datasets and six synthetic data generators, without computationally expensive shadow models.

Conclusion: Provides practical framework for quantifying membership disclosure risk in synthetic data, enabling data custodians to conduct post-generation risk assessment before releasing synthetic datasets.

Abstract: The use of synthetic data has become increasingly popular as a privacy-preserving alternative to sharing real datasets, especially in sensitive domains such as healthcare, finance, and demography. However, the privacy assurances of synthetic data are not absolute, and remain susceptible to membership inference attacks (MIAs), where adversaries aim to determine whether a specific individual was present in the dataset used to train the generator. In this work, we propose a practical and effective method to quantify membership disclosure risk in tabular synthetic datasets using kernel density estimators (KDEs). Our KDE-based approach models the distribution of nearest-neighbour distances between synthetic data and the training records, allowing probabilistic inference of membership and enabling robust evaluation via ROC curves. We propose two attack models: a ‘True Distribution Attack’, which assumes privileged access to training data, and a more realistic, implementable ‘Realistic Attack’ that uses auxiliary data without true membership labels. Empirical evaluations across four real-world datasets and six synthetic data generators demonstrate that our method consistently achieves higher F1 scores and sharper risk characterization than a prior baseline approach, without requiring computationally expensive shadow models. The proposed method provides a practical framework and metric for quantifying membership disclosure risk in synthetic data, which enables data custodians to conduct a post-generation risk assessment prior to releasing their synthetic datasets for downstream use. The datasets and codes for this study are available at https://github.com/PyCoder913/MIA-KDE.

[483] When should we trust the annotation? Selective prediction for molecular structure retrieval from mass spectra

Mira Jürgens, Gaetan De Waele, Morteza Rakhshaninejad, Willem Waegeman

Main category: cs.LG

TL;DR: Selective prediction framework for molecular structure retrieval from mass spectrometry data that enables models to abstain when uncertain, with comprehensive evaluation of uncertainty quantification methods.

Details

Motivation: Current machine learning methods for molecular structure identification from tandem mass spectra have high error rates, which is problematic for high-stakes applications like clinical metabolomics and environmental screening where incorrect annotations can have serious consequences.

Method: Introduces a selective prediction framework within the risk-coverage tradeoff framework. Evaluates uncertainty quantification strategies at two levels: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. Compares scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, and distance-based measures in latent space.

Result: Fingerprint-level uncertainty scores are poor proxies for retrieval success, but computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs. Distribution-free risk control via generalization bounds allows practitioners to specify tolerable error rates and obtain subsets of annotations satisfying constraints with high probability.

Conclusion: Selective prediction with appropriate uncertainty quantification enables reliable molecular structure retrieval from MS/MS spectra, particularly important for high-stakes applications where prediction reliability is critical.

Abstract: Machine learning methods for identifying molecular structures from tandem mass spectra (MS/MS) have advanced rapidly, yet current approaches still exhibit significant error rates. In high-stakes applications such as clinical metabolomics and environmental screening, incorrect annotations can have serious consequences, making it essential to determine when a prediction can be trusted. We introduce a selective prediction framework for molecular structure retrieval from MS/MS spectra, enabling models to abstain from predictions when uncertainty is too high. We formulate the problem within the risk-coverage tradeoff framework and comprehensively evaluate uncertainty quantification strategies at two levels of granularity: fingerprint-level uncertainty over predicted molecular fingerprint bits, and retrieval-level uncertainty over candidate rankings. We compare scoring functions including first-order confidence measures, aleatoric and epistemic uncertainty estimates from second-order distributions, as well as distance-based measures in the latent space. All experiments are conducted on the MassSpecGym benchmark. Our analysis reveals that while fingerprint-level uncertainty scores are poor proxies for retrieval success, computationally inexpensive first-order confidence measures and retrieval-level aleatoric uncertainty achieve strong risk-coverage tradeoffs across evaluation settings. We demonstrate that by applying distribution-free risk control via generalization bounds, practitioners can specify a tolerable error rate and obtain a subset of annotations satisfying that constraint with high probability.

[484] Ranking Reasoning LLMs under Test-Time Scaling

Mohsen Hariri, Michael Hinczewski, Jing Ma, Vipin Chaudhary

Main category: cs.LG

TL;DR: Scorio is a library for ranking reasoning LLMs under test-time scaling using statistical methods like paired comparisons, IRT, and voting rules, validated on math benchmarks.

Details

Motivation: Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but there's a lack of formal methods for ranking models in this regime. Existing approaches need systematic evaluation and reliable statistical ranking methods.

Method: Introduces Scorio library implementing statistical ranking methods including paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Evaluates on 20 reasoning models across four Olympiad-style math benchmarks with up to 80 trials per prompt.

Result: Most full-trial rankings agree closely with Bayesian gold standard (mean Kendall’s τ_b = 0.93-0.95), with 19-34 methods recovering identical ordering. In single-trial regime, best methods reach τ_b ≈ 0.86. Using greedy decoding as empirical prior reduces variance by 16-52% at N=1 but can bias rankings when greedy and stochastic sampling disagree.

Conclusion: Identifies reliable ranking methods for both high- and low-budget test-time scaling. Scorio provides open-source tools for systematic model ranking under test-time scaling regimes.

Abstract: Test-time scaling evaluates reasoning LLMs by sampling multiple outputs per prompt, but ranking models in this regime remains underexplored. We formalize dense benchmark ranking under test-time scaling and introduce Scorio, a library that implements statistical ranking methods such as paired-comparison models, item response theory (IRT) models, voting rules, and graph- and spectral-based methods. Across $20$ reasoning models on four Olympiad-style math benchmarks (AIME'24, AIME'25, HMMT'25, and BrUMO'25; up to $N=80$ trials), most full-trial rankings agree closely with the Bayesian gold standard $\mathrm{Bayes}{\mathcal{U}}@80$ (mean Kendall’s $τ_b = 0.93$–$0.95$), and $19$–$34$ methods recover exactly the same ordering. In the single-trial regime, the best methods reach $τ_b \approx 0.86$. Using greedy decoding as an empirical prior ($\mathrm{Bayes}{\mathbf{R}_0}@N$) reduces variance at $N=1$ by $16$–$52%$, but can bias rankings when greedy and stochastic sampling disagree. These results identify reliable ranking methods for both high- and low-budget test-time scaling. We release Scorio as an open-source library at https://github.com/mohsenhariri/scorio.

[485] Neural Field Thermal Tomography: A Differentiable Physics Framework for Non-Destructive Evaluation

Tao Zhong, Yixun Hu, Dongzhe Zheng, Aditya Sood, Christine Allen-Blanchette

Main category: cs.LG

TL;DR: NeFTY is a differentiable physics framework for 3D thermal tomography that uses neural fields to parameterize material properties and enforces thermodynamic laws as hard constraints through a differentiable solver.

Details

Motivation: Traditional thermography methods use pixel-wise 1D approximations that neglect lateral heat diffusion, while Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness issues. There's a need for accurate 3D reconstruction of material properties from surface temperature measurements.

Method: NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. It uses a “discretize-then-optimize” paradigm with a differentiable physics solver that enforces thermodynamic laws as hard constraints while maintaining memory efficiency for high-resolution 3D tomography.

Result: Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baseline methods. The approach effectively mitigates spectral bias and ill-posedness inherent in inverse heat conduction problems.

Conclusion: NeFTY provides a novel differentiable physics framework for 3D thermal tomography that overcomes limitations of traditional methods and PINNs, enabling accurate recovery of subsurface defects at arbitrary scales through hard constraint enforcement of physical laws.

Abstract: We propose Neural Field Thermal Tomography (NeFTY), a differentiable physics framework for the quantitative 3D reconstruction of material properties from transient surface temperature measurements. While traditional thermography relies on pixel-wise 1D approximations that neglect lateral diffusion, and soft-constrained Physics-Informed Neural Networks (PINNs) often fail in transient diffusion scenarios due to gradient stiffness, NeFTY parameterizes the 3D diffusivity field as a continuous neural field optimized through a rigorous numerical solver. By leveraging a differentiable physics solver, our approach enforces thermodynamic laws as hard constraints while maintaining the memory efficiency required for high-resolution 3D tomography. Our discretize-then-optimize paradigm effectively mitigates the spectral bias and ill-posedness inherent in inverse heat conduction, enabling the recovery of subsurface defects at arbitrary scales. Experimental validation on synthetic data demonstrates that NeFTY significantly improves the accuracy of subsurface defect localization over baselines. Additional details at https://cab-lab-princeton.github.io/nefty/

[486] Bio-Inspired Self-Supervised Learning for Wrist-worn IMU Signals

Prithviraj Tarale, Kiet Chu, Abhishek Varghese, Kai-Chun Liu, Maxwell A Xu, Mohit Iyyer, Sunghoon I. Lee

Main category: cs.LG

TL;DR: A novel self-supervised learning approach for human activity recognition using wearable accelerometers, featuring a biologically-inspired tokenization strategy based on submovement theory to create movement segments as tokens for transformer pretraining.

Details

Motivation: Learning robust human-activity representations from wearable accelerometers is constrained by labeled data scarcity. Existing self-supervised approaches treat sensor streams as unstructured time series, overlooking the biological structure of human movement which is critical for effective HAR.

Method: Introduces a tokenization strategy grounded in submovement theory of motor control, treating movement segments (composed of finite sequences of submovements) as tokens. Pretrains a Transformer encoder via masked movement-segment reconstruction to model temporal dependencies beyond local waveform morphology.

Result: Pretrained on NHANES corpus (28k hours, 11k participants, 10M windows), the representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks and demonstrate stronger data efficiency in data-scarce settings.

Conclusion: The biologically-inspired tokenization approach enables more effective self-supervised learning for human activity recognition by leveraging the underlying structure of human movement, addressing data scarcity challenges in wearable sensing.

Abstract: Wearable accelerometers have enabled large-scale health and wellness monitoring, yet learning robust human-activity representations has been constrained by the scarcity of labeled data. While self-supervised learning offers a potential remedy, existing approaches treat sensor streams as unstructured time series, overlooking the underlying biological structure of human movement, a factor we argue is critical for effective Human Activity Recognition (HAR). We introduce a novel tokenization strategy grounded in the submovement theory of motor control, which posits that continuous wrist motion is composed of superposed elementary basis functions called submovements. We define our token as the movement segment, a unit of motion composed of a finite sequence of submovements that is readily extractable from wrist accelerometer signals. By treating these segments as tokens, we pretrain a Transformer encoder via masked movement-segment reconstruction to model the temporal dependencies of movement segments, shifting the learning focus beyond local waveform morphology. Pretrained on the NHANES corpus (approximately 28k hours; approximately 11k participants; approximately 10M windows), our representations outperform strong wearable SSL baselines across six subject-disjoint HAR benchmarks. Furthermore, they demonstrate stronger data efficiency in data-scarce settings. Code and pretrained weights will be made publicly available.

[487] FRIEND: Federated Learning for Joint Optimization of multi-RIS Configuration and Eavesdropper Intelligent Detection in B5G Networks

Maria Lamprini A. Bartsioka, Ioannis A. Bartsiokas, Anastasios K. Papazafeiropoulos, Maria A. Seimeni, Dimitra I. Kaklamani, Iakovos S. Venieris

Main category: cs.LG

TL;DR: Federated Learning-based malicious user detection framework for RIS-enhanced cell-free mmWave networks in IIoT, improving secrecy rates by 30% while preserving privacy.

Details

Motivation: As B5G systems evolve with cell-free mmWave and RIS for IIoT, securing these distributed environments against eavesdropping is challenging due to scalability and latency constraints of conventional security methods.

Method: Proposes a federated learning framework where edge devices collaboratively train a Deep Convolutional Neural Network on locally observed Channel State Information without raw data exchange. Incorporates RIS coordination and an early-exit mechanism for computational efficiency.

Result: Integration of FL and multi-RIS coordination improves secrecy rate by approximately 30% compared to baseline non-RIS-assisted methods while maintaining near-optimal detection accuracy.

Conclusion: Establishes a distributed, privacy-preserving approach to physical layer eavesdropping detection tailored for next-generation IIoT deployments in RIS-enhanced cell-free mmWave networks.

Abstract: As wireless systems evolve toward Beyond 5G (B5G), the adoption of cell-free (CF) millimeter-wave (mmWave) architectures combined with Reconfigurable Intelligent Surfaces (RIS) is emerging as a key enabler for ultra-reliable, high-capacity, scalable, and secure Industrial Internet of Things (IIoT) communications. However, safeguarding these complex and distributed environments against eavesdropping remains a critical challenge, particularly when conventional security mechanisms struggle to overcome scalability, and latency constraints. In this paper, a novel framework for detecting malicious users in RIS-enhanced cell-free mmWave networks using Federated Learning (FL) is presented. The envisioned setup features multiple access points (APs) operating without traditional cell boundaries, assisted by RIS nodes to dynamically shape the wireless propagation environment. Edge devices collaboratively train a Deep Convolutional Neural Network (DCNN) on locally observed Channel State Information (CSI), eliminating the need for raw data exchange. Moreover, an early-exit mechanism is incorporated in that model to jointly satisfy computational complexity requirements. Performance evaluation indicates that the integration of FL and multi-RIS coordination improves approximately 30% the achieved secrecy rate (SR) compared to baseline non-RIS-assisted methods while maintaining near-optimal detection accuracy levels. This work establishes a distributed, privacy-preserving approach to physical layer eavesdropping detection tailored for next-generation IIoT deployments.

[488] Federated Learning-driven Beam Management in LEO 6G Non-Terrestrial Networks

Maria Lamprini Bartsioka, Ioannis A. Bartsiokas, Athanasios D. Panagopoulos, Dimitra I. Kaklamani, Iakovos S. Venieris

Main category: cs.LG

TL;DR: FL-based beam selection in LEO satellite networks using HAPS, comparing MLP and GNN models for beam prediction accuracy.

Details

Motivation: Need efficient beam management in dynamic LEO NTNs, leveraging FL for distributed learning across orbital planes via HAPS infrastructure.

Method: Federated Learning approach with orbital planes as distributed learners using HAPS, evaluating MLP and GNN models on realistic channel/beamforming data.

Result: GNN outperforms MLP in beam prediction accuracy and stability, especially at low elevation angles, enabling lightweight intelligent beam management.

Conclusion: GNN-based FL enables effective beam selection for LEO NTNs, offering superior performance over traditional MLP approaches.

Abstract: Low Earth Orbit (LEO) Non-Terrestrial Networks (NTNs) require efficient beam management under dynamic propagation conditions. This work investigates Federated Learning (FL)-based beam selection in LEO satellite constellations, where orbital planes operate as distributed learners through the utilization of High-Altitude Platform Stations (HAPS). Two models, a Multi-Layer Perceptron (MLP) and a Graph Neural Network (GNN), are evaluated using realistic channel and beamforming data. Results demonstrate that GNN surpasses MLP in beam prediction accuracy and stability, particularly at low elevation angles, enabling lightweight and intelligent beam management for future NTN deployments.

[489] The Discrete Charm of the MLP: Binary Routing of Continuous Signals in Transformer Feed-Forward Layers

Peter Balogh

Main category: cs.LG

TL;DR: Transformer MLP layers implement binary routing of continuous signals using consensus architectures with default-ON neurons and exception handlers, creating functional switches that determine which tokens need nonlinear processing.

Details

Motivation: To understand the internal mechanisms of transformer MLP layers and how they process information, specifically investigating whether they implement binary routing decisions despite handling continuous signals.

Method: Analyzed GPT-2 Small (124M parameters) by examining neuron activations, identifying consensus architectures with default-ON neurons and exception handlers, performing cross-layer analysis, and validating routing functionality through ablation studies and comparison of binary vs. continuous features.

Result: Found specific binary routing switches with 93-98% mutual exclusivity, developmental arc across layers, functional validation showing 4x perplexity difference, and minimal information loss from binarization (79.2% vs 78.8% accuracy).

Conclusion: Transformer MLPs implement binary routing decisions about which tokens need nonlinear processing, explaining why smooth polynomial approximations fail and suggesting a routing characterization complements the piecewise-affine view of deep networks.

Abstract: We show that MLP layers in transformer language models perform binary routing of continuous signals: the decision of whether a token needs nonlinear processing is well-captured by binary neuron activations, even though the signals being routed are continuous. In GPT-2 Small (124M parameters), we find that specific neurons implement a consensus architecture – seven “default-ON” neurons and one exception handler (N2123 in Layer 11) that are 93-98% mutually exclusive – creating a binary routing switch. A cross-layer analysis reveals a developmental arc: early layers (L1-3) use single gateway neurons to route exceptions without consensus quorums; middle layers (L4-6) show diffuse processing with neither gateway nor consensus; and late layers (L7-11) crystallize full consensus/exception architectures with increasing quorum size (1 to 3 to 7 consensus neurons). Causal validation confirms the routing is functional: removing the MLP at consensus breakdown costs 43.3% perplexity, while at full consensus removing it costs only 10.1% – exceeding a 4x difference. Comparing binary vs. continuous features for the routing decision confirms that binarization loses essentially no information (79.2% vs. 78.8% accuracy), while continuous activations carry additional magnitude information (R^2 = 0.36 vs. 0.22). This binary routing structure explains why smooth polynomial approximation fails: cross-validated polynomial fits (degrees 2-7) never exceed R^2 = 0.06 for highly nonlinear layers. We propose that the well-established piecewise-affine characterization of deep networks can be complemented by a routing characterization: along the natural data manifold, the piecewise boundaries implement binary decisions about which tokens need nonlinear processing, routing continuous signals through qualitatively different computational paths.

[490] MCMC Informed Neural Emulators for Uncertainty Quantification in Dynamical Systems

Heikki Haario, Zhi-Song Liu, Martin Simon, Hendrik Weichel

Main category: cs.LG

TL;DR: Neural network surrogate for physical models with uncertainty quantification via MCMC sampling of model parameters as network input, enabling efficient emulation with same uncertainty as original model.

Details

Motivation: Traditional neural network surrogates for physical models struggle when exhaustive parameter sampling is computationally expensive or leads to unphysical parameter values. The paper addresses the challenge of incorporating uncertainty quantification without requiring accurate prior parameter distributions.

Method: Decouples uncertainty quantification from network architecture by introducing model-parameter distribution as input via Markov chain Monte Carlo (MCMC) sampling. Presents two approaches: 1) quantile emulator for prediction, and 2) novel autoencoder-based ODE network emulator that estimates different trajectory paths for different ODE parameters.

Result: The surrogate achieves the same uncertainty quantification as the underlying physical model with substantially reduced computation time. The approach is fully agnostic to neural network choice and includes mathematical analysis relating performance loss to measurable distribution mismatch.

Conclusion: The proposed method provides an efficient way to build neural network surrogates for physical models with proper uncertainty quantification, overcoming limitations of traditional parameter sampling approaches while maintaining model-agnostic flexibility.

Abstract: Neural networks are a commonly used approach to replace physical models with computationally cheap surrogates. Parametric uncertainty quantification can be included in training, assuming that an accurate prior distribution of the model parameters is available. Here we study the common opposite situation, where direct screening or random sampling of model parameters leads to exhaustive training times and evaluations at unphysical parameter values. Our solution is to decouple uncertainty quantification from network architecture. Instead of sampling network weights, we introduce the model-parameter distribution as an input to network training via Markov chain Monte Carlo (MCMC). In this way, the surrogate achieves the same uncertainty quantification as the underlying physical model, but with substantially reduced computation time. The approach is fully agnostic with respect to the neural network choice. In our examples, we present a quantile emulator for prediction and a novel autoencoder-based ODE network emulator that can flexibly estimate different trajectory paths corresponding to different ODE model parameters. Moreover, we present a mathematical analysis that provides a transparent way to relate potential performance loss to measurable distribution mismatch.

[491] Factorized Neural Implicit DMD for Parametric Dynamics

Siyuan Chen, Zhecheng Wang, Yixin Chen, Yue Chang, Peter Yichen Chen, Eitan Grinspun, Jonathan Panuelos

Main category: cs.LG

TL;DR: Physics-coded neural field learns Koopman operator spectral decomposition for stable long-term predictions of dynamical systems without explicit governing equations

Details

Motivation: Traditional numerical solvers for physical systems with high-dimensional state spaces and nonlinear dynamics are computationally expensive and ill-suited for real-time analysis and control, even when partial differential equations are available. There's a need for data-driven approaches that can learn parametric flows supporting long-horizon rollouts, generalization to unseen parameters, and spectral analysis.

Method: Proposes a physics-coded neural field parameterization of the Koopman operator’s spectral decomposition. Unlike physics-constrained neural fields (single solution surface) or neural operators (fixed time horizons), this model learns a factorized flow operator that decouples spatial modes and temporal evolution, exposing eigenvalues, modes, and stability of the underlying physical process.

Result: Demonstrates efficacy on a range of dynamics problems, showing accurate prediction of complex spatiotemporal phenomena while providing insights into system dynamic behavior through spectral analysis capabilities.

Conclusion: The physics-coded neural field approach enables stable long-term rollouts, interpolation across parameter spaces, and spectral analysis by learning the Koopman operator’s spectral decomposition, offering a powerful tool for modeling dynamical systems without explicit governing equations.

Abstract: A data-driven, model-free approach to modeling the temporal evolution of physical systems mitigates the need for explicit knowledge of the governing equations. Even when physical priors such as partial differential equations are available, such systems often reside in high-dimensional state spaces and exhibit nonlinear dynamics, making traditional numerical solvers computationally expensive and ill-suited for real-time analysis and control. Consider the problem of learning a parametric flow of a dynamical system: with an initial field and a set of physical parameters, we aim to predict the system’s evolution over time in a way that supports long-horizon rollouts, generalization to unseen parameters, and spectral analysis. We propose a physics-coded neural field parameterization of the Koopman operator’s spectral decomposition. Unlike a physics-constrained neural field, which fits a single solution surface, and neural operators, which directly approximate the solution operator at fixed time horizons, our model learns a factorized flow operator that decouples spatial modes and temporal evolution. This structure exposes underlying eigenvalues, modes, and stability of the underlying physical process to enable stable long-term rollouts, interpolation across parameter spaces, and spectral analysis. We demonstrate the efficacy of our method on a range of dynamics problems, showcasing its ability to accurately predict complex spatiotemporal phenomena while providing insights into the system’s dynamic behavior.

[492] Cross-Species Transfer Learning for Electrophysiology-to-Transcriptomics Mapping in Cortical GABAergic Interneurons

Theo Schwider, Ramin Ramezani

Main category: cs.LG

TL;DR: Reproduces and extends electrophysiology-to-transcriptomics framework using Allen Institute Patch-seq data from mouse and human cortex, focusing on GABAergic interneurons with attention-based BiLSTM models for cross-species transfer learning.

Details

Motivation: To replicate and extend the electrophysiology-to-transcriptomics framework for linking neuronal physiology to transcriptomic identity, particularly for GABAergic inhibitory interneurons across mouse and human species, enabling cross-species transfer learning.

Method: Used publicly available Allen Institute Patch-seq datasets from mouse visual cortex (3,699 neurons) and human neocortex (506 neurons). Applied standardized electrophysiological features with sparse PCA for class-level separation. Developed attention-based BiLSTM that operates directly on structured IPFX feature-family representation for interpretability. Evaluated cross-species transfer learning with pretraining on mouse data and fine-tuning on human data.

Result: Successfully reproduced major class-level separations in mouse data. Attention-based BiLSTM matched feature-engineered baselines in mouse data and provided feature-family-level interpretability. Cross-species transfer learning (mouse pretraining + human fine-tuning) improved human macro-F1 relative to human-only training baseline.

Conclusion: The study confirms reproducibility of the electrophysiology-to-transcriptomics pipeline, demonstrates sequence models can match feature-engineered baselines with interpretability, and shows mouse-to-human transfer learning provides measurable gains for human subclass prediction.

Abstract: Single-cell electrophysiological recordings provide a powerful window into neuronal functional diversity and offer an interpretable route for linking intrinsic physiology to transcriptomic identity. Here, we replicate and extend the electrophysiology-to-transcriptomics framework introduced by Gouwens et al. (2020) using publicly available Allen Institute Patch-seq datasets from both mouse and human cortex. We focus on GABAergic inhibitory interneurons to target a subclass structure (Lamp5, Pvalb, Sst, Vip) that is comparable and conserved across species. After quality control, we analyzed 3,699 mouse visual cortex neurons and 506 human neocortical neurons from neurosurgical resections. Using standardized electrophysiological features and sparse PCA, we reproduced the major class-level separations reported in the original mouse study. For supervised prediction, a class-balanced random forest provided a strong feature-engineered baseline in mouse data and a reduced but still informative baseline in human data. We then developed an attention-based BiLSTM that operates directly on the structured IPFX feature-family representation, avoiding sPCA and providing feature-family-level interpretability via learned attention weights. Finally, we evaluated a cross-species transfer setting in which the sequence model is pretrained on mouse data and fine-tuned on human data for an aligned 4-class task, improving human macro-F1 relative to a human-only training baseline. Together, these results confirm reproducibility of the Gouwens pipeline in mouse data, demonstrate that sequence models can match feature-engineered baselines, and show that mouse-to-human transfer learning can provide measurable gains for human subclass prediction.

[493] Leech Lattice Vector Quantization for Efficient LLM Compression

Tycho F. A. van der Ouderaa, Mart van Baalen, Paul Whatmough, Markus Nagel

Main category: cs.LG

TL;DR: Leech Lattice Vector Quantization (LLVQ) uses the optimal 24-dimensional Leech lattice for LLM quantization, achieving state-of-the-art compression without explicit codebook storage through efficient indexing and parallel dequantization.

Details

Motivation: Scalar quantization of LLMs faces information-theoretic limits, while vector quantization requires expensive codebook storage. The Leech lattice offers optimal high-dimensional packing properties that could overcome these limitations for practical LLM compression.

Method: Extends existing search algorithm based on extended Golay code to support: 1) indexing for bitstring conversion without materializing codebook, 2) angular search over union of Leech lattice shells, 3) fully-parallelizable dequantization kernel, resulting in LLVQ algorithm.

Result: LLVQ achieves state-of-the-art LLM quantization performance, outperforming recent methods like Quip#, QTIP, and PVQ, demonstrating the effectiveness of high-dimensional lattices for model compression.

Conclusion: High-dimensional lattices like the Leech lattice provide scalable, theoretically grounded model compression, with LLVQ offering practical implementation that avoids explicit codebook storage while delivering superior quantization performance.

Abstract: Scalar quantization of large language models (LLMs) is fundamentally limited by information-theoretic bounds. While vector quantization (VQ) overcomes these limits by encoding blocks of parameters jointly, practical implementations must avoid the need for expensive lookup mechanisms or other explicit codebook storage. Lattice approaches address this through highly structured and dense packing. This paper explores the Leech lattice, which, with its optimal sphere packing and kissing configurations at 24 dimensions, is the highest dimensional lattice known with such optimal properties. To make the Leech lattice usable for LLM quantization, we extend an existing search algorithm based on the extended Golay code construction, to i) support indexing, enabling conversion to and from bitstrings without materializing the codebook, ii) allow angular search over union of Leech lattice shells, iii) propose fully-parallelisable dequantization kernel. Together this yields a practical algorithm, namely Leech Lattice Vector Quantization (LLVQ). LLVQ delivers state-of-the-art LLM quantization performance, outperforming recent methods such as Quip#, QTIP, and PVQ. These results highlight the importance of high-dimensional lattices for scalable, theoretically grounded model compression.

[494] ForwardFlow: Simulation only statistical inference using deep learning

Stefan Böhringer

Main category: cs.LG

TL;DR: Unable to analyze paper 2603.10991 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.10991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[495] Efficient Bayesian Updates for Deep Active Learning via Laplace Approximations

Denis Huseljic, Marek Herde, Lukas Rauch, Paul Hahn, Zhixin Huang, Daniel Kottke, Stephan Vogt, Bernhard Sick

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2210.06112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2210.06112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[496] Disjunctive Branch-and-Bound for Certifiably Optimal Low-Rank Matrix Completion

Dimitris Bertsimas, Ryan Cory-Wright, Sean Lo, Jean Pauphilet

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2305.12292: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2305.12292&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[497] Communication-Efficient Multimodal Federated Learning: Joint Modality and Client Selection

Liangqi Yuan, Dong-Jun Han, Su Wang, Devesh Upadhyay, Christopher G. Brinton

Main category: cs.LG

TL;DR: Paper ID 2401.16685 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.

Details

Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.

Method: Unable to determine method as the paper content could not be retrieved due to API rate limiting.

Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting.

Conclusion: Unable to draw conclusions as the paper content could not be retrieved due to API rate limiting.

Abstract: Failed to fetch summary for 2401.16685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.16685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[498] Mamba Neural Operator: Who Wins? Transformers vs. State-Space Models for PDEs

Chun-Wun Cheng, Jiahao Huang, Yi Zhang, Guang Yang, Carola-Bibiane Schönlieb, Angelica I. Aviles-Rivero

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2410.02113: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.02113&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[499] An Algorithm to perform Covariance-Adjusted Support Vector Classification in Non-Euclidean Spaces

Satyajeet Sahoo, Jhareswar Maiti

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2504.04371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.04371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[500] Panda: A pretrained forecast model for chaotic dynamics

Jeffrey Lai, Anthony Bao, William Gilpin

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2505.13755 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2505.13755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.13755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[501] CARTGen-IR: Synthetic Tabular Data Generation for Imbalanced Regression

António Pedro Pinheiro, Rita P. Ribeiro

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to draw conclusions due to access restrictions

Abstract: Failed to fetch summary for 2506.02811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[502] Sequential-Parallel Duality in Prefix Scannable Models

Morris Yau, Sharut Gupta, Valerie Engelmayer, Kazuki Irie, Stefanie Jegelka, Jacob Andreas

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.10918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[503] Silhouette-Driven Instance-Weighted $k$-means

Aggelos Semoglou, Aristidis Likas, John Pavlopoulos

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data retrieval error

Method: Unable to determine method due to data retrieval error

Result: Unable to determine results due to data retrieval error

Conclusion: Unable to determine conclusion due to data retrieval error

Abstract: Failed to fetch summary for 2506.12878: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12878&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[504] Order Optimal Regret Bounds for Sharpe Ratio Optimization under Thompson Sampling

Mohammad Taha Shah, Sabrina Khurshid, Gourab Ghatak

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2508.13749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[505] GDR-learners: Orthogonal Learning of Generative Models for Potential Outcomes

Valentyn Melnychuk, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.22953 suggests it’s from September 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2509.22953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[506] One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.24483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[507] Overlap-Adaptive Regularization for Conditional Average Treatment Effect Estimation

Valentyn Melnychuk, Dennis Frauen, Jonas Schweisthal, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2509.24962 suggests it’s from September 2025, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2509.24962: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24962&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[508] Composer: A Search Framework for Hybrid Neural Architecture Design

Bilge Acun, Prasoon Sinha, Newsha Ardalani, Sangmin Bae, Alicia Golden, Chien-Yu Lin, Meghana Madhyastha, Fei Sun, Neeraja J. Yadwadkar, Carole-Jean Wu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2510.00379: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00379&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[509] Communication Enables Cooperation in LLM Agents: A Comparison with Curriculum-Based Approaches

Hachem Madmoun, Salem Lahlou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2510.05748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[510] Absolute indices for determining compactness, separability and number of clusters

Adil M. Bagirov, Ramiz M. Aliguliyev, Nargiz Sultanova, Sona Taheri

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2510.13065: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13065&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] Revisiting Value Iteration: Unified Analysis of Discounted and Average-Reward Cases

Arsenii Mustafin, Xinyi Sheng, Dominik Baumann

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.23914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments

Junwoo Chang, Minwoo Park, Joohwan Seo, Roberto Horowitz, Jongmin Lee, Jongeun Choi

Main category: cs.LG

TL;DR: Proposes Partially group-Invariant MDP (PI-MDP) framework and PE-DQN/PE-SAC algorithms that selectively apply group-invariant or standard Bellman backups to handle local symmetry-breaking in RL, improving sample efficiency and robustness.

Details

Motivation: Real-world RL environments rarely have perfect group symmetries due to local symmetry-breaking in dynamics, actuation limits, or reward design. Standard group-invariant approaches propagate errors from local symmetry-breaking across the entire state-action space, causing global value estimation errors.

Method: Introduces Partially group-Invariant MDP (PI-MDP) framework that selectively applies group-invariant Bellman backups where symmetry holds and standard backups where symmetry is broken. Develops practical algorithms: PE-DQN for discrete control and PE-SAC for continuous control that implement this selective symmetry exploitation.

Result: Experiments on Grid-World, locomotion, and manipulation benchmarks show PE-DQN and PE-SAC significantly outperform baseline methods, demonstrating improved sample efficiency and robustness to symmetry-breaking.

Conclusion: Selective exploitation of symmetries via the PI-MDP framework provides a principled approach to handle local symmetry-breaking in RL, combining benefits of equivariance with robustness, leading to more sample-efficient and generalizable algorithms.

Abstract: Group symmetries provide a powerful inductive bias for reinforcement learning (RL), enabling efficient generalization across symmetric states and actions via group-invariant Markov Decision Processes (MDPs). However, real-world environments almost never realize fully group-invariant MDPs; dynamics, actuation limits, and reward design usually break symmetries, often only locally. Under group-invariant Bellman backups for such cases, local symmetry-breaking introduces errors that propagate across the entire state-action space, resulting in global value estimation errors. To address this, we introduce Partially group-Invariant MDP (PI-MDP), which selectively applies group-invariant or standard Bellman backups depending on where symmetry holds. This framework mitigates error propagation from locally broken symmetries while maintaining the benefits of equivariance, thereby enhancing sample efficiency and generalizability. Building on this framework, we present practical RL algorithms – Partially Equivariant (PE)-DQN for discrete control and PE-SAC for continuous control – that combine the benefits of equivariance with robustness to symmetry-breaking. Experiments across Grid-World, locomotion, and manipulation benchmarks demonstrate that PE-DQN and PE-SAC significantly outperform baselines, highlighting the importance of selective symmetry exploitation for robust and sample-efficient RL. Project page: https://pranaboy72.github.io/perl_page/

[513] Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Yedi Zhang, Andrew Saxe, Peter E. Latham

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.20607: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20607&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] Time series forecasting with Hahn Kolmogorov-Arnold networks

Md Zahidul Hasan, A. Ben Hamza, Nizar Bouguila

Main category: cs.LG

TL;DR: Paper 2601.18837: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2601.18837: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18837&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[515] Position: Beyond Model-Centric Prediction – Agentic Time Series Forecasting

Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen

Main category: cs.LG

TL;DR: Paper 2602.01776: Unable to fetch summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation due to inability to access paper content.

Method: Cannot determine method due to inability to access paper content.

Result: Cannot determine results due to inability to access paper content.

Conclusion: Cannot determine conclusion due to inability to access paper content.

Abstract: Failed to fetch summary for 2602.01776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[516] Grounding Generated Videos in Feasible Plans via World Models

Christos Ziakas, Amir Bar, Alessandra Russo

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.01960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[517] Expert-Data Alignment Governs Generation Quality in Decentralized Diffusion Models

Marcos Villagra, Bidhan Roy, Raihan Seraj, Zhiying Jiang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.02685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[518] BLITZRANK: Principled Zero-shot Ranking Agents with Tournament Graphs

Sheshansh Agrawal, Thien Hang Nguyen, Douwe Kiela

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.05448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[519] SPAARS: Safer RL Policy Alignment through Abstract Exploration and Refined Exploitation of Action Space

Swaminathan S K, Aritra Hazra

Main category: cs.LG

TL;DR: SPAARS is a curriculum learning framework for offline-to-online RL that starts with safe latent-space exploration using CVAEs, then transitions to raw action space to bypass decoder bottlenecks, achieving better performance and sample efficiency.

Details

Motivation: Offline-to-online RL for robotics needs safe online exploration without deviating from offline data support. Existing CVAE-based methods suffer from exploitation gaps due to decoder reconstruction loss limitations.

Method: Curriculum learning framework with two phases: 1) initial latent-space exploration using CVAE for safety and sample efficiency, 2) transition to raw action space to bypass decoder bottlenecks. Two variants: CVAE-based (unordered pairs) and SPAARS-SUPE (with OPAL temporal skill pretraining).

Result: SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 vs 0.75 for SUPE with 5x better sample efficiency. Standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2, surpassing IQL baselines.

Conclusion: SPAARS effectively bridges offline-to-online RL by combining safe latent exploration with eventual raw-space control, providing theoretical guarantees and practical performance improvements across robotics domains.

Abstract: Offline-to-online reinforcement learning (RL) offers a promising paradigm for robotics by pre-training policies on safe, offline demonstrations and fine-tuning them via online interaction. However, a fundamental challenge remains: how to safely explore online without deviating from the behavioral support of the offline data? While recent methods leverage conditional variational autoencoders (CVAEs) to bound exploration within a latent space, they inherently suffer from an exploitation gap – a performance ceiling imposed by the decoder’s reconstruction loss. We introduce SPAARS, a curriculum learning framework that initially constrains exploration to the low-dimensional latent manifold for sample-efficient, safe behavioral improvement, then seamlessly transfers control to the raw action space, bypassing the decoder bottleneck. SPAARS has two instantiations: the CVAE-based variant requires only unordered (s,a) pairs and no trajectory segmentation; SPAARS-SUPE pairs SPAARS with OPAL temporal skill pretraining for stronger exploration structure at the cost of requiring trajectory chunks. We prove an upper bound on the exploitation gap using the Performance Difference Lemma, establish that latent-space policy gradients achieve provable variance reduction over raw-space exploration, and show that concurrent behavioral cloning during the latent phase directly controls curriculum transition stability. Empirically, SPAARS-SUPE achieves 0.825 normalized return on kitchen-mixed-v0 versus 0.75 for SUPE, with 5x better sample efficiency; standalone SPAARS achieves 92.7 and 102.9 normalized return on hopper-medium-v2 and walker2d-medium-v2 respectively, surpassing IQL baselines of 66.3 and 78.3 respectively, confirming the utility of the unordered-pair CVAE instantiation.

[520] Latent Poincaré Shaping for Agentic Reinforcement Learning

Hanchen Xia, Baoyou Chen, Zelin Zang, Yutang Ge, Guojiang Zhao, Siyu Zhu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.09375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[521] LexiSafe: Offline Safe Reinforcement Learning with Lexicographic Safety-Reward Hierarchy

Hsin-Jung Yang, Zhanhong Jiang, Prajwal Koirala, Qisai Liu, Cody Fleming, Soumik Sarkar

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2602.17312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[522] Active Value Querying to Minimize Additive Error in Subadditive Set Function Learning

Martin Černý, David Sychrovský, Filip Úradník, Jakub Černý

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.23529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] Solving adversarial examples requires solving exponential misalignment

Alessandro Salvatore, Stanislav Fort, Surya Ganguli

Main category: cs.LG

TL;DR: Neural networks have perceptual manifolds with orders of magnitude higher dimensionality than human concepts, creating exponential misalignment that explains adversarial vulnerability.

Details

Motivation: To understand the mysterious origins of adversarial examples and why neural networks remain vulnerable to imperceptible perturbations that fool them but not humans.

Method: Define and analyze perceptual manifolds (PMs) - spaces of inputs confidently assigned to classes by networks. Compare dimensionalities of neural network PMs vs human concepts across 18 different networks with varying robust accuracy.

Result: Network PMs have orders of magnitude higher dimensionality than human concepts. Robust accuracy and distance to PMs are negatively correlated with PM dimension. Even robust networks remain exponentially misaligned, with only PMs approaching human concept dimensionality showing perceptual alignment.

Conclusion: High-dimensional machine perceptual manifolds create exponential misalignment with humans, explaining adversarial vulnerability. Dimensional alignment between machine and human PMs is essential for adversarial robustness.

Abstract: Adversarial attacks - input perturbations imperceptible to humans that fool neural networks - remain both a persistent failure mode in machine learning, and a phenomenon with mysterious origins. To shed light, we define and analyze a network’s perceptual manifold (PM) for a class concept as the space of all inputs confidently assigned to that class by the network. We find, strikingly, that the dimensionalities of neural network PMs are orders of magnitude higher than those of natural human concepts. Since volume typically grows exponentially with dimension, this suggests exponential misalignment between machines and humans, with exponentially many inputs confidently assigned to concepts by machines but not humans. Furthermore, this provides a natural geometric hypothesis for the origin of adversarial examples: because a network’s PM fills such a large region of input space, any input will be very close to any class concept’s PM. Our hypothesis thus suggests that adversarial robustness cannot be attained without dimensional alignment of machine and human PMs, and therefore makes strong predictions: both robust accuracy and distance to any PM should be negatively correlated with the PM dimension. We confirmed these predictions across 18 different networks of varying robust accuracy. Crucially, we find even the most robust networks are still exponentially misaligned, and only the few PMs whose dimensionality approaches that of human concepts exhibit alignment to human perception. Our results connect the fields of alignment and adversarial examples, and suggest the curse of high dimensionality of machine PMs is a major impediment to adversarial robustness.

[524] Stochastic Attention via Langevin Dynamics on the Modern Hopfield Energy

Abdulrahman Alswaidan, Jeffrey D. Varner

Main category: cs.LG

TL;DR: Stochastic attention reformulates attention as gradient descent on an energy function, enabling training-free sampling with temperature control for retrieval-to-generation transitions.

Details

Motivation: To provide a unified framework for attention mechanisms that can seamlessly transition between retrieval and generation modes without requiring additional training or model modifications.

Method: Reformulates attention as one step of gradient descent on a classical energy function, then uses Langevin sampling to create stochastic attention controlled by temperature. Derives closed-form entropy inflection condition to identify retrieval-to-generation transition temperature.

Result: Stochastic attention achieves 2.6× more novel and 2.0× more diverse samples than learned VAE on MNIST, and 6.9× lower amino acid composition divergence on protein sequences. Denoising diffusion baselines fail across all memory sizes tested.

Conclusion: Attention mechanisms inherently contain both retrieval and generative capabilities, which can be unlocked through temperature-controlled stochastic sampling without additional training or architectural changes.

Abstract: Attention heads retrieve: given a query, they return a softmax-weighted average of stored values. We show that this computation is one step of gradient descent on a classical energy function, and that Langevin sampling from the corresponding distribution yields stochastic attention: a training-free sampler controlled by a single temperature. Lowering the temperature gives exact retrieval; raising it gives open-ended generation. Because the energy gradient equals the attention map, no score network, training loop, or learned model is required. We derive a closed-form entropy inflection condition that identifies the retrieval-to-generation transition temperature for any memory geometry, with a scaling law $β^*!\sim!\sqrt{d}$ for random patterns. We validate on five domains (64 to 4,096 dimensions). On MNIST digit images, stochastic attention is $2.6{\times}$ more novel and $2.0{\times}$ more diverse than the best learned baseline (a VAE trained on the same patterns), while matching a Metropolis-corrected gold standard. On protein sequences from the Pfam RRM family, the generation regime achieves $6.9{\times}$ lower amino acid composition divergence than the VAE (KL $= 0.060$ vs.\ $0.416$) at matched novelty, demonstrating that the training-free score function preserves family-level fidelity that learned models lose. A denoising diffusion baseline (DDPM) fails across all memory sizes tested ($K = 100$ to $3{,}500$), producing samples indistinguishable from isotropic noise. The approach requires no architectural changes to the underlying attention mechanism.

[525] Equitable Multi-Task Learning for AI-RANs

Panayiotis Raptis, Fatih Aslan, George Iosifidis

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.08717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[526] Proxy-Guided Measurement Calibration

Saketh Vishnubhatla, Shu Wan, Andre Harrison, Adrienne Raglin, Huan Liu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.09288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[527] Reconstructing Movement from Sparse Samples: Enhanced Spatio-Temporal Matching Strategies for Low-Frequency Data

Ali Yousefian, Arianna Burzacchi, Simone Vantini

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.09412: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09412&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

Aleksei Rozanov, Arvind Renganathan, Vipin Kumar

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2603.09974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[529] EarthquakeNPP: A Benchmark for Earthquake Forecasting with Neural Point Processes

Samuel Stockman, Daniel Lawson, Maximilian Werner

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2410.08226 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2410.08226: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.08226&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[530] Losing dimensions: Geometric memorization in generative diffusion

Beatrice Achilli, Enrico Ventura, Gianluigi Silvestri, Bao Pham, Gabriel Raya, Dmitry Krotov, Carlo Lucibello, Luca Ambrogioni

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2410.08727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.08727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[531] Conditional Local Importance by Quantile Expectations

Kelvyn K. Bladen, Adele Cutler, D. Richard Cutler, Kevin R. Moon

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2411.08821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.08821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[532] A Novel Single-Layer Quantum Neural Network for Approximate SRBB-Based Unitary Synthesis

Giacomo Belli, Marco Mordacci, Michele Amoretti

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2412.03083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.03083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] Pairwise Comparisons without Stochastic Transitivity: Model, Theory and Applications

Sze Ming Lee, Yunxiao Chen

Main category: cs.LG

TL;DR: Paper ID 2501.07437: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2501.07437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.07437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] Universal Dynamics with Globally Controlled Analog Quantum Simulators

Hong-Ye Hu, Abigail McClain Gomez, Liyuan Chen, Aaron Trowbridge, Andy J. Goldschmidt, Zachary Manchester, Frederic T. Chong, Arthur Jaffe, Susanne F. Yelin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.19075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] Tensor Train Completion from Fiberwise Observations Along a Single Mode

Shakir Showkat Sofi, Lieven De Lathauwer

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.18149 appears to be from September 2025, suggesting it’s a recent multimodal or AI-related work.

Details

Motivation: Cannot determine motivation without access to paper content. Based on the arXiv ID format (2509.18149), this appears to be a recent paper from September 2025, likely in the AI/multimodal domain.

Method: Method unknown due to HTTP 429 error preventing access to paper details. The arXiv API rate limiting suggests high demand for this paper.

Result: Results cannot be determined without access to the paper content. The HTTP 429 error indicates the arXiv API is rate-limited, possibly due to high interest in this paper.

Conclusion: Unable to analyze paper due to technical limitations. The arXiv ID suggests this is a recent (September 2025) paper that may be relevant to multimodal AI research.

Abstract: Failed to fetch summary for 2509.18149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] Zero-Shot Transferable Solution Method for Parametric Optimal Control Problems

Xingjian Li, Kelvin Kan, Deepanshu Verma, Krishna Kumar, Stanley Osher, Ján Drgoňa

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.18404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] Empirical PAC-Bayes Bounds for Markov Chains

Vahe Karagulyan, Pierre Alquier

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available due to API rate limiting preventing access to paper information

Conclusion: Cannot provide conclusion as the paper content could not be retrieved due to rate limiting issues

Abstract: Failed to fetch summary for 2509.20985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[538] Resource Allocation in Hybrid Radio-Optical IoT Networks using GNN with Multi-task Learning

Aymen Hamrouni, Sofie Pollin, Hazem Sallouha

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.07428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[539] A scalable and real-time neural decoder for topological quantum codes

Andrew W. Senior, Thomas Edlich, Francisco J.H. Heras, Lei M. Zhang, Oscar Higgott, James S. Spencer, Taylor Applebaum, Sam Blackwell, Justin Ledford, Akvilė Žemgulytė, Augustin Žídek, Noah Shutty, Andrew Cowie, Yin Li, George Holland, Peter Brooks, Charlie Beattie, Michael Newman, Alex Davies, Cody Jones, Sergio Boixo, Hartmut Neven, Pushmeet Kohli, Johannes Bausch

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.07737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Federico Ottomano, Yingzhen Li, Alex M. Ganose

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.19733: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19733&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[541] Sampling via Stochastic Interpolants by Langevin-based Velocity and Initialization Estimation in Flow ODEs

Chenguang Duan, Yuling Jiao, Gabriele Steidl, Christian Wald, Jerry Zhijian Yang, Ruizhe Zhang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2601.08527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[542] Error Analysis of Bayesian Inverse Problems with Generative Priors

Bamdad Hosseini, Ziqi Huang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2601.17374: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17374&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[543] Singular Bayesian Neural Networks

Mame Diarra Toure, David A. Stephens

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to API access limitations

Method: Cannot determine method due to API access limitations

Result: Cannot determine results due to API access limitations

Conclusion: Cannot draw conclusions due to API access limitations

Abstract: Failed to fetch summary for 2602.00387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[544] Emergence of Distortions in High-Dimensional Guided Diffusion Models

Enrico Ventura, Beatrice Achilli, Luca Ambrogioni, Carlo Lucibello

Main category: cs.LG

TL;DR: Paper ID 2602.00716 could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.00716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[545] Universality of General Spiked Tensor Models

Yanjin Xiang, Zhihua Zhang

Main category: cs.LG

TL;DR: Paper 2602.04472: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2602.04472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[546] Benchmarking Graph Neural Networks in Solving Hard Constraint Satisfaction Problems

Geri Skenderi, Lorenzo Buffoni, Francesco D’Amico, David Machado, Raffaele Marino, Matteo Negri, Federico Ricci-Tersenghi, Carlo Lucibello, Maria Chiara Angelini

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2602.18419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[547] Micro-Diffusion Compression - Binary Tree Tweedie Denoising for Online Probability Estimation

Roberto Tacconelli

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2603.08771: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08771&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[548] LLMGreenRec: LLM-Based Multi-Agent Recommender System for Sustainable E-Commerce

Hao N. Nguyen, Hieu M. Nguyen, Son Van Nguyen, Nguyen Thi Hanh

Main category: cs.MA

TL;DR: LLMGreenRec: A multi-agent LLM framework for sustainable e-commerce recommendations that infers green-oriented user intents and promotes eco-friendly products while reducing digital carbon footprint.

Details

Motivation: Traditional session-based recommender systems focus on short-term conversions and fail to capture nuanced user intents for sustainable choices, creating a gap between green intentions and actions. There's also a need to minimize the digital carbon footprint of recommender systems themselves.

Method: Introduces LLMGreenRec, a multi-agent framework leveraging Large Language Models. Specialized agents collaboratively analyze user interactions and use iterative prompt refinement to deduce green-oriented user intents, then prioritize eco-friendly product recommendations while reducing unnecessary interactions.

Result: Extensive experiments on benchmark datasets validate LLMGreenRec’s effectiveness in recommending sustainable products. The framework successfully bridges the gap between green intentions and actions while reducing energy consumption.

Conclusion: LLMGreenRec provides a robust solution for promoting sustainable consumption in e-commerce through intent-driven recommendations, fostering a responsible digital economy while addressing environmental concerns.

Abstract: Rising environmental awareness in e-commerce necessitates recommender systems that not only guide users to sustainable products but also minimize their own digital carbon footprints. Traditional session-based systems, optimized for short-term conversions, often fail to capture nuanced user intents for eco-friendly choices, perpetuating a gap between green intentions and actions. To tackle this, we introduce LLMGreenRec, a novel multi-agent framework that leverages Large Language Models (LLMs) to promote sustainable consumption. Through collaborative analysis of user interactions and iterative prompt refinement, LLMGreenRec’s specialized agents deduce green-oriented user intents and prioritize eco-friendly product recommendations. Notably, this intent-driven approach also reduces unnecessary interactions and energy consumption. Extensive experiments on benchmark datasets validate LLMGreenRec’s effectiveness in recommending sustainable products, demonstrating a robust solution that fosters a responsible digital economy.

[549] The Coordination Gap: Alternation Metrics for Temporal Dynamics in Multi-Agent Battle of the Exes

Nikolaos Al. Papadopoulos, Konstantinos Psannis

Main category: cs.MA

TL;DR: The paper introduces temporally-sensitive metrics for evaluating multi-agent coordination, showing that conventional metrics can be misleading when assessing temporal coordination quality.

Details

Motivation: Existing metrics for multi-agent coordination are temporally blind and fail to distinguish structured coordination patterns from random or monopolistic behaviors, especially as the number of agents grows.

Method: Proposed Perfect Alternation as a reference coordination regime and introduced six novel Alternation (ALT) metrics. Used Q-learning agents as a diagnostic baseline and compared against random-policy null processes in a BoE-derived multi-agent variant of Battle of the Exes formalized as a Markov game.

Result: Learned policies showed deceptively high traditional metrics (reward fairness often >0.9) but performed up to 81% below random baselines under ALT metrics, with deficits present in two-agent case and intensifying as n grows.

Conclusion: High aggregate payoffs can coexist with poor temporal coordination, conventional metrics may severely mischaracterize emergent dynamics, and temporally-aware observables are essential for analyzing coordination in multi-agent games.

Abstract: Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.

cs.MM

[550] AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

Yunsheng Wang, Yuntao Shou, Yilong Tan, Wei Ai, Tao Meng, Keqin Li

Main category: cs.MM

TL;DR: AMB-DSGDN: A multimodal emotion recognition framework using differential graph attention and adaptive modality balancing to capture emotional dependencies while filtering noise and preventing modality dominance.

Details

Motivation: Existing multimodal emotion recognition approaches struggle with filtering redundant/noisy signals and suffer from dominant modalities overwhelming the fusion process, suppressing complementary contributions from non-dominant modalities like speech and vision.

Method: Constructs modality-specific subgraphs (text, speech, vision) with intra-speaker and inter-speaker graphs, uses differential graph attention to compute discrepancy between attention maps to cancel shared noise, and employs adaptive modality balancing with dropout probabilities based on each modality’s relative contribution.

Result: The paper claims the method yields purer and more discriminative emotional representations by filtering noise while retaining modality-specific signals, and prevents dominant modalities from suppressing others through adaptive balancing.

Conclusion: AMB-DSGDN addresses key limitations in multimodal emotion recognition by better modeling emotional dependencies and learning balanced multimodal representations through differential attention and adaptive modality balancing.

Abstract: Multimodal dialogue emotion recognition captures emotional cues by fusing text, visual, and audio modalities. However, existing approaches still suffer from notable limitations in modeling emotional dependencies and learning multimodal representations. On the one hand, they are unable to effectively filter out redundant or noisy signals within multimodal features, which hinders the accurate capture of the dynamic evolution of emotional states across and within speakers. On the other hand, during multimodal feature learning, dominant modalities tend to overwhelm the fusion process, thereby suppressing the complementary contributions of non-dominant modalities such as speech and vision, ultimately constraining the overall recognition performance. To address these challenges, we propose an Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network (AMB-DSGDN). Concretely, we first construct modality-specific subgraphs for text, speech, and vision, where each modality contains intra-speaker and inter-speaker graphs to capture both self-continuity and cross-speaker emotional dependencies. On top of these subgraphs, we introduce a differential graph attention mechanism, which computes the discrepancy between two sets of attention maps. By explicitly contrasting these attention distributions, the mechanism cancels out shared noise patterns while retaining modality-specific and context-relevant signals, thereby yielding purer and more discriminative emotional representations. In addition, we design an adaptive modality balancing mechanism, which estimates a dropout probability for each modality according to its relative contribution in emotion modeling.

Dongxu Zhang, Yiding Sun, Cheng Tan, Wenbiao Yan, Ning Yang, Jihua Zhu, Haijun Zhang

Main category: cs.MM

TL;DR: V-Skip addresses latency issues in multimodal CoT reasoning by introducing visual-anchored token pruning that prevents visual amnesia, achieving 2.9× speedup with minimal accuracy loss.

Details

Motivation: Current CoT reasoning in MLLMs suffers from high latency due to autoregressive nature. Existing token compression methods fail in multimodal contexts by applying text-centric metrics that cause visual amnesia - pruning linguistically redundant but visually important tokens.

Method: V-Skip reformulates token pruning as a Visual-Anchored Information Bottleneck optimization problem. It uses a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow to preserve visually salient anchors.

Result: Achieves 2.9× speedup with negligible accuracy loss. Preserves fine-grained visual details and outperforms other baselines by over 30% on DocVQA. Tested on Qwen2-VL and Llama-3.2 families.

Conclusion: V-Skip effectively addresses the latency bottleneck in multimodal CoT reasoning while preventing visual amnesia, making it a practical solution for efficient multimodal reasoning.

Abstract: While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9\times$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30% on the DocVQA.

eess.AS

[552] Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

Elizaveta Kostenok, Mathieu Salzmann, Milos Cernak

Main category: eess.AS

TL;DR: Novel post-training method adapts Audio Large Language Model for multidimensional speech quality assessment using calibration and reinforcement learning with dimension-specific rewards

Details

Motivation: Current speech quality assessment relies on Mean Opinion Scores (MOS) which lack explainability; need to analyze underlying perceptual dimensions and provide detailed artifact detection

Result: Achieves state-of-the-art 0.71 mean PCC score on QualiSpeech benchmark, 13% improvement in MOS prediction, and substantial advances in pinpointing and classifying audio artifacts temporally

Conclusion: The method successfully enables explainable multidimensional speech quality assessment with improved accuracy and temporal artifact localization through tailored Audio LLM adaptation

[553] Speech Codec Probing from Semantic and Phonetic Perspectives

Xuan Shi, Chang Zeng, Tiantian Feng, Shih-Heng Wang, Jianbo Ma, Shrikanth Narayanan

Main category: eess.AS

TL;DR: Analysis shows current speech tokenizers capture phonetic rather than semantic information, revealing a mismatch with text semantics that affects multimodal LLM performance.

Details

Result: Current tokenizers primarily capture phonetic rather than lexical-semantic structure, revealing a fundamental mismatch between speech and text representations.

Conclusion: The findings provide practical implications for designing next-generation speech tokenization methods that better align with text semantics for improved multimodal LLM performance.

[554] G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition

Jing Peng, Ziyi Chen, Haoyu Li, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang

Main category: eess.AS

TL;DR: G-STAR: End-to-end timestamped speaker-attributed ASR system for long-form multi-party speech with overlap, combining time-aware speaker tracking with Speech-LLM transcription backbone.

Details

Motivation: Need for timestamped speaker-attributed ASR that preserves meeting-level speaker identity consistency across chunks while handling overlap, addressing limitations of previous Speech-LLM systems that prioritize either local diarization or global labeling but lack fine-grained temporal boundaries or robust cross-chunk identity linking.

Method: Proposes G-STAR system coupling time-aware speaker-tracking module with Speech-LLM transcription backbone. Tracker provides structured speaker cues with temporal grounding, LLM generates attributed text conditioned on these cues. Supports both component-wise optimization and joint end-to-end training for flexible learning under heterogeneous supervision and domain shift.

Result: Experiments analyze cue fusion, local versus long-context trade-offs, and hierarchical objectives. System addresses challenges of timestamped speaker-attributed ASR for long-form multi-party speech with overlap.

Conclusion: G-STAR provides an end-to-end solution for timestamped speaker-attributed ASR that maintains speaker identity consistency across chunks while producing time-stamped, speaker-labeled transcripts, addressing limitations of previous approaches.

Abstract: We study timestamped speaker-attributed ASR for long-form, multi-party speech with overlap, where chunk-wise inference must preserve meeting-level speaker identity consistency while producing time-stamped, speaker-labeled transcripts. Previous Speech-LLM systems tend to prioritize either local diarization or global labeling, but often lack the ability to capture fine-grained temporal boundaries or robust cross-chunk identity linking. We propose G-STAR, an end-to-end system that couples a time-aware speaker-tracking module with a Speech-LLM transcription backbone. The tracker provides structured speaker cues with temporal grounding, and the LLM generates attributed text conditioned on these cues. G-STAR supports both component-wise optimization and joint end-to-end training, enabling flexible learning under heterogeneous supervision and domain shift. Experiments analyze cue fusion, local versus long-context trade-offs and hierarchical objectives.

[555] FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu

Main category: eess.AS

TL;DR: FireRedASR2S is an industrial-grade all-in-one speech recognition system integrating ASR, VAD, language identification, and punctuation prediction modules, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: To create a comprehensive, high-performance speech processing system that integrates multiple essential modules (ASR, VAD, LID, punctuation) into a unified pipeline for industrial applications, addressing the need for accurate multilingual and multi-dialect speech recognition.

Method: Developed four specialized modules: 1) FireRedASR2 with LLM (8B+ parameters) and AED (1B+ parameters) variants for speech/singing transcription, 2) ultra-lightweight DFSMN-based VAD module (0.6M parameters), 3) Encoder-Decoder LID supporting 100+ languages, and 4) BERT-style punctuation prediction for Chinese and English.

Result: Achieved SOTA performance across all modules: ASR (2.89% CER on Mandarin, 11.55% on dialects), VAD (97.57% F1, 99.60% AUC-ROC), LID (97.18% accuracy on 82 languages), and punctuation (78.90% F1 vs 62.77% baseline). Outperformed competitive systems like Doubao-ASR, Qwen3-ASR, Whisper, and SpeechBrain.

Conclusion: FireRedASR2S represents a comprehensive, high-performance speech processing system that integrates multiple essential modules into a unified pipeline, achieving state-of-the-art results across various benchmarks while supporting multilingual and multi-dialect applications.

Abstract: We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.

[556] Geo-ATBench: A Benchmark for Geospatial Audio Tagging with Geospatial Semantic Context

Yuanbo Hou, Yanru Wu, Qiaoqiao Ren, Shengchen Li, Stephen Roberts, Dick Botteldooren

Main category: eess.AS

TL;DR: Geo-AT introduces geospatial semantic context to improve audio tagging by reducing ambiguity in acoustically similar sounds through location-based environmental priors.

Details

Motivation: Traditional audio-only recognition struggles with acoustically similar events that are difficult to separate from waveforms alone. Geospatial semantic context (from GIS data like POIs) provides location-tied environmental priors that can help disambiguate these confounded sounds.

Method: Proposes Geo-AT task for multi-label sound event tagging conditioned on GSC alongside audio. Introduces Geo-ATBench dataset (10.71 hours, 28 event categories, 11 semantic context categories) and GeoFusion-AT framework evaluating feature-, representation-, and decision-level fusion on audio backbones.

Result: Incorporating GSC improves audio tagging performance, especially on acoustically confounded labels. No significant difference between model performance on Geo-ATBench labels and aggregated human labels in listening study with 10 participants on 579 samples.

Conclusion: Geospatial semantics provide effective priors beyond audio alone for sound event recognition. The Geo-AT task, benchmark, and fusion framework establish foundation for studying audio tagging with geospatial context in CASA community.

Abstract: Environmental sound understanding in computational auditory scene analysis (CASA) is often formulated as an audio-only recognition problem. This formulation leaves a persistent drawback in multi-label audio tagging (AT): acoustic similarity can make certain events difficult to separate from waveforms alone. In such cases, disambiguating cues often lie outside the waveform. Geospatial semantic context (GSC), derived from geographic information system data, e.g., points of interest (POI), provides location-tied environmental priors that can help reduce this ambiguity. A systematic study of this direction is enabled through the proposed geospatial audio tagging (Geo-AT) task, which conditions multi-label sound event tagging on GSC alongside audio. To benchmark Geo-AT, Geo-ATBench is introduced as a polyphonic audio benchmark with geographical annotations, containing 10.71 hours of audio across 28 event categories; each clip is paired with a GSC representation from 11 semantic context categories. GeoFusion-AT is proposed as a unified geo-audio fusion framework that evaluates feature-, representation-, and decision-level fusion on representative audio backbones, with audio- and GSC-only baselines. Results show that incorporating GSC improves AT performance, especially on acoustically confounded labels, indicating geospatial semantics provide effective priors beyond audio alone. A crowdsourced listening study with 10 participants on 579 samples shows that there is no significant difference in performance between models on Geo-ATBench labels and aggregated human labels, supporting Geo-ATBench as a human-aligned benchmark. The Geo-AT task, benchmark Geo-ATBench, and reproducible geo-audio fusion framework GeoFusion-AT provide a foundation for studying AT with geospatial semantic context within the CASA community. Dataset, code, models are on homepage (https://github.com/WuYanru2002/Geo-ATBench).

[557] MOS-Bias: From Hidden Gender Bias to Gender-Aware Speech Quality Assessment

Wenze Ren, Yi-Cheng Lin, Wen-Chin Huang, Erica Cooper, Ryandhimas E. Zezario, Hsin-Min Wang, Hung-yi Lee, Yu Tsao

Main category: eess.AS

TL;DR: First systematic analysis of gender bias in speech quality assessment (MOS), revealing male listeners consistently give higher scores than females, especially for low-quality speech, and proposing gender-aware models to address this bias.

Details

Motivation: While MOS is the standard metric for speech quality assessment, biases in human annotations remain underexplored. The paper aims to systematically analyze gender bias in MOS ratings and understand how it affects automated speech quality assessment models.

Method: Conducted systematic analysis of gender bias in MOS ratings, revealing quality-dependent scoring patterns. Proposed gender-aware model that learns gender-specific scoring patterns through abstracting binary group embeddings to improve prediction accuracy.

Result: Found that male listeners consistently assign higher MOS scores than female listeners, with the gap most pronounced in low-quality speech and diminishing as quality improves. Automated MOS models trained on aggregated labels exhibit predictions skewed toward male standards. Gender-aware model improves overall and gender-specific prediction accuracy.

Conclusion: Gender bias in MOS constitutes a systematic, learnable pattern that demands attention in equitable speech evaluation. Quality-dependent bias structure is difficult to eliminate through simple calibration, requiring specialized approaches like gender-aware modeling.

Abstract: The Mean Opinion Score (MOS) serves as the standard metric for speech quality assessment, yet biases in human annotations remain underexplored. We conduct the first systematic analysis of gender bias in MOS, revealing that male listeners consistently assign higher scores than female listeners–a gap that is most pronounced in low-quality speech and gradually diminishes as quality improves. This quality-dependent structure proves difficult to eliminate through simple calibration. We further demonstrate that automated MOS models trained on aggregated labels exhibit predictions skewed toward male standards of perception. To address this, we propose a gender-aware model that learns gender-specific scoring patterns through abstracting binary group embeddings, thereby improving overall and gender-specific prediction accuracy. This study establishes that gender bias in MOS constitutes a systematic, learnable pattern demanding attention in equitable speech evaluation.

[558] Robust Audio-Visual Target Speaker Extraction with Emotion-Aware Multiple Enrollment Fusion

Zhan Jin, Bang Zeng, Peijun Yang, Jiarong Du, Wei Ju, Yao Tian, Juan Liu, Ming Li

Main category: eess.AS

TL;DR: AVTSE robustness study shows training with high modality missing rates improves performance stability when test-time modalities are missing, with face image + lip features fusion achieving best balance.

Details

Motivation: Real-world AVTSE applications face intermittent signal loss, especially for frame-level cues like lip motion. Current multimodal fusion approaches degrade sharply when encountering unseen modality missing during testing.

Method: Systematic investigation of multi-enrollment fusion robustness under varying degrees of modality missing. Training with different missing rates and testing under various modality missing conditions.

Result: Training with high missing rate dramatically enhances robustness, maintaining stable performance even under severe test-time modality missing. Fusing one frame of face image with frame-level lip features achieves both strong performance and robustness.

Conclusion: Robust AVTSE requires training strategies that account for modality missing. The complementary fusion of face image and lip features provides optimal balance between performance and robustness for real-world applications.

Abstract: Audio-Visual Target Speaker Extraction (AVTSE) is crucial for cocktail party scenarios. Leveraging multiple cues –such as utterance-level speaker embeddings or steady face images, and frame-level lip motion or facial expression features –can significantly improve performance. However, real-world applications often suffer from intermittent signal loss, especially for frame-level cues. This paper systematically investigates the robustness of multi-enrollment fusion under varying degrees of modality missing. Results show that while full multimodal fusion excels under ideal conditions, its performance degrades sharply when encountering unseen modalities missing during the testing. Crucially, training with a high missing rate dramatically enhances robustness, maintaining stable performance even under severe test-time modality missing. We demonstrate that fusing the complementary one frame of face image with frame-level lip features achieves both strong performance and robustness for the AVTSE task. The model and codes are shared.

[559] HyWA: Hypernetwork Weight Adapting Personalized Voice Activity Detection

Mahsa Ghazvini Nejad, Hamed Jafarzadeh Asl, Amin Edraki, Mohammadreza Sadeghi, Masoud Asgharian, Yuanhao Yu, Vahid Partovi Nia

Main category: eess.AS

TL;DR: HyWA introduces a hypernetwork-based approach for personalized voice activity detection that generates speaker-specific weights for selected layers of a standard VAD model, outperforming existing speaker-conditioning methods.

Details

Motivation: Existing speaker-conditioning methods for personalized VAD typically modify inputs or activations, but there's a need for more effective approaches that improve performance while maintaining deployment efficiency.

Method: HyWA uses a hypernetwork to generate personalized weights for selected layers of a standard VAD model, enabling speaker-specific adaptation without changing the core architecture.

Result: HyWA consistently outperforms baseline speaker-conditioning techniques, showing improvements in mean average precision while facilitating deployment by reusing the same VAD architecture.

Conclusion: The hypernetwork-based approach represents an effective alternative to existing speaker-conditioning methods for personalized VAD, offering both performance gains and practical deployment advantages.

Abstract: Personalized Voice Activity Detection (PVAD) systems activate only in response to a specific target speaker. Speaker-conditioning methods are employed to inject information about the target speaker into a VAD pipeline, to achieve personalization. Existing speaker-conditioning methods typically modify the inputs or activations of a VAD model. We propose an alternative perspective to speaker conditioning. Our approach, HyWA, employs a hypernetwork to generate personalized weights for a few selected layers of a standard VAD model. We evaluate HyWA against multiple baseline speaker-conditioning techniques using a fixed backbone VAD. Our comparison shows consistent improvements in PVAD performance. This new approach improves the current speaker-conditioning techniques in two ways: i) increases the mean average precision, ii) facilitates deployment by reusing the same VAD architecture.

[560] Multi-View Based Audio Visual Target Speaker Extraction

Peijun Yang, Zhan Jin, Juan Liu, Ming Li

Main category: eess.AS

TL;DR: MVTF is a novel framework for Audio-Visual Target Speaker Extraction that leverages multi-view lip videos during training to improve single-view performance, using tensor fusion to model cross-view correlations.

Details

Motivation: Existing AVTSE methods rely exclusively on frontal-view videos, limiting robustness in real-world scenarios where non-frontal views are prevalent. These alternative visual perspectives contain complementary articulatory information that could enhance speech extraction.

Method: Proposes Multi-View Tensor Fusion (MVTF) that transforms multi-view learning into single-view performance gains. During training, uses synchronized multi-perspective lip videos to learn cross-view correlations through pairwise outer products that explicitly model multiplicative interactions between different views of input lip embeddings. At inference, supports both single-view and multi-view inputs.

Result: Experimental results show that with single-view inputs, the framework leverages multi-view knowledge to achieve significant performance gains. In multi-view mode, it further improves overall performance and enhances robustness.

Conclusion: MVTF effectively addresses the limitation of frontal-only AVTSE by incorporating multi-view visual information, improving both single-view performance and multi-view robustness through cross-view correlation learning.

Abstract: Audio-Visual Target Speaker Extraction (AVTSE) aims to separate a target speaker’s voice from a mixed audio signal using the corresponding visual cues. While most existing AVTSE methods rely exclusively on frontal-view videos, this limitation restricts their robustness in real-world scenarios where non-frontal views are prevalent. Such visual perspectives often contain complementary articulatory information that could enhance speech extraction. In this work, we propose Multi-View Tensor Fusion (MVTF), a novel framework that transforms multi-view learning into single-view performance gains. During the training stage, we leverage synchronized multi-perspective lip videos to learn cross-view correlations through MVTF, where pairwise outer products explicitly model multiplicative interactions between different views of input lip embeddings. At the inference stage, the system supports both single-view and multi-view inputs. Experimental results show that in the single-view inputs, our framework leverages multi-view knowledge to achieve significant performance gains, while in the multi-view mode, it further improves overall performance and enhances the robustness. Our demo, code and data are available at https://anonymous.4open.science/w/MVTF-Gridnet-209C/

eess.IV

[561] ARCHE: Autoregressive Residual Compression with Hyperprior and Excitation

Sofia Iliopoulou, Dimitris Ampeliotis, Athanassios Skodras

Main category: eess.IV

TL;DR: ARCHE is an efficient end-to-end learned image compression framework that achieves state-of-the-art rate-distortion performance through unified hierarchical, spatial, and channel-based priors without using recurrent or transformer components.

Details

Motivation: Existing learned image compression methods often achieve high efficiency at the cost of increased computational complexity and limited parallelism. There's a need for frameworks that balance modeling accuracy with computational efficiency for practical deployment.

Method: ARCHE unifies hierarchical, spatial, and channel-based priors within a single probabilistic framework using adaptive feature recalibration and residual refinement. It captures global and local dependencies in latent representations without relying on recurrent or transformer components.

Result: ARCHE reduces BD-Rate by 48% vs Balle et al., 30% vs Minnen & Singh, and 5% vs VVC Intra on Kodak dataset. It maintains computational efficiency with 95M parameters and 222ms/image runtime, producing sharper textures and better color fidelity at low bit rates.

Conclusion: Accurate entropy modeling for image compression can be achieved through efficient convolutional designs without complex recurrent/transformer architectures, making learned compression practical for deployment while maintaining state-of-the-art performance.

Abstract: Recent progress in learning-based image compression has demonstrated that end-to-end optimization can substantially outperform traditional codecs by jointly learning compact latent representations and probabilistic entropy models. However, many existing approaches achieve high rate-distortion efficiency at the expense of increased computational cost and limited parallelism. This paper presents ARCHE - Autoregressive Residual Compression with Hyperprior and Excitation, an end-to-end learned image compression framework that balances modeling accuracy and computational efficiency. The proposed architecture unifies hierarchical, spatial, and channel-based priors within a single probabilistic framework, capturing both global and local dependencies in the latent representation of the image, while employing adaptive feature recalibration and residual refinement to enhance latent representation quality. Without relying on recurrent or transformer-based components, ARCHE attains state-of-the-art rate-distortion efficiency: it reduces the BD-Rate by approximately 48% relative to the commonly used benchmark model of Balle et al., 30% relative to the channel-wise autoregressive model of Minnen & Singh and 5% against the VVC Intra codec on the Kodak benchmark dataset. The framework maintains computational efficiency with 95M parameters and 222ms running time per image. Visual comparisons confirm sharper textures and improved color fidelity, particularly at lower bit rates, demonstrating that accurate entropy modeling can be achieved through efficient convolutional designs suitable for practical deployment.

[562] Semantic Satellite Communications for Synchronized Audiovisual Reconstruction

Fangyu Liu, Peiwen Jiang, Wenjin Wang, Chao-Kai Wen, Xiao Li, Shi Jin

Main category: eess.IV

TL;DR: Adaptive multimodal semantic transmission system for satellite communications that dynamically switches between video-driven audio generation and audio-driven video generation to reduce bandwidth while maintaining audiovisual synchronization.

Details

Motivation: Satellite communications struggle with high-fidelity synchronized audiovisual services due to fluctuating channel conditions, limited bandwidth, and long propagation delays. Conventional schemes fail to maintain cross-modal coherence under these constraints.

Method: Dual-stream generative architecture that flexibly switches between video-driven audio generation and audio-driven video generation, transmitting only the most important modality while using cross-modal generation to recover the other. Includes dynamic keyframe update mechanism and LLM-based decision module with satellite-specific knowledge to adapt to wireless scenarios and user requirements.

Result: Significantly reduces bandwidth consumption while achieving high-fidelity audiovisual synchronization, improving transmission efficiency and robustness in challenging satellite scenarios.

Conclusion: The proposed adaptive multimodal semantic transmission system effectively addresses satellite communication bottlenecks by dynamically optimizing cross-modal generation and transmission strategies based on real-time conditions.

Abstract: Satellite communications face severe bottlenecks in supporting high-fidelity synchronized audiovisual services, as conventional schemes struggle with cross-modal coherence under fluctuating channel conditions, limited bandwidth, and long propagation delays. To address these limitations, this paper proposes an adaptive multimodal semantic transmission system tailored for satellite scenarios, aiming for high-quality synchronized audiovisual reconstruction under bandwidth constraints. Unlike static schemes with fixed modal priorities, our framework features a dual-stream generative architecture that flexibly switches between video-driven audio generation and audio-driven video generation. This allows the system to dynamically decouple semantics, transmitting only the most important modality while employing cross-modal generation to recover the other. To balance reconstruction quality and transmission overhead, a dynamic keyframe update mechanism adaptively maintains the shared knowledge base according to wireless scenarios and user requirements. Furthermore, a large language model based decision module is introduced to enhance system adaptability. By integrating satellite-specific knowledge, this module jointly considers task requirements and channel factors such as weather-induced fading to proactively adjust transmission paths and generation workflows. Simulation results demonstrate that the proposed system significantly reduces bandwidth consumption while achieving high-fidelity audiovisual synchronization, improving transmission efficiency and robustness in challenging satellite scenarios.

[563] Regularizing INR with diffusion prior self-supervised 3D reconstruction of neutron computed tomography data

Maliha Hossain, Haley Duba-Sullivan, Amirkoushyar Ziabari

Main category: eess.IV

TL;DR: Diffusive INR (DINR) combines diffusion priors with implicit neural representations for high-quality sparse-view CT reconstruction, achieving superior performance on concrete microstructure imaging.

Details

Motivation: Traditional CT reconstruction methods suffer substantial degradation with sparse views. The paper aims to develop a framework that can achieve high-quality reconstruction from limited data by combining the strengths of generative diffusion priors and implicit neural representations.

Method: Proposes Diffusive INR (DINR), a framework that regularizes implicit neural representations (INRs) using generative diffusion priors for computed tomography inversion. The method is pretrained purely on synthetic data and applied to sparse-view neutron CT reconstruction.

Result: DINR delivers superior performance compared to state-of-the-art sparse-view reconstruction techniques, reduces reconstruction artifacts, and achieves gains in PSNR and SSIM metrics. It enables accurate micro-structural characterization even under extreme data limitations.

Conclusion: The combination of diffusion priors with implicit neural representations provides an effective framework for high-quality CT reconstruction from sparse views, demonstrating strong performance on both simulated and experimentally obtained observations of concrete microstructures.

Abstract: Recently, generative diffusion priors have made huge strides as inverse problem solvers, including the ability to be adapted for inference on out-of-distribution data. Concurrently, implicit neural representations (INRs) have emerged as fast and lightweight inverse imaging solvers that are amenable to hybrid approaches that combine learned priors with traditional inverse problem formulations. In this paper, we present a diffusive computed tomography (CT) inversion framework for regularizing INRs called Diffusive INR (DINR), designed to enable high-quality reconstruction from sparse-view neutron CT. Pretrained purely on synthetic data, DINR is evaluated on simulated and experimentally obtained observations of concrete microstructures, where traditional reconstruction methods suffer substantial degradation when the number of views is reduced. Our approach delivers superior performance, reduces reconstruction artifacts, and achieves gains in PSNR and SSIM, enabling accurate micro-structural characterization even under extreme data limitations compared to state-of-the-art sparse-view reconstruction techniques.

[564] Segmentation of Retinal Low-Cost Optical Coherence Tomography Images using Deep Learning

Timo Kepp, Helge Sudkamp, Claus von der Burchard, Hendrik Schenke, Peter Koch, Gereon Hüttmann, Johann Roider, Mattias P. Heinrich, Heinz Handels

Main category: eess.IV

TL;DR: Deep learning approach for segmenting retinal scans from a low-cost home OCT system to enable automated detection of AMD biomarkers for personalized treatment monitoring.

Details

Motivation: Current AMD treatment monitoring is insufficient due to non-personalized frequency and lack of home monitoring solutions. Automated computer-aided diagnosis is needed for home OCT systems to detect pathological changes using OCT-based biomarkers.

Method: Uses CNN to segment total retina and pigment epithelial detachments (PED) from self-examination low-cost full-field OCT scans, with convolutional denoising autoencoder refinement to correct segmentation errors from artifacts.

Result: CNN-based approach achieves high accuracy for retina segmentation, but PED segmentation proves challenging. CDAE refinement successfully corrects segmentation errors caused by OCT image artifacts.

Conclusion: Deep learning methods show promise for automated segmentation of home OCT retinal scans, enabling computer-aided diagnosis for AMD monitoring, though PED segmentation remains challenging and requires further improvement.

Abstract: The treatment of age-related macular degeneration (AMD) requires continuous eye exams using optical coherence tomography (OCT). The need for treatment is determined by the presence or change of disease-specific OCT-based biomarkers. Therefore, the monitoring frequency has a significant influence on the success of AMD therapy. However, the monitoring frequency of current treatment schemes is not individually adapted to the patient and therefore often insufficient. While a higher monitoring frequency would have a positive effect on the success of treatment, in practice it can only be achieved with a home monitoring solution. One of the key requirements of a home monitoring OCT system is a computer-aided diagnosis to automatically detect and quantify pathological changes using specific OCT-based biomarkers. In this paper, for the first time, retinal scans of a novel self-examination low-cost full-field OCT (SELF-OCT) are segmented using a deep learning-based approach. A convolutional neural network (CNN) is utilized to segment the total retina as well as pigment epithelial detachments (PED). It is shown that the CNN-based approach can segment the retina with high accuracy, whereas the segmentation of the PED proves to be challenging. In addition, a convolutional denoising autoencoder (CDAE) refines the CNN prediction, which has previously learned retinal shape information. It is shown that the CDAE refinement can correct segmentation errors caused by artifacts in the OCT image.

[565] Enhancing Brain Source Reconstruction by Initializing 3D Neural Networks with Physical Inverse Solutions

Marco Morik, Ali Hashemi, Klaus-Robert Müller, Stefan Haufe, Shinichi Nakajima

Main category: eess.IV

TL;DR: 3D-PIUNet: A hybrid deep learning approach for EEG source localization that combines physics-informed initialization with 3D convolutional U-Net refinement for improved spatial accuracy.

Details

Motivation: EEG source localization is challenging due to its ill-posed nature. Traditional methods use manual priors and lack flexibility, while deep learning approaches often ignore physical constraints. There's a need for methods that integrate both data-driven learning and physical principles.

Method: Hybrid approach combining physics-informed initialization (pseudo inverse mapping) with 3D convolutional U-Net refinement. Uses simulated pseudo-realistic brain source data covering various source distributions for training. Views brain as 3D volume to capture spatial dependencies.

Result: Significantly improves spatial accuracy over both traditional and end-to-end data-driven methods. Successfully validated on real EEG data from visual tasks, identifying visual cortex and reconstructing expected temporal behavior.

Conclusion: 3D-PIUNet effectively integrates traditional and deep learning techniques for EEG source localization, demonstrating practical applicability and superior performance through physics-informed initialization and data-driven refinement.

Abstract: Reconstructing brain sources is a fundamental challenge in neuroscience, crucial for understanding brain function and dysfunction. Electroencephalography (EEG) signals have a high temporal resolution. However, identifying the correct spatial location of brain sources from these signals remains difficult due to the ill-posed structure of the problem. Traditional methods predominantly rely on manually crafted priors, missing the flexibility of data-driven learning, while recent deep learning approaches focus on end-to-end learning, typically using the physical information of the forward model only for generating training data. We propose the novel hybrid method 3D-PIUNet for EEG source localization that effectively integrates the strengths of traditional and deep learning techniques. 3D-PIUNet starts from an initial physics-informed estimate by using the pseudo inverse to map from measurements to source space. Secondly, by viewing the brain as a 3D volume, we use a 3D convolutional U-Net to capture spatial dependencies and refine the solution according to the learned data prior. Training the model relies on simulated pseudo-realistic brain source data, covering different source distributions. Trained on this data, our model significantly improves spatial accuracy, demonstrating superior performance over both traditional and end-to-end data-driven methods. Additionally, we validate our findings with real EEG data from a visual task, where 3D-PIUNet successfully identifies the visual cortex and reconstructs the expected temporal behavior, thereby showcasing its practical applicability.

[566] GOUHFI 2.0: A Next-Generation Toolbox for Brain Segmentation and Cortex Parcellation at Ultra-High Field MRI

Marc-Antoine Fortin, Anne Louise Kristoffersen, Paal Erik Goa

Main category: eess.IV

TL;DR: GOUHFI 2.0 is an updated deep learning toolbox for automatic brain segmentation and cortical parcellation in Ultra-High Field MRI, addressing challenges of signal inhomogeneities and limited UHF-optimized tools.

Details

Motivation: Ultra-High Field MRI faces challenges in automatic brain segmentation and cortical parcellation due to signal inhomogeneities, heterogeneous contrasts/resolutions, and limited UHF-optimized tools, restricting quantitative analyses.

Method: Two independently trained 3D U-Net segmentation tasks: 1) whole-brain segmentation into 35 labels using domain-randomization strategy and 238-subject training data, 2) cortical parcellation into 62 DKT protocol labels using same training data.

Result: Improved segmentation accuracy over original toolbox, particularly in heterogeneous cohorts; produced reliable cortical parcellations; integrated volumetry pipeline yielded results consistent with standard workflows.

Conclusion: GOUHFI 2.0 provides comprehensive solution for brain segmentation, parcellation and volumetry across field strengths, constituting first deep-learning toolbox enabling robust cortical parcellation at UHF-MRI.

Abstract: Ultra-High Field MRI (UHF-MRI) is increasingly used in large-scale neuroimaging studies, yet automatic brain segmentation and cortical parcellation remain challenging due to signal inhomogeneities, heterogeneous contrasts and resolutions, and the limited availability of tools optimized for UHF data. Standard software packages such as FastSurferVINN and SynthSeg+ often yield suboptimal results when applied directly to UHF images, thereby restricting region-based quantitative analyses. To address this need, we introduce GOUHFI 2.0, an updated implementation of GOUHFI that incorporates increased training data variability and additional functionalities, including cortical parcellation and volumetry. GOUHFI 2.0 preserves the contrast- and resolution-agnostic design of the original toolbox while introducing two independently trained 3D U-Net segmentation tasks. The first performs whole-brain segmentation into 35 labels across contrasts, resolutions, field strengths and populations, using a domain-randomization strategy and a training dataset of 238 subjects. Using the same training data, the second network performs cortical parcellation into 62 labels following the Desikan-Killiany-Tourville (DKT) protocol. Across multiple datasets, GOUHFI 2.0 demonstrated improved segmentation accuracy relative to the original toolbox, particularly in heterogeneous cohorts, and produced reliable cortical parcellations. In addition, the integrated volumetry pipeline yielded results consistent with standard volumetric workflows. Overall, GOUHFI 2.0 provides a comprehensive solution for brain segmentation, parcellation and volumetry across field strengths, and constitutes the first deep-learning toolbox enabling robust cortical parcellation at UHF-MRI.

Editor’s Picks

[1] Calibration-Reasoning Framework for Descriptive Speech Quality Assessment

[2] Speech Codec Probing from Semantic and Phonetic Perspectives

[3] ID-LoRA: Identity-Driven Audio-Video Personalization with In-Context LoRA

Today’s Research Highlights

Table of Contents

cs.CL

[1] GhazalBench: Usage-Grounded Evaluation of LLMs on Persian Ghazals

[2] Large Language Models and Book Summarization: Reading or Remembering, Which Is Better?

[3] AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic

[4] An Efficient Hybrid Deep Learning Approach for Detecting Online Abusive Language

[5] The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

[6] Quantifying Hallucinations in Language Language Models on Medical Textbooks

[7] Evolving Demonstration Optimization for Chain-of-Thought Feature Transformation

[8] Causally Grounded Mechanistic Interpretability for LLMs with Faithful Natural-Language Explanations

[9] The System Hallucination Scale (SHS): A Minimal yet Effective Human-Centered Instrument for Evaluating Hallucination-Related Behavior in Large Language Models

[10] A Two-Stage Architecture for NDA Analysis: LLM-based Segmentation and Transformer-based Clause Classification

[11] PoultryLeX-Net: Domain-Adaptive Dual-Stream Transformer Architecture for Large-Scale Poultry Stakeholder Modeling

[12] Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors

[13] TAMUSA-Chat: A Domain-Adapted Large Language Model Conversational System for Research and Responsible Deployment

[14] CEI: A Benchmark for Evaluating Pragmatic Reasoning in Language Models

[15] Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives

[16] Context Over Compute Human-in-the-Loop Outperforms Iterative Chain-of-Thought Prompting in Interview Answer Quality

[17] There Are No Silly Questions: Evaluation of Offline LLM Capabilities from a Turkish Perspective

[18] Empathy Is Not What Changed: Clinical Assessment of Psychological Safety Across GPT Model Generations

[19] Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

[20] A Retrieval-Augmented Language Assistant for Unmanned Aircraft Safety Assessment and Regulatory Compliance

[21] Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America

[22] Beyond the Prompt in Large Language Models: Comprehension, In-Context Learning, and Chain-of-Thought

[23] SpreadsheetArena: Decomposing Preference in LLM Generation of Spreadsheet Workbooks

[24] Probing the Limits of the Lie Detector Approach to LLM Deception

[25] SENS-ASR: Semantic Embedding injection in Neural-transducer for Streaming Automatic Speech Recognition

[26] Fine-Tune, Don’t Prompt, Your Language Model to Identify Biased Language in Clinical Notes

[27] GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification

[28] Adaptive Engram Memory System for Indonesian Language Model: Generative AI Based on TOBA LM for Batak and Minang Language

[29] FERRET: Framework for Expansion Reliant Red Teaming

[30] GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification

[31] Measuring and Eliminating Refusals in Military Large Language Models

[32] Gemma Needs Help: Investigating and Mitigating Emotional Instability in LLMs

[33] Evaluating Progress in Graph Foundation Models: A Comprehensive Benchmark and New Insights

[34] A Principle-Driven Adaptive Policy for Group Cognitive Stimulation Dialogue for Elderly with Cognitive Impairment

[35] TriageSim: A Conversational Emergency Triage Simulation Framework from Structured Electronic Health Records

[36] The Prediction-Measurement Gap: Toward Meaning Representations as Scientific Instruments

[37] The Generation-Recognition Asymmetry: Six Dimensions of a Fundamental Divide in Formal Language Theory

[38] Reason and Verify: A Framework for Faithful Retrieval-Augmented Generation

[39] Lost in Backpropagation: The LM Head is a Gradient Bottleneck

[40] OpenClaw-RL: Train Any Agent Simply by Talking

[41] Adaptive Activation Cancellation for Hallucination Mitigation in Large Language Models

[42] ViDia2Std: A Parallel Corpus and Methods for Low-Resource Vietnamese Dialect-to-Standard Translation

[43] Sabiá-4 Technical Report

[44] S-GRADES – Studying Generalization of Student Response Assessments in Diverse Evaluative Settings

[45] GR-SAP: Generative Replay for Safety Alignment Preservation during Fine-Tuning

[46] Is this Idea Novel? An Automated Benchmark for Judgment of Research Ideas

[47] Large language models can disambiguate opioid slang on social media

[48] Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck

[49] Dynamic Knowledge Fusion for Multi-Domain Dialogue State Tracking

[50] Aligning Large Language Models with Searcher Preferences

[51] Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

[52] PEEM: Prompt Engineering Evaluation Metrics for Interpretable Joint Evaluation of Prompts and Responses

[53] Human-AI Co-reasoning for Clinical Diagnosis with Evidence-Integrated Language Agent

[54] VERI-DPO: Evidence-Aware Alignment for Clinical Summarization via Claim Verification and Direct Preference Optimization

[55] Safe and Scalable Web Agent Learning via Recreated Websites

[56] AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations

[57] Automatic End-to-End Data Integration using Large Language Models

[58] End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering

[59] MUNIChus: Multilingual News Image Captioning Benchmark

[60] Disentangling Similarity and Relatedness in Topic Models

[61] Making Bielik LLM Reason (Better): A Field Report

[62] Prism-$Δ$: Differential Subspace Steering for Prompt Highlighting in Large Language Models

[63] HeartAgent: An Autonomous Agent System for Explainable Differential Diagnosis in Cardiology

[64] mAceReason-Math: A Dataset of High-Quality Multilingual Math Problems Ready For RLVR

[65] Word Recovery in Large Language Models Enables Character-Level Tokenization Robustness

[66] Large Language Models as Annotators for Machine Translation Quality Estimation

[67] Interpretable Chinese Metaphor Identification via LLM-Assisted MIPVU Rule Script Generation: A Comparative Protocol Study

[68] LuxBorrow: From Pompier to Pompjee, Tracing Borrowing in Luxembourgish

[69] Multilingual Reasoning Gym: Multilingual Scaling of Procedural Reasoning Environments

[70] PivotAttack: Rethinking the Search Trajectory in Hard-Label Text Attacks via Pivot Words

[71] SiDiaC-v.2.0: Sinhala Diachronic Corpus Version 2.0

[72] An Extreme Multi-label Text Classification (XMTC) Library Dataset: What if we took “Use of Practical AI in Digital Libraries” seriously?

[73] From Images to Words: Efficient Cross-Modal Knowledge Distillation to Language Models from Black-box Teachers