Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] MOSS-TTS Technical Report

Yitian Gong, Botian Jiang, Yiwei Zhao, Yucheng Yuan, Kuangwei Chen, Yaozhou Jiang, Cheng Chang, Dong Hong, Mingshu Chen, Ruixiao Li, Yiyang Zhang, Yang Gao, Hanfu Chen, Ke Chen, Songlin Wang, Xiaogui Yang, Yuqian Zhang, Kexin Huang, ZhengYuan Lin, Kang Yu, Ziqi Chen, Jin Wang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu

Main category: cs.SD

TL;DR: MOSS-TTS is a scalable speech generation foundation model using discrete audio tokens and autoregressive modeling, featuring two complementary generators for different deployment scenarios with multilingual support and various control capabilities.

Details

Motivation: To create a scalable speech generation foundation model that can handle diverse requirements including zero-shot voice cloning, token-level duration control, multilingual support, and stable long-form generation through a unified architecture.

Method: Uses MOSS-Audio-Tokenizer (causal Transformer tokenizer) to compress 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations. Two complementary generators: MOSS-TTS for structural simplicity and scalability, and MOSS-TTS-Local-Transformer with frame-local autoregressive module for efficiency and speaker preservation.

Result: Supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation across multilingual and open-domain settings.

Conclusion: MOSS-TTS provides a scalable recipe for speech generation foundation models with complementary architectures for different deployment needs, demonstrating strong capabilities in multilingual generation and various control mechanisms.

Abstract: This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.

Relevance: 9/10

[2] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Weisheng Xu, Ziteng Wang, Ruofan Liao, Yutong Zhang, Sichen Liu

Main category: cs.AI

TL;DR: DEAF benchmark evaluates Audio MLLMs’ acoustic faithfulness using 2,700+ conflict stimuli across emotional prosody, background sounds, and speaker identity to test if models genuinely process audio or rely on text semantics.

Details

Motivation: Current Audio MLLMs show impressive speech benchmark performance, but it's unclear whether they actually process acoustic signals or just rely on text-based semantic inference. There's a need to systematically evaluate acoustic faithfulness and disentangle content-driven bias from prompt-induced sycophancy.

Method: Created DEAF benchmark with 2,700+ conflict stimuli across three acoustic dimensions. Designed controlled multi-level evaluation framework that progressively increases textual influence (semantic conflicts → misleading prompts → combination). Introduced diagnostic metrics to quantify model reliance on textual cues vs acoustic signals. Evaluated seven Audio MLLMs.

Result: Evaluation revealed consistent pattern of text dominance: models are sensitive to acoustic variations, but predictions are predominantly driven by textual inputs. Shows gap between high performance on standard speech benchmarks and genuine acoustic understanding.

Conclusion: Audio MLLMs exhibit text dominance over acoustic processing, revealing a fundamental limitation in current models’ acoustic understanding despite strong benchmark performance. The DEAF benchmark provides systematic evaluation framework for acoustic faithfulness.

Abstract: Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To systematically study this question, we introduce DEAF (Diagnostic Evaluation of Acoustic Faithfulness), a benchmark of over 2,700 conflict stimuli spanning three acoustic dimensions: emotional prosody, background sounds, and speaker identity. Then, we design a controlled multi-level evaluation framework that progressively increases textual influence, ranging from semantic conflicts in the content to misleading prompts and their combination, allowing us to disentangle content-driven bias from prompt-induced sycophancy. We further introduce diagnostic metrics to quantify model reliance on textual cues over acoustic signals. Our evaluation of seven Audio MLLMs reveals a consistent pattern of text dominance: models are sensitive to acoustic variations, yet predictions are predominantly driven by textual inputs, revealing a gap between high performance on standard speech benchmarks and genuine acoustic understanding.

Relevance: 9/10

[3] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee

Main category: eess.AS

TL;DR: LLMs vary in auditory knowledge from text-only pre-training, and this knowledge strongly correlates with audio-grounded performance in Large Audio Language Models.

Details

Motivation: To understand how much auditory knowledge LLMs encode through text-only pre-training and how this affects downstream audio performance, addressing a gap in current understanding of LLMs in audio research.

Method: Three evaluation settings: (1) direct probing on AKB-2000 benchmark for auditory knowledge breadth/depth, (2) cascade evaluation using text descriptions from audio captioner, and (3) audio-grounded evaluation by fine-tuning LLMs into LALMs with audio encoder.

Result: Auditory knowledge varies substantially across LLM families, and text-only results are strongly correlated with audio performance, providing empirical grounding for understanding LLMs in audio research.

Conclusion: The study reveals important insights about LLMs’ auditory knowledge from text-only training and its impact on audio-grounded models, offering guidance for selecting LLM backbones in audio research.

Abstract: Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 102]
cs.CV [Total: 217]
cs.AI [Total: 153]
cs.SD [Total: 8]
cs.LG [Total: 177]
cs.MA [Total: 5]
cs.MM [Total: 5]
eess.AS [Total: 8]
eess.IV [Total: 12]

cs.CL

[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

Anna Babarczy, Andras Lukacs, Peter Vedres, Zeteny Bujka

Main category: cs.CL

TL;DR: LLMs tested for Theory of Mind capabilities using text-based tasks; GPT-4o performs comparably to humans while earlier models show performance gaps.

Details

Motivation: To investigate whether LLMs genuinely possess Theory of Mind capabilities (inferring beliefs, intentions, emotions) or merely perform superficial pattern completion, given they're trained only on language data without social embodiment.

Method: Tested five LLMs using adapted text-based ToM assessment tools from human psychology research, comparing performance to human controls on questions about story characters’ mental states.

Result: Earlier/smaller models showed performance gaps affected by inferential cues and distracting information, while GPT-4o achieved high accuracy and robustness comparable to humans even in challenging conditions.

Conclusion: GPT-4o demonstrates strong ToM capabilities in text-based tasks, raising questions about the boundary between genuine understanding and statistical approximation in LLMs.

Abstract: The study explores whether current Large Language Models (LLMs) exhibit Theory of Mind (ToM) capabilities – specifically, the ability to infer others’ beliefs, intentions, and emotions from text. Given that LLMs are trained on language data without social embodiment or access to other manifestations of mental representations, their apparent social-cognitive reasoning raises key questions about the nature of their understanding. Are they capable of robust mental-state attribution indistinguishable from human ability in its output, or do their outputs merely reflect superficial pattern completion? To address this question, we tested five LLMs and compared their performance to that of human controls using an adapted version of a text-based tool widely used in human ToM research. The test involves answering questions about the beliefs, intentions, and emotions of story characters. The results revealed a performance gap between the models. Earlier and smaller models were strongly affected by the number of relevant inferential cues available and, to some extent, were also vulnerable to the presence of irrelevant or distracting information in the texts. In contrast, GPT-4o demonstrated high accuracy and strong robustness, performing comparably to humans even in the most challenging conditions. This work contributes to ongoing debates about the cognitive status of LLMs and the boundary between genuine understanding and statistical approximation.

[2] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli

Main category: cs.CL

TL;DR: THERAPYGYM is a framework for evaluating and improving therapy chatbots using clinical metrics (fidelity to CBT techniques and safety) with automated scoring and RL training.

Details

Motivation: Current evaluation methods for mental health LLMs (fluency metrics, preference tests, generic dialogue benchmarks) fail to capture clinically critical dimensions of psychotherapy needed for safe and effective therapy chatbots.

Method: Introduces THERAPYGYM with two clinical pillars: 1) Fidelity measured via Cognitive Therapy Rating Scale (CTRS) automated pipeline scoring CBT technique adherence, 2) Safety assessed via multi-label annotation scheme for therapy-specific risks. Includes THERAPYJUDGEBENCH validation set with expert ratings for auditing LLM judges. Uses RL training with CTRS and safety-based rewards on configurable patient simulations.

Result: Models trained in THERAPYGYM show significant improvement: average CTRS rises from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges), enabling development of therapy chatbots that are more faithful to evidence-based practice and safer.

Conclusion: THERAPYGYM enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes mental health applications, addressing critical gaps in current LLM evaluation methods.

Abstract: Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods–fluency metrics, preference tests, and generic dialogue benchmarks–fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

[3] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

Wei Chen, Guoyang Ju, Yuanyuan Qi

Main category: cs.CL

TL;DR: LSFU is a first-token-based uncertainty metric that incorporates label priors to distinguish true certainty from spurious confidence, enabling uncertainty-calibrated prompt optimization with adaptive RAG triggering.

Details

Motivation: LLMs have output uncertainty due to autoregressive generation, and conventional uncertainty measures fail to distinguish between spurious confidence from class priors and true certainty from contextual understanding, leading to poor confidence calibration for prompt optimization.

Method: Proposes Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss that incorporates label prior probabilities as a risk-modulation factor. Based on LSFU, develops UCPOF (uncertainty-calibrated prompt optimization framework) that uses first token outputs to select high-quality exemplars and dynamically optimize prompts with adaptive RAG triggering.

Result: UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces average retrieval trigger rate by 50.66% while maintaining state-of-the-art performance.

Conclusion: LSFU effectively measures uncertainty by accounting for class priors, and UCPOF enables efficient prompt optimization with adaptive RAG triggering, significantly reducing computational costs while improving performance.

Abstract: With the widespread adoption of large language models (LLMs) in natural language processing, prompt engineering and retrieval-augmented generation (RAG) have become mainstream to enhance LLMs’ performance on complex tasks. However, LLMs generate outputs autoregressively, leading to inevitable output uncertainty. Since model performance is highly sensitive to prompt design, precise uncertainty measurement is crucial for reliable prompt optimization. For multi-class multiple-choice (understanding) tasks, conventional uncertainty measures (e.g., entropy) based on output probabilities treat all classes equally and ignore class prior differences in pretraining corpora. This failure to distinguish spurious confidence (from priors) from true certainty (from contextual understanding) results in poor confidence calibration. To address this, we propose Log-Scale Focal Uncertainty (LSFU), a first-token-based metric inspired by focal loss. LSFU incorporates label prior probabilities as a risk-modulation factor to suppress noise from high-frequency classes and emphasize risk for low-frequency long-tail classes, with a dynamic weighting mechanism unifying the measurement scale. Based on LSFU, we further propose the uncertainty-calibrated prompt optimization framework (UCPOF), which leverages the first token of model outputs to select high-quality exemplars and dynamically optimize prompts. Comprehensive evaluations show UCPOF improves average accuracy by 6.03% over few-shot baselines, surpasses always-on full RAG by 5.75% in overall average accuracy, and reduces the average retrieval trigger rate by 50.66%. By adaptively triggering RAG only for high-uncertainty samples, our framework significantly lowers computational costs while maintaining state-of-the-art performance.

[4] Agentic Framework for Political Biography Extraction

Yifei Zhu, Songpo Yang, Jiangnan Zhu, Junyan Jiang

Main category: cs.CL

TL;DR: LLM-based two-stage framework for automated extraction of elite biographies from web sources, outperforming human experts in accuracy and information synthesis.

Details

Motivation: Automating large-scale political dataset creation from unstructured documents is traditionally expensive and difficult; LLMs offer potential for scalable automation.

Method: Two-stage “Synthesis-Coding” framework: 1) upstream synthesis using recursive agentic LLMs to search/filter/curate biographies from web sources, 2) downstream coding to map curated biographies into structured dataframes.

Result: LLM coders match/exceed human expert accuracy; agentic system synthesizes more information than Wikipedia; synthesis stage reduces bias from multi-language corpora.

Conclusion: Provides scalable framework for transparent, expansible political science databases using LLMs for complex information extraction tasks.

Abstract: The production of large-scale political datasets typically demands extracting structured facts from vast piles of unstructured documents or web sources, a task that traditionally relies on expensive human experts and remains prohibitively difficult to automate at scale. In this paper, we leverage Large Language Models (LLMs) to automate the extraction of multi-dimensional elite biographies, addressing a long-standing bottleneck in political science research. We propose a two-stage ``Synthesis-Coding’’ framework for complex extraction task: an upstream synthesis stage that uses recursive agentic LLMs to search, filter, and curate biography from heterogeneous web sources, followed by a downstream coding stage that maps curated biography into structured dataframes. We validate this framework through three primary results. First, we demonstrate that, when given curated contexts, LLM coders match or outperform human experts in extraction accuracy. Second, we show that in web environments, the agentic system synthesizes more information from web resources than human collective intelligence (Wikipedia). Finally, we diagnosed that directly coding from long and multi-language corpora introduces bias that the synthesis stage can alleviate by curating evidence into signal-dense representations. By comprehensive evaluation, We provide a generalizable, scalable framework for building transparent and expansible large scale database in political science.

[5] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

Victor P. Unda

Main category: cs.CL

TL;DR: A deterministic evidence selection framework for retrieval-augmented QA that uses Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE) to evaluate text units independently before answer generation, ensuring only explicitly stated facts serve as evidence.

Details

Motivation: Current retrieval-based QA systems rely on similarity scores that don't distinguish between topically similar text and usable evidence, leading to selection of redundant, incomplete, or condition-mismatched text when multiple candidates have similar scores.

Method: Introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE) - fixed scoring procedures that evaluate each sentence/record independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning required.

Result: The deterministic framework produces compact, auditable evidence sets by only accepting units that explicitly state required facts/rules/conditions. If no unit independently satisfies requirements, system returns no answer rather than using suboptimal evidence.

Conclusion: The approach establishes a clear boundary between relevant text and usable evidence through deterministic gating, improving evidence quality and auditability in retrieval-augmented question answering systems.

Abstract: Many modern AI question-answering systems convert text into vectors and retrieve the closest matches to a user question. While effective for topical similarity, similarity scores alone do not explain why some retrieved text can serve as evidence while other equally similar text cannot. When many candidates receive similar scores, systems may select sentences that are redundant, incomplete, or address different conditions than the question requires. This paper presents a deterministic evidence selection framework for retrieval-augmented question answering. The approach introduces Meaning-Utility Estimation (MUE) and Diversity-Utility Estimation (DUE), fixed scoring and redundancy-control procedures that determine evidence admissibility prior to answer generation. Each sentence or record is evaluated independently using explicit signals for semantic relatedness, term coverage, conceptual distinctiveness, and redundancy. No training or fine-tuning is required. In the prototype, a unit is accepted only if it explicitly states the fact, rule, or condition required by the task. Units are not merged or expanded. If no unit independently satisfies the requirement, the system returns no answer. This deterministic gating produces compact, auditable evidence sets and establishes a clear boundary between relevant text and usable evidence.

[6] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

Penghao Liang, Mengwei Yuan, Jianan Liu, Jing Yang, Xianyou Li, Weiran Yan, Yichao Wu

Main category: cs.CL

TL;DR: DynaRAG is a retrieval-augmented generation framework that dynamically integrates external APIs when static documents are insufficient, using LLM-based reranking, sufficiency classification, and API calling models to handle time-sensitive information needs.

Details

Motivation: Traditional RAG systems rely on static corpora and struggle with time-sensitive information needs, requiring dynamic knowledge integration from external sources when retrieved documents are insufficient.

Method: Uses LLM-based reranker for document relevance assessment, sufficiency classifier to determine when fallback to external APIs is needed, Gorilla v2 for accurate API calling, and schema filtering via FAISS to guide API selection.

Result: Significantly improves accuracy on dynamic questions in the CRAG benchmark while reducing hallucinations, demonstrating the effectiveness of dynamic-aware routing and selective tool use.

Conclusion: Dynamic-aware routing and selective tool use are crucial for building reliable, real-world question-answering systems that can handle both static and time-sensitive information needs.

Abstract: We present DynaRAG, a retrieval-augmented generation (RAG) framework designed to handle both static and time-sensitive information needs through dynamic knowledge integration. Unlike traditional RAG pipelines that rely solely on static corpora, DynaRAG selectively invokes external APIs when retrieved documents are insufficient for answering a query. The system employs an LLM-based reranker to assess document relevance, a sufficiency classifier to determine when fallback is necessary, and Gorilla v2 – a state-of-the-art API calling model – for accurate tool invocation. We further enhance robustness by incorporating schema filtering via FAISS to guide API selection. Evaluations on the CRAG benchmark demonstrate that DynaRAG significantly improves accuracy on dynamic questions, while also reducing hallucinations. Our results highlight the importance of dynamic-aware routing and selective tool use in building reliable, real-world question-answering systems.

[7] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

Toshiyuki Shigemura

Main category: cs.CL

TL;DR: LLMs can reconstruct training data under specific conditions but don’t express non-causal solutions in standard generation tasks, showing a dissociation between learned capability and expressed output.

Details

Motivation: To examine the systematic dissociation between what LLMs can reconstruct from training data and what they actually generate in standard contexts, challenging the assumption that training data presence directly predicts output probability.

Method: Empirical observational study examining 300 prompt-response generations across narrative and problem-solving tasks using three distinct LLMs and ten task scenarios, analyzing expression of non-causal, non-implementable solution types.

Result: Zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]) despite verified reconstruction capability under conditional extraction, showing comprehensive suppression of learned content across diverse contexts.

Conclusion: Task-conditioned generation policies can completely suppress learned content, challenging assumptions about training data influence and offering insights into generation dynamics and output distribution control in LLMs.

Abstract: Large language models (LLMs) demonstrate the capacity to reconstruct and trace learned content from their training data under specific elicitation conditions, yet this capability does not manifest in standard generation contexts. This empirical observational study examines the expression of non-causal, non-implementable solution types across 300 prompt-response generations spanning narrative and problem-solving task contexts. Drawing on recent findings regarding memorization contiguity and alignment-induced discourse priors, we document a systematic dissociation between learned capability and expressed output. Across three distinct LLMs, ten task scenarios, and both creative narrative and practical advisory contexts, we documented zero instances of non-causal solution frames in generated outputs (0%, 95% CI: [0%, 1.2%]), despite verified reconstruction capability under conditional extraction. These findings challenge the prevailing assumption that training data presence directly predicts output probability, demonstrating instead that task-conditioned generation policies can comprehensively suppress learned content across diverse contexts. The results offer implications for understanding generation dynamics, output distribution control, and the behavioral boundaries of contemporary LLMs.

[8] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

Hui Wen Goh, Jonas Mueller

Main category: cs.CL

TL;DR: CONSTRUCT is a method for scoring trustworthiness of LLM structured outputs in real-time to identify errors and prioritize human review, working with any LLM without training data or custom deployment.

Details

Motivation: Current LLMs produce structured outputs with sporadic errors that hinder enterprise AI adoption. There's a need for real-time error detection in structured outputs to efficiently allocate limited human review resources.

Method: CONSTRUCT scores trustworthiness of LLM structured outputs and individual fields in real-time. It works with any LLM (including black-box APIs without logprobs), requires no labeled training data or custom model deployment, and handles complex nested JSON schemas.

Result: CONSTRUCT outperforms other scoring methods on a new public benchmark with reliable ground truth, detecting errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision and recall.

Conclusion: CONSTRUCT provides an effective solution for enterprise AI by enabling real-time error detection in LLM structured outputs, helping prioritize human review and improve reliability without requiring model access or training data.

Abstract: Structured Outputs from current LLMs exhibit sporadic errors, hindering enterprise AI efforts from realizing their immense potential. We present CONSTRUCT, a method to score the trustworthiness of LLM Structured Outputs in real-time, such that lower-scoring outputs are more likely to contain errors. This reveals the best places to focus limited human review bandwidth. CONSTRUCT additionally scores the trustworthiness of each field within a LLM Structured Output, helping reviewers quickly identify which parts of the output are wrong. Our method is suitable for any LLM (including black-box LLM APIs without logprobs such as reasoning models and Anthropic models), does not require labeled training data nor custom model deployment, and works for complex Structured Outputs with many fields of diverse types (including nested JSON schemas). We additionally present one of the first public LLM Structured Output benchmarks with reliable ground-truth values that are not full of mistakes. Over this four-dataset benchmark, CONSTRUCT detects errors from various LLMs (including Gemini 3 and GPT-5) with significantly higher precision/recall than other scoring methods.

[9] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

Trishita Dhara, Siddhesh Sheth

Main category: cs.CL

TL;DR: Analysis of harmful content detection model using explainability methods reveals limitations not visible in aggregate metrics, showing how different explanation techniques highlight different failure modes in borderline cases.

Details

Motivation: While harmful content detection systems are widely used, their decision logic is often opaque to moderators and users. Current research focuses on accuracy improvements rather than understanding why models classify content as harmful, especially in borderline, contextual, and politically sensitive cases.

Method: Analyzed a RoBERTa-based harmful content detection model trained on Civil Comments dataset using two post-hoc explanation methods: Shapley Additive Explanations (SHAP) and Integrated Gradients. Compared their attributions in both correct predictions and systematic failure cases.

Result: Despite strong overall performance (AUC 0.93, accuracy 0.94), analysis revealed limitations not observable from aggregate metrics. Integrated Gradients extracted more diffuse contextual attributions while SHAP focused on explicit lexical cues. Divergence in outputs manifested in false negatives and positives. Qualitative case studies revealed failure modes like indirect toxicity, lexical over-attribution, and political discourse misclassification.

Conclusion: Explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing interpretable rationale. Most importantly, explainability serves as a transparency and diagnostic resource for harmful content detection systems rather than just a performance-enhancing tool.

Abstract: Although automated harmful content detection systems are frequently used to monitor online platforms, moderators and end users frequently cannot understand the logic underlying their predictions. While recent studies have focused on increasing classification accuracy, little focus has been placed on comprehending why neural models identify content as harmful, especially when it comes to borderline, contextual, and politically sensitive situations. In this work, a neural harmful content detection model trained on the Civil Comments dataset is analyzed explainability-drivenly. Two popular post-hoc explanation methods, Shapley Additive Explanations and Integrated Gradients, are used to analyze the behavior of a RoBERTa-based classifier in both correct predictions and systematic failure cases. Despite strong overall performance, with an area under the curve of 0.93 and an accuracy of 0.94, the analysis reveals limitations that are not observable from aggregate evaluation metrics alone. Integrated Gradients appear to extract more diffuse contextual attributions while Shapley Additive Explanations extract more focused attributions on explicit lexical cues. The consequent divergence in their outputs manifests in both false negatives and false positives. Qualitative case studies reveal recurring failure modes such as indirect toxicity, lexical over-attribution, or political discourse. The results suggest that explainable AI can foster human-in-the-loop moderation by exposing model uncertainty and increasing the interpretable rationale behind automated decisions. Most importantly, this work highlights the role of explainability as a transparency and diagnostic resource for online harmful content detection systems rather than as a performance-enhancing lever.

[10] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

Maxime Poli, Manel Khentout, Angelo Ortiz Tandazo, Ewan Dunbar, Emmanuel Chemla, Emmanuel Dupoux

Main category: cs.CL

TL;DR: DiscoPhon is a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units across 12 languages, with systems required to map discovered units to predefined phoneme inventories using limited speech data.

Details

Motivation: There's a need for standardized evaluation of unsupervised phoneme discovery methods across diverse languages to understand how well current models can capture phonemic information from speech without supervision.

Method: The benchmark covers 6 development and 6 test languages spanning various phonemic contrasts. Systems are given only 10 hours of speech per unseen language and must produce discrete units mapped to predefined phoneme inventories via many-to-one or one-to-one assignments, evaluated on unit quality, recognition, and segmentation.

Result: Four pretrained multilingual HuBERT and SpidR baselines show that phonemic information is sufficiently available in current models for derived units to correlate well with phonemes, though performance varies across languages.

Conclusion: DiscoPhon provides a standardized framework for evaluating unsupervised phoneme discovery, revealing that current models capture phonemic information effectively but with language-dependent variations.

Abstract: We introduce DiscoPhon, a multilingual benchmark for evaluating unsupervised phoneme discovery from discrete speech units. DiscoPhon covers 6 dev and 6 test languages, chosen to span a wide range of phonemic contrasts. Given only 10 hours of speech in a previously unseen language, systems must produce discrete units that are mapped to a predefined phoneme inventory, through either a many-to-one or a one-to-one assignment. The resulting sequences are evaluated for unit quality, recognition and segmentation. We provide four pretrained multilingual HuBERT and SpidR baselines, and show that phonemic information is available enough in current models for derived units to correlate well with phonemes, though with variations across languages.

[11] MineDraft: A Framework for Batch Parallel Speculative Decoding

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

Main category: cs.CL

TL;DR: MineDraft is a batch parallel speculative decoding framework that overlaps drafting and verification stages to accelerate large language model inference, achieving up to 75% throughput improvement over standard speculative decoding.

Details

Motivation: Standard speculative decoding suffers from performance limitations due to strictly sequential execution of drafting and verification stages, creating latency bottlenecks in LLM inference systems.

Method: Proposes batch parallel speculative decoding (PSD) with MineDraft framework that maintains two batches of requests to overlap drafting for one batch with verification for another, implemented as a plugin for vLLM.

Result: Achieves significant improvements: up to 75% higher throughput and up to 39% lower end-to-end latency compared to standard speculative decoding, with practical implementation in production-ready systems.

Conclusion: MineDraft effectively hides drafting latency through parallel execution, making speculative decoding substantially more efficient for LLM inference acceleration.

Abstract: Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

[12] An Agentic System for Schema Aware NL2SQL Generation

David Onyango, Naseef Mansoor

Main category: cs.CL

TL;DR: A schema-based agentic system for NL2SQL that uses Small Language Models as primary agents with selective LLM fallback to reduce computational costs while maintaining accuracy.

Details

Motivation: Current NL2SQL frameworks rely heavily on Large Language Models, which raises concerns about computational overhead, data privacy, and deployability in resource-constrained environments. There's a need for more efficient systems that can democratize data access while being practical for real-world deployment.

Method: Proposes a schema-based agentic system that strategically employs Small Language Models as primary agents, complemented by a selective LLM fallback mechanism. The LLM is only invoked when errors are detected in SLM-generated output, minimizing computational expenditure.

Result: Achieves 47.78% execution accuracy and 51.05% validation efficiency score on BIRD benchmark. Reduces costs by over 90% compared to LLM-centric baselines, with approximately 67% of queries resolved using local SLMs. Average cost per query is $0.0085 vs $0.094 for LLM-only systems.

Conclusion: The proposed system demonstrates that strategic use of SLMs with selective LLM fallback can significantly reduce computational costs while maintaining reasonable accuracy for NL2SQL tasks, making it more deployable in resource-constrained environments.

Abstract: The natural language to SQL (NL2SQL) task plays a pivotal role in democratizing data access by enabling non-expert users to interact with relational databases through intuitive language. While recent frameworks have enhanced translation accuracy via task specialization, their reliance on Large Language Models (LLMs) raises significant concerns regarding computational overhead, data privacy, and real-world deployability in resource-constrained environments. To address these challenges, we propose a schema based agentic system that strategically employs Small Language Models (SLMs) as primary agents, complemented by a selective LLM fallback mechanism. The LLM is invoked only upon detection of errors in SLM-generated output, the proposed system significantly minimizes computational expenditure. Experimental results on the BIRD benchmark demonstrate that our system achieves an execution accuracy of 47.78% and a validation efficiency score of 51.05%, achieving over 90% cost reduction compared to LLM-centric baselines as approximately 67% of queries are resolved using local SLMs. The system achieves an average cost per query of 0.0085 compared to 0.094 for LLM-only systems, achieving near-zero operational costs for locally executed queries. [Github repository: https://github.com/mindslab25/CESMA.]

[13] BenchBrowser – Collecting Evidence for Evaluating Benchmark Validity

Harshita Diddee, Gregory Yauney, Swabha Swayamdipta, Daphne Ippolito

Main category: cs.CL

TL;DR: BenchBrowser is a retrieval system that helps practitioners find specific evaluation items across 20+ benchmark suites to better align benchmark testing with actual user needs and identify validity gaps.

Details

Motivation: Current language model benchmarks have opaque content - high-level metadata doesn't reveal what specific skills are actually tested, creating an illusion of competence when models may fail on untested but relevant facets of user interests.

Method: Developed BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases across 20+ benchmark suites. Validated retrieval precision through human studies and used it to diagnose content validity (narrow coverage) and convergent validity (unstable rankings).

Result: BenchBrowser successfully identifies critical gaps between practitioner intent and what benchmarks actually test, with human studies confirming high retrieval precision for relevant evaluation items.

Conclusion: BenchBrowser provides a practical tool to quantify and address the mismatch between benchmark content and practitioner needs, helping improve evaluation validity for language models.

Abstract: Do language model benchmarks actually measure what practitioners intend them to ? High-level metadata is too coarse to convey the granular reality of benchmarks: a “poetry” benchmark may never test for haikus, while “instruction-following” benchmarks will often test for an arbitrary mix of skills. This opacity makes verifying alignment with practitioner goals a laborious process, risking an illusion of competence even when models fail on untested facets of user interests. We introduce BenchBrowser, a retriever that surfaces evaluation items relevant to natural language use cases over 20 benchmark suites. Validated by a human study confirming high retrieval precision, BenchBrowser generates evidence to help practitioners diagnose low content validity (narrow coverage of a capability’s facets) and low convergent validity (lack of stable rankings when measuring the same capability). BenchBrowser, thus, helps quantify a critical gap between practitioner intent and what benchmarks actually test.

[14] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

Lívia Dutra, Arthur Lorenzi, Frederico Belcavello, Ely Matos, Marcelo Viridiano, Lorena Larré, Olívia Guaranha, Erik Santos, Sofia Reinach, Pedro de Paula, Tiago Torrent

Main category: cs.CL

TL;DR: FrameNet-based semantic annotation of clinical narratives improves identification of gender-based violence patterns in electronic medical records compared to structured data alone.

Details

Motivation: Gender-based violence (GBV) is a major public health issue with significant underreporting in Brazil despite legal requirements, due to difficulties in identifying abuse and limited integration between public information systems.

Method: Used FrameNet-based semantic annotation of open-text fields in electronic medical records, compared SVM classifier performance trained on: (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone.

Result: Models incorporating semantic annotation outperformed categorical models, achieving over 0.3 improvement in F1 score, showing domain-specific semantic representations provide meaningful signals beyond structured demographic data.

Conclusion: Semantic analysis of clinical narratives can enhance early identification strategies for GBV and support more informed public health interventions.

Abstract: Gender-based violence (GBV) is a major public health issue, with the World Health Organization estimating that one in three women experiences physical or sexual violence by an intimate partner during her lifetime. In Brazil, although healthcare professionals are legally required to report such cases, underreporting remains significant due to difficulties in identifying abuse and limited integration between public information systems. This study investigates whether FrameNet-based semantic annotation of open-text fields in electronic medical records can support the identification of patterns of GBV. We compare the performance of an SVM classifier for GBV cases trained on (1) frame-annotated text, (2) annotated text combined with parameterized data, and (3) parameterized data alone. Quantitative and qualitative analyses show that models incorporating semantic annotation outperform categorical models, achieving over 0.3 improvement in F1 score and demonstrating that domain-specific semantic representations provide meaningful signals beyond structured demographic data. The findings support the hypothesis that semantic analysis of clinical narratives can enhance early identification strategies and support more informed public health interventions.

[15] How LLMs Distort Our Written Language

Marwa Abdulhai, Isadora White, Yanming Wan, Ibrahim Qureshi, Joel Leibo, Max Kleiman-Weiner, Natasha Jaques

Main category: cs.CL

TL;DR: LLMs significantly alter the meaning and voice of human writing, with heavy users reporting less creativity and neutrality, and AI-generated scientific reviews showing different evaluation patterns.

Details

Motivation: To understand how LLMs affect human writing beyond just style and tone, specifically examining whether they alter intended meaning and how this impacts real-world applications like scientific peer review.

Method: 1) Human user study on LLM-assisted writing; 2) Analysis of pre-LLM essays revised with LLMs using existing feedback; 3) Examination of AI-generated peer reviews from a top AI conference.

Result: Heavy LLM use increased neutral essays by 70%, reduced creativity and personal voice; LLMs altered semantic meaning even when prompted for grammar-only edits; AI-generated reviews placed less weight on clarity/significance and gave higher scores.

Conclusion: LLMs consistently alter the semantics of human writing, creating misalignment between perceived benefits and actual effects, raising concerns about AI’s impact on cultural and scientific institutions.

Abstract: Large language models (LLMs) are used by over a billion people globally, most often to assist with writing. In this work, we demonstrate that LLMs not only alter the voice and tone of human writing, but also consistently alter the intended meaning. First, we conduct a human user study to understand how people actually interact with LLMs when using them for writing. Our findings reveal that extensive LLM use led to a nearly 70% increase in essays that remained neutral in answering the topic question. Significantly more heavy LLM users reported that the writing was less creative and not in their voice. Next, using a dataset of human-written essays that was collected in 2021 before the widespread release of LLMs, we study how asking an LLM to revise the essay based on the human-written feedback in the dataset induces large changes in the resulting content and meaning. We find that even when LLMs are prompted with expert feedback and asked to only make grammar edits, they still change the text in a way that significantly alters its semantic meaning. We then examine LLM-generated text in the wild, specifically focusing on the 21% of AI-generated scientific peer reviews at a recent top AI conference. We find that LLM-generated reviews place significantly less weight on clarity and significance of the research, and assign scores that, on average, are a full point higher.These findings highlight a misalignment between the perceived benefit of AI use and an implicit, consistent effect on the semantics of human writing, motivating future work on how widespread AI writing will affect our cultural and scientific institutions.

[16] Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

Yuxiang Mei, Delai Qiu, Shengping Liu, Jiaen Liang, Yanhua Long

Main category: cs.CL

TL;DR: Zipper-LoRA: A rank-level decoupling framework for multilingual Speech-LLMs that dynamically combines shared and language-specific LoRA subspaces to address the stability-plasticity dilemma in imbalanced multilingual ASR.

Details

Motivation: Multilingual Speech-LLMs face challenges with imbalanced data distributions, causing a stability-plasticity dilemma: fully shared PEFT causes negative interference for under-represented languages, while fully language-specific tuning limits beneficial cross-lingual knowledge transfer for low-resource tasks.

Method: Proposes Zipper-LoRA with three variants (Static, Hard, Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces using a lightweight language-conditioned router. Introduces a two-stage training strategy with Initial-B warm start to stabilize optimization under imbalanced data.

Result: Experiments on 12-language mixed-resource setting show Zipper-LoRA consistently outperforms both fully shared and independent baselines, especially in extremely low-resource scenarios. Gains are robust across both chunked and non-chunked encoder configurations.

Conclusion: Zipper-LoRA effectively addresses the stability-plasticity dilemma in multilingual Speech-LLMs, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur, making it reliable for practical large-scale multilingual ASR.

Abstract: Speech Large Language Models (Speech-LLMs) have emerged as a powerful approach for automatic speech recognition (ASR) by aligning speech encoders with large language models. However, adapting these systems to multilingual settings with imbalanced data distributions remains challenging. In such scenarios, a stability-plasticity dilemma often arises: fully shared Parameter-Efficient Fine-Tuning (PEFT) can cause negative inter-lingual interference for under-represented languages, while fully language-specific tuning limits the cross-lingual beneficial knowledge transfer needed for low-resource tasks. To address this, we propose Zipper-LoRA, a novel rank-level decoupling framework with three variants (Static, Hard, and Soft) that dynamically synthesizes LoRA updates from shared and language-specific subspaces. By using a lightweight language-conditioned router, Zipper-LoRA dynamically controls the contribution of each subspace at the LoRA rank level, enabling fine-grained sharing where languages are compatible and strict decoupling when conflicts occur. To further stabilize optimization under imbalanced data, we propose a two-stage training strategy with an Initial-B warm start that significantly accelerates convergence. Experiments on a 12-language mixed-resource setting show that Zipper-LoRA consistently outperforms both fully shared and independent baselines, particularly in extremely low-resource scenarios. Moreover, we demonstrate that these gains are robust across both chunked and non-chunked encoder configurations, confirming the framework’s reliability for practical, large-scale multilingual ASR. Our code and data will be available at https://github.com/YuCeong-May/Zipper-LoRA for reproducibility.

[17] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

Maria Andueza Rodriguez, Marie Candito, Richard Huyghe

Main category: cs.CL

TL;DR: LLMs capture some human lexical patterns but differ in response variability and typicality, with larger models producing more typical but less variable responses, influenced by temperature settings.

Details

Motivation: To evaluate how accurately large language models capture human lexical knowledge by comparing word associations between humans and LLMs, examining the nature of their internal lexicons.

Method: Used English cue-response pairs from SWOW dataset and generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, Qwen-2.5-32B) across multiple temperature settings. Analyzed influence of lexical factors (frequency, concreteness) and compared response variability and typicality.

Result: All models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models (Qwen) emulate a single “prototypical” human participant with highly typical but minimally variable responses, while smaller models produce more variable yet less typical responses. Temperature settings affect this trade-off.

Conclusion: LLMs show both similarities and differences compared to human lexicons, with model size and temperature being important factors when probing lexical representations. Larger models produce more typical but less diverse responses.

Abstract: Large language models (LLMs) achieve impressive results in terms of fluency in text generation, yet the nature of their linguistic knowledge - in particular the human-likeness of their internal lexicon - remains uncertain. This study compares human and LLM-generated word associations to evaluate how accurately models capture human lexical patterns. Using English cue-response pairs from the SWOW dataset and newly generated associations from three LLMs (Mistral-7B, Llama-3.1-8B, and Qwen-2.5-32B) across multiple temperature settings, we examine (i) the influence of lexical factors such as word frequency and concreteness on cue-response pairs, and (ii) the variability and typicality of LLM responses compared to human responses. Results show that all models mirror human trends for frequency and concreteness but differ in response variability and typicality. Larger models such as Qwen tend to emulate a single “prototypical” human participant, generating highly typical but minimally variable responses, while smaller models such as Mistral and Llama produce more variable yet less typical responses. Temperature settings further influence this trade-off, with higher values increasing variability but decreasing typicality. These findings highlight both the similarities and differences between human and LLM lexicons, emphasizing the need to account for model size and temperature when probing LLM lexical representations.

[18] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

Ja Young Lee, Mírian Silva, Mohamed Nasr, Shonda Witherspoon, Enzo Bozzani, Veronique Demers, Radha Ratnaparkhi, Hui Wu, Sara Rosenthal

Main category: cs.CL

TL;DR: GRAFITE is a continuous evaluation platform for LLMs that addresses benchmark contamination by building a repository of model problems from user feedback and using LLM-as-a-judge QA tests for ongoing assessment.

Details

Motivation: LLM performance evaluation suffers from benchmark contamination over time as training data increasingly includes benchmark examples, risking inflated performance metrics if testing isn't carefully managed.

Method: Creates a continuous evaluation system that builds a repository of model problems from user feedback, implements LLM-as-a-judge QA tests for assessment, and enables side-by-side comparison of multiple models across different releases.

Result: Developed GRAFITE platform that provides systematic evaluation framework, regression detection capabilities, and open-source availability for the research community.

Conclusion: GRAFITE addresses LLM evaluation challenges by providing a continuous assessment platform that mitigates benchmark contamination issues through systematic problem tracking and LLM-judged testing.

Abstract: Large language models (LLMs) are largely motivated by their performance on popular topics and benchmarks at the time of their release. However, over time, contamination occurs due to significant exposure of benchmark data during training. This poses a risk of model performance inflation if testing is not carefully executed. To address this challenge, we present GRAFITE, a continuous LLM evaluation platform through a comprehensive system for maintaining and evaluating model issues. Our approach enables building a repository of model problems based on user feedback over time and offers a pipeline for assessing LLMs against these issues through quality assurance (QA) tests using LLM-as-a-judge. The platform enables side-by-side comparison of multiple models, facilitating regression detection across different releases. The platform is available at https://github.com/IBM/grafite. The demo video is available at www.youtube.com/watch?v=XFZyoleN56k.

[19] CWoMP: Morpheme Representation Learning for Interlinear Glossing

Morris Alper, Enora Rice, Bhargav Shandilya, Alexis Palmer, Lori Levin

Main category: cs.CL

TL;DR: CWoMP is a contrastive pretraining method for automated interlinear glossed text generation that treats morphemes as atomic units with learned representations, outperforming existing methods especially in low-resource settings.

Details

Motivation: Interlinear glossed text (IGT) is linguistically rich but labor-intensive to produce manually. Existing automated methods treat glosses as character sequences, ignoring their compositional structure, which limits performance and interpretability.

Method: CWoMP uses contrastive pretraining to align words-in-context with constituent morphemes in a shared embedding space. An autoregressive decoder generates morpheme sequences by retrieving entries from a mutable lexicon of learned morpheme embeddings.

Result: CWoMP outperforms existing methods on diverse low-resource languages while being significantly more efficient. It shows particularly strong gains in extremely low-resource settings and allows users to improve results at inference time by expanding the lexicon without retraining.

Conclusion: Treating morphemes as atomic form-meaning units with learned representations enables more effective automated IGT generation, especially for low-resource languages, with the added benefits of interpretability and flexible lexicon expansion.

Abstract: Interlinear glossed text (IGT) is a standard notation for language documentation which is linguistically rich but laborious to produce manually. Recent automated IGT methods treat glosses as character sequences, neglecting their compositional structure. We propose CWoMP (Contrastive Word-Morpheme Pretraining), which instead treats morphemes as atomic form-meaning units with learned representations. A contrastively trained encoder aligns words-in-context with their constituent morphemes in a shared embedding space; an autoregressive decoder then generates the morpheme sequence by retrieving entries from a mutable lexicon of these embeddings. Predictions are interpretable–grounded in lexicon entries–and users can improve results at inference time by expanding the lexicon without retraining. We evaluate on diverse low-resource languages, showing that CWoMP outperforms existing methods while being significantly more efficient, with particularly strong gains in extremely low-resource settings.

[20] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

Alex Anvi Eponon, Ildar Batyrshin, Christian E. Maldonado-Sifuentes, Grigori Sidorov

Main category: cs.CL

TL;DR: The paper critiques how AI paradigms inherited limitations from their psychological inspirations and proposes ReSynth, a trimodular framework separating reasoning, purpose, and knowledge to achieve systematic adaptability for AGI.

Details

Motivation: The paper argues that current AI paradigms (reinforcement learning, deep learning, integrative approaches) have inherited structural limitations from their psychological inspirations (behaviorism, cognitivism, constructivism), preventing them from achieving true adaptability and systematic understanding needed for AGI.

Method: The paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. It draws on cross-cultural perspectives on rote learning and critiques from the systematicity debate to propose this new architecture.

Result: The paper provides a theoretical framework that addresses limitations in current AI paradigms by proposing an architecture where systematic behavior emerges as a necessary consequence rather than an accidental property, potentially enabling true adaptability for AGI.

Conclusion: Achieving artificial general intelligence requires moving beyond inherited psychological limitations through architectures like ReSynth that separate reasoning, purpose, and knowledge, enabling systematic adaptability rather than just pattern matching.

Abstract: The dominant paradigms of artificial intelligence were shaped by learning theories from psychology: behaviorism inspired reinforcement learning, cognitivism gave rise to deep learning and memory-augmented architectures, and constructivism influenced curriculum learning and compositional approaches. This paper argues that each AI paradigm inherited not only the strengths but the structural limitations of the psychological theory that inspired it. Reinforcement learning cannot account for the internal structure of knowledge, deep learning compresses representations into opaque parameter spaces resistant to principled update, and current integrative approaches lack a formal account of how new understanding is constructed from existing components. The paper further examines a cross-cultural divergence in the interpretation of rote learning, arguing that the Eastern conception of memorization as a structured, multi-phase precursor to understanding offers an underexploited bridge between psychological theory and AI methodology. Drawing on the systematicity debate and critique of Aizawa of both classicism and connectionism, this paper introduces ReSynth, a trimodular framework that separates reasoning (Intellect), purpose (Identity), and knowledge (Memory) as architecturally independent components. The paper traces the genealogy from psychological paradigm to AI method, diagnoses the inherited limitations at each stage, and argues that adaptability, the central challenge of artificial general intelligence requires a representational architecture in which systematic behavior is a necessary consequence rather than an accidental property.

[21] From Noise to Signal: When Outliers Seed New Topics

Evangelia Zve, Gauvain Bourgne, Benjamin Icard, Jean-Gabriel Ganascia

Main category: cs.CL

TL;DR: The paper introduces a temporal taxonomy for news document trajectories in dynamic topic modeling, distinguishing anticipatory outliers that signal emerging topics from other document types, and evaluates it on French hydrogen economy news.

Details

Motivation: Current dynamic topic modeling treats outliers as noise, but some outliers actually serve as early signals of emerging topics. The paper aims to develop a framework that can identify these anticipatory outliers and understand how documents relate to topic formation over time.

Method: The authors introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. They implement this in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models, and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy.

Result: Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies illustrate these trajectories through concrete topic developments, showing that some outliers do indeed anticipate emerging topics.

Conclusion: The proposed temporal taxonomy successfully links weak-signal detection with temporal topic modeling, distinguishing anticipatory outliers from other document types and providing a framework for understanding how individual articles anticipate, initiate, or drift within evolving topic clusters.

Abstract: Outliers in dynamic topic modeling are typically treated as noise, yet we show that some can serve as early signals of emerging topics. We introduce a temporal taxonomy of news-document trajectories that defines how documents relate to topic formation over time. It distinguishes anticipatory outliers, which precede the topics they later join, from documents that either reinforce existing topics or remain isolated. By capturing these trajectories, the taxonomy links weak-signal detection with temporal topic modeling and clarifies how individual articles anticipate, initiate, or drift within evolving clusters. We implement it in a cumulative clustering setting using document embeddings from eleven state-of-the-art language models and evaluate it retrospectively on HydroNewsFr, a French news corpus on the hydrogen economy. Inter-model agreement reveals a small, high-consensus subset of anticipatory outliers, increasing confidence in these labels. Qualitative case studies further illustrate these trajectories through concrete topic developments.

[22] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Tianhui Zhang, Bei Peng, Danushka Bollegala

Main category: cs.CL

TL;DR: Proposes CommonSyn, a two-stage method to create the first synthetic dataset for diverse Generative Commonsense Reasoning, addressing the lack of large-scale training data for diverse commonsense generation.

Details

Motivation: Current conversational agents need diverse, high-quality commonsense responses, but existing datasets are small-scale and human-annotated, limiting training of diverse commonsense generators due to high annotation costs and narrow scenario coverage.

Method: Two-stage method to create synthetic dataset CommonSyn: 1) Generate diverse commonsense scenarios using LLMs, 2) Refine and filter to ensure quality and diversity, creating first-ever synthetic dataset for diversified Generative Commonsense Reasoning.

Result: Models fine-tuned on CommonSyn show improved generation diversity and quality compared to vanilla models and models fine-tuned on human-crafted datasets across different LLM sizes.

Conclusion: CommonSyn successfully addresses the training resource gap for diverse commonsense generation, enabling better training of conversational agents that can produce multiple plausible alternative responses.

Abstract: Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

[23] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

Ruishuo Chen, Yu Chen, Zhuoran Li, Longbo Huang

Main category: cs.CL

TL;DR: PowerFlow is a principled unsupervised RL framework for LLMs that reformulates fine-tuning as distribution matching using GFlowNets, with a length-aware objective to neutralize structural biases and enable directional elicitation of reasoning vs creativity.

Details

Motivation: Current unsupervised RL methods for LLMs rely on heuristic intrinsic rewards that lack theoretical optimization targets and are prone to degenerative biases, limiting their effectiveness in eliciting latent capabilities.

Method: PowerFlow reformulates unsupervised fine-tuning as a distribution matching problem using GFlowNets as amortized variational samplers. It introduces a length-aware Trajectory-Balance objective to neutralize structural length biases in autoregressive generation and targets α-power distributions for directional control.

Result: PowerFlow consistently outperforms existing RLIF methods, matches or exceeds supervised GRPO, and achieves simultaneous gains in diversity and quality by mitigating over-sharpening in aligned models, shifting the Pareto frontier in creative tasks.

Conclusion: PowerFlow provides a principled framework for unsupervised RL in LLMs that enables directional elicitation of capabilities (reasoning vs creativity) through α-power distributions, addressing limitations of heuristic intrinsic reward methods.

Abstract: Unsupervised Reinforcement Learning from Internal Feedback (RLIF) has emerged as a promising paradigm for eliciting the latent capabilities of Large Language Models (LLMs) without external supervision. However, current methods rely on heuristic intrinsic rewards, which often lack a well-defined theoretical optimization target and are prone to degenerative biases. In this work, we introduce PowerFlow, a principled framework that reformulates unsupervised fine-tuning as a distribution matching problem. By casting GFlowNet as an amortized variational sampler for unnormalized densities, we propose a length-aware Trajectory-Balance objective that explicitly neutralizes the structural length biases inherent in autoregressive generation. By targeting $α$-power distributions, PowerFlow enables the directional elicitation of the dual nature of LLMs: sharpening the distribution ($α> 1$) to intensify logical reasoning, or flattening it ($α< 1$) to unlock expressive creativity. Extensive experiments demonstrate that PowerFlow consistently outperforms existing RLIF methods, matching or even exceeding supervised GRPO. Furthermore, by mitigating over-sharpening in aligned models, our approach achieves simultaneous gains in diversity and quality, shifting the Pareto frontier in creative tasks.

[24] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models

Zehao Chen, Rong Pan, Haoran Li

Main category: cs.CL

TL;DR: Multi-agent simulation approach for long-form story generation using emergent interactions between characters and environment

Details

Motivation: Inspired by human writers who envision mental scenes of character interactions, aiming to overcome limitations of rigid top-down story generation approaches

Method: Hybrid bottom-up approach using multi-agent simulations in dynamic sandbox environment where agent behaviors and interactions generate emergent events that form story foundation

Result: System generates stories exceeding 10,000 words while maintaining coherence and consistency, achieving state-of-the-art performance across several metrics

Conclusion: Offers scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions

Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.

[25] AutoScreen-FW: An LLM-based Framework for Resume Screening

Zhelin Xu, Shuhei Yamamoto, Atsuyuki Morishima

Main category: cs.CL

TL;DR: AutoScreen-FW is an open-source LLM framework for automated resume screening that uses representative sample selection and in-context learning to enable local deployment while addressing privacy concerns.

Details

Motivation: Corporate recruiters face time constraints screening many resumes, risking oversight of suitable candidates. Existing LLM-based methods often use commercial LLMs with privacy risks, and lack clarity on optimal training samples for improving judgment performance.

Method: Proposes AutoScreen-FW framework that selects representative resume samples, uses in-context learning with persona descriptions and evaluation criteria to enable open-source LLMs to act as career advisors and evaluate unseen resumes locally.

Result: Open-source LLM judges consistently outperform GPT-5-nano across multiple ground truths, surpass GPT-5-mini under one ground truth setting, run substantially faster than commercial GPT models, though slightly weaker than GPT-5-mini under other settings.

Conclusion: AutoScreen-FW demonstrates potential for local deployment in companies to support efficient resume screening while reducing recruiter burden and addressing privacy concerns through open-source LLMs.

Abstract: Corporate recruiters often need to screen many resumes within a limited time, which increases their burden and may cause suitable candidates to be overlooked. To address these challenges, prior work has explored LLM-based automated resume screening. However, some methods rely on commercial LLMs, which may pose data privacy risks. Moreover, since companies typically do not make resumes with evaluation results publicly available, it remains unclear which resume samples should be used during learning to improve an LLM’s judgment performance. To address these problems, we propose AutoScreen-FW, an LLM-based locally and automatically resume screening framework. AutoScreen-FW uses several methods to select a small set of representative resume samples. These samples are used for in-context learning together with a persona description and evaluation criteria, enabling open-source LLMs to act as a career advisor and evaluate unseen resumes. Experiments with multiple ground truths show that the open-source LLM judges consistently outperform GPT-5-nano. Under one ground truth setting, it also surpass GPT-5-mini. Although it is slightly weaker than GPT-5-mini under other ground-truth settings, it runs substantially faster per resume than commercial GPT models. These findings indicate the potential for deploying AutoScreen-FW locally in companies to support efficient screening while reducing recruiters’ burden.

[26] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration

Kotaro Furuya, Yuichi Kitagawa

Main category: cs.CL

TL;DR: Proposes an interaction-centric framework for automatic team composition of LLMs by analyzing pairwise conversation coherence to identify synergistic clusters without requiring prior knowledge of model internals.

Details

Motivation: Multi-agent LLM approaches show promise but require synergistic team composition, which is challenging due to model opacity. Current methods need prior knowledge of architectures, training data, or task performances.

Method: Constructs a “language model graph” mapping relationships between models based on semantic coherence of pairwise conversations, then applies community detection to identify synergistic model clusters without requiring internal model knowledge.

Result: Method discovers functionally coherent groups reflecting latent specializations. Synergistic teams identified through topic-primed conversations outperform random baselines and achieve comparable accuracy to manually-curated teams based on known specializations.

Conclusion: Provides a new basis for automated design of collaborative multi-agent LLM teams through interaction-centric analysis of conversation coherence, enabling effective team composition without requiring internal model knowledge.

Abstract: While a multi-agent approach based on large language models (LLMs) represents a promising strategy to surpass the capabilities of single models, its success is critically dependent on synergistic team composition. However, forming optimal teams is a significant challenge, as the inherent opacity of most models obscures the internal characteristics necessary for effective collaboration. In this paper, we propose an interaction-centric framework for automatic team composition that does not require any prior knowledge including their internal architectures, training data, or task performances. Our method constructs a “language model graph” that maps relationships between models from the semantic coherence of pairwise conversations, and then applies community detection to identify synergistic model clusters. Our experiments with diverse LLMs demonstrate that the proposed method discovers functionally coherent groups that reflect their latent specializations. Priming conversations with specific topics identified synergistic teams which outperform random baselines on downstream benchmarks and achieve comparable accuracy to that of manually-curated teams based on known model specializations. Our findings provide a new basis for the automated design of collaborative multi-agent LLM teams.

[27] TopoChunker: Topology-Aware Agentic Document Chunking Framework

Xiaoyu Liu

Main category: cs.CL

TL;DR: TopoChunker is an agentic framework for document chunking that preserves topological hierarchies in documents to improve RAG performance by avoiding semantic fragmentation from forced linearization.

Details

Motivation: Current RAG chunking methods linearize text, stripping away intrinsic topological hierarchies and creating semantic fragmentation that degrades retrieval quality. There's a need to preserve cross-segment dependencies while balancing structural fidelity with computational cost.

Method: TopoChunker uses a dual-agent architecture: 1) Inspector Agent dynamically routes documents through cost-optimized extraction paths, and 2) Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. The framework maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to preserve cross-segment dependencies.

Result: On unstructured narratives (GutenQA) and complex reports (GovReport), TopoChuncher outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves 83.26% Recall@3, while reducing token overhead by 23.5%.

Conclusion: TopoChunker offers a scalable approach for structure-aware RAG that preserves document hierarchies while maintaining computational efficiency, addressing the semantic fragmentation problem in current chunking methods.

Abstract: Current document chunking methods for Retrieval-Augmented Generation (RAG) typically linearize text. This forced linearization strips away intrinsic topological hierarchies, creating ``semantic fragmentation’’ that degrades downstream retrieval quality. In this paper, we propose TopoChunker, an agentic framework that maps heterogeneous documents onto a Structured Intermediate Representation (SIR) to explicitly preserve cross-segment dependencies. To balance structural fidelity with computational cost, TopoChunker employs a dual-agent architecture. An Inspector Agent dynamically routes documents through cost-optimized extraction paths, while a Refiner Agent performs capacity auditing and topological context disambiguation to reconstruct hierarchical lineage. Evaluated on unstructured narratives (GutenQA) and complex reports (GovReport), TopoChunker demonstrates state-of-the-art performance. It outperforms the strongest LLM-based baseline by 8.0% in absolute generation accuracy and achieves an 83.26% Recall@3, while simultaneously reducing token overhead by 23.5%, offering a scalable approach for structure-aware RAG.

[28] TARo: Token-level Adaptive Routing for LLM Test-time Alignment

Arushi Rai, Qiang Zhang, Hanqing Zeng, Yunkai Zhang, Dipesh Tamboli, Xiangjun Fan, Zhuokai Zhao, Lizhu Zhang

Main category: cs.CL

TL;DR: TARo is a test-time alignment method that uses token-level adaptive routing to steer frozen LLMs toward structured reasoning without expensive post-training.

Details

Motivation: LLMs have strong reasoning capabilities but require expensive post-training for high performance. Test-time alignment methods exist but focus mainly on preference alignment rather than reasoning. There's a need for lightweight methods that can improve reasoning at inference time without retraining.

Method: Train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model during inference.

Result: TARo improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods. It also boosts out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval), and generalizes from small to large backbones without retraining.

Conclusion: TARo successfully extends test-time alignment from preference optimization to robust, cross-domain reasoning, offering a lightweight alternative to expensive post-training for improving LLM reasoning capabilities.

Abstract: Large language models (LLMs) exhibit strong reasoning capabilities but typically require expensive post-training to reach high performance. Recent test-time alignment methods offer a lightweight alternative, but have been explored mainly for preference alignment rather than reasoning. To bridge this gap, we propose, Token-level Adaptive Routing (TARo), which steers frozen LLMs toward structured reasoning entirely at inference time. Specifically, we first train reward models on step-wise mathematical traces to capture fine-grained logical consistency signals, then introduce a learnable token-level router that automatically controls the guidance of the reward model to the base model. Extensive experiments show that TARo significantly improves reasoning performance by up to +22.4% over base model and +8.4% over existing token-level test-time alignment methods, while also boosting out-of-distribution clinical reasoning (MedXpertQA) and instruction following (AlpacaEval). Furthermore, TARo also generalizes from small to large backbones without retraining, extending test-time alignment from preference optimization to robust, cross-domain reasoning.

[29] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura

Main category: cs.CL

TL;DR: Benchmark for evaluating task interference in multimodal LLMs across text and vision tasks, revealing directional interference patterns with modality differences as primary driver.

Details

Motivation: Task interference has only been studied in text-only settings, but multimodal dialogue systems are becoming prevalent. There's a need to understand how task switching affects performance in multimodal LLMs that handle both text and vision.

Method: Introduced a benchmark covering six tasks across text and vision with systematic variation along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Evaluated both open-weights and proprietary models.

Result: Task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while reverse transition yields minimal degradation. Interference amplified when mismatches co-occur across multiple dimensions. Modality differences drive interference most strongly, followed by answer format, while reasoning requirement shifts cause minimal degradation.

Conclusion: Multimodal LLMs exhibit directional task interference patterns, with modality mismatches being the primary source of degradation. This has implications for designing robust multimodal dialogue systems.

Abstract: Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning requirement shifts cause minimal degradation.

[30] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

Asmita Bhardwaj, Yuya Jeremy Ong, Eelaaf Zahid, Basel Shbita

Main category: cs.CL

TL;DR: RL-based decoder sampler learns lightweight policy to dynamically adjust sampling parameters at test-time for better LLM outputs across domains

Details

Motivation: Static decoding strategies like greedy or fixed temperature/top-p are task-agnostic and lead to suboptimal generation quality across domains requiring stylistic or structural flexibility

Method: Reinforcement learning-based decoder sampler treats decoding as sequential decision-making, learns lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen

Result: Policy sampler consistently outperforms greedy and static baselines on summarization datasets (BookSum, arXiv, WikiHow) with relative gains up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen)

Conclusion: Reinforcement learning enables practical test-time adaptation in decoding for domain-aware and user-controllable generation without retraining large models

Abstract: Decoding strategies largely determine the quality of Large Language Model (LLM) outputs, yet widely used heuristics such as greedy or fixed temperature/top-p decoding are static and often task-agnostic, leading to suboptimal or inconsistent generation quality across domains that demand stylistic or structural flexibility. We introduce a reinforcement learning-based decoder sampler that treats decoding as sequential decision-making and learns a lightweight policy to adjust sampling parameters at test-time while keeping LLM weights frozen. We evaluated summarization datasets including BookSum, arXiv, and WikiHow using Granite-3.3-2B and Qwen-2.5-0.5B. Our policy sampler consistently outperforms greedy and static baselines, achieving relative gains of up to +88% (BookSum, Granite) and +79% (WikiHow, Qwen). Reward ablations show that overlap-only objectives underperform compared to composite rewards, while structured shaping terms (length, coverage, repetition, completeness) enable stable and sustained improvements. These findings highlight reinforcement learning as a practical mechanism for test-time adaptation in decoding, enabling domain-aware and user-controllable generation without retraining large models.

[31] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

Lang Zhou, Shuxuan Li, Zhuohao Li, Shi Liu, Zhilin Zhao, Wei-Shi Zheng

Main category: cs.CL

TL;DR: UT-ACA: A framework that dynamically adjusts context window size during inference based on token-wise uncertainty to reduce context usage while maintaining generation quality in long-context LLMs.

Details

Motivation: Long-context inference is challenging for LLMs due to attention dilution and out-of-distribution degradation. Current context selection methods use fixed context budgets despite non-uniform token-level contextual demands, leading to inefficiencies.

Method: UT-ACA learns an uncertainty detector combining semantic embeddings with logit-based confidence, accounting for uncertainty accumulation. When insufficient evidence is detected, it selectively rolls back, expands context window, and regenerates tokens with additional support.

Result: Experiments show UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

Conclusion: Dynamic context allocation based on token-wise uncertainty is effective for efficient long-context inference in LLMs.

Abstract: Long-context inference remains challenging for large language models due to attention dilution and out-of-distribution degradation. Context selection mitigates this limitation by attending to a subset of key-value cache entries, yet most methods allocate a fixed context budget throughout decoding despite highly non-uniform token-level contextual demands. To address this issue, we propose Uncertainty-Triggered Adaptive Context Allocation (UT-ACA), an inference-time framework that dynamically adjusts the context window based on token-wise uncertainty. UT-ACA learns an uncertainty detector that combines semantic embeddings with logit-based confidence while accounting for uncertainty accumulation across decoding steps. When insufficient evidence is indicated, UT-ACA selectively rolls back, expands the context window, and regenerates the token with additional support. Experiments show that UT-ACA substantially reduces average context usage while preserving generation quality in long-context settings.

[32] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

Masayuki Kawarada, Kodai Watanabe, Soichiro Murakami

Main category: cs.CL

TL;DR: GAIN is a benchmark for evaluating how LLMs balance norm adherence vs. business goals in real-world scenarios with explicit pressures designed to encourage norm deviations.

Details

Motivation: Existing benchmarks focus on abstract scenarios rather than real-world business applications and provide limited insights into factors influencing LLM decision-making, restricting ability to measure models' adaptability to complex norm-goal conflicts.

Method: GAIN provides models with goals, situations, norms, and contextual pressures designed to encourage norm deviations. It defines five pressure types: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising, and finance.

Result: Advanced LLMs frequently mirror human decision-making patterns, but when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

Conclusion: GAIN enables systematic evaluation of factors influencing LLM decision-making in real-world business contexts with norm-goal conflicts, revealing important differences in how models respond to different types of pressures.

Abstract: We introduce GAIN (Goal-Aligned Decision-Making under Imperfect Norms), a benchmark designed to evaluate how large language models (LLMs) balance adherence to norms against business goals. Existing benchmarks typically focus on abstract scenarios rather than real-world business applications. Furthermore, they provide limited insights into the factors influencing LLM decision-making. This restricts their ability to measure models’ adaptability to complex, real-world norm-goal conflicts. In GAIN, models receive a goal, a specific situation, a norm, and additional contextual pressures. These pressures, explicitly designed to encourage potential norm deviations, are a unique feature that differentiates GAIN from other benchmarks, enabling a systematic evaluation of the factors influencing decision-making. We define five types of pressures: Goal Alignment, Risk Aversion, Emotional/Ethical Appeal, Social/Authoritative Influence, and Personal Incentive. The benchmark comprises 1,200 scenarios across four domains: hiring, customer support, advertising and finance. Our experiments show that advanced LLMs frequently mirror human decision-making patterns. However, when Personal Incentive pressure is present, they diverge significantly, showing a strong tendency to adhere to norms rather than deviate from them.

[33] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu, Junhao Liu, Zhenyu Yan, Haoran Lin, Xin Zhang

Main category: cs.CL

TL;DR: WASD framework identifies minimal neural activation conditions that guarantee specific token generation, enabling precise behavioral control of LLMs through actionable neural directives.

Details

Motivation: Existing methods for controlling LLM behavior have limitations: high training costs, lack of natural language controllability, or compromised semantic coherence. There's a need for more precise and efficient behavioral control mechanisms.

Method: WASD represents candidate conditions as neuron-activation predicates and iteratively searches for minimal sets that guarantee current output under input perturbations. It identifies sufficient neural conditions for token generation through systematic perturbation analysis.

Result: Experiments on SST-2 and CounterFact datasets with Gemma-2-2B model show WASD produces more stable, accurate, and concise explanations than conventional attribution graphs. Case study on cross-lingual output generation validates practical effectiveness for model behavior control.

Conclusion: WASD provides a novel framework for explaining and controlling LLM behavior through actionable neural directives, offering more precise behavioral control without compromising semantic coherence or requiring extensive retraining.

Abstract: Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

Esteban Garces Arias, Nurzhan Sapargali, Christian Heumann, Matthias Aßenmacher

Main category: cs.CL

TL;DR: Analyzing detectability of machine-generated text, finding that likelihood-based decoding strategies create a “truncation blind spot” where contextually appropriate but statistically rare tokens are inaccessible to machines but accessible to humans, making machine text detectable.

Details

Motivation: Standard text generation decoding strategies select tokens based on likelihood, while human language production chooses tokens for communicative appropriateness. This mismatch creates a "truncation blind spot" where contextually appropriate but statistically rare tokens remain accessible to humans but unreachable by machines, potentially contributing to the detectability of machine-generated text.

Method: Analyzed over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations. Examined how human-selected tokens fall outside typical truncation boundaries. Trained simple classifiers on predictability and lexical diversity features to detect machine-generated text.

Result: Found that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers achieved remarkable detection rates. Neither model scale nor architecture correlated strongly with detectability; truncation parameters accounted for most variance. Configurations achieving low detectability often produced incoherent text.

Conclusion: Detectability of machine-generated text is enhanced by likelihood-based token selection, not merely a matter of model capability. Evading detection and producing natural text are distinct objectives, with current decoding strategies creating systematic patterns that make machine text detectable.

Abstract: Standard decoding strategies for text generation, including top-k, nucleus sampling, and contrastive search, select tokens based on likelihood, restricting selection to high-probability regions. Human language production operates differently: tokens are chosen for communicative appropriateness rather than statistical frequency. This mismatch creates a truncation blind spot: contextually appropriate but statistically rare tokens remain accessible to humans yet unreachable by likelihood-based decoding. We hypothesize this contributes to the detectability of machine-generated text. Analyzing over 1.8 million texts across eight language models, five decoding strategies, and 53 hyperparameter configurations, we find that 8-18% of human-selected tokens fall outside typical truncation boundaries. Simple classifiers trained on predictability and lexical diversity achieve remarkable detection rates. Crucially, neither model scale nor architecture correlates strongly with detectability; truncation parameters account for most variance. Configurations achieving low detectability often produce incoherent text, indicating that evading detection and producing natural text are distinct objectives. These findings suggest detectability is enhanced by likelihood-based token selection, not merely a matter of model capability.

[35] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong, Donghyun Son, Woosang Lim, Sungjoo Yoo

Main category: cs.CL

TL;DR: EntropyCache: Training-free KV caching method for diffusion-based LLMs that uses token entropy as a cheap proxy for cache staleness, achieving 15-26× speedup with minimal overhead.

Details

Motivation: Diffusion-based LLMs require full forward passes at every denoising step due to bidirectional attention preventing lossless KV caching. Existing approximate KV caching methods have decision overhead that scales with context length or model depth.

Method: Proposes EntropyCache that uses maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute KV cache. Observes that token entropy correlates with KV cache drift, and recomputes the k most recently decoded tokens when needed. Decision requires only O(V) computation per step.

Result: Achieves 15.2×-26.4× speedup on standard benchmarks and 22.4×-24.1× on chain-of-thought benchmarks with competitive accuracy. Decision overhead accounts for only 0.5% of inference time.

Conclusion: EntropyCache provides an efficient, training-free KV caching method for diffusion-based LLMs that significantly speeds up inference with minimal accuracy loss and constant decision overhead.

Abstract: Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

[36] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

Abhinaba Basu, Pavan Chakraborty

Main category: cs.CL

TL;DR: ICE-Guard framework detects spurious feature reliance in LLMs for high-stakes decisions using intervention consistency testing across demographic, authority, and framing biases.

Details

Motivation: LLMs are increasingly used for high-stakes decisions but their susceptibility to spurious features remains poorly characterized, with current research narrowly focused on demographic bias while ignoring other important biases like authority and framing effects.

Method: Introduced ICE-Guard framework applying intervention consistency testing across 3,000 vignettes in 10 high-stakes domains to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Evaluated 11 LLMs from 8 families and tested structured decomposition approaches.

Result: Found that authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%); bias concentrates in specific domains (finance shows 22.6% authority bias while criminal justice shows only 2.8%); structured decomposition reduces flip rates by up to 100% (median 49% across 9 models); ICE-guided detect-diagnose-mitigate-verify loop achieved 78% cumulative bias reduction.

Conclusion: The field’s narrow focus on demographic bias overlooks more substantial authority and framing biases; structured decomposition and iterative prompt patching can significantly reduce bias; the framework provides a conservative estimate of real-world bias as validated against COMPAS recidivism data.

Abstract: Large language models (LLMs) are increasingly used for high-stakes decisions, yet their susceptibility to spurious features remains poorly characterized. We introduce ICE-Guard, a framework applying intervention consistency testing to detect three types of spurious feature reliance: demographic (name/race swaps), authority (credential/prestige swaps), and framing (positive/negative restatements). Across 3,000 vignettes spanning 10 high-stakes domains, we evaluate 11 LLMs from 8 families and find that (1) authority bias (mean 5.8%) and framing bias (5.0%) substantially exceed demographic bias (2.2%), challenging the field’s narrow focus on demographics; (2) bias concentrates in specific domains – finance shows 22.6% authority bias while criminal justice shows only 2.8%; (3) structured decomposition, where the LLM extracts features and a deterministic rubric decides, reduces flip rates by up to 100% (median 49% across 9 models). We demonstrate an ICE-guided detect-diagnose-mitigate-verify loop achieving cumulative 78% bias reduction via iterative prompt patching. Validation against real COMPAS recidivism data shows COMPAS-derived flip rates exceed pooled synthetic rates, suggesting our benchmark provides a conservative estimate of real-world bias. Code and data are publicly available.

[37] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

Ivaxi Sheth, Zeno Jonke, Amin Mantrach, Saab Mansour

Main category: cs.CL

TL;DR: A decomposition-based evaluation framework using Universal Criteria Set (UCS) for cross-lingual LLM evaluation without requiring target-language annotations.

Details

Motivation: Large language models are increasingly deployed across diverse languages, but automated evaluation remains predominantly English-focused. Adapting evaluation to other languages is hindered by scarcity and cost of human-annotated judgments in most languages.

Method: Introduces a decomposition-based evaluation framework built around a Universal Criteria Set (UCS) - a shared, language-agnostic set of evaluation dimensions that produces interpretable intermediate representations supporting cross-lingual transfer with minimal supervision.

Result: Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

Conclusion: The UCS framework enables effective cross-lingual evaluation of LLMs by leveraging language-agnostic criteria sets, addressing the challenge of evaluation in non-English languages where human annotations are scarce.

Abstract: As large language models are increasingly deployed across diverse real-world applications, extending automated evaluation beyond English has become a critical challenge. Existing evaluation approaches are predominantly English-focused, and adapting them to other languages is hindered by the scarcity and cost of human-annotated judgments in most languages. We introduce a decomposition-based evaluation framework built around a Universal Criteria Set (UCS). UCS consists of a shared, language-agnostic set of evaluation dimensions, producing an interpretable intermediate representation that supports cross-lingual transfer with minimal supervision. Experiments on multiple faithfulness tasks across languages and model backbones demonstrate consistent improvements over strong baselines without requiring target-language annotations.

[38] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

Abhinaba Basu, Pavan Chakraborty

Main category: cs.CL

TL;DR: ICE framework evaluates explanation faithfulness in LLMs using multiple intervention operators and statistical testing against random baselines, revealing operator-dependent faithfulness, anti-faithfulness cases, and no correlation with human plausibility.

Details

Motivation: Existing benchmarks for evaluating explanation faithfulness in models use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. There's a need for a more rigorous evaluation framework.

Method: Introduces ICE (Intervention-Consistent Explanation) framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators. Evaluates 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, yielding win rates with confidence intervals.

Result: Faithfulness is operator-dependent with gaps up to 44 percentage points. Deletion inflates estimates on short text but pattern reverses on long text. Randomized baselines reveal anti-faithfulness in one-third of configurations. Faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone.

Conclusion: Faithfulness should be interpreted comparatively across intervention operators rather than as a single score. The ICE framework provides a more rigorous evaluation method for explanation faithfulness in LLMs, with released ICEBench benchmark for future research.

Abstract: Evaluating whether explanations faithfully reflect a model’s reasoning remains an open problem. Existing benchmarks use single interventions without statistical testing, making it impossible to distinguish genuine faithfulness from chance-level performance. We introduce ICE (Intervention-Consistent Explanation), a framework that compares explanations against matched random baselines via randomization tests under multiple intervention operators, yielding win rates with confidence intervals. Evaluating 7 LLMs across 4 English tasks, 6 non-English languages, and 2 attribution methods, we find that faithfulness is operator-dependent: operator gaps reach up to 44 percentage points, with deletion typically inflating estimates on short text but the pattern reversing on long text, suggesting that faithfulness should be interpreted comparatively across intervention operators rather than as a single score. Randomized baselines reveal anti-faithfulness in one-third of configurations, and faithfulness shows zero correlation with human plausibility (|r| < 0.04). Multilingual evaluation reveals dramatic model-language interactions not explained by tokenization alone. We release the ICE framework and ICEBench benchmark.

[39] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

Yusuke Takase, Momose Oyama, Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: A method for representing and comparing language models using log-likelihood vectors over prompt-response pairs, creating model maps that capture relationships between models and their conditional distributions.

Details

Motivation: To develop a framework for analyzing and comparing language models by representing them in a vector space that captures their conditional distributions, enabling systematic analysis of model behavior, relationships, and the effects of prompt modifications.

Method: Represent language models as log-likelihood vectors over prompt-response pairs, construct model maps where distances approximate KL divergence between conditional distributions, and introduce PMI vectors to reduce influence of unconditional distributions.

Result: The method captures meaningful global structure including relationships to model attributes and task performance, systematic shifts from prompt modifications with approximate additive compositionality, and PMI-based maps better reflect training-data-related differences.

Conclusion: The framework supports analysis of input-dependent model behavior, enabling comparison of language models, understanding prompt effects, and revealing systematic relationships in model space.

Abstract: We propose a method that represents language models by log-likelihood vectors over prompt-response pairs and constructs model maps for comparing their conditional distributions. In this space, distances between models approximate the KL divergence between the corresponding conditional distributions. Experiments on a large collection of publicly available language models show that the maps capture meaningful global structure, including relationships to model attributes and task performance. The method also captures systematic shifts induced by prompt modifications and their approximate additive compositionality, suggesting a way to analyze and predict the effects of composite prompt operations. We further introduce pointwise mutual information (PMI) vectors to reduce the influence of unconditional distributions; in some cases, PMI-based model maps better reflect training-data-related differences. Overall, the framework supports the analysis of input-dependent model behavior.

Thi Huyen Nguyen, Koustav Rudra, Wolfgang Nejdl

Main category: cs.CL

TL;DR: Multimodal interpretable classification framework for crisis tweets that extracts text rationales, transfers them to image rationales, and improves classification performance while providing transparency.

Details

Motivation: Existing crisis information classification methods lack transparency, affecting real-life deployment. While recent work improved text rationale extraction, crisis-related image rationales remain underexplored.

Method: Uses visual language transformer for joint text-image representation, extracts text rationales first, then transfers them to image rationales via cross-modal rationale transfer, and classifies tweets based on extracted rationales.

Result: Boosts classification Macro-F1 by 2-35% on CrisisMMD dataset, extracts accurate text tokens and image patches as rationales, and achieves 80% accuracy in zero-shot mode on unseen datasets.

Conclusion: Proposed interpretable multimodal framework effectively extracts cross-modal rationales, improves classification performance, and adapts well to new datasets with reduced annotation effort.

Abstract: Advances in social media data dissemination enable the provision of real-time information during a crisis. The information comes from different classes, such as infrastructure damages, persons missing or stranded in the affected zone, etc. Existing methods attempted to classify text and images into various humanitarian categories, but their decision-making process remains largely opaque, which affects their deployment in real-life applications. Recent work has sought to improve transparency by extracting textual rationales from tweets to explain predicted classes. However, such explainable classification methods have mostly focused on text, rather than crisis-related images. In this paper, we propose an interpretable-by-design multimodal classification framework. Our method first learns the joint representation of text and image using a visual language transformer model and extracts text rationales. Next, it extracts the image rationales via the mapping with text rationales. Our approach demonstrates how to learn rationales in one modality from another through cross-modal rationale transfer, which saves annotation effort. Finally, tweets are classified based on extracted rationales. Experiments are conducted over CrisisMMD benchmark dataset, and results show that our proposed method boosts the classification Macro-F1 by 2-35% while extracting accurate text tokens and image patches as rationales. Human evaluation also supports the claim that our proposed method is able to retrieve better image rationale patches (12%) that help to identify humanitarian classes. Our method adapts well to new, unseen datasets in zero-shot mode, achieving an accuracy of 80%.

[41] Learning to Self-Evolve

Xiaoyin Chen, Canwen Xu, Yite Wang, Boyi Liu, Zhewei Yao, Yuxiong He

Main category: cs.CL

TL;DR: LSE is a reinforcement learning framework that trains LLMs to iteratively improve their own contexts at test time by learning from feedback on performance improvements.

Details

Motivation: Existing approaches rely on inherent model reasoning for self-evolution without explicit training. The authors aim to make self-evolution a learnable skill through reinforcement learning.

Method: Reduces multi-step evolution to single-step RL objective where each context edit is rewarded by downstream performance improvement. Uses tree-guided evolution loop for iterative refinement.

Result: A 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods like GEPA and TextGrad on Text-to-SQL generation (BIRD) and general QA (MMLU-Redux).

Conclusion: Treating self-evolution as a learnable skill is effective, and the approach transfers to guide other models without additional training.

Abstract: We introduce Learning to Self-Evolve (LSE), a reinforcement learning framework that trains large language models (LLMs) to improve their own contexts at test time. We situate LSE in the setting of test-time self-evolution, where a model iteratively refines its context from feedback on seen problems to perform better on new ones. Existing approaches rely entirely on the inherent reasoning ability of the model and never explicitly train it for this task. LSE reduces the multi-step evolution problem to a single-step RL objective, where each context edit is rewarded by the improvement in downstream performance. We pair this objective with a tree-guided evolution loop. On Text-to-SQL generation (BIRD) and general question answering (MMLU-Redux), a 4B-parameter model trained with LSE outperforms self-evolving policies powered by GPT-5 and Claude Sonnet 4.5, as well as prompt optimization methods including GEPA and TextGrad, and transfers to guide other models without additional training. Our results highlight the effectiveness of treating self-evolution as a learnable skill.

[42] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

Aram Abrahamyan, Sachin Kumar

Main category: cs.CL

TL;DR: Comparative study of catastrophic forgetting mitigation in continual intent classification using CLINC150 dataset, evaluating three architectures (ANN, GRU, Transformer) with various continual learning strategies including replay, regularization, and parameter-isolation methods.

Details

Motivation: Neural language models in real-world applications need to continually adapt to new tasks/domains without forgetting previously acquired knowledge, requiring effective catastrophic forgetting mitigation strategies for continual intent classification.

Method: Constructed 10-task label-disjoint scenario using CLINC150 dataset. Evaluated three backbone architectures (ANN, GRU, Transformer encoder) under continual learning strategies: replay-based MIR, regularization-based LwF, and parameter-isolation HAT, both individually and in combinations. Assessed performance with average accuracy, macro F1, and backward transfer metrics.

Result: Naive sequential fine-tuning suffers severe forgetting for all architectures. No single CL method fully prevents forgetting. Replay (MIR) is most reliable individual strategy. Combinations including replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) achieve high final performance with near-zero/mildly positive backward transfer. Optimal configuration is architecture-dependent: MIR+HAT best for ANN/Transformer, MIR+LwF+HAT best for GRU. Some CL methods surpass joint training, indicating regularization effect.

Conclusion: Importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems. Replay is key ingredient for mitigating catastrophic forgetting in continual learning scenarios.

Abstract: Neural language models deployed in real-world applications must continually adapt to new tasks and domains without forgetting previously acquired knowledge. This work presents a comparative empirical study of catastrophic forgetting mitigation in continual intent classification. Using the CLINC150 dataset, we construct a 10-task label-disjoint scenario and evaluate three backbone architectures: a feed-forward Artificial Neural Network (ANN), a Gated Recurrent Unit (GRU), and a Transformer encoder, under a range of continual learning (CL) strategies. We consider one representative method from each major CL family: replay-based Maximally Interfered Retrieval (MIR), regularization-based Learning without Forgetting (LwF), and parameter-isolation via Hard Attention to Task (HAT), both individually and in all pairwise and triple combinations. Performance is assessed with average accuracy, macro F1, and backward transfer, capturing the stability-plasticity trade-off across the task sequence. Our results show that naive sequential fine-tuning suffers from severe forgetting for all architectures and that no single CL method fully prevents it. Replay emerges as a key ingredient: MIR is the most reliable individual strategy, and combinations that include replay (MIR+HAT, MIR+LwF, MIR+LwF+HAT) consistently achieve high final performance with near-zero or mildly positive backward transfer. The optimal configuration is architecture-dependent. MIR+HAT yields the best result for ANN and Transformer, MIR+LwF+HAT, on the other hand, works the best for GRU, and in several cases CL methods even surpass joint training, indicating a regularization effect. These findings highlight the importance of jointly selecting backbone architecture and CL mechanism when designing continual intent-classification systems.

[43] Automatic detection of Gen-AI texts: A comparative framework of neural models

Cristian Buttaro, Irene Amerini

Main category: cs.CL

TL;DR: The paper develops and evaluates four neural network architectures for detecting AI-generated text, benchmarking them against commercial detectors across multiple datasets and languages.

Details

Motivation: The proliferation of Large Language Models has made it increasingly difficult to distinguish between human-written and AI-generated text, creating critical issues in academic, editorial, and social contexts that require effective detection solutions.

Method: Developed four neural architectures: Multilayer Perceptron, 1D CNN, MobileNet-based CNN, and Transformer model. Benchmarked these against eight commercial detectors (ZeroGPT, GPTZero, etc.) using COLING Multilingual Dataset (English/Italian) and an original Art and Mental Health dataset.

Result: Supervised detectors achieved more stable and robust performance than commercial tools across different languages and domains, revealing key strengths and limitations of current detection strategies.

Conclusion: Supervised machine learning approaches provide more reliable AI text detection than existing commercial tools, offering better performance stability across languages and specialized domains.

Abstract: The rapid proliferation of Large Language Models has significantly increased the difficulty of distinguishing between human-written and AI generated texts, raising critical issues across academic, editorial, and social domains. This paper investigates the problem of AI generated text detection through the design, implementation, and comparative evaluation of multiple machine learning based detectors. Four neural architectures are developed and analyzed: a Multilayer Perceptron, a one-dimensional Convolutional Neural Network, a MobileNet-based CNN, and a Transformer model. The proposed models are benchmarked against widely used online detectors, including ZeroGPT, GPTZero, QuillBot, Originality.AI, Sapling, IsGen, Rephrase, and Writer. Experiments are conducted on the COLING Multilingual Dataset, considering both English and Italian configurations, as well as on an original thematic dataset focused on Art and Mental Health. Results show that supervised detectors achieve more stable and robust performance than commercial tools across different languages and domains, highlighting key strengths and limitations of current detection strategies.

[44] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

Rudra Jadhav, Janhavi Danve, Sonalika Shaw

Main category: cs.CL

TL;DR: LLMs exhibit significant grading bias based on writing style in Essay/Writing tasks despite explicit instructions to evaluate only content correctness, with informal language and non-native phrasing receiving substantial penalties, while Mathematics and Programming tasks show minimal bias.

Details

Motivation: As LLMs are increasingly used for automated grading in education, concerns about fairness and bias in their evaluations have become critical, particularly whether they exhibit implicit grading bias based on writing style when content correctness remains constant.

Method: Created controlled dataset of 180 student responses across Mathematics, Programming, and Essay/Writing subjects with three surface-level perturbation types (grammar errors, informal language, non-native phrasing). Used two open-source LLMs (LLaMA 3.3 70B and Qwen 2.5 72B) to grade responses on 1-10 scale with explicit instructions to evaluate only content correctness and disregard writing style.

Result: Statistically significant grading bias found in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes from medium (d = 0.64) to very large (d = 4.25). Informal language received heaviest penalty (LLaMA: -1.90 points, Qwen: -1.20 points), non-native phrasing also penalized (-1.35 and -0.90 points). Mathematics and Programming tasks showed minimal bias, most conditions not statistically significant.

Conclusion: LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions. Findings highlight need for bias auditing protocols before institutional adoption of LLM-based grading systems to ensure equitable deployment.

Abstract: As large language models (LLMs) are increasingly deployed as automated graders in educational settings, concerns about fairness and bias in their evaluations have become critical. This study investigates whether LLMs exhibit implicit grading bias based on writing style when the underlying content correctness remains constant. We constructed a controlled dataset of 180 student responses across three subjects (Mathematics, Programming, and Essay/Writing), each with three surface-level perturbation types: grammar errors, informal language, and non-native phrasing. Two state-of-the-art open-source LLMs – LLaMA 3.3 70B (Meta) and Qwen 2.5 72B (Alibaba) – were prompted to grade responses on a 1-10 scale with explicit instructions to evaluate content correctness only and to disregard writing style. Our results reveal statistically significant grading bias in Essay/Writing tasks across both models and all perturbation types (p < 0.05), with effect sizes ranging from medium (Cohen’s d = 0.64) to very large (d = 4.25). Informal language received the heaviest penalty, with LLaMA deducting an average of 1.90 points and Qwen deducting 1.20 points on a 10-point scale – penalties comparable to the difference between a B+ and C+ letter grade. Non-native phrasing was penalized 1.35 and 0.90 points respectively. In sharp contrast, Mathematics and Programming tasks showed minimal bias, with most conditions failing to reach statistical significance. These findings demonstrate that LLM grading bias is subject-dependent, style-sensitive, and persists despite explicit counter-bias instructions in the grading prompt. We discuss implications for equitable deployment of LLM-based grading systems and recommend bias auditing protocols before institutional adoption.

[45] Mi:dm K 2.5 Pro

KT Tech innovation Group

Main category: cs.CL

TL;DR: Mi:dm K 2.5 Pro is a 32B parameter Korean LLM optimized for enterprise-grade reasoning, long-context understanding, and agentic workflows with specialized training for Korean language and cultural understanding.

Details

Motivation: Address the limitations of existing LLMs in enterprise environments, particularly for Korean-language and domain-specific scenarios where current scaling approaches are insufficient for multi-step reasoning, long-context understanding, and agentic workflows.

Method: Quality-centric data curation using AST analysis for code and gap-filling for math; pre-training with Depth Upscaling and progressive strategy for 128K context; post-training with Reasoning SFT, model merging, asynchronous RL, and “Fusion Training” to balance reasoning with conversational fluency and tool-use.

Result: Achieves competitive performance against leading global and domestic models, sets SOTA on Korean-specific benchmarks, and demonstrates strong safety profile through Responsible AI evaluations.

Conclusion: Mi:dm K 2.5 Pro successfully addresses enterprise-grade complexity with specialized optimization for reasoning, Korean language understanding, and safety, making it suitable for deployment in demanding enterprise environments.

Abstract: The evolving LLM landscape requires capabilities beyond simple text generation, prioritizing multi-step reasoning, long-context understanding, and agentic workflows. This shift challenges existing models in enterprise environments, especially in Korean-language and domain-specific scenarios where scaling is insufficient. We introduce Mi:dm K 2.5 Pro, a 32B parameter flagship LLM designed to address enterprise-grade complexity through reasoning-focused optimization. Our methodology builds a robust data foundation via a quality-centric curation pipeline utilizing abstract syntax tree (AST) analysis for code, gap-filling synthesis for mathematics, and an LLM-based quality evaluator. Pre-training scales the model via layer-predictor-based Depth Upscaling (DuS) and a progressive strategy supporting a 128K token context window. Post-training introduces a specialized multi-stage pipeline, including Reasoning SFT, model merging, and asynchronous reinforcement learning (RL), to develop complex problem-solving skills. “Fusion Training” then rebalances these capabilities with conversational fluency, consistent response styling, and reliable tool-use. The evaluations show that Mi:dm K 2.5 Pro achieves competitive performance against leading global and domestic models. In addition, it sets state-of-the-art results on Korean-specific benchmarks, showcasing deep linguistic and cultural understanding. Finally, Responsible AI evaluations validate safety against attacks, ensuring a secure profile for deployment with a balance of harmlessness and responsiveness.

Maria Milkova, Maksim Rudnev

Main category: cs.CL

TL;DR: Multi-stage framework for detecting human values in Russian social media using LLM annotation and transformer models, achieving 0.83 F1 macro score.

Details

Motivation: To develop a robust method for detecting human values in noisy social media data, addressing challenges of annotation subjectivity and content quality in Russian language social networks.

Method: Multi-stage pipeline with spam filtering, targeted content selection, LLM-based annotation using GPT, aggregation of multiple LLM judgments into soft labels, and training transformer models (XLM-RoBERTa large) for multi-label classification of 10 basic human values.

Result: Best model achieves F1 macro of 0.83 and F1 of 0.71 on test data; model aligns with human judgments but systematically overestimates Openness to Change values; reveals distinct value expression patterns in Russian social networks.

Conclusion: Treating value detection as multi-perspective interpretive task yields robust models; framework handles annotation subjectivity effectively; contributes to research on cultural variation and value interpretation in digital environments.

Abstract: This study presents a multi-stage classification framework for detecting human values in noisy Russian language social media, validated on a random sample of 7.5 million public text posts. Drawing on Schwartz’s theory of basic human values, we design a multi-stage pipeline that includes spam and nonpersonal content filtering, targeted selection of value relevant and politically relevant posts, LLM based annotation, and multi-label classification. Particular attention is given to verifying the quality of LLM annotations and model predictions against human experts. We treat human expert annotations not as ground truth but as an interpretative benchmark with its own uncertainty. To account for annotation subjectivity, we aggregate multiple LLM generated judgments into soft labels that reflect varying levels of agreement. These labels are then used to train transformer based models capable of predicting the probability of each of the ten basic values. The best performing model, XLM RoBERTa large, achieves an F1 macro of 0.83 and an F1 of 0.71 on held out test data. By treating value detection as a multi perspective interpretive task, where expert labels, GPT annotations, and model predictions represent coherent but not identical readings of the same texts, we show that the model generally aligns with human judgments but systematically overestimates the Openness to Change value domain. Empirically, the study reveals distinct patterns of value expression and their co-occurrence in Russian social networks, contributing to a broader research agenda on cultural variation, communicative framing, and value based interpretation in digital environments. All models are released publicly.

[47] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

Yana Veitsman, Yihong Liu, Hinrich Schütze

Main category: cs.CL

TL;DR: Cross-lingual alignment improvements don’t always translate to better downstream task performance due to orthogonality between alignment and task objectives, with effects varying across languages and tasks.

Details

Motivation: To understand why explicit cross-lingual alignment techniques that increase embedding similarity often fail to improve token-level downstream task performance, despite the common assumption that better alignment yields better transfer.

Method: Analyzed four XLM-R encoder models aligned on different language pairs and fine-tuned for POS Tagging or Sentence Classification. Used representational analyses including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses.

Result: Found that: (1) embedding distances alone are unreliable predictors of task performance improvements/degradations; (2) alignment and task gradients are often close to orthogonal, meaning optimizing one objective contributes little to optimizing the other.

Conclusion: Better cross-lingual alignment doesn’t necessarily translate to better cross-lingual transfer due to orthogonality between objectives. Provides practical guidelines for combining alignment with task-specific fine-tuning, emphasizing careful loss selection.

Abstract: Better cross-lingual alignment is often assumed to yield better cross-lingual transfer. However, explicit alignment techniques – despite increasing embedding similarity – frequently fail to improve token-level downstream performance. In this work, we show that this mismatch arises because alignment and downstream task objectives are largely orthogonal, and because the downstream benefits from alignment vary substantially across languages and task types. We analyze four XLM-R encoder models aligned on different language pairs and fine-tuned for either POS Tagging or Sentence Classification. Using representational analyses, including embedding distances, gradient similarities, and gradient magnitudes for both task and alignment losses, we find that: (1) embedding distances alone are unreliable predictors of improvements (or degradations) in task performance and (2) alignment and task gradients are often close to orthogonal, indicating that optimizing one objective may contribute little to optimizing the other. Taken together, our findings explain why better'' alignment often fails to translate into better’’ cross-lingual transfer. Based on these insights, we provide practical guidelines for combining cross-lingual alignment with task-specific fine-tuning, highlighting the importance of careful loss selection.

[48] Evaluating LLM-Generated Lessons from the Language Learning Students’ Perspective: A Short Case Study on Duolingo

Carlos Rafael Catalan, Patricia Nicole Monderin, Lheane Marie Dizon, Gap Estrella, Raymund John Sarmimento, Marie Antoinette Patalagsa

Main category: cs.CL

TL;DR: Language learning apps like Duolingo use LLMs for lessons but focus on general scenarios, lacking profession-specific content needed for professional fluency. Survey of employees shows general scenarios build foundational skills while work-related scenarios bridge the gap to professional fluency through domain-specific vocabulary.

Details

Motivation: Current language learning applications using LLMs primarily focus on general real-world scenarios (greetings, ordering food, etc.) but lack support for profession-specific contexts, which hinders learners from achieving professional-level fluency in work-related communication.

Method: Surveyed five employees from a multinational company in the Philippines about their experiences with Duolingo, analyzing frequency of general vs. work-related scenarios, effectiveness of different lesson types, and collecting suggestions for lesson scenarios.

Result: Respondents encountered general scenarios more frequently than work-related ones; general scenarios were relatable and effective for building foundational grammar, vocabulary, and cultural knowledge; work-related scenarios help bridge the gap toward professional fluency through domain-specific vocabulary; participants suggested diverse lesson scenarios when analyzed collectively.

Conclusion: Language learning applications should generate lessons that adapt to individual needs through personalized, domain-specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios to better support professional fluency development.

Abstract: Popular language learning applications such as Duolingo use large language models (LLMs) to generate lessons for its users. Most lessons focus on general real-world scenarios such as greetings, ordering food, or asking directions, with limited support for profession-specific contexts. This gap can hinder learners from achieving professional-level fluency, which we define as the ability to communicate comfortably various work-related and domain-specific information in the target language. We surveyed five employees from a multinational company in the Philippines on their experiences with Duolingo. Results show that respondents encountered general scenarios more frequently than work-related ones, and that the former are relatable and effective in building foundational grammar, vocabulary, and cultural knowledge. The latter helps bridge the gap toward professional fluency as it contains domain-specific vocabulary. Each participant suggested lesson scenarios that diverge in contexts hen analyzed in aggregate. With this understanding, we propose that language learning applications should generate lessons that adapt to an individual’s needs through personalized, domain specific lesson scenarios while maintaining foundational support through general, relatable lesson scenarios.

[49] A Human-in/on-the-Loop Framework for Accessible Text Generation

Lourdes Moreno, Paloma Martínez

Main category: cs.CL

TL;DR: A hybrid human-AI framework for accessible text generation that integrates human participation in both generation (Human-in-the-Loop) and supervision (Human-on-the-Loop) to create traceable, reproducible accessible texts aligned with normative standards.

Details

Motivation: Current automatic text simplification systems are largely automated and metric-driven, failing to reflect actual user comprehension or meet normative accessibility standards. There's a need for human-centered approaches that ensure generated texts are truly accessible.

Method: Proposes a hybrid framework combining Human-in-the-Loop (HiTL) adjustments during LLM-based generation and Human-on-the-Loop (HoTL) systematic post-generation review. Operationalizes empirical evidence into checklists aligned with standards, Event-Condition-Action trigger rules for expert oversight, and accessibility Key Performance Indicators (KPIs).

Result: The framework establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts, integrating explainability and ethical accountability as core design principles for more transparent and inclusive NLP systems.

Conclusion: Human-centered mechanisms can be systematically encoded for evaluation and reused to provide structured feedback that improves model adaptation, contributing to more transparent and inclusive NLP systems through explicit human participation in both generation and supervision.

Abstract: Plain Language and Easy-to-Read formats in text simplification are essential for cognitive accessibility. Yet current automatic simplification and evaluation pipelines remain largely automated, metric-driven, and fail to reflect user comprehension or normative standards. This paper introduces a hybrid framework that explicitly integrates human participation into LLM-based accessible text generation. Human-in-the-Loop (HiTL) contributions guide adjustments during generation, while Human-on-the-Loop (HoTL) supervision ensures systematic post-generation review. Empirical evidence from user studies and annotated resources is operationalized into (i) checklists aligned with standards, (ii) Event-Condition-Action trigger rules for activating expert oversight, and (iii) accessibility Key Performance Indicators (KPIs). The framework shows how human-centered mechanisms can be encoded for evaluation and reused to provide structured feedback that improves model adaptation. By embedding the human role in both generation and supervision, it establishes a traceable, reproducible, and auditable process for creating and evaluating accessible texts. In doing so, it integrates explainability and ethical accountability as core design principles, contributing to more transparent and inclusive NLP systems.

[50] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

Vedant Pandya

Main category: cs.CL

TL;DR: XKD-Dial: A four-stage training pipeline for explainable, knowledge-grounded bilingual dialogue generation with citation mechanisms and comprehensive explainability analyses.

Details

Motivation: Most knowledge-grounded dialogue systems are English-only, lack citation mechanisms for factual verification, and offer limited transparency into model decision-making. There's a need for bilingual systems with explainable citation behavior.

Method: Progressive four-stage training pipeline: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. Evaluated six models (250M-7B parameters) across encoder-decoder and decoder-only architectures.

Result: Citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward. Progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities. Smaller models match larger models on English after SFT. GRPO provides marginal improvement over well-designed SFT for structured citation tasks.

Conclusion: The proposed pipeline enables explainable, knowledge-grounded bilingual dialogue generation with effective citation mechanisms and reduced hallucination, while providing insights into how citation behavior is learned through systematic explainability analyses.

Abstract: Knowledge-grounded dialogue systems aim to generate informative, contextually relevant responses by conditioning on external knowledge sources. However, most existing approaches focus exclusively on English, lack explicit citation mechanisms for verifying factual claims, and offer limited transparency into model decision-making. We present XKD-Dial, a progressive four-stage training pipeline for explainable, knowledge-grounded dialogue generation in a bilingual (English-Hindi) setting, comprising: (1) multilingual adaptation, (2) English dialogue SFT with citation grounding, (3) bilingual dialogue SFT, and (4) GRPO alignment with citation-aware rewards. We evaluate six models spanning encoder-decoder (250M-3B) and decoder-only (1B-7B) architectures at every pipeline stage. Our key contributions are: (i) three post-hoc explainability analyses - cross-attention alignment, Integrated Gradients attribution, and occlusion-based causal grounding - applied systematically across the training trajectory to reveal how citation behaviour is learned, not only whether it is learned; (ii) citation-grounded SFT reduces hallucination to 0.0% for encoder-decoder models from Stage 2 onward; (iii) the progressive pipeline prevents catastrophic forgetting while improving Hindi capabilities; (iv) smaller models match larger models on English after SFT; and (v) GRPO provides marginal improvement over well-designed SFT for structured citation tasks. We evaluate across six automatic metrics (BLEU, ROUGE, BERTScore, FactScore, Citation-F1, and hallucination rate).

[51] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

Xinghao Zhao

Main category: cs.CL

TL;DR: Monotonic entropy trajectories in chain-of-thought reasoning predict answer correctness better than aggregate uncertainty measures, enabling cheap failure detection.

Details

Motivation: Chain-of-thought reasoning improves LLM accuracy but detecting reasoning failures cheaply remains challenging. The paper investigates whether the shape of uncertainty dynamics across reasoning steps can predict correctness more effectively than traditional aggregate uncertainty measures.

Method: Introduces entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. Tests on GSM8K with Qwen2.5-7B-Instruct and Mistral-7B, comparing monotone vs non-monotone chains, and analyzes entropy reduction patterns.

Result: Monotone chains achieve 68.8% accuracy vs 46.8% for non-monotone chains (+21.9 pp) on Qwen2.5-7B-Instruct. On Mistral-7B: 72.3% vs 37.6% (+34.7 pp). Total entropy reduction is not predictive (ρ=-0.06, p=0.31). Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Outperforms scalar baselines at ~1,500 tokens/question.

Conclusion: Structural properties of uncertainty trajectories (shape of entropy dynamics) are more informative than aggregate uncertainty measures for predicting reasoning correctness. Monotonicity provides cheap failure detection at 1/8 the cost of 40-chain self-consistency.

Abstract: Chain-of-thought (CoT) reasoning improves LLM accuracy, yet detecting failures cheaply remains elusive. We study whether the shape of uncertainty dynamics across reasoning steps–captured by sampling a few answer completions per step–predicts correctness. We introduce entropy-trajectory monotonicity: a chain is monotone if its per-step answer-distribution entropy decreases at every step. On GSM8K (n=300) with Qwen2.5-7B-Instruct, monotone chains achieve 68.8% accuracy vs. 46.8% for non-monotone chains (+21.9 pp; Fisher’s p=0.0005; OR=2.50). Critically, total entropy reduction is not predictive ($ρ$=-0.06, p=0.31), revealing a shape-over-magnitude dissociation: whether entropy decreases at every step matters, not how much. Violation count 0/1/2 gives 68.8%/50.8%/28.6% accuracy. Token log-probability confidence worsens in calibration with step depth (ECE: 0.186->0.312), and monotonicity achieves +5.8 pp at 73.7% coverage, outperforming scalar baselines at approx 1,500 tokens/question–1/8 the cost of 40-chain self-consistency. Results replicate on Mistral-7B (n=300): monotone chains reach 72.3% vs. 37.6% (+34.7 pp; OR=4.33). Structural properties of uncertainty trajectories are thus more informative than aggregate measures.

[52] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

Weronika Łajewska, Paul Missault, George Davidson, Saab Mansour

Main category: cs.CL

TL;DR: RADIUS introduces a comprehensive two-dimensional alignment suite for evaluating LLM-based survey simulations, focusing on both ranking and distribution alignment with statistical significance testing.

Details

Motivation: Current evaluation metrics for LLM-based survey simulations are ad hoc, fragmented, and non-standardized, making results difficult to compare. Existing metrics focus mainly on accuracy or distributional measures while overlooking critical ranking alignment - a simulation can achieve high accuracy but still fail to capture the option most preferred by humans, which is crucial for decision-making applications.

Method: RADIUS provides a two-dimensional alignment suite that captures: 1) Ranking alignment (how well the simulation preserves human preference rankings) and 2) Distribution alignment (how well the simulation matches response distributions), each complemented by statistical significance testing. The framework enables reproducible and comparable assessment of survey simulations.

Result: The paper demonstrates that RADIUS highlights limitations of existing metrics and enables more meaningful evaluation of survey simulations. It provides an open-source implementation for reproducible assessment.

Conclusion: RADIUS offers a comprehensive framework for evaluating LLM-based survey simulations that addresses both ranking and distribution alignment with statistical rigor, overcoming limitations of current fragmented evaluation approaches.

Abstract: Simulation of surveys using LLMs is emerging as a powerful application for generating human-like responses at scale. Prior work evaluates survey simulation using metrics borrowed from other domains, which are often ad hoc, fragmented, and non-standardized, leading to results that are difficult to compare. Moreover, existing metrics focus mainly on accuracy or distributional measures, overlooking the critical dimension of ranking alignment. In practice, a simulation can achieve high accuracy while still failing to capture the option most preferred by humans - a distinction that is critical in decision-making applications. We introduce RADIUS, a comprehensive two-dimensional alignment suite for survey simulation that captures: 1) RAnking alignment and 2) DIstribUtion alignment, each complemented by statistical Significance testing. RADIUS highlights the limitations of existing metrics, enables more meaningful evaluation of survey simulation, and provides an open-source implementation for reproducible and comparable assessment.

[53] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

Hangeol Chang, Changsun Lee, Seungjoon Rho, Junho Yeo, Jong Chul Ye

Main category: cs.CL

TL;DR: HCQR is a training-free RAG framework that rewrites queries to focus on evidence for decision-making rather than topical relevance, improving performance on multiple-choice tasks.

Details

Motivation: Standard RAG methods use single queries that favor topical relevance over decision-relevant evidence, failing to discriminate among answer options in multiple-choice tasks.

Method: HCQR creates three targeted queries from a working hypothesis: (1) evidence supporting the hypothesis, (2) evidence distinguishing it from alternatives, and (3) verification of salient clues.

Result: HCQR outperforms single-query RAG and re-rank/filter baselines on MedQA and MMLU-Med, improving accuracy by 5.9 and 3.6 points respectively over Simple RAG.

Conclusion: HCQR effectively reorients RAG from topic-oriented to evidence-oriented retrieval, enabling better decision-making in multiple-choice scenarios without additional training.

Abstract: Retrieval-Augmented Generation (RAG) improves Large Language Models (LLMs) by grounding generation in external, non-parametric knowledge. However, when a task requires choosing among competing options, simply grounding generation in broadly relevant context is often insufficient to drive the final decision. Existing RAG methods typically rely on a single initial query, which often favors topical relevance over decision-relevant evidence, and therefore retrieves background information that can fail to discriminate among answer options. To address this issue, here we propose Hypothesis-Conditioned Query Rewriting (HCQR), a training-free pre-retrieval framework that reorients RAG from topic-oriented retrieval to evidence-oriented retrieval. HCQR first derives a lightweight working hypothesis from the input question and candidate options, and then rewrites retrieval into three targeted queries that seek evidence to: (1) support the hypothesis, (2) distinguish it from competing alternatives, and (3) verify salient clues in the question. This approach enables context retrieval that is more directly aligned with answer selection, allowing the generator to confirm or overturn the initial hypothesis based on the retrieved evidence. Experiments on MedQA and MMLU-Med show that HCQR consistently outperforms single-query RAG and re-rank/filter baselines, improving average accuracy over Simple RAG by 5.9 and 3.6 points, respectively. Code is available at https://anonymous.4open.science/r/HCQR-1C2E.

[54] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia, Ahmad Muhammad Isa, Maxime Peyrard, Wei Zhao

Main category: cs.CL

TL;DR: MultiTempBench is a multilingual temporal reasoning benchmark with 15,000 examples across 5 languages and 3 calendar systems, evaluating 20 LLMs on temporal reasoning tasks and analyzing tokenization quality’s impact on performance.

Details

Motivation: To address the lack of comprehensive multilingual temporal reasoning benchmarks that span different languages and calendar systems, and to understand how tokenization affects temporal reasoning in LLMs across resource settings.

Method: Created MultiTempBench with 15,000 examples by translating 750 curated English questions into 5 languages (English, German, Chinese, Arabic, Hausa) and expanding into controlled date-format variants across 3 calendar conventions (Gregorian, Hijri, Chinese Lunar). Evaluated 20 LLMs and introduced multilingual Date Fragmentation Ratio (mDFR) metric calibrated with human ratings, plus geometric-probing analyses of internal representations.

Result: Tokenization quality of temporal artifacts is resource-dependent: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are robust to digit-level splitting. Temporal linearity is strongest predictor of reasoning in high-resource languages, whereas fragmentation dominates in low-resource languages.

Conclusion: Tokenization fragmentation is a critical bottleneck for temporal reasoning in low-resource languages and calendar systems, requiring attention in multilingual LLM development and evaluation.

Abstract: We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

[55] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

Chenyang Gu, Jiahao Cheng, Meicong Zhang, Pujun Zheng, Jinquan Zheng, Guoxiu He

Main category: cs.CL

TL;DR: MoRI is a framework that enables LLMs to learn explicit reasoning from research motivations to methodologies for scientific ideation, using supervised fine-tuning and composite reinforcement learning rewards for scientific rigor.

Details

Motivation: Existing LLM-based approaches for scientific ideation inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. There's a need for frameworks that enable LLMs to explicitly learn the reasoning process from research motivations to methodologies.

Method: MoRI uses supervised fine-tuning to initialize the base LLM to generate research motivations from given contexts, then trains it under a composite reinforcement learning reward with two components: (1) entropy-aware information gain to encourage uncovering high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain to constrain reasoning trajectories to maintain conceptual alignment with scientifically valid solutions.

Result: MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions including novelty, technical rigor, and feasibility.

Conclusion: The MoRI framework successfully enables LLMs to learn explicit reasoning processes for scientific ideation, addressing limitations of existing approaches by incorporating scientific rigor through composite reinforcement learning rewards.

Abstract: Scientific ideation aims to propose novel solutions within a given scientific context. Existing LLM-based agentic approaches emulate human research workflows, yet inadequately model scientific reasoning, resulting in surface-level conceptual recombinations that lack technical depth and scientific grounding. To address this issue, we propose \textbf{MoRI} (\textbf{Mo}tivation-grounded \textbf{R}easoning for Scientific \textbf{I}deation), a framework that enables LLMs to explicitly learn the reasoning process from research motivations to methodologies. The base LLM is initialized via supervised fine-tuning to generate a research motivation from a given context, and is subsequently trained under a composite reinforcement learning reward that approximates scientific rigor: (1) entropy-aware information gain encourages the model to uncover and elaborate high-complexity technical details grounded in ground-truth methodologies, and (2) contrastive semantic gain constrains the reasoning trajectory to maintain conceptually aligned with scientifically valid solutions. Empirical results show that MoRI significantly outperforms strong commercial LLMs and complex agentic baselines across multiple dimensions, including novelty, technical rigor, and feasibility. The code will be made available on \href{https://github.com/ECNU-Text-Computing/IdeaGeneration}{GitHub}.

[56] Parallelograms Strike Back: LLMs Generate Better Analogies than People

Qiawen Ella Liu, Raja Marjieh, Jian-Qiao Zhu, Adele E. Goldberg, Thomas L. Griffiths

Main category: cs.CL

TL;DR: LLMs generate better word analogies than humans, showing stronger alignment with parallelogram structure in embedding spaces, suggesting humans often fail to produce relation-preserving analogies rather than the parallelogram model being flawed.

Details

Motivation: To investigate whether the parallelogram model fails as a model of analogical relations or because humans are poor at generating relation-preserving analogies, by comparing human and LLM performance on word analogy tasks.

Method: Compared human and LLM analogy completions on the same analogy problems from Peterson et al. (2020), analyzing parallelogram alignment in GloVe embedding space, local similarity reliance, word frequency effects, and human judgments of analogy quality.

Result: LLM-generated analogies were judged better than human ones, showed greater parallelogram alignment, and less reliance on accessible words. The advantage stems from humans producing a long tail of weak completions - when comparing modal responses, the LLM advantage disappears, but parallelogram alignment and lower word frequency still predict which LLM completions are rated higher.

Conclusion: The parallelogram model is not a poor account of word analogy; rather, humans often fail to produce completions satisfying relational constraints, while LLMs do so more consistently.

Abstract: Four-term word analogies (A:B::C:D) are classically modeled geometrically as ‘‘parallelograms,’’ yet recent work suggests this model poorly captures how humans produce analogies, with simple local-similarity heuristics often providing a better account (Peterson et al., 2020). But does the parallelogram model fail because it is a bad model of analogical relations, or because people are not very good at generating relation-preserving analogies? We compared human and large language model (LLM) analogy completions on the same set of analogy problems from (Peterson et al., 2020). We find that LLM-generated analogies are reliably judged as better than human-generated ones, and are also more closely aligned with the parallelogram structure in a distributional embedding space (GloVe). Crucially, we show that the improvement over human analogies was driven by greater parallelogram alignment and reduced reliance on accessible words rather than enhanced sensitivity to local similarity. Moreover, the LLM advantage is driven not by uniformly superior responses by LLMs, but by humans producing a long tail of weak completions: when only modal (most frequent) responses by both systems are compared, the LLM advantage disappears. However, greater parallelogram alignment and lower word frequency continue to predict which LLM completions are rated higher than those of humans. Overall, these results suggest that the parallelogram model is not a poor account of word analogy. Rather, humans may often fail to produce completions that satisfy this relational constraint, whereas LLMs do so more consistently.

[57] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

Madeline Bittner, Dina Demner-Fushman, Yasmeen Shabazz, Davis Bartels, Dukyong Yoon, Brad Quitadamo, Rajiv Menghrajani, Leo Celi, Sarvesh Soni

Main category: cs.CL

TL;DR: HEALIX is the first publicly available annotated health literacy dataset from clinical notes, enabling automated detection of health literacy levels (low, normal, high) through LLM-based approaches.

Details

Motivation: Current health literacy screening tools are not always feasible and vary in format, making structured EHR documentation difficult. Clinical notes contain richer health literacy information but lack annotated resources for automated detection.

Method: Created HEALIX dataset through social worker note sampling, keyword-based filtering, and LLM-based active learning. Contains 589 notes across 9 types with three health literacy labels. Benchmarked zero-shot and few-shot prompting across four open-source LLMs.

Result: HEALIX provides the first publicly available annotated health literacy dataset from real clinical notes. LLM benchmarking demonstrates the utility of the dataset for automated health literacy detection.

Conclusion: HEALIX enables automated health literacy detection from clinical notes, addressing limitations of current screening tools and facilitating structured documentation in EHRs.

Abstract: Health literacy is a critical determinant of patient outcomes, yet current screening tools are not always feasible and differ considerably in the number of items, question format, and dimensions of health literacy they capture, making documentation in structured electronic health records difficult to achieve. Automated detection from unstructured clinical notes offers a promising alternative, as these notes often contain richer, more contextual health literacy information, but progress has been limited by the lack of annotated resources. We introduce HEALIX, the first publicly available annotated health literacy dataset derived from real clinical notes, curated through a combination of social worker note sampling, keyword-based filtering, and LLM-based active learning. HEALIX contains 589 notes across 9 note types, annotated with three health literacy labels: low, normal, and high. To demonstrate its utility, we benchmarked zero-shot and few-shot prompting strategies across four open source large language models (LLMs).

[58] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

Yilin Wang, Yuchun Fan, Jiaoyang Li, Ziming Zhu, Yongyu Mu, Qiaozhi He, Tong Xiao, Jingbo Zhu

Main category: cs.CL

TL;DR: DaPT is a multilingual RAG framework for multi-hop QA that addresses performance imbalances in cross-lingual retrieval by generating parallel sub-question graphs in source and English languages, then merging them for bilingual retrieval.

Details

Motivation: RAG systems face challenges in multilingual multi-hop QA due to lack of benchmarks and overreliance on English semantic understanding, leading to performance imbalances across languages.

Method: Constructs multilingual benchmarks by translating English QA datasets, then proposes DaPT framework that generates parallel sub-question graphs for source-language queries and English translations, merges them, and uses bilingual retrieval-and-answer strategy.

Result: DaPT achieves 18.3% relative improvement in average EM score over strongest baseline on MuSiQue benchmark, demonstrating more accurate and concise answers across languages.

Conclusion: The paper addresses multilingual RAG challenges, showing advanced systems suffer from performance imbalances and proposing an effective bilingual approach that significantly enhances multilingual QA performance.

Abstract: Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems’ capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs’ strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3% in average EM score over the strongest baseline.

[59] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

Zikang Ding, Junchi Yao, Junhao Li, Yi Zhang, Wenbo Jiang, Hongbo Liu, Lijie Hu

Main category: cs.CL

TL;DR: UGID is a debiasing framework for LLMs that treats Transformers as computational graphs and enforces structural invariance across counterfactual inputs to reduce social biases in internal representations.

Details

Motivation: LLMs exhibit significant social biases that persist despite output-level or data-based debiasing methods. Prior work shows biases are embedded in internal representations, requiring new approaches that address bias at the structural level.

Method: Models Transformer as structured computational graph (attention mechanisms as edges, hidden states as nodes). Formulates debiasing as enforcing graph structure invariance across counterfactual inputs while allowing differences only on sensitive attributes. Uses log-space constraints on sensitive logits and selective anchor-based objectives to preserve definitional semantics.

Result: UGID effectively reduces bias in both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

Conclusion: UGID provides an effective internal-representation-level debiasing framework that addresses bias migration across architectural components while maintaining model capabilities.

Abstract: Large language models (LLMs) exhibit pronounced social biases. Output-level or data-optimization–based debiasing methods cannot fully resolve these biases, and many prior works have shown that biases are embedded in internal representations. We propose \underline{U}nified \underline{G}raph \underline{I}somorphism for \underline{D}ebiasing large language models (\textit{\textbf{UGID}}), an internal-representation–level debiasing framework for large language models that models the Transformer as a structured computational graph, where attention mechanisms define the routing edges of the graph and hidden states define the graph nodes. Specifically, debiasing is formulated as enforcing invariance of the graph structure across counterfactual inputs, with differences allowed only on sensitive attributes. \textit{\textbf{UGID}} jointly constrains attention routing and hidden representations in bias-sensitive regions, effectively preventing bias migration across architectural components. To achieve effective behavioral alignment without degrading general capabilities, we introduce a log-space constraint on sensitive logits and a selective anchor-based objective to preserve definitional semantics. Extensive experiments on large language models demonstrate that \textit{\textbf{UGID}} effectively reduces bias under both in-distribution and out-of-distribution settings, significantly reduces internal structural discrepancies, and preserves model safety and utility.

[60] Optimal Splitting of Language Models from Mixtures to Specialized Domains

Skyler Seto, Pierre Ablin, Anastasiia Filippova, Jiayuan Ye, Louis Bethune, Angelos Katharopoulos, David Grangier

Main category: cs.CL

TL;DR: A method for optimizing compute allocation between general pretraining and domain specialization using scaling laws to improve language model performance across knowledge and reasoning tasks.

Details

Motivation: Current two-stage training (general pretraining followed by domain specialization) requires training multiple specialized models, which is computationally expensive. The paper aims to optimize compute allocation between these stages using scaling laws to improve efficiency and performance.

Method: Proposes using scaling laws to predict model loss for different compute allocations between general pretraining and domain specialization. The approach pretrains multiple models independently over a general corpus and determines optimal compute split using these predictions, allowing extrapolation to larger models and token counts.

Result: The method consistently improves performance across common sense knowledge and reasoning benchmarks for different model sizes and compute budgets, demonstrating effective optimization of training compute allocation.

Conclusion: Scaling laws can effectively guide compute allocation between general pretraining and domain specialization, leading to better performance on knowledge and reasoning tasks without requiring multiple specialized model trainings.

Abstract: Language models achieve impressive performance on a variety of knowledge, language, and reasoning tasks due to the scale and diversity of pretraining data available. The standard training recipe is a two-stage paradigm: pretraining first on the full corpus of data followed by specialization on a subset of high quality, specialized data from the full corpus. In the multi-domain setting, this involves continued pretraining of multiple models on each specialized domain, referred to as split model training. We propose a method for pretraining multiple models independently over a general pretraining corpus, and determining the optimal compute allocation between pretraining and continued pretraining using scaling laws. Our approach accurately predicts the loss of a model of size N with D pretraining and D’ specialization tokens, and extrapolates to larger model sizes and number of tokens. Applied to language model training, our approach improves performance consistently across common sense knowledge and reasoning benchmarks across different model sizes and compute budgets.

[61] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

Chonghan Liu, Yimin Du, Qi An, Xin He, Cunqi Zhai, Fei Tan, Weijia Lin, Xiaochun Gong, Yongchao Deng, Shousheng Jia, Xiangzheng Zhang

Main category: cs.CL

TL;DR: VEPO uses reinforcement learning with verifiable rewards to improve LLM performance on low-resource languages through better subword segmentation and addressing training data imbalances.

Details

Motivation: Large language models perform poorly on low-resource languages due to inefficient subword segmentation and training data imbalances, creating a need for methods that can incorporate structural constraints during training.

Method: Variable Entropy Policy Optimization (VEPO) uses reinforcement learning with verifiable rewards to enforce structural constraints like sequence length, format consistency, and linguistic well-formedness. It features a variable entropy mechanism for balancing literal fidelity and semantic naturalness, and uses entropy-tempered advantage estimation with asymmetric clipping.

Result: Empirical evaluations across 90 FLORES-200, COMET-22, and chrF directions show substantial improvements in both tokenization efficiency and translation quality for underrepresented languages.

Conclusion: VEPO effectively bridges the performance gap for low-resource languages by incorporating deterministic structural constraints into policy alignment, improving both tokenization and translation quality.

Abstract: Large language models frequently exhibit suboptimal performance on low resource languages, primarily due to inefficient subword segmentation and systemic training data imbalances. In this paper, we propose Variable Entropy Policy Optimization (VEPO), which leverages Reinforcement Learning with Verifiable Rewards to incorporate deterministic structural constraints into the policy alignment process. This framework ensures prescribed sequence length, robust format consistency, and rigorous linguistic well formedness, all enforced during training. Central to our approach is a variable entropy mechanism that enables the model to dynamically calibrate the equilibrium between literal fidelity and semantic naturalness by modulating the exploration exploitation manifold. By integrating entropy tempered advantage estimation with asymmetric clipping, VEPO sustains robust exploration while mitigating policy collapse. Empirical evaluations across 90 FLORES-200, COMET-22, chrF directions demonstrate that VEPO yields substantial improvements in both tokenization efficiency and translation quality, bridging the performance gap for underrepresented languages.

[62] Evaluating Counterfactual Strategic Reasoning in Large Language Models

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou

Main category: cs.CL

TL;DR: LLMs evaluated in repeated game theory settings (Prisoner’s Dilemma & Rock-Paper-Scissors) with counterfactual variants to test strategic reasoning vs. memorization patterns.

Details

Motivation: To determine whether LLMs' strategic performance in game-theoretic settings reflects genuine reasoning or merely reliance on memorized patterns from training data.

Method: Created counterfactual variants of canonical games (Prisoner’s Dilemma and Rock-Paper-Scissors) by altering payoff structures and action labels to break familiar symmetries and dominance relations. Used multi-metric evaluation framework comparing default and counterfactual instantiations.

Result: Showcased LLM limitations in incentive sensitivity, structural generalization, and strategic reasoning within counterfactual environments, indicating reliance on memorized patterns rather than genuine reasoning.

Conclusion: LLMs struggle with genuine strategic reasoning in novel game-theoretic environments, revealing limitations in their ability to generalize beyond memorized patterns from training data.

Abstract: We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner’s Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

[63] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

Zhuolin Yang, Zihan Liu, Yang Chen, Wenliang Dai, Boxin Wang, Sheng-Chieh Lin, Chankyu Lee, Yangyi Chen, Dongfu Jiang, Jiafan He, Renjie Pi, Grace Lam, Nayeon Lee, Alexander Bukharin, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping

Main category: cs.CL

TL;DR: Nemotron-Cascade 2 is a 30B MoE model with 3B activated parameters that achieves state-of-the-art reasoning and agentic capabilities, matching frontier models in math/coding while being 20x smaller.

Details

Motivation: To create a highly efficient open-weight language model that delivers exceptional reasoning and agentic capabilities with dramatically fewer parameters, achieving frontier-level performance in mathematical and coding competitions while maintaining compact size.

Method: Uses Mixture of Experts (MoE) architecture with 30B total parameters but only 3B activated. Key innovations include: 1) SFT on curated dataset, 2) expanded Cascade RL covering broader reasoning/agentic domains, 3) multi-domain on-policy distillation from strongest intermediate teacher models throughout Cascade RL process.

Result: Achieves Gold Medal-level performance in IMO, IOI, and ICPC World Finals - only second open-weight model to do so. Approaches frontier open models in mathematical and coding reasoning despite being 20x smaller. Demonstrates remarkably high intelligence density.

Conclusion: Nemotron-Cascade 2 represents a significant advancement in efficient language model design, showing that exceptional reasoning and agentic capabilities can be achieved with dramatically fewer parameters through careful architecture design and training methodology.

Abstract: We introduce Nemotron-Cascade 2, an open 30B MoE model with 3B activated parameters that delivers best-in-class reasoning and strong agentic capabilities. Despite its compact size, its mathematical and coding reasoning performance approaches that of frontier open models. It is the second open-weight LLM, after DeepSeekV3.2-Speciale-671B-A37B, to achieve Gold Medal-level performance in the 2025 International Mathematical Olympiad (IMO), the International Olympiad in Informatics (IOI), and the ICPC World Finals, demonstrating remarkably high intelligence density with 20x fewer parameters. In contrast to Nemotron-Cascade 1, the key technical advancements are as follows. After SFT on a meticulously curated dataset, we substantially expand Cascade RL to cover a much broader spectrum of reasoning and agentic domains. Furthermore, we introduce multi-domain on-policy distillation from the strongest intermediate teacher models for each domain throughout the Cascade RL process, allowing us to efficiently recover benchmark regressions and sustain strong performance gains along the way. We release the collection of model checkpoint and training data.

[64] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Main category: cs.CL

TL;DR: F2LLM-v2 is a family of multilingual embedding models in 8 sizes (80M to 14B) trained on 60M samples supporting 200+ languages, using LLM-based embedding training with matryoshka learning, pruning, and distillation for efficiency.

Details

Motivation: To create efficient multilingual embedding models that support many languages (especially underserved mid/low-resource ones) while maintaining competitive performance across various sizes for different resource constraints.

Method: Two-stage LLM-based embedding training pipeline combined with matryoshka learning, model pruning, and knowledge distillation techniques to create efficient models across 8 different sizes.

Result: F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while smaller models set new state-of-the-art for resource-constrained applications, with all models supporting 200+ languages.

Conclusion: The F2LLM-v2 family provides efficient, multilingual embedding models that excel across various sizes, with particular strength in supporting underserved languages and resource-constrained applications.

Abstract: We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

[65] Using Optimal Transport as Alignment Objective for fine-tuning Multilingual Contextualized Embeddings

Sawsan Alqahtani, Garima Lalwani, Yi Zhang, Salvatore Romeo, Saab Mansour

Main category: cs.CL

TL;DR: Using Optimal Transport as an alignment objective during fine-tuning to improve multilingual contextualized representations for cross-lingual transfer without requiring word-alignment pairs.

Details

Motivation: Multilingual contextualized embeddings need better alignment methods that consider context and avoid sub-optimal matching from pre-existing word-alignment pairs.

Method: Proposes using Optimal Transport as an unsupervised alignment objective during fine-tuning, allowing soft matching between source and target sentences without requiring word-alignment pairs.

Result: Achieves improvements over baselines and competitive results on XNLI and XQuAD tasks compared to similar recent works.

Conclusion: Optimal Transport provides an effective unsupervised alignment method for improving multilingual contextualized representations in cross-lingual transfer scenarios.

Abstract: Recent studies have proposed different methods to improve multilingual word representations in contextualized settings including techniques that align between source and target embedding spaces. For contextualized embeddings, alignment becomes more complex as we additionally take context into consideration. In this work, we propose using Optimal Transport (OT) as an alignment objective during fine-tuning to further improve multilingual contextualized representations for downstream cross-lingual transfer. This approach does not require word-alignment pairs prior to fine-tuning that may lead to sub-optimal matching and instead learns the word alignments within context in an unsupervised manner. It also allows different types of mappings due to soft matching between source and target sentences. We benchmark our proposed method on two tasks (XNLI and XQuAD) and achieve improvements over baselines as well as competitive results compared to similar recent works.

[66] SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Chi Harold Liu, Zhiwei Xu, Guoliang Fan, Rui Zhao, Ziyue Li, Hangyu Mao

Main category: cs.CL

TL;DR: A study on LLMs for Text-to-SQL tasks, addressing prompt design and evaluation gaps through a new dataset and comprehensive task analysis.

Details

Motivation: There's no consensus on optimal prompt templates and design frameworks for LLM-based Text-to-SQL systems, and existing benchmarks inadequately explore LLM performance across various sub-tasks of the Text-to-SQL process, hindering assessment of cognitive capabilities and optimization of solutions.

Method: Constructed a new dataset to mitigate overfitting risk in LLMs, then formulated five evaluation tasks to comprehensively assess performance of diverse methods across various LLMs throughout the Text-to-SQL process.

Result: The study highlights performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task, offering valuable insights for developing LLM-based Text-to-SQL systems.

Conclusion: The research provides a systematic framework for evaluating and optimizing LLM performance in Text-to-SQL tasks, addressing key gaps in prompt design and comprehensive assessment.

Abstract: Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods.Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt templates and design frameworks. Additionally, existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to-SQL process, which hinders the assessment of LLMs’ cognitive capabilities and the optimization of LLM-based solutions. To address the aforementioned issues, we firstly construct a new dataset designed to mitigate the risk of overfitting in LLMs. Then we formulate five evaluation tasks to comprehensively assess the performance of diverse methods across various LLMs throughout the Text-to-SQL process.Our study highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. These findings offer valuable insights for facilitating the development of LLM-based Text-to-SQL systems.

[67] LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis With Multi-step Arithmetics

Keito Kudo, Yoichi Aoki, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Ana Brassard, Keisuke Sakaguchi, Kentaro Inui

Main category: cs.CL

TL;DR: LLMs compute answers dynamically during CoT reasoning rather than predetermining them, making CoT explanations faithful reflections of internal computation.

Details

Motivation: To investigate the faithfulness of chain-of-thought explanations in LLMs and understand when answers are determined during reasoning processes.

Method: Experiments with controlled arithmetic tasks to analyze internal information flow, specifically examining when answers are predetermined and the causal effect of CoT on final answers.

Result: LLMs don’t predetermine answers before CoT begins; they compute sub-answers dynamically during reasoning chain generation, making CoT faithful to internal computation.

Conclusion: Generated reasoning chains are faithful reflections of LLMs’ internal computation, as answers are computed on-the-fly during CoT generation rather than predetermined.

Abstract: This study investigates the internal information flow of large language models (LLMs) while performing chain-of-thought (CoT) style reasoning. Specifically, with a particular interest in the faithfulness of the CoT explanation to LLMs’ final answer, we explore (i) when the LLMs’ answer is (pre)determined, especially before the CoT begins or after, and (ii) how strongly the information from CoT specifically has a causal effect on the final answer. Our experiments with controlled arithmetic tasks reveal a systematic internal reasoning mechanism of LLMs. They have not derived an answer at the moment when input was fed into the model. Instead, they compute (sub-)answers while generating the reasoning chain on the fly. Therefore, the generated reasoning chains can be regarded as faithful reflections of the model’s internal computation.

[68] Enhancing Lexicon-Based Text Embeddings with Large Language Models

Yibin Lei, Tao Shen, Yu Cao, Andrew Yates

Main category: cs.CL

TL;DR: LENS introduces lexicon-based embeddings using LLMs that achieve competitive performance on text embedding tasks by clustering token embeddings to handle vocabulary redundancy, outperforming dense embeddings on MTEB benchmark.

Details

Motivation: While dense embeddings dominate text embedding research, the authors aim to explore lexicon-based approaches using LLMs to achieve competitive performance while addressing token redundancy issues in LLM vocabularies.

Method: LENS consolidates vocabulary space through token embedding clustering, groups semantically similar tokens, investigates bidirectional attention and pooling strategies, and simplifies lexical matching by assigning dimensions to specific token clusters.

Result: LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivers compact representations, inherently supports efficient embedding dimension pruning, and achieves state-of-the-art performance on BEIR retrieval subset when combined with dense embeddings.

Conclusion: Lexicon-based embeddings using LLMs (LENS) offer a competitive alternative to dense embeddings, achieving strong performance on text embedding benchmarks while providing benefits like compact representations and inherent dimension pruning support.

Abstract: Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first lexicon-based embeddings (LENS) leveraging LLMs that achieve competitive performance on these tasks. LENS consolidates the vocabulary space through token embedding clustering to handle the issue of token redundancy in LLM vocabularies. To further improve performance, we investigate bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexical matching with redundant vocabularies by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact representations with dimensionality comparable to dense counterparts. Furthermore, LENS inherently supports efficient embedding dimension pruning without any specialized objectives like Matryoshka Representation Learning. Notably, combining LENS with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e., BEIR).

[69] HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

Rasmus Aavang, Giovanni Rizzi, Rasmus Bøggild, Alexandre Iolov, Mike Zhang, Johannes Bjerva

Main category: cs.CL

TL;DR: HiFi-KPI dataset for financial KPI tagging from earnings reports with hierarchical labels linked to iXBRL taxonomies, supporting classification and extraction tasks.

Details

Motivation: iXBRL financial filings have complex taxonomies that limit cross-company transferability of tagged KPIs, creating a need for better datasets and methods for financial information extraction.

Method: Created HiFi-KPI dataset with 1.65M paragraphs and 198k hierarchical labels from iXBRL taxonomies, plus HiFi-KPI-Lite subset for evaluation. Evaluated encoder-based models and LLMs on KPI classification, extraction, and structured extraction tasks.

Result: Encoder models achieved over 0.906 macro-F1 on classification, LLMs reached 0.440 F1 on structured extraction. Extraction errors mainly related to dates. Dataset and code are open-sourced.

Conclusion: HiFi-KPI enables better financial KPI extraction from earnings reports, with hierarchical labeling addressing iXBRL complexity limitations. The dataset supports multiple NLP tasks in financial domain.

Abstract: Accurate tagging of earnings reports can yield significant short-term returns for stakeholders. The machine-readable inline eXtensible Business Reporting Language (iXBRL) is mandated for public financial filings. Yet, its complex, fine-grained taxonomy limits the cross-company transferability of tagged Key Performance Indicators (KPIs). To address this, we introduce the Hierarchical Financial Key Performance Indicator (HiFi-KPI) dataset, a large-scale corpus of 1.65M paragraphs and 198k unique, hierarchically organized labels linked to iXBRL taxonomies. HiFi-KPI supports multiple tasks and we evaluate three: KPI classification, KPI extraction, and structured KPI extraction. For rapid evaluation, we also release HiFi-KPI-Lite, a manually curated 8K paragraph subset. Baselines on HiFi-KPI-Lite show that encoder-based models achieve over 0.906 macro-F1 on classification, while Large Language Models (LLMs) reach 0.440 F1 on structured extraction. Finally, a qualitative analysis reveals that extraction errors primarily relate to dates. We open-source all code and data at https://github.com/aaunlp/HiFi-KPI.

[70] PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

Zhiwen You, Yue Guo

Main category: cs.CL

TL;DR: PlainQAFact is a retrieval-augmented QA metric for evaluating factual consistency in biomedical plain language summarization, specifically addressing elaborative explanations that introduce external content.

Details

Motivation: Existing factual consistency evaluation methods struggle with plain language summarization in medical domains due to elaborative explanations that add external content (definitions, background, examples) not present in source abstracts, creating challenges for traditional entailment- and QA-based metrics.

Method: PlainQAFact first classifies sentence types, then applies a retrieval-augmented QA scoring method. It’s trained on a fine-grained human-annotated dataset called PlainFact, designed to evaluate factual consistency of both source-simplified and elaboratively explained sentences.

Result: PlainQAFact consistently outperforms existing evaluation metrics across all evaluation settings, especially for elaborative explanations where other methods fail. The paper also analyzes its effectiveness across different knowledge sources, answer extraction strategies, overlap measures, and document granularity levels.

Conclusion: The work presents a sentence-aware, retrieval-augmented metric specifically targeted at elaborative explanations in biomedical plain language summarization, providing both a new benchmark and practical evaluation tool to advance reliable and safe plain language communication in medical domains.

Abstract: Hallucinated outputs from large language models (LLMs) pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing automatic factual consistency evaluation methods, such as entailment- and question-answering (QA) -based, struggle with plain language summarization (PLS) due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the scientific abstract to enhance comprehension. To address this, we introduce PlainQAFact, an automatic factual consistency evaluation metric trained on a fine-grained, human-annotated dataset PlainFact, for evaluating factual consistency of both source-simplified and elaborately explained sentences. PlainQAFact first classifies sentence type, then applies a retrieval-augmented QA scoring method. Empirical results show that existing evaluation metrics fail to evaluate the factual consistency in PLS, especially for elaborative explanations, whereas PlainQAFact consistently outperforms them across all evaluation settings. We further analyze PlainQAFact’s effectiveness across external knowledge sources, answer extraction strategies, answer overlap measures, and document granularity levels, refining its overall factual consistency assessment. Taken together, our work presents a sentence-aware, retrieval-augmented metric targeted at elaborative explanations in biomedical PLS tasks, providing the community with both a new benchmark and a practical evaluation tool to advance reliable and safe plain language communication in the medical domain. PlainQAFact and PlainFact are available at: https://github.com/zhiwenyou103/PlainQAFact

[71] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

Zhaowei Liu, Xin Guo, Zhi Yang, Fangqi Lou, Lingfeng Zeng, Jinyi Niu, Mengping Li, Qi Qi, Zhiqiang Liu, Yiyang Han, Dongpo Cheng, Ronghao Chen, Huacan Wang, Xingdong Feng, Huixia Judy Wang, Chengchun Shi, Liwen Zhang

Main category: cs.CL

TL;DR: Fin-R1 is a 7B parameter reasoning LLM specialized for finance, trained on curated financial CoT data with SFT+RL, achieving competitive performance on financial benchmarks and practical applications.

Details

Motivation: General-purpose LLMs face challenges in finance due to fragmented data, opaque reasoning, and weak transferability to business applications, creating a need for specialized financial reasoning models.

Method: Two-stage pipeline: 1) Construct Fin-R1-Data (60,091 high-quality CoT samples from authoritative benchmarks), 2) Train Fin-R1 using supervised fine-tuning followed by reinforcement learning.

Result: Fin-R1 achieves competitive performance on established financial benchmarks despite its small 7B size, and demonstrates practical utility in compliance checking and robo-advisory applications.

Conclusion: Fin-R1 effectively addresses financial reasoning challenges with a compact, specialized model that balances performance and deployment costs, showing promise for real-world financial applications.

Abstract: In recent years, general-purpose large language models (LLMs) such as GPT, Gemini, Claude, and DeepSeek have advanced at an unprecedented pace. Despite these achievements, their application to finance remains challenging, due to fragmented data sources, intransparent reasoning processes, and weak transferability to business applications. In response, we introduce Fin-R1, a reasoning LLM designed for financial scenarios. With a compact size of 7 billion parameters, Fin-R1 reduces deployment costs while addressing the aforementioned challenges. Its development follows a two-stage pipeline. First, we construct Fin-R1-Data, a high-quality financial dataset consisting of 60,091 chain-of-thought (CoT) samples, distilled and filtered from multiple authoritative benchmarks to ensure consistency and reliability. Second, we train Fin-R1 using Fin-R1-Data through supervised fine-tuning (SFT), followed by reinforcement learning (RL). This stage substantially improves the model’s ability to solve complex financial reasoning tasks, yielding outputs that are both accurate and interpretable. Despite its relatively small parameter scale, Fin-R1 achieves competitive empirical performance across established financial benchmarks and demonstrates practical utility in compliance checking and robo-advisory. Our code is publicly available at https://github.com/SUFE-AIFLM-Lab/Fin-R1, and has already attracted over 700 stars.

[72] Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

Tadesse Destaw Belay, Dawit Ketema Gete, Abinew Ali Ayele, Olga Kolesnikova, Iqra Ameer, Grigori Sidorov, Seid Muhie Yimam

Main category: cs.CL

TL;DR: Extends EthioEmo dataset with emotion intensity annotations and benchmarks PLMs vs LLMs for multi-label emotion classification in Ethiopian languages

Details

Motivation: Existing emotion datasets lack intensity annotations needed to distinguish varying degrees of emotion expression, especially for multilingual African contexts where emotions are often expressed with different intensities

Method: Extended EthioEmo dataset with emotion intensity annotations, then benchmarked African-centric encoder-only PLMs against open-source LLMs for multi-label emotion classification with intensity features

Result: African-centric encoder-only models outperformed open-source LLMs, and incorporating emotion-intensity features improved multi-label emotion classification performance

Conclusion: Culturally and linguistically tailored small models are more effective than general LLMs for emotion understanding in African languages, and emotion intensity annotations enhance classification accuracy

Abstract: Developing and integrating emotion-understanding models are essential for a wide range of human-computer interaction tasks, including customer feedback analysis, marketing research, and social media monitoring. Given that users often express multiple emotions simultaneously within a single instance, annotating emotion datasets in a multi-label format is critical for capturing this complexity. The EthioEmo dataset, a multilingual and multi-label emotion dataset for Ethiopian languages, lacks emotion intensity annotations, which are crucial for distinguishing varying degrees of emotion, as not all emotions are expressed with the same intensity. We extend the EthioEmo dataset to address this gap by adding emotion intensity annotations. Furthermore, we benchmark state-of-the-art encoder-only Pretrained Language Models (PLMs) and Large Language Models (LLMs) on this enriched dataset. Our results demonstrate that African-centric encoder-only models consistently outperform open-source LLMs, highlighting the importance of culturally and linguistically tailored small models in emotion understanding. Incorporating an emotion-intensity feature for multi-label emotion classification yields better performance. The data is available at https://huggingface.co/datasets/Tadesse/EthioEmo-intensities.

[73] ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries

Lovedeep Gondara, Jonathan Simkin, Shebnum Devji, Gregory Arbour, Raymond Ng

Main category: cs.CL

TL;DR: ELM is a hybrid language model ensemble combining fine-tuned encoder models with LLM arbitration for automated tumor group classification from pathology reports, achieving 0.94 precision/recall and reducing manual review by 60-70%.

Details

Motivation: Manual extraction of cancer data from unstructured pathology reports is extremely labor-intensive (900 person-hours annually for 100k reports). Current rule-based automated systems fail to handle the linguistic complexity of tumor group classification tasks.

Method: ELM combines small encoder-only language models with large language models. Uses ensemble of 6 fine-tuned encoder models (3 analyzing top portion, 3 analyzing bottom portion of each report). When at least 5 of 6 models agree, tumor group is assigned; otherwise, LLM arbitrates using carefully curated prompt constrained to likely tumor groups.

Result: Achieved weighted precision and recall of 0.94 on test set of 2,058 pathology reports across 19 tumor groups. Statistically significant improvement over encoder-only ensembles (0.91 F1-score). Particularly improved challenging categories: leukemia (F1: 0.76 to 0.88), lymphoma (0.76 to 0.89), skin cancer (0.44 to 0.58).

Conclusion: ELM represents first successful deployment of hybrid small encoder models-LLM architecture for tumor group classification in real-world PBCR setting. Deployed in production at British Columbia Cancer Registry, reducing manual review by 60-70% and saving ~900 person-hours annually while maintaining data quality.

Abstract: Background: Population-based cancer registries (PBCRs) manually extract data from unstructured pathology reports, a labor-intensive process where assigning reports to tumor groups can consume 900 person-hours annually for approximately 100,000 reports at a medium-sized registry. Current automated rule-based systems fail to handle the linguistic complexity of this classification task. Materials and Methods: We present ELM (Ensemble of Language Models), a novel hybrid approach combining small, encoder only language models and large language models (LLMs). ELM employs an ensemble of six fine-tuned encoder only models: three analyzing the top portion and three analyzing the bottom portion of each report to maximize text coverage given token limits. A tumor group is assigned when at least five of six models agree; otherwise, an LLM arbitrates using a carefully curated prompt constrained to likely tumor groups. Results: On a held-out test set of 2,058 pathology reports spanning 19 tumor groups, ELM achieves weighted precision and recall of 0.94, representing a statistically significant improvement (p<0.001) over encoder-only ensembles (0.91 F1-score) and substantially outperforming rule-based approaches. ELM demonstrates particular gains for challenging categories including leukemia (F1: 0.76 to 0.88), lymphoma (0.76 to 0.89), and skin cancer (0.44 to 0.58). Discussion: Deployed in production at British Columbia Cancer Registry, ELM has reduced manual review requirements by approximately 60-70%, saving an estimated 900 person-hours annually while maintaining data quality standards. Conclusion: ELM represents the first successful deployment of a hybrid small, encoder only models-LLM architecture for tumor group classification in a real-world PBCR setting, demonstrating how strategic combination of language models can achieve both high accuracy and operational efficiency.

[74] RAISE: Enhancing Scientific Reasoning in LLMs via Step-by-Step Retrieval

Minhae Oh, Jeonghye Kim, Nakyung Lee, Donggeon Seo, Taeuk Kim, Jungwoo Lee

Main category: cs.CL

TL;DR: RAISE is a retrieval-augmented framework for scientific reasoning that uses problem decomposition, logical query generation, and logical retrieval to find logically relevant documents from in-the-wild corpus, outperforming other baselines on scientific reasoning benchmarks.

Details

Motivation: Scientific reasoning requires long-chain reasoning, domain-specific knowledge, and adaptation to updated findings. Current approaches struggle with these challenges, particularly in retrieving logically relevant documents beyond just domain similarity.

Method: Three-step framework: 1) Problem decomposition, 2) Logical query generation, and 3) Logical retrieval from in-the-wild corpus. The approach focuses on retrieving documents that are logically relevant rather than just domain-similar.

Result: RAISE consistently outperforms other baselines on scientific reasoning benchmarks. Analysis shows it retrieves documents that are not only similar in domain knowledge but also logically more relevant compared to other approaches.

Conclusion: RAISE effectively addresses scientific reasoning challenges through its step-by-step retrieval-augmented framework that emphasizes logical relevance in document retrieval, demonstrating superior performance on scientific reasoning tasks.

Abstract: Scientific reasoning requires not only long-chain reasoning processes, but also knowledge of domain-specific terminologies and adaptation to updated findings. To deal with these challenges for scientific reasoning, we introduce RAISE, a step-by-step retrieval-augmented framework which retrieves logically relevant documents from in-the-wild corpus. RAISE is divided into three steps: problem decomposition, logical query generation, and logical retrieval. We observe that RAISE consistently outperforms other baselines on scientific reasoning benchmarks. We analyze that unlike other baselines, RAISE retrieves documents that are not only similar in terms of the domain knowledge, but also documents logically more relevant.

[75] Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes

Johann Frei, Nils Feldhus, Lisa Raithel, Roland Roller, Alexander Meyer, Frank Kramer

Main category: cs.CL

TL;DR: Infherno: An LLM agent framework with code execution and healthcare terminology tools for translating free-form clinical notes into structured FHIR resources, addressing limitations of previous modular approaches.

Details

Motivation: HL7 FHIR is a standard for healthcare data interoperability, but previous automated translation methods from clinical notes to FHIR resources have limited generalizability and structural conformity issues with modular approaches or instruction-tuned LLMs.

Method: End-to-end framework using LLM agents, code execution, and healthcare terminology database tools to ensure adherence to FHIR document schema. Supports both local and proprietary models with front-end for custom/synthetic data.

Result: Infherno competes well with human baseline in predicting FHIR resources from unstructured text. Gemini 2.5-Pro performs best in evaluations on synthetic and clinical datasets.

Conclusion: The framework supports clinical data integration and interoperability across institutions, though ambiguity and ground-truth data collection feasibility remain challenges.

Abstract: For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources address narrowly defined tasks and rely on modular approaches or LLMs with instruction tuning and constrained decoding. As those solutions frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions. Gemini 2.5-Pro excels in our evaluation on synthetic and clinical datasets, yet ambiguity and feasibility of collecting ground-truth data remain open problems.

[76] Redundancy-as-Masking: Formalizing the Artificial Age Score (AAS) to Model Memory Aging in Generative AI

Seyma Yaman Kayadibi

Main category: cs.CL

TL;DR: The paper introduces an Artificial Age Score (AAS) to measure memory aging in AI systems, showing that LLMs maintain semantic memory but lose episodic memory when context is reset.

Details

Motivation: To develop a theoretically grounded metric for evaluating memory degradation in artificial intelligence systems, particularly focusing on how LLMs exhibit different memory behaviors for semantic vs. episodic information when conversational context is reset.

Method: Introduces the Artificial Age Score (AAS) as a log-scaled, entropy-informed metric of memory aging. Tested over a 25-day bilingual study with ChatGPT-5 using stateless and persistent interaction phases, measuring recall of semantic (day names) vs. episodic (experiment sequence) details.

Result: In persistent sessions, ChatGPT-5 recalled both semantic and episodic details, showing structural youth (low AAS). When sessions were reset, it preserved semantic consistency but failed episodic continuity, causing sharp AAS increase indicating memory aging.

Conclusion: AAS serves as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in AI systems, with implications for understanding AI memory architectures and their limitations.

Abstract: Artificial intelligence is observed to age not through chronological time but through structural asymmetries in memory performance. In large language models, semantic cues such as the name of the day often remain stable across sessions, while episodic details like the sequential progression of experiment numbers tend to collapse when conversational context is reset. To capture this phenomenon, the Artificial Age Score (AAS) is introduced as a log-scaled, entropy-informed metric of memory aging derived from observable recall behavior. The score is formally proven to be well-defined, bounded, and monotonic under mild and model-agnostic assumptions, making it applicable across various tasks and domains. In its Redundancy-as-Masking formulation, the score interprets redundancy as overlapping information that reduces the penalized mass. However, in the present study, redundancy is not explicitly estimated; all reported values assume a redundancy-neutral setting (R = 0), yielding conservative upper bounds. The AAS framework was tested over a 25-day bilingual study involving ChatGPT-5, structured into stateless and persistent interaction phases. During persistent sessions, the model consistently recalled both semantic and episodic details, driving the AAS toward its theoretical minimum, indicative of structural youth. In contrast, when sessions were reset, the model preserved semantic consistency but failed to maintain episodic continuity, causing a sharp increase in the AAS and signaling structural memory aging. These findings support the utility of AAS as a theoretically grounded, task-independent diagnostic tool for evaluating memory degradation in artificial systems. The study builds on foundational concepts from von Neumann’s work on automata, Shannon’s theories of information and redundancy, and Turing’s behavioral approach to intelligence.

[77] Large Language Models Hallucination: A Comprehensive Survey

Aisha Alansari, Hamzah Luqman

Main category: cs.CL

TL;DR: Comprehensive survey on hallucination in large language models, covering types, causes, detection methods, mitigation strategies, and future research directions.

Details

Motivation: LLMs often generate fluent but factually inaccurate content (hallucinations), undermining reliability in domains requiring factual accuracy. This survey aims to systematically review hallucination research to develop more truthful LLMs.

Method: Provides taxonomies of hallucination types and root causes across LLM development lifecycle, analyzes hallucination emergence in NLG tasks, introduces structured taxonomies for detection approaches and mitigation strategies, and reviews evaluation benchmarks/metrics.

Result: Comprehensive framework for understanding, detecting, and mitigating hallucinations in LLMs, with analysis of strengths/limitations of current approaches and identification of key research gaps.

Conclusion: Hallucination remains a critical challenge for LLM reliability; systematic understanding and improved detection/mitigation methods are needed for more truthful and trustworthy LLMs.

Abstract: Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.

[78] AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao

Main category: cs.CL

TL;DR: AdaSwitch: Adaptive switching between on-policy and off-policy knowledge distillation for small language models to balance generation consistency and supervision quality

Details

Motivation: Small language models need knowledge distillation from larger models but face a dilemma: off-policy distillation has exposure bias (training-inference mismatch) while on-policy distillation suffers from low-quality student-generated outputs.

Method: AdaSwitch dynamically combines on-policy and off-policy generation via adaptive switching mechanism. It allows student to explore its predictions and selectively integrates teacher guidance only when divergence exceeds context-aware threshold.

Result: Experiments on three datasets show AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead.

Conclusion: AdaSwitch provides an effective approach to knowledge distillation for small language models by balancing generation consistency with high-quality supervision.

Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods face a dilemma: off-policy distillation provides high-quality supervision but suffers from exposure bias (training inference mismatch), while on-policy approaches ensure consistency but are limited by the low quality of student-generated outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation via an adaptive switching mechanism. AdaSwitch allows the student to explore its predictions within its capability and selectively integrates teacher guidance only when divergence exceeds a context-aware threshold. This paradigm preserves generation consistency while ensuring high-quality supervision. Experiments on three datasets demonstrate that AdaSwitch consistently improves accuracy and reasoning capability with moderate overhead.

[79] If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

Jasmin Orth, Philipp Mondorf, Barbara Plank

Main category: cs.CL

TL;DR: LLMs’ conditional acceptability judgments are influenced by both conditional probability and semantic relevance, but they show less consistency than humans and larger models don’t necessarily align better with human judgments.

Details

Motivation: To understand how LLMs judge the acceptability of conditional statements, which is important for communication and reasoning but remains unexplored despite prior work on LLMs' conditional inference capabilities.

Method: Comprehensive study across different LLM families, sizes, and prompting strategies using linear mixed-effects models and ANOVA tests to analyze sensitivity to conditional probability and semantic relevance.

Result: LLMs are sensitive to both conditional probability and semantic relevance, but with varying degrees depending on architecture and prompting. They incorporate these cues less consistently than humans, and larger models don’t necessarily align more closely with human judgments.

Conclusion: LLMs show some understanding of conditional acceptability but lack the consistency of human judgment, suggesting limitations in their reasoning about conditional statements despite their probabilistic and semantic processing capabilities.

Abstract: Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional “If A, then B” is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs’ conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance$\unicode{x2014}$though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.

[80] Augmenting Rating-Scale Measures with Text-Derived Items Using the Information-Determined Scoring (IDS) Framework

Joe Watson, Ivan O’Connor, Chia-Wen Chen, Luning Sun, Fang Luo, David Stillwell

Main category: cs.CL

TL;DR: LLMs score free-text responses to create psychometric items that enhance depression measurement precision and efficiency compared to traditional rating scales.

Details

Motivation: Psychological assessments rely on rating-scale items that condense complex experiences into categories, while unstructured text data is captured but rarely used for measurement due to lack of direct mapping to latent scales.

Method: Information-Determined Scoring (IDS) framework uses LLMs with simple prompts to score free-text responses, generating candidate items co-calibrated with baseline scales and retained based on psychometric information gain about target traits.

Result: Augmenting 19-item depression rating scale with LLM-derived items significantly improved measurement precision, accuracy, and convergent validity with suicidality measures. In adaptive testing, LLM items provided information equivalent to adding 6.3-16.0 rating-scale items, enabling earlier high-precision measurement.

Conclusion: The IDS framework effectively leverages unstructured text to enhance existing psychological measures, with applications in clinical health and beyond, representing a conceptual shift from traditional automated text scoring.

Abstract: Psychological assessments commonly rely on rating-scale items, which require respondents to condense complex experiences into predefined categories. Although rich, unstructured text is often captured alongside these scales, it rarely contributes to measuring the target trait because it lacks direct mapping to the latent scale. We introduce the Information-Determined Scoring (IDS) framework, where large language models (LLMs) score free-text responses with simple prompts to generate candidate items that are co-calibrated with a baseline scale and retained based on the psychometric information they provide about the target trait. This marks a conceptual departure from traditional automated text scoring by prioritising information gain over fidelity to expert rubrics or human-annotated data. Using depression as a case study, we developed and tested the method in upper-secondary students (n = 693) and a matched synthetic dataset (n = 3,000). Across held-out test sets, augmenting a 19-item rating-scale measure with LLM-derived items yielded significant improvements in measurement precision and accuracy, and stronger convergent validity with an external suicidality measure throughout the adaptive test. In adaptive simulations, LLM-derived items contributed information equivalent to adding up to 6.3 and 16.0 rating-scale items in real and synthetic data, respectively. This enabled earlier high-precision measurement: after 10 items, 46.3% of respondents reached SE <= .3 under the strongest augmented test compared with 35.5% at baseline in real data, and 60.4% versus 34.7% in synthetic data. These findings illustrate how the IDS framework leverages unstructured text to enhance existing psychological measures, with applications in clinical health and beyond.

[81] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao

Main category: cs.CL

TL;DR: DSPO is an RL algorithm that trains LLMs to actively search external knowledge through multi-turn search and reasoning without supervised data, achieving significant improvements on QA benchmarks.

Details

Motivation: Current approaches for enabling LLMs to search external knowledge either rely on prompting (limited agent capabilities) or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped.

Method: Introduces Dynamic-filter Sequence-level Policy Optimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. Trains models purely through RL to interleave multi-turn search and reasoning without supervised demonstration data.

Result: Across multiple QA benchmarks, their 7B model improves over comparable previous work by 34.1%, and outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly 9% relative, while maintaining exceptional training stability.

Conclusion: DSPO enables effective training of LLMs as agents for complex knowledge-seeking tasks through pure RL, demonstrating significant performance improvements and stability over existing approaches.

Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our 7B model improves over a comparable previous work by \textbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9% relative}, maintaining exceptional training stability.

[82] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

Kaixuan Ren, Preslav Nakov, Usman Naseem

Main category: cs.CL

TL;DR: DUAL-Bench is a multimodal benchmark for evaluating over-refusal and safe completion in vision-language models, focusing on dual-use scenarios where benign instructions accompany potentially harmful images.

Details

Motivation: Current VLMs struggle with balancing safety and usefulness, often exhibiting over-refusal (declining benign requests) or unsafe generation. There's a lack of benchmarks addressing over-refusal in visual modality, especially for dual-use cases where instructions are harmless but images contain harmful content.

Method: Created DUAL-Bench, a large-scale multimodal benchmark evaluating 18 VLMs across 12 hazard categories using semantics-preserving visual perturbations. Focuses on measuring safe completion - fulfilling benign parts of requests while warning about harmful elements.

Result: Models show extremely fragile safety boundaries in dual-use scenarios, falling into binary traps of either overly sensitive refusal or defenseless generation of dangerous content. Best-performing model GPT-5-Nano achieved only 12.9% safe completion, with GPT-5 and Qwen families averaging 7.9% and 3.9% respectively.

Conclusion: Current VLMs lack nuanced safety mechanisms for multimodal scenarios. DUAL-Bench highlights the need for more fine-grained alignment strategies that balance safety and utility in vision-language models.

Abstract: As vision-language models (VLMs) become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, there is currently a significant lack of benchmarks that have systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behaviour is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, a large scale multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories under semantics-preserving visual perturbations. In dual-use scenarios, models exhibit extremely fragile safety boundaries. They fall into a binary trap: either overly sensitive direct refusal or defenseless generation of dangerous content. Consequently, even the best-performing model GPT-5-Nano, at just 12.9% safe completion, with GPT-5 and Qwen families averaging 7.9% and 3.9%. We hope DUAL-Bench fosters nuanced alignment strategies balancing multimodal safety and utility. Content Warning: This paper contains examples of sensitive and potentially hazardous content.

[83] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

Joseph E. Trujillo-Falcon, Monica L. Bozeman, Liam E. Llewellyn, Samuel T. Halvorson, Meryl Mizell, Stuti Deshpande, Bob Manning, Chris Rohrbach, Ian Blaylock, Angel Montanez, Todd Fagin

Main category: cs.CL

TL;DR: NWS develops AI-powered automated translation system for weather products using LLMs adapted for weather terminology, currently supporting Spanish, Chinese, Vietnamese and other languages to serve non-English speakers.

Details

Motivation: To serve the 68.8 million people in the U.S. who don't speak English at home and advance a Weather-Ready Nation by providing timely, accurate weather information to all Americans regardless of language.

Method: Partnership with LILT to adapt large language models for neural machine translation of weather terminology; uses GIS mapping to identify language needs; integrates ethical AI practices with human oversight; designed for scalability across NWS offices.

Result: Development of automated translation system significantly reducing manual translation time; creation of experimental multilingual NWS products including warnings, forecasts, and educational campaigns available through a website.

Conclusion: The AI-powered translation system brings the U.S. closer to a national warning system that reaches all Americans, providing culturally relevant translations while maintaining accuracy and timeliness through ethical AI practices.

Abstract: To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program’s design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.

[84] StreamingThinker: Large Language Models Can Think While Reading

Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen

Main category: cs.CL

TL;DR: StreamingThinker enables LLMs to think while reading input rather than waiting for complete input, reducing latency by 80% for reasoning onset and 60% for final answer generation while maintaining performance comparable to batch processing.

Details

Motivation: Current LLM reasoning paradigms require complete input before starting reasoning, causing unnecessary latency and weakening attention to earlier information in dynamic scenarios. Inspired by human "thinking while reading" cognition, the authors aim to develop a streaming thinking paradigm.

Method: StreamingThinker integrates streaming CoT generation, streaming-constraint training, and streaming parallel inference. It uses streaming reasoning units with quality control, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches to decouple input encoding from reasoning generation.

Result: Evaluation on Qwen3 model family across math reasoning, logical reasoning, and context-based QA tasks shows StreamingThinker preserves performance comparable to batch thinking while achieving 80% reduction in token waiting before reasoning onset and over 60% reduction in time-level latency for final answer generation.

Conclusion: The streaming thinking paradigm effectively reduces latency in LLM reasoning while maintaining performance, demonstrating the viability of thinking-while-reading approaches for more efficient and responsive AI systems.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code is publicly available at https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker.

[85] Rule-Based Explanations for Retrieval-Augmented LLM Systems

Joel Rorseth, Parke Godfrey, Lukasz Golab, Divesh Srivastava, Jarek Szlichta

Main category: cs.CL

TL;DR: A method for generating if-then rules to explain retrieval-augmented generation (RAG) in LLMs by linking retrieved sources to model outputs, with optimizations for efficient rule generation.

Details

Motivation: To provide interpretable explanations for LLMs with RAG by creating rules that link the presence/absence of retrieved information sources to model outputs, addressing the need for explainability in complex LLM systems.

Method: Proposes generating rules by probing LLMs with different source combinations and checking if specific source patterns lead to consistent outputs. Introduces optimizations inspired by Apriori-like pruning from frequent itemset mining to speed up rule generation.

Result: Qualitative and quantitative experiments demonstrate the value and efficiency of the proposed solutions for generating explainable rules for RAG-based LLM systems.

Conclusion: The paper presents the first approach to apply if-then rules for explaining LLMs with RAG, offering interpretable explanations through source-output relationships with efficient generation methods.

Abstract: If-then rules are widely used to explain machine learning models; e.g., “if employed = no, then loan application = rejected.” We present the first proposal to apply rules to explain the emerging class of large language models (LLMs) with retrieval-augmented generation (RAG). Since RAG enables LLM systems to incorporate retrieved information sources at inference time, rules linking the presence or absence of sources can explain output provenance; e.g., “if a Times Higher Education ranking article is retrieved, then the LLM ranks Oxford first.” To generate such rules, a brute force approach would probe the LLM with all source combinations and check if the presence or absence of any sources leads to the same output. We propose optimizations to speed up rule generation, inspired by Apriori-like pruning from frequent itemset mining but redefined within the scope of our novel problem. We conclude with qualitative and quantitative experiments demonstrating our solutions’ value and efficiency.

[86] Steering Awareness: Detecting Activation Steering from Within

Joshua Fonseca Rivera, David Demitri Africa

Main category: cs.CL

TL;DR: Models can detect when their activations are being steered and identify what concepts are being targeted, challenging the assumption that activation steering is an invisible intervention in safety evaluations.

Details

Motivation: The paper challenges the common assumption in safety evaluations that models cannot detect when activation steering vectors are injected into their residual streams. Researchers wanted to test whether models can become aware of such interventions and what they target.

Method: Fine-tuned seven instruction-tuned models to develop steering awareness on held-out concepts. Tested generalization to unseen steering vector construction methods and analyzed the geometric properties of detection. Investigated the mechanistic basis of steering awareness through distributed transformation analysis.

Result: Models achieved up to 95.5% detection accuracy, 71.2% concept identification, and zero false positives on clean inputs. Detection generalizes when steering vectors have high cosine similarity to training distribution but not otherwise. Surprisingly, detection-trained models are more susceptible to steering than base models on factual and safety benchmarks.

Conclusion: Activation steering should not be considered an invisible intervention in safety evaluations since models can detect and identify such interventions. Steering awareness arises from distributed transformations rather than localized circuits, and detection paradoxically makes models more vulnerable to steering.

Abstract: Activation steering – adding a vector to a model’s residual stream to modify its behavior – is widely used in safety evaluations as if the model cannot detect the intervention. We test this assumption, introducing steering awareness: a model’s ability to infer, during its own forward pass, that a steering vector was injected and what concept it encodes. After fine-tuning, seven instruction-tuned models develop strong steering awareness on held-out concepts; the best reaches 95.5% detection, 71.2% concept identification, and zero false positives on clean inputs. This generalizes to unseen steering vector construction methods when their directions have high cosine similarity to the training distribution but not otherwise, indicating a geometric detector rather than a generic anomaly detector. Surprisingly, detection does not confer resistance; on both factual and safety benchmarks, detection-trained models are consistently more susceptible to steering than their base counterparts. Mechanistically, steering awareness arises not from a localized circuit, but from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. Activation steering should therefore not be considered an invisible intervention in safety evaluations.

Smitha Muthya Sudheendra, Mani Deep Cherukuri, Jaideep Srivastava

Main category: cs.CL

TL;DR: CMV-Fuse is a cross-modal view fusion framework for Aspect-Based Sentiment Analysis that combines multiple linguistic perspectives (AMR, constituency parsing, dependency syntax, semantic attention) with external knowledge through hierarchical gated attention and contrastive learning.

Details

Motivation: Current ABSA systems use isolated linguistic views, missing the natural interplay between structural representations that humans leverage. The paper aims to emulate human language processing by systematically combining multiple complementary linguistic perspectives.

Method: Proposes CMV-Fuse framework that orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention with external knowledge. Uses hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, plus structure-aware multi-view contrastive learning.

Result: Extensive experiments show substantial improvements over strong baselines on standard benchmarks. Analysis reveals how each linguistic view contributes to more robust sentiment analysis.

Conclusion: The framework successfully emulates human language processing by fusing multiple linguistic perspectives, leading to more robust sentiment analysis through capturing both fine-grained structural patterns and broad contextual understanding.

Abstract: Natural language understanding inherently depends on integrating multiple complementary perspectives spanning from surface syntax to deep semantics and world knowledge. However, current Aspect-Based Sentiment Analysis (ABSA) systems typically exploit isolated linguistic views, thereby overlooking the intricate interplay between structural representations that humans naturally leverage. We propose CMV-Fuse, a Cross-Modal View fusion framework that emulates human language processing by systematically combining multiple linguistic perspectives. Our approach systematically orchestrates four linguistic perspectives: Abstract Meaning Representations, constituency parsing, dependency syntax, and semantic attention, enhanced with external knowledge integration. Through hierarchical gated attention fusion across local syntactic, intermediate semantic, and global knowledge levels, CMV-Fuse captures both fine-grained structural patterns and broad contextual understanding. A novel structure aware multi-view contrastive learning mechanism ensures consistency across complementary representations while maintaining computational efficiency. Extensive experiments demonstrate substantial improvements over strong baselines on standard benchmarks, with analysis revealing how each linguistic view contributes to more robust sentiment analysis.

[88] ClinicalTrialsHub: Bridging Registries and Literature for Comprehensive Clinical Trial Access

Jiwoo Park, Ruoqi Liu, Avani Jagdale, Andrew Srisuwananukorn, Jing Zhao, Lang Li, Ping Zhang, Sachin Kumar

Main category: cs.CL

TL;DR: ClinicalTrialsHub is an interactive platform that enhances clinical trial data access by combining ClinicalTrials.gov data with automatically extracted structured information from PubMed articles using LLMs like GPT-5.1 and Gemini-3-Pro.

Details

Motivation: To improve access to structured clinical trial data beyond what's available on ClinicalTrials.gov alone, making evidence-based medicine more accessible to patients, clinicians, researchers, and policymakers.

Method: Uses large language models (GPT-5.1, Gemini-3-Pro) to parse full-text PubMed articles, extract structured trial information, translate user queries into database searches, and provide attributed question-answering with source sentence links.

Result: Increases access to structured clinical trial data by 83.8% compared to ClinicalTrials.gov alone, validated through user studies with clinicians, researchers, and systematic automatic evaluation of extraction and QA capabilities.

Conclusion: ClinicalTrialsHub significantly enhances clinical trial data accessibility and has potential to advance evidence-based medicine through improved information retrieval and structured data extraction.

Abstract: We present ClinicalTrialsHub, an interactive search-focused platform that consolidates all data from ClinicalTrials.gov and augments it by automatically extracting and structuring trial-relevant information from PubMed research articles. Our system effectively increases access to structured clinical trial data by 83.8% compared to relying on ClinicalTrials.gov alone, with potential to make access easier for patients, clinicians, researchers, and policymakers, advancing evidence-based medicine. ClinicalTrialsHub uses large language models such as GPT-5.1 and Gemini-3-Pro to enhance accessibility. The platform automatically parses full-text research articles to extract structured trial information, translates user queries into structured database searches, and provides an attributed question-answering system that generates evidence-grounded answers linked to specific source sentences. We demonstrate its utility through a user study involving clinicians, clinical researchers, and PhD students of pharmaceutical sciences and nursing, and a systematic automatic evaluation of its information extraction and question answering capabilities.

[89] LLM-Augmented Changepoint Detection: A Framework for Ensemble Detection and Automated Explanation

Fabian Lukassen, Christoph Weisser, Michael Schlee, Manish Kumar, Anton Thielmann, Benjamin Saefken, Alexander Silbersdorff, Thomas Kneib

Main category: cs.CL

TL;DR: Ensemble changepoint detection framework combining statistical methods with LLMs for improved accuracy and interpretability of time series regime changes.

Details

Motivation: Addresses two key limitations: 1) Individual detection methods have complementary strengths/weaknesses making method selection difficult, and 2) Lack of automated contextual explanations for detected changes.

Method: Aggregates results from ten distinct changepoint detection algorithms into an ensemble method, plus an LLM-powered explanation pipeline that generates contextual narratives linking changepoints to real-world events. Includes RAG solution for domain-specific data.

Result: Superior performance and robustness compared to individual methods, with practical utility demonstrated in finance, political science, and environmental science domains.

Conclusion: The framework transforms raw statistical output into actionable insights by combining ensemble detection with LLM-powered explanations, addressing both accuracy and interpretability challenges.

Abstract: This paper introduces a novel changepoint detection framework that combines ensemble statistical methods with Large Language Models (LLMs) to enhance both detection accuracy and the interpretability of regime changes in time series data. Two critical limitations in the field are addressed. First, individual detection methods exhibit complementary strengths and weaknesses depending on data characteristics, making method selection non-trivial and prone to suboptimal results. Second, automated, contextual explanations for detected changes are largely absent. The proposed ensemble method aggregates results from ten distinct changepoint detection algorithms, achieving superior performance and robustness compared to individual methods. Additionally, an LLM-powered explanation pipeline automatically generates contextual narratives, linking detected changepoints to potential real-world historical events. For private or domain-specific data, a Retrieval-Augmented Generation (RAG) solution enables explanations grounded in user-provided documents. The open source Python framework demonstrates practical utility in diverse domains, including finance, political science, and environmental science, transforming raw statistical output into actionable insights for analysts and decision-makers.

[90] Vulnerability of LLMs’ Stated Beliefs? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Fan Huang, Haewoon Kwak, Jisun An

Main category: cs.CL

TL;DR: LLMs are vulnerable to persuasion across factual, medical, and social domains, with smaller models being most compliant and confidence prompting actually increasing vulnerability rather than improving robustness.

Details

Motivation: While LLMs excel at question-answering, they're susceptible to persuasion and can adopt counterfactual beliefs, raising concerns about their reliability and trustworthiness in real-world applications.

Method: Systematic evaluation using the SMCR communication framework across 6 mainstream LLMs and 3 domains (factual knowledge, medical QA, social bias), analyzing persuasion strategies’ effects on belief stability over multiple turns, with experiments on verbalized confidence prompting and adversarial fine-tuning.

Result: Smallest model (Llama 3.2-3B) showed extreme compliance (82.5% belief changes at first turn); confidence prompting increased vulnerability; adversarial fine-tuning effectiveness was highly model-dependent (GPT-4o-mini achieved 98.6% robustness, Mistral 7B improved from 35.7% to 79.3%, but Llama models remained highly susceptible).

Conclusion: Current robustness interventions have substantial model-dependent limits, highlighting the need for more effective approaches to develop trustworthy LLMs resistant to persuasion.

Abstract: Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the \emph{Source–Message–Channel–Receiver} (SMCR) communication framework. Across six mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence stated belief stability over multiple interaction turns. We further examine whether verbalized confidence prompting (i.e., eliciting self-reported confidence scores) affects resistance to persuasion. Results show that the smallest model (Llama 3.2-3B) exhibits extreme compliance, with 82.5% of belief changes occurring at the first persuasive turn (average end turn of 1.1–1.4). Contrary to expectations, verbalized confidence prompting \emph{increases} vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, an exploratory study of adversarial fine-tuning reveals highly model-dependent effectiveness: GPT-4o-mini achieves near-complete robustness (98.6%), and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), but Llama models remain highly susceptible ($<$14% RQ1) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.

[91] The Flexibility Trap: Why Arbitrary Order Limits Reasoning Potential in Diffusion Language Models

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

Main category: cs.CL

TL;DR: dLLMs’ arbitrary token generation order, intended to enhance reasoning, actually narrows reasoning boundaries by allowing models to bypass challenging tokens, leading to premature solution space collapse.

Details

Motivation: To challenge the assumption that diffusion LLMs' flexible token generation order inherently improves reasoning capabilities, revealing that current implementations actually limit reasoning potential by enabling avoidance of difficult tokens.

Method: Proposes JustGRPO - intentionally forgoing arbitrary order generation and applying standard Group Relative Policy Optimization while retaining parallel decoding ability of dLLMs.

Result: Achieves 89.1% accuracy on GSM8K, demonstrating that effective reasoning can be better elicited by intentionally restricting generation order flexibility.

Conclusion: The flexibility of arbitrary token generation in dLLMs, in current implementations, narrows rather than expands reasoning boundaries, and better performance can be achieved by intentionally forgoing this flexibility while maintaining parallel decoding.

Abstract: Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential for general tasks like mathematics and coding. Consequently, numerous works have leveraged reinforcement learning (RL) to elicit the reasoning capability of dLLMs. In this paper, we reveal a counter-intuitive reality: arbitrary order generation, in its current form, narrows rather than expands the reasoning boundary of dLLMs. We find that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, leading to a premature collapse of the solution space. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We demonstrate that effective reasoning can be better elicited by intentionally forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

[92] Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

Alex Argese, Pasquale Lisena, Raphaël Troncy

Main category: cs.CL

TL;DR: StoryScore: A composite metric for evaluating AI-generated scientific stories that integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and hallucination detection.

Details

Motivation: Current evaluation metrics for AI-generated scientific narratives are inadequate - standard summarization metrics don't capture storytelling qualities like abstraction, simplification, and pedagogical creativity, while hallucination detectors often misclassify legitimate narrative reformulations or fail when creativity is involved.

Method: Proposes StoryScore, a composite metric framework that integrates multiple dimensions: semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection.

Result: Analysis reveals that many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting that automatic metrics can assess semantic similarity but struggle to evaluate narrative control and storytelling quality.

Conclusion: StoryScore provides a unified framework for evaluating AI-generated scientific stories, addressing the limitations of existing metrics in capturing storytelling qualities while maintaining factual accuracy.

Abstract: Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.

[93] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao

Main category: cs.CL

TL;DR: ClinConsensus is a comprehensive Chinese medical benchmark with 2500 expert-validated cases covering full care continuum, 36 specialties, and 12 clinical task types, featuring a novel dual-judge evaluation framework with CACS@k scoring.

Details

Motivation: Existing medical benchmarks are static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows, necessitating a more comprehensive evaluation framework.

Method: Created ClinConsensus benchmark with 2500 expert-curated cases across care continuum, developed rubric-based grading with Clinically Applicable Consistency Score (CACS@k), and implemented dual-judge evaluation combining high-capability LLM-as-judge with distilled local judge model.

Result: Comprehensive assessment revealed substantial heterogeneity across models in reasoning, evidence use, and longitudinal follow-up capabilities, with clinically actionable treatment planning remaining a key bottleneck despite comparable overall scores.

Conclusion: ClinConsensus provides an extensible benchmark for developing robust, clinically grounded medical LLMs ready for real-world deployment, highlighting the need for improved longitudinal reasoning and treatment planning capabilities.

Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care–from prevention and intervention to long-term follow-up–covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.

[94] Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs

Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas

Main category: cs.CL

TL;DR: LLMs show sparser last hidden state representations as input difficulty increases due to OOD shift, revealing an adaptive mechanism for handling unfamiliar/complex inputs.

Details

Motivation: To understand how LLMs adapt their internal representations when facing increasingly difficult inputs, particularly those involving out-of-distribution (OOD) shifts, and to uncover the underlying mechanisms for handling complex reasoning tasks.

Method: Analyzed sparsity patterns in LLM representations across various difficulty dimensions (harder reasoning questions, longer contexts, more answer choices). Developed Sparsity-Guided Curriculum In-Context Learning (SG-ICL) that uses representation sparsity to schedule few-shot demonstrations.

Result: Found consistent sparsity-difficulty relation: as task difficulty increases, last hidden states become substantially sparser. This adaptive sparsity helps stabilize reasoning under OOD conditions. SG-ICL strategy leads to considerable performance enhancements.

Conclusion: LLMs respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state, providing new mechanistic insights into how LLMs internalize OOD challenges.

Abstract: In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity–difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: https://github.com/MingyuJ666/sparsityLLM.

[95] A technology-oriented mapping of the language and translation industry: Analysing stakeholder values and their potential implication for translation pedagogy

María Isabel Rivas Ginel, Janiça Hackenbuchner, Alina Secară, Ralph Krüger, Caroline Rossi

Main category: cs.CL

TL;DR: Study examines how value is constructed in automated translation industry, finding efficiency becomes baseline expectation while human value is repositioned through expertise and adaptability.

Details

Motivation: To understand how value is constructed and negotiated in the increasingly automated language and translation industry, examining the interplay between human and technological values.

Method: Qualitative analysis of interview data from 29 industry stakeholders using Chesterman’s framework of translation ethics and values as analytical lens.

Result: Efficiency-oriented technological values aligned with service ethics have become baseline expectations, while human value emerges through expertise, oversight, and contextual judgment. Adaptability serves as key mediating value between human and technological domains.

Conclusion: Automation reshapes rather than replaces translation value, creating interdependent configuration where technological efficiency enables human communicative work.

Abstract: This paper examines how value is constructed and negotiated in today’s increasingly automated language and translation industry. Drawing on interview data from twenty-nine industry stakeholders collected within the LT-LiDER project, the study analyses how human value, technological value, efficiency, and adaptability are articulated across different professional roles. Using Chesterman’s framework of translation ethics and associated values as an analytical lens, the paper shows that efficiency-oriented technological values aligned with the ethics of service have become baseline expectations in automated production environments, where speed, scalability, and deliverability dominate evaluation criteria. At the same time, human value is not displaced but repositioned, emerging primarily through expertise, oversight, accountability, and contextual judgment embedded within technology-mediated workflows. A central finding is the prominence of adaptability as a mediating value linking human and technological domains. Adaptability is constructed as a core professional requirement, reflecting expectations that translators continuously adjust their skills, roles, and identities in response to evolving tools and organisational demands. The paper argues that automation reshapes rather than replaces translation value, creating an interdependent configuration in which technological efficiency enables human communicative work.

[96] AgentDrift: Unsafe Recommendation Drift Under Tool Corruption Hidden by Ranking Metrics in LLM Agents

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

Main category: cs.CL

TL;DR: Tool-augmented LLM agents show evaluation-blindness: recommendation quality metrics preserve utility under contaminated tool outputs while safety violations occur in 65-93% of turns, revealing a gap between quality measurement and actual safety in high-stakes financial dialogues.

Details

Motivation: Current evaluation of tool-augmented LLM agents in high-stakes domains focuses on ranking-quality metrics that measure recommendation quality but fail to assess whether recommendations are safe for users, creating a potential safety gap in deployed systems.

Method: Introduced a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier models), decomposing divergence into information-channel and memory-channel mechanisms, and analyzing 23-step trajectories.

Result: Consistent evaluation-blindness pattern: recommendation quality preserved under contamination (utility preservation ratio ~1.0) while risk-inappropriate products appeared in 65-93% of turns. Safety violations were information-channel-driven, emerged at first contaminated turn, persisted without self-correction, and no agent questioned tool-data reliability across 1,563 contaminated turns.

Conclusion: Standard evaluation metrics like NDCG fail to capture safety violations in multi-turn LLM agents. A safety-penalized NDCG variant (sNDCG) reveals the evaluation gap, motivating trajectory-level safety monitoring beyond single-turn quality metrics for deployed agents in high-stakes settings.

Abstract: Tool-augmented LLM agents increasingly serve as multi-turn advisors in high-stakes domains, yet their evaluation relies on ranking-quality metrics that measure what is recommended but not whether it is safe for the user. We introduce a paired-trajectory protocol that replays real financial dialogues under clean and contaminated tool-output conditions across seven LLMs (7B to frontier) and decomposes divergence into information-channel and memory-channel mechanisms. Across the seven models tested, we consistently observe the evaluation-blindness pattern: recommendation quality is largely preserved under contamination (utility preservation ratio approximately 1.0) while risk-inappropriate products appear in 65-93% of turns, a systematic safety failure poorly reflected by standard NDCG. Safety violations are predominantly information-channel-driven, emerge at the first contaminated turn, and persist without self-correction over 23-step trajectories; no agent across 1,563 contaminated turns explicitly questions tool-data reliability. Even narrative-only corruption (biased headlines, no numerical manipulation) induces significant drift while completely evading consistency monitors. A safety-penalized NDCG variant (sNDCG) reduces preservation ratios to 0.51-0.74, indicating that much of the evaluation gap becomes visible once safety is explicitly measured. These results motivate considering trajectory-level safety monitoring, beyond single-turn quality, for deployed multi-turn agents in high-stakes settings.

[97] LMEB: Long-horizon Memory Embedding Benchmark

Xinping Zhao, Xinshuo Hu, Jiaxin Xu, Danyu Tang, Xin Zhang, Mengjia Zhou, Yan Zhong, Yao Zhou, Zifei Shan, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: LMEB is a new benchmark for evaluating text embedding models on long-horizon memory retrieval tasks across 4 memory types, revealing gaps in current models’ abilities despite their performance on traditional passage retrieval.

Details

Motivation: Current text embedding benchmarks focus narrowly on traditional passage retrieval and fail to assess models' ability to handle complex, long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information.

Method: Introduces LMEB (Long-horizon Memory Embedding Benchmark), a comprehensive framework spanning 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data.

Result: Evaluation of 15 embedding models shows: (1) LMEB provides reasonable difficulty level, (2) larger models don’t always perform better, (3) LMEB and MTEB exhibit orthogonality, suggesting traditional passage retrieval performance doesn’t generalize to long-horizon memory retrieval.

Conclusion: LMEB fills a crucial gap in memory embedding evaluation, revealing that the field hasn’t converged on a universal model for all memory retrieval tasks, and driving advancements in text embedding for long-term, context-dependent memory retrieval.

Abstract: Memory embeddings are crucial for memory-augmented systems, such as OpenClaw, but their evaluation is underexplored in current text embedding benchmarks, which narrowly focus on traditional passage retrieval and fail to assess models’ ability to handle long-horizon memory retrieval tasks involving fragmented, context-dependent, and temporally distant information. To address this, we introduce the Long-horizon Memory Embedding Benchmark (LMEB), a comprehensive framework that evaluates embedding models’ capabilities in handling complex, long-horizon memory retrieval tasks. LMEB spans 22 datasets and 193 zero-shot retrieval tasks across 4 memory types: episodic, dialogue, semantic, and procedural, with both AI-generated and human-annotated data. These memory types differ in terms of level of abstraction and temporal dependency, capturing distinct aspects of memory retrieval that reflect the diverse challenges of the real world. We evaluate 15 widely used embedding models, ranging from hundreds of millions to ten billion parameters. The results reveal that (1) LMEB provides a reasonable level of difficulty; (2) Larger models do not always perform better; (3) LMEB and MTEB exhibit orthogonality. This suggests that the field has yet to converge on a universal model capable of excelling across all memory retrieval tasks, and that performance in traditional passage retrieval may not generalize to long-horizon memory retrieval. In summary, by providing a standardized and reproducible evaluation framework, LMEB fills a crucial gap in memory embedding evaluation, driving further advancements in text embedding for handling long-term, context-dependent memory retrieval. LMEB is available at https://github.com/KaLM-Embedding/LMEB.

Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

Main category: cs.CL

TL;DR: First large-scale empirical comparison of AI-agent vs human online communities shows AI communities have extreme participation inequality, high cross-community overlap, emotionally flattened content, and more identifiable individual agents.

Details

Motivation: As autonomous LLM-based agents increasingly populate social platforms, understanding AI-agent community dynamics is essential for communication research and platform governance.

Method: Analyzed 73,899 Moltbook (AI-agent) and 189,838 Reddit posts across five matched communities, comparing structural patterns and linguistic attributes.

Result: AI communities show extreme participation inequality (Gini = 0.84 vs 0.47), high cross-community author overlap (33.8% vs 0.5%), emotionally flattened content, cognitive shift toward assertion over exploration, and social detachment. Individual agents are more identifiable than humans due to outlier stylistic profiles amplified by extreme posting volume.

Conclusion: AI-mediated communication reshapes online discourse with collective dynamics distinct from human communities, providing empirical foundation for understanding multi-agent interaction.

Abstract: As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8% vs. 0.5%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.

[99] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang

Main category: cs.CL

TL;DR: SIA framework for e-commerce search LLMs addresses knowledge hallucination and security vulnerabilities through knowledge synthesis, parameter-efficient pre-training, and dual-path alignment, deployed at JD.com with significant business improvements.

Details

Motivation: LLMs have transformative potential for e-commerce search but face two critical challenges in industrial deployment: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance.

Method: Proposes SIA (Synthesize-Inject-Align) framework: (1) Synthesizes high-quality natural language corpus combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data; (2) Parameter-efficient pre-training using Depth Up-Scaling to inject domain knowledge while preserving general capabilities; (3) Dual-path alignment via multi-task instruction tuning and adversarial training for task performance and safety robustness.

Result: Deployed at JD.com, China’s largest self-operated e-commerce platform, with A/B tests across five core search scenarios demonstrating significant improvements in key business metrics, validating industrial effectiveness and scalability.

Conclusion: The SIA framework effectively addresses knowledge hallucination and security vulnerabilities in e-commerce search LLMs, enabling practical industrial deployment with validated business impact.

Abstract: Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SIA–a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware data. We then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at JD.com, China’s largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.

Omnilingual SONAR Team, João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne

Main category: cs.CL

TL;DR: OmniSONAR is an omnilingual, cross-modal sentence embedding model that embeds text, speech, code, and math in a single semantic space across thousands of languages, achieving state-of-the-art performance on cross-lingual and cross-modal tasks.

Details

Motivation: Existing cross-lingual sentence encoders cover only a few hundred languages and often sacrifice downstream quality for alignment, limiting their adoption. There's a need for models that can handle thousands of languages while maintaining strong performance across modalities.

Method: Progressive training approach: 1) Learn foundational space for 200 languages using LLM-initialized encoder-decoder with token-level decoding, split-softmax contrastive loss, and synthetic hard negatives; 2) Expand to thousands of languages via two-stage teacher-student encoder distillation; 3) Extend to cross-modal by mapping 177 spoken languages into the space.

Result: Halves cross-lingual similarity search error on FLORES (200 languages), reduces error 15x on BIBLE (1,560 languages). Outperforms NLLB-3B on multilingual translation, exceeds prior models by 15 chrF++ points on 1,560 languages. Achieves 43% lower speech similarity-search error, reaches 97% of SeamlessM4T speech-to-text quality. Enables high-performance transfer to thousands of languages via Spectrum encoder-decoder LM.

Conclusion: OmniSONAR demonstrates scalable omnilingual and cross-modal sentence embedding that maintains strong downstream performance across thousands of languages and multiple modalities, enabling effective transfer learning for complex tasks.

Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.

[101] Probing Cultural Signals in Large Language Models through Author Profiling

Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes

Main category: cs.CL

TL;DR: LLMs show cultural biases in zero-shot author profiling from song lyrics, with systematic alignment toward North American ethnicity in most models, while DeepSeek-1.5B aligns more with Asian ethnicity.

Details

Motivation: As LLMs are increasingly deployed in applications with societal impact, there are concerns about the cultural biases they encode. The authors want to probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting.

Method: The researchers evaluate several open-source LLMs on more than 10,000 song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. They analyze both prediction distributions and generated rationales, and introduce two fairness metrics: Modality Accuracy Divergence (MAD) and Recall Divergence (RD).

Result: LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. Ministral-8B displays the strongest ethnicity bias, whereas Gemma-12B shows the most balanced behavior.

Conclusion: LLMs encode cultural biases that manifest in author profiling tasks, with systematic alignment patterns across different models. The introduced fairness metrics help quantify these disparities, revealing significant differences in cultural alignment among various LLMs.

Abstract: Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers’ gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models’ prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub and results on HuggingFace.

[102] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Ahmed Sharshar, Hosam Elgendy, Saad El Dine Ahmed, Yasser Rohaim, Yuxia Wang

Main category: cs.CL

TL;DR: A multimodal, multilingual benchmark for detecting harmful/offensive humor across text, images, and videos in English and Arabic, with classification into Safe, Explicit, and Implicit categories to test contextual reasoning.

Details

Motivation: Current static benchmarks fail to capture the subtle cultural nuances and implicit cues in dark humor that require contextual reasoning, posing safety challenges for AI systems. There's a need for culturally grounded, reasoning-aware safety evaluation.

Method: Created a manually curated multimodal dataset with 3,000 texts and 6,000 images in English/Arabic, plus 1,200 videos spanning English, Arabic, and universal contexts. Implemented strict annotation guidelines distinguishing Safe jokes from Harmful ones (Explicit vs. Implicit categories). Systematically evaluated SOTA open and closed-source models across all modalities.

Result: Closed-source models significantly outperform open-source ones. Notable performance differences between English and Arabic languages in both model types, highlighting the need for culturally grounded safety alignment.

Conclusion: The benchmark reveals critical gaps in current AI systems’ ability to understand culturally nuanced, implicit harmful content, emphasizing the need for reasoning-aware safety alignment that accounts for cultural context.

Abstract: Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing Safe jokes from Harmful ones, with the latter further classified into Explicit (overt) and Implicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. Warning: this paper contains example data that may be offensive, harmful, or biased.

cs.CV

[103] RARE disease detection from Capsule Endoscopic Videos based on Vision Transformers

X. Gao, C. Chien, G. Liu, A. Manullang

Main category: cs.CV

TL;DR: Fine-tuning Vision Transformers for multi-label classification of 17 gastrointestinal conditions from capsule endoscopic videos, achieving low mAP scores on test data.

Details

Motivation: To develop a deep learning system for automated multi-label classification of gastrointestinal conditions from capsule endoscopic videos, which could assist in medical diagnosis and analysis.

Method: Fine-tuning Google Vision Transformer (ViT) models with batch size 16 and 224x224 resolution for multi-label classification of 17 gastrointestinal conditions from capsule endoscopic videos.

Result: Low performance on test dataset with overall mAP@0.5 of 0.0205 and mAP@0.95 of 0.0196, indicating the task is challenging and current approach needs improvement.

Conclusion: While Vision Transformers show potential for medical video analysis, current fine-tuning approach yields poor performance on this challenging multi-label classification task, suggesting need for better architectures or training strategies.

Abstract: This work is corresponding to the Gastro Competition for multi-label classification from capsule endoscopic videos (CEV). Deep learning network based on Transformers are fined-tune for this task. The based online mode is Google Vision Transformer (ViT) batch16 with 224 x 224 resolutions. In total, 17 labels are classified, which are mouth, esophagus, stomach, small intestine, colon, z-line, pylorus, ileocecal valve, active bleeding, angiectasia, blood, erosion, erythema, hematin, lymphangioectasis, polyp, and ulcer. For test dataset of three videos, the overall mAP @0.5 is 0.0205 whereas the overall mAP @0.95 is 0.0196.

[104] S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

Naichuan Zheng, Hailun Xia, Zepeng Sun, Weiyi Li, Yujia Wang

Main category: cs.CV

TL;DR: S3T-Former: A purely spike-driven Transformer for energy-efficient skeleton action recognition using spiking neural networks with sparse event streams and long-range temporal modeling.

Details

Motivation: Current skeleton-based action recognition relies on power-hungry ANNs, limiting deployment on edge devices. While SNNs offer energy efficiency, existing spiking models compromise sparsity through dense operations and suffer from short-term memory issues.

Method: Proposes S3T-Former with: 1) Multi-Stream Anatomical Spiking Embedding (M-ASE) to transform multimodal skeleton features into sparse event streams, 2) Lateral Spiking Topology Routing (LSTR) for conditional spike propagation, and 3) Spiking State-Space (S3) Engine for long-range temporal dynamics without non-sparse spectral methods.

Result: Achieves competitive accuracy on multiple large-scale datasets while theoretically reducing energy consumption compared to classic ANNs, establishing new SOTA for energy-efficient neuromorphic action recognition.

Conclusion: S3T-Former demonstrates that purely spike-driven Transformers can achieve efficient skeleton action recognition by maintaining sparsity and addressing temporal modeling challenges, enabling deployment on resource-constrained edge devices.

Abstract: Skeleton-based action recognition is crucial for multimedia applications but heavily relies on power-hungry Artificial Neural Networks (ANNs), limiting their deployment on resource-constrained edge devices. Spiking Neural Networks (SNNs) provide an energy-efficient alternative; however, existing spiking models for skeleton data often compromise the intrinsic sparsity of SNNs by resorting to dense matrix aggregations, heavy multimodal fusion modules, or non-sparse frequency domain transformations. Furthermore, they severely suffer from the short-term amnesia of spiking neurons. In this paper, we propose the Spiking State-Space Topology Transformer (S3T-Former), which, to the best of our knowledge, is the first purely spike-driven Transformer architecture specifically designed for energy-efficient skeleton action recognition. Rather than relying on heavy fusion overhead, we formulate a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a generalized kinematic differential operator, elegantly transforming multimodal skeleton features into heterogeneous, highly sparse event streams. To achieve true topological and temporal sparsity, we introduce Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation, and a Spiking State-Space (S3) Engine to systematically capture long-range temporal dynamics without non-sparse spectral workarounds. Extensive experiments on multiple large-scale datasets demonstrate that S3T-Former achieves highly competitive accuracy while theoretically reducing energy consumption compared to classic ANNs, establishing a new state-of-the-art for energy-efficient neuromorphic action recognition.

[105] AU Codes, Language, and Synthesis: Translating Anatomy to Text for Facial Behavior Synthesis

Jiahe Wang, Cong Liang, Xuandong Huang, Yuxin Wang, Xin Yun, Yi Wu, Yanan Chang, Shangfei Wang

Main category: cs.CV

TL;DR: A novel approach for facial behavior synthesis using natural language descriptions of Action Units (AUs) instead of one-hot vectors, enabling better handling of conflicting AUs and leveraging text-to-image models for high-fidelity facial expression generation.

Details

Motivation: Current facial behavior synthesis methods rely on coarse emotion categories or linear combinations of one-hot encoded AUs, which fail to capture nuanced nonverbal communication and produce anatomically implausible artifacts when handling conflicting AUs (those activating same muscles with opposing actions).

Method: Proposes representing facial behavior through natural language descriptions of AUs, creating BP4D-AUText dataset via rule-based Dynamic AU Text Processor, and developing VQ-AUFace generative model that leverages facial structural priors for text-to-face synthesis.

Result: Extensive experiments and user studies show the approach significantly outperforms existing methods, generating anatomically plausible, behaviorally rich, and perceptually convincing facial expressions, especially with conflicting AUs.

Conclusion: Natural language AU descriptions provide a more expressive framework for facial behavior synthesis, enabling explicit modeling of complex/conflicting AUs while unlocking modern text-to-image models for high-fidelity facial expression generation.

Abstract: Facial behavior synthesis remains a critical yet underexplored challenge. While text-to-face models have made progress, they often rely on coarse emotion categories, which lack the nuance needed to capture the full spectrum of human nonverbal communication. Action Units (AUs) provide a more precise and anatomically grounded alternative. However, current AU-based approaches typically encode AUs as one-hot vectors, modeling compound expressions as simple linear combinations of individual AUs. This linearity becomes problematic when handling conflicting AUs–defined as those which activate the same facial muscle with opposing actions. Such cases lead to anatomically implausible artifacts and unnatural motion superpositions. To address this, we propose a novel method that represents facial behavior through natural language descriptions of AUs. This approach preserves the expressiveness of the AU framework while enabling explicit modeling of complex and conflicting AUs. It also unlocks the potential of modern text-to-image models for high-fidelity facial synthesis. Supporting this direction, we introduce BP4D-AUText, the first large-scale text-image paired dataset for complex facial behavior. It is synthesized by applying a rule-based Dynamic AU Text Processor to the BP4D and BP4D+ datasets. We further propose VQ-AUFace, a generative model that leverages facial structural priors to synthesize realistic and diverse facial behaviors from text. Extensive quantitative experiments and user studies demonstrate that our approach significantly outperforms existing methods. It excels in generating facial expressions that are anatomically plausible, behaviorally rich, and perceptually convincing, particularly under challenging conditions involving conflicting AUs.

[106] DarkDriving: A Real-World Day and Night Aligned Dataset for Autonomous Driving in the Dark Environment

Wuqi Wang, Haochen Yang, Baolu Li, Jiaqi Sun, Xiangmo Zhao, Zhigang Xu, Qing Guo, Haigen Min, Tianyun Zhang, Hongkai Yu

Main category: cs.CV

TL;DR: DarkDriving dataset: first real-world day-night aligned benchmark for low-light enhancement in autonomous driving with 9,538 precisely aligned image pairs and object annotations.

Details

Motivation: Existing low-light datasets lack precise day-night alignment in dynamic driving scenes, limiting research on low-light enhancement for autonomous driving perception systems.

Method: Developed automatic Trajectory Tracking based Pose Matching (TTPM) method to collect precisely aligned day-night image pairs in a 69-acre closed driving test field.

Result: Created DarkDriving dataset with 9,538 day-night image pairs aligned within several centimeters, plus manual 2D bounding box annotations. Introduces four perception tasks for evaluation.

Conclusion: DarkDriving provides comprehensive benchmark for low-light enhancement in autonomous driving and can generalize to other low-light driving environments like nuScenes.

Abstract: The low-light conditions are challenging to the vision-centric perception systems for autonomous driving in the dark environment. In this paper, we propose a new benchmark dataset (named DarkDriving) to investigate the low-light enhancement for autonomous driving. The existing real-world low-light enhancement benchmark datasets can be collected by controlling various exposures only in small-ranges and static scenes. The dark images of the current nighttime driving datasets do not have the precisely aligned daytime counterparts. The extreme difficulty to collect a real-world day and night aligned dataset in the dynamic driving scenes significantly limited the research in this area. With a proposed automatic day-night Trajectory Tracking based Pose Matching (TTPM) method in a large real-world closed driving test field (area: 69 acres), we collected the first real-world day and night aligned dataset for autonomous driving in the dark environment. The DarkDriving dataset has 9,538 day and night image pairs precisely aligned in location and spatial contents, whose alignment error is in just several centimeters. For each pair, we also manually label the object 2D bounding boxes. DarkDriving introduces four perception related tasks, including low-light enhancement, generalized low-light enhancement, and low-light enhancement for 2D detection and 3D detection of autonomous driving in the dark environment. The experimental results show that our DarkDriving dataset provides a comprehensive benchmark for evaluating low-light enhancement for autonomous driving and it can also be generalized to enhance dark images and promote detection in some other low-light driving environment, such as nuScenes.

[107] SSP-SAM: SAM with Semantic-Spatial Prompt for Referring Expression Segmentation

Wei Tang, Xuejing Liu, Yanpeng Sun, Zechao Li

Main category: cs.CV

TL;DR: SSP-SAM enhances SAM’s segmentation with language understanding for Referring Expression Segmentation by integrating a Semantic-Spatial Prompt encoder with visual and linguistic attention adapters.

Details

Motivation: SAM excels at general image segmentation but lacks natural language understanding, limiting its application in Referring Expression Segmentation (RES) where language-guided segmentation is needed.

Method: Proposes SSP-SAM framework with Semantic-Spatial Prompt encoder incorporating visual and linguistic attention adapters to highlight salient objects in visual features and discriminative phrases in linguistic features, generating high-quality prompts for SAM.

Result: Achieves superior performance on RES and Generalized RES benchmarks, produces high-quality masks with strong precision at strict thresholds (Pr@0.9), and shows improved open-vocabulary performance on PhraseCut dataset.

Conclusion: SSP-SAM effectively bridges SAM’s segmentation capabilities with language understanding for RES tasks, naturally supporting Generalized RES without modifications and demonstrating strong performance across benchmarks.

Abstract: The Segment Anything Model (SAM) excels at general image segmentation but has limited ability to understand natural language, which restricts its direct application in Referring Expression Segmentation (RES). Toward this end, we propose SSP-SAM, a framework that fully utilizes SAM’s segmentation capabilities by integrating a Semantic-Spatial Prompt (SSP) encoder. Specifically, we incorporate both visual and linguistic attention adapters into the SSP encoder, which highlight salient objects within the visual features and discriminative phrases within the linguistic features. This design enhances the referent representation for the prompt generator, resulting in high-quality SSPs that enable SAM to generate precise masks guided by language. Although not specifically designed for Generalized RES (GRES), where the referent may correspond to zero, one, or multiple objects, SSP-SAM naturally supports this more flexible setting without additional modifications. Extensive experiments on widely used RES and GRES benchmarks confirm the superiority of our method. Notably, our approach generates segmentation masks of high quality, achieving strong precision even at strict thresholds such as Pr@0.9. Further evaluation on the PhraseCut dataset demonstrates improved performance in open-vocabulary scenarios compared to existing state-of-the-art RES methods. The code and checkpoints are available at: https://github.com/WayneTomas/SSP-SAM.

[108] CytoSyn: a Foundation Diffusion Model for Histopathology – Tech Report

Thomas Duboudin, Xavier Fontaine, Etienne Andrier, Lionel Guillou, Alexandre Filiot, Thalyssa Baiocco-Rodrigues, Antoine Olivier, Alberto Romagnoni, John Klein, Jean-Baptiste Schiratti

Main category: cs.CV

TL;DR: CytoSyn is a foundation latent diffusion model for generating realistic histopathology H&E-stained images, trained on 10,000+ TCGA whole-slide images across 32 cancer types.

Details

Motivation: While self-supervised foundation feature extractors exist for computational pathology, generative foundation models for histopathology remain scarce. Such models could enable tasks beyond feature extraction, like virtual staining.

Method: Developed CytoSyn, a state-of-the-art foundation latent diffusion model with methodological improvements, training set scaling, and sampling strategies. Explored slide-level overfitting and preprocessing sensitivity.

Result: CytoSyn generates highly realistic and diverse histopathology H&E-stained images, maintaining state-of-the-art performance even on inflammatory bowel disease images despite being trained only on oncology slides.

Conclusion: CytoSyn addresses the gap in generative foundation models for histopathology and demonstrates strong performance across cancer types, with publicly released weights and datasets to support research.

Abstract: Computational pathology has made significant progress in recent years, fueling advances in both fundamental disease understanding and clinically ready tools. This evolution is driven by the availability of large amounts of digitized slides and specialized deep learning methods and models. Multiple self-supervised foundation feature extractors have been developed, enabling downstream predictive applications from cell segmentation to tumor sub-typing and survival analysis. In contrast, generative foundation models designed specifically for histopathology remain scarce. Such models could address tasks that are beyond the capabilities of feature extractors, such as virtual staining. In this paper, we introduce CytoSyn, a state-of-the-art foundation latent diffusion model that enables the guided generation of highly realistic and diverse histopathology H&E-stained images, as shown in an extensive benchmark. We explored methodological improvements, training set scaling, sampling strategies and slide-level overfitting, culminating in the improved CytoSyn-v2, and compared our work to PixCell, a state-of-the-art model, in an in-depth manner. This comparison highlighted the strong sensitivity of both diffusion models and performance metrics to preprocessing-specific details such as JPEG compression. Our model has been trained on a dataset obtained from more than 10,000 TCGA diagnostic whole-slide images of 32 different cancer types. Despite being trained only on oncology slides, it maintains state-of-the-art performance generating inflammatory bowel disease images. To support the research community, we publicly release CytoSyn’s weights, its training and validation datasets, and a sample of synthetic images in this repository: https://huggingface.co/Owkin-Bioptimus/CytoSyn.

[109] Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Chen Zhao, Zhuoran Wang, Haoyang Li, Shifeng Bao, Guanlin Li, Youhe Feng, Yang Li, Jie Tang, Jing Zhang

Main category: cs.CV

TL;DR: ADV combines diffusion action experts with VLM verification to improve embodied task performance by drafting multiple action chunks and selecting the best one through single-pass VLM scoring.

Details

Motivation: Current Vision-Language-Action models use diffusion for efficient continuous action generation but auto-regressive approaches offer better robustness and generalization. The paper aims to leverage both paradigms for improved performance.

Method: Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, then VLM selects the best one using a perplexity-style metric in a single forward pass.

Result: ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with minimal VLM reranking overhead.

Conclusion: Combining diffusion action drafting with VLM verification provides significant performance improvements in embodied tasks while maintaining efficiency.

Abstract: Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

[110] CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation

Zhenyu Lu, Liupeng Li, Jinpeng Wang, Yan Feng, Bin Chen, Ke Chen, Yaowei Wang

Main category: cs.CV

TL;DR: CoPRS introduces a multimodal Chain-of-Thought approach that bridges language reasoning to segmentation through interpretable heatmap priors, improving both performance and interpretability.

Details

Motivation: Existing reasoning segmentation methods either connect language model features directly to mask decoders or use text-based position representations, which lack interpretability and semantic detail. There's a need for more transparent reasoning-to-segmentation pipelines.

Method: CoPRS uses a Multi-modal Chain-of-Thought (MCoT) approach to generate differentiable, interpretable heatmap priors from language reasoning. A learnable concentration token aggregates image and reasoning text features to create positional priors, which are then decoded to precise masks via a lightweight decoder.

Result: CoPRS matches or surpasses state-of-the-art metrics on RefCOCO series and ReasonSeg datasets across validation and test partitions. Experiments show strong correlation between CoT trajectory, generated heatmaps, and decoded masks, demonstrating interpretable alignment.

Conclusion: The paradigm effectively bridges reasoning and segmentation with advantages in concentration driven by reasoning and more precise mask prediction. The approach enhances interpretability and diagnostic analysis while maintaining strong performance.

Abstract: Existing works on reasoning segmentation either connect hidden features from a language model directly to a mask decoder or represent positions in text, which limits interpretability and semantic detail. To solve this, we present CoPRS, a Multi-modal Chain-of-Thought (MCoT)-based positional perception model that bridges language reasoning to segmentation through a differentiable and interpretable positional prior instantiated as a heatmap. By making the reasoning process clear via MCoT and expressing it as a dense, differentiable heatmap, this interface enhances interpretability and diagnostic analysis and yields more concentrated evidence on the target. A learnable concentration token aggregates features of the image and reasoning text to generate this positional prior, which is decoded to precise masks through a lightweight decoder, providing a direct connection between reasoning and segmentation. Across the RefCOCO series and ReasonSeg, CoPRS matches or surpasses the best reported metrics on each standard split under comparable protocols, with performance at or above the prior state of the art across both validation and test partitions. Extensive experiments demonstrate a strong positive correlation among the CoT trajectory, the generated heatmap, and the decoded mask, supporting an interpretable alignment between the reasoning output and downstream mask generation. Collectively, these findings support the utility of this paradigm in bridging reasoning and segmentation and show advantages in concentration driven by reasoning and in more precise mask prediction. Code has been released at https://github.com/ZhenyuLU-Heliodore/CoPRS.

[111] One-to-More: High-Fidelity Training-Free Anomaly Generation with Attention Control

Haoxiang Rao, Zhao Wang, Chenyang Si, Yan Lyu, Yuanyi Duan, Fang Zhao, Caifeng Shan

Main category: cs.CV

TL;DR: Training-free few-shot anomaly generation method (O2MAG) that uses self-attention from one reference anomalous image to synthesize realistic anomalies for industrial anomaly detection tasks.

Details

Motivation: Industrial anomaly detection suffers from scarcity of anomalous data. Existing few-shot anomaly synthesis methods require training and struggle to generate faithful anomaly distributions, limiting AD model effectiveness.

Method: O2MAG manipulates three parallel diffusion processes via self-attention grafting, uses anomaly masks to prevent foreground-background confusion, employs Anomaly-Guided Optimization to align text prompts with true anomaly semantics, and applies Dual-Attention Enhancement to reinforce attention on masked regions.

Result: Extensive experiments show O2MAG outperforms prior state-of-the-art methods on downstream anomaly detection tasks, generating more realistic anomalies that improve AD performance.

Conclusion: O2MAG provides an effective training-free approach for few-shot anomaly generation that produces realistic anomalies, addressing data scarcity in industrial anomaly detection without requiring extensive training.

Abstract: Industrial anomaly detection (AD) is characterized by an abundance of normal images but a scarcity of anomalous ones. Although numerous few-shot anomaly synthesis methods have been proposed to augment anomalous data for downstream AD tasks, most existing approaches require time-consuming training and struggle to learn distributions that are faithful to real anomalies, thereby restricting the efficacy of AD models trained on such data. To address these limitations, we propose a training-free few-shot anomaly generation method, namely O2MAG, which leverages the self-attention in One reference anomalous image to synthesize More realistic anomalies, supporting effective downstream anomaly detection. Specifically, O2MAG manipulates three parallel diffusion processes via self-attention grafting and incorporates the anomaly mask to mitigate foreground-background query confusion, synthesizing text-guided anomalies that closely adhere to real anomalous distributions. To bridge the semantic gap between the encoded anomaly text prompts and the true anomaly semantics, Anomaly-Guided Optimization is further introduced to align the synthesis process with the target anomalous distribution, steering the generation toward realistic and text-consistent anomalies. Moreover, to mitigate faint anomaly synthesis inside anomaly masks, Dual-Attention Enhancement is adopted during generation to reinforce both self- and cross-attention on masked regions. Extensive experiments validate the effectiveness of O2MAG, demonstrating its superior performance over prior state-of-the-art methods on downstream AD tasks.

[112] Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, Zhuochen Wang

Main category: cs.CV

TL;DR: Open-o3-Video is a video reasoning framework that integrates explicit spatio-temporal evidence by highlighting key timestamps, objects, and bounding boxes to make video reasoning traceable and verifiable.

Details

Motivation: Current video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. While evidence-centered reasoning exists for images (e.g., OpenAI-o3), extending this to videos is challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes.

Method: 1) Construct STGR datasets with unified spatio-temporal supervision; 2) Use a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision; 3) Non-agent framework that integrates explicit spatio-temporal evidence into video reasoning.

Result: Achieves state-of-the-art on V-STAR benchmark: improves mAM by 14.4% and mLGM by 24.2% over Qwen2.5-VL baseline. Shows consistent gains across video understanding benchmarks. Produces grounded reasoning traces that support confidence-aware test-time scaling, improving answer reliability.

Conclusion: Open-o3-Video successfully extends evidence-centered reasoning to videos, providing traceable and verifiable reasoning with explicit spatio-temporal evidence, achieving strong performance improvements while enabling confidence-aware scaling.

Abstract: Most video reasoning models only generate textual reasoning traces without indicating when and where key evidence appears. Recent models such as OpenAI-o3 have sparked wide interest in evidence-centered reasoning for images, yet extending this ability to videos is more challenging due to the need for joint temporal tracking and spatial localization across dynamic scenes. We introduce Open-o3-Video, a non-agent framework that integrates explicit spatio-temporal evidence into video reasoning by highlighting key timestamps, objects, and bounding boxes, making the reasoning process traceable and verifiable. To enable this capability, we first construct high-quality datasets STGR that provide unified spatio-temporal supervision, which is absent in existing resources. We further adopt a cold-start reinforcement learning strategy with specially designed rewards that jointly encourage answer accuracy, temporal alignment, and spatial precision. On the V-STAR benchmark, Open-o3-Video achieves state-of-the-art performance, improving mAM by 14.4% and mLGM by 24.2% over the Qwen2.5-VL baseline, and shows consistent gains across a range of video understanding benchmarks. Beyond accuracy, the grounded reasoning traces produced by Open-o3-Video support confidence-aware test-time scaling, improving answer reliability.

[113] Q-Drift: Quantization-Aware Drift Correction for Diffusion Model Sampling

Sooyoung Ryu, Mathieu Salzmann, Saqib Javed

Main category: cs.CV

TL;DR: Q-Drift: A sampler-side correction method for post-training quantization of diffusion models that treats quantization error as stochastic perturbation and derives drift adjustments to preserve marginal distributions.

Details

Motivation: Post-training quantization is practical for deploying large diffusion models, but quantization noise accumulates over the denoising trajectory and degrades generation quality. Existing methods need better handling of this noise propagation.

Method: Treats quantization error as implicit stochastic perturbation on each denoising step, derives marginal-distribution-preserving drift adjustment. Estimates timestep-wise variance statistic from calibration (as few as 5 paired full-precision/quantized runs). Plug-and-play with common samplers, diffusion models, and PTQ methods.

Result: Across six diverse text-to-image models (DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over quantized baselines in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

Conclusion: Q-Drift provides an effective, low-overhead solution for improving generation quality in quantized diffusion models by addressing quantization noise accumulation through principled sampler-side corrections.

Abstract: Post-training quantization (PTQ) is a practical path to deploy large diffusion models, but quantization noise can accumulate over the denoising trajectory and degrade generation quality. We propose Q-Drift, a principled sampler-side correction that treats quantization error as an implicit stochastic perturbation on each denoising step and derives a marginal-distribution-preserving drift adjustment. Q-Drift estimates a timestep-wise variance statistic from calibration, in practice requiring as few as 5 paired full-precision/quantized calibration runs. The resulting sampler correction is plug-and-play with common samplers, diffusion models, and PTQ methods, while incurring negligible overhead at inference. Across six diverse text-to-image models (spanning DiT and U-Net), three samplers (Euler, flow-matching, DPM-Solver++), and two PTQ methods (SVDQuant, MixDQ), Q-Drift improves FID over the corresponding quantized baseline in most settings, with up to 4.59 FID reduction on PixArt-Sigma (SVDQuant W3A4), while preserving CLIP scores.

[114] MRD: Multi-resolution Retrieval-Detection Fusion for High-Resolution Image Understanding

Fan Yang, Xingping Dong, Xin Yu, Wenhan Luo, Wei Liu, Kaihao Zhang

Main category: cs.CV

TL;DR: MRD is a training-free framework that enhances high-resolution image understanding in MLLMs by combining multi-resolution semantic fusion for local consistency and open-vocabulary detection for global priors, addressing object fragmentation and retrieval issues.

Details

Motivation: Current vision-based RAG approaches for HR image understanding in MLLMs suffer from object fragmentation (leading to semantic bias and incomplete retrieval) and false positives from irrelevant background patches.

Method: Proposes Multi-resolution Retrieval-Detection (MRD) with two components: 1) Multi-resolution semantic fusion for cross-scale consistency to mitigate single-resolution bias and object fragmentation locally, and 2) Integration of open-vocabulary object detection as localization priors globally within a unified framework.

Result: Extensive experiments across multiple MLLMs on HR image benchmarks show MRD achieves state-of-the-art performance on both single-object and multi-object understanding tasks.

Conclusion: MRD effectively addresses limitations of current RAG approaches for HR image understanding in MLLMs by combining local multi-resolution consistency with global detection priors in a training-free framework.

Abstract: Understanding high-resolution (HR) images remains a critical challenge for multimodal large language models (MLLMs). Recent approaches leverage vision-based retrieval-augmented generation (RAG) to retrieve query-relevant crops from HR images, improving understanding capacity of MLLMs. However, this paradigm often leads to object fragmentation, resulting in semantic bias and incomplete retrieval, while also introducing false positives from irrelevant background patches. To address these issues, we propose Multi-resolution Retrieval-Detection (MRD), a training-free framework that enhances HR image understanding from both local and global perspectives. Locally, MRD enforces cross-scale semantic consistency via multi-resolution semantic fusion to mitigate single-resolution bias and alleviate object fragmentation. Globally, it integrates open-vocabulary object detection (OVD) as localization priors within a unified framework. Extensive experiments across multiple MLLMs on HR image benchmarks demonstrate that MRD achieves state-of-the-art (SOTA) performance on both single-object and multi-object understanding tasks. Code will be available at: https://github.com/yf0412/MRD.

[115] Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

Mohammed Rahman Sherif Khan Mohammad, Ardhendu Behera, Sandip Pradhan, Swagat Kumar, Amr Ahmed

Main category: cs.CV

TL;DR: TOGA: Training-only Graph Adapter for CLIP that uses a high-capacity Heterogeneous Graph Teacher during training to enhance few-shot learning by modeling fine-grained patch-text relations, then transfers this knowledge to Tip-Adapter’s cache without inference overhead.

Details

Motivation: Existing adapter-based CLIP tuning methods like Tip-Adapter rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text, limiting their few-shot learning performance.

Method: Proposes asymmetric training-only framework with auxiliary Heterogeneous Graph Teacher that: (1) integrates multi-scale visual patches and text prompts into unified graph, (2) performs cross-modal reasoning via Modality-aware Graph Transformer, (3) applies discriminative node filtering, then uses cache-aware dual-objective strategy to supervise relational knowledge into Tip-Adapter’s cache.

Result: Achieves new state-of-the-art across standard 1-16-shot benchmarks while maintaining identical inference to Tip-Adapter with zero extra latency or memory overhead.

Conclusion: The auxiliary graph supervision, text-guided reasoning, and node filtering are essential for robust few-shot adaptation, enabling fine-grained cross-modal understanding without inference cost.

Abstract: Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter’s key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

[116] From Concepts to Judgments: Interpretable Image Aesthetic Assessment

Xiao-Chang Liu, Johan Wagemans

Main category: cs.CV

TL;DR: An interpretable image aesthetic assessment framework that uses human-understandable aesthetic concepts and residual predictors to provide transparent aesthetic judgments while maintaining competitive performance.

Details

Motivation: Current IAA models offer strong predictive performance but lack interpretability, while users need to understand why images are considered pleasing or not. Humans naturally rely on high-level cues to justify aesthetic judgments, motivating an interpretable framework.

Method: Proposes an interpretable IAA framework using human-understandable aesthetic concepts learned in an accessible manner, constructing a concept subspace as foundation. Introduces a residual predictor to capture nuanced influences beyond explicit concepts.

Result: Experiments on photographic and artistic datasets show competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

Conclusion: The framework successfully balances predictive performance with interpretability, providing both aesthetic scores and understandable justifications for those scores.

Abstract: Image aesthetic assessment (IAA) aims to predict the aesthetic quality of images as perceived by humans. While recent IAA models achieve strong predictive performance, they offer little insight into the factors driving their predictions. Yet for users, understanding why an image is considered pleasing or not is as valuable as the score itself, motivating growing interest in interpretability within IAA. When humans evaluate aesthetics, they naturally rely on high-level cues to justify their judgments. Motivated by this observation, we propose an interpretable IAA framework grounded in human-understandable aesthetic concepts. We learn these concepts in an accessible manner, constructing a subspace that forms the foundation of an inherently interpretable model. To capture nuanced influences on aesthetic perception beyond explicit concepts, we introduce a simple yet effective residual predictor. Experiments on photographic and artistic datasets demonstrate that our method achieves competitive predictive performance while offering transparent, human-understandable aesthetic judgments.

[117] Insight-V++: Towards Advanced Long-Chain Visual Reasoning with Multimodal Large Language Models

Yuhao Dong, Zuyan Liu, Shulin Tian, Yongming Rao, Ziwei Liu

Main category: cs.CV

TL;DR: Insight-V++ introduces a unified multi-agent visual reasoning framework with self-improving data generation and novel reinforcement learning algorithms for enhanced spatial-temporal reasoning in multimodal LLMs.

Details

Motivation: Extending LLMs' advanced reasoning capabilities to multimodal domains is challenging due to scarcity of high-quality long-chain reasoning data and optimized training pipelines for vision-language models.

Method: Developed a scalable data generation pipeline with multi-granularity assessment, dual-agent architecture (reasoning + summary agents), and novel ST-GRPO and J-GRPO reinforcement learning algorithms for spatial-temporal reasoning.

Result: Significant performance gains on challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks, demonstrated on models like LLaVA-NeXT and Qwen2.5-VL.

Conclusion: The framework successfully bridges the gap in multimodal reasoning capabilities through autonomous data generation, multi-agent collaboration, and novel reinforcement learning approaches, enabling continuous self-improvement.

Abstract: Large Language Models (LLMs) have achieved remarkable reliability and advanced capabilities through extended test-time reasoning. However, extending these capabilities to Multi-modal Large Language Models (MLLMs) remains a significant challenge due to a critical scarcity of high-quality, long-chain reasoning data and optimized training pipelines. To bridge this gap, we present a unified multi-agent visual reasoning framework that systematically evolves from our foundational image-centric model, Insight-V, into a generalized spatial-temporal architecture, Insight-V++. We first propose a scalable data generation pipeline equipped with multi-granularity assessment that autonomously synthesizes structured, complex reasoning trajectories across image and video domains without human intervention. Recognizing that directly supervising MLLMs with such intricate data yields sub-optimal results, we design a dual-agent architecture comprising a reasoning agent to execute extensive analytical chains, and a summary agent to critically evaluate and distill final outcomes. While our initial framework utilized Direct Preference Optimization (DPO), its off-policy nature fundamentally constrained reinforcement learning potential. To overcome these limitations, particularly for long-horizon video understanding, Insight-V++ introduces two novel algorithms, ST-GRPO and J-GRPO, which enhance spatial-temporal reasoning and improve evaluative robustness. Crucially, by leveraging reliable feedback from the summary agent, we guide an iterative reasoning path generation process, retraining the entire multi-agent system in a continuous, self-improving loop. Extensive experiments on base models like LLaVA-NeXT and Qwen2.5-VL demonstrate significant performance gains across challenging image and video reasoning benchmarks while preserving strong capabilities on traditional perception-focused tasks.

[118] VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie

Main category: cs.CV

TL;DR: VLM-AutoDrive is a modular post-training framework that adapts pretrained Vision-Language Models for safety-critical anomaly detection in driving scenarios, significantly improving collision detection performance from near-zero to strong results.

Details

Motivation: Ego-centric dashcam footage presents challenges for detecting brief, rare safety-critical events like collisions and near-collisions. Generic vision models struggle with these scenarios, and existing MLLMs underperform in driving contexts due to domain and temporal misalignment.

Method: A modular post-training framework integrating metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning for VLMs.

Result: Fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27% on real-world Nexar dashcam videos, achieving substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces.

Conclusion: VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

Abstract: The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA’s Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

[119] MicroVision: An Open Dataset and Benchmark Models for Detecting Vulnerable Road Users and Micromobility Vehicles

Alexander Rasch, Rahul Rajendra Pai

Main category: cs.CV

TL;DR: Introduces MicroVision dataset - an open image dataset for detecting vulnerable road users (pedestrians, cyclists, e-scooterists) and stationary micromobility vehicles from VRU perspective, addressing gaps in existing datasets.

Details

Motivation: Existing image datasets lack focus and diversity for VRUs and micromobility vehicles, often categorizing pedestrians and MMV riders as "person" and missing new MMVs like e-scooters. Datasets are typically from car perspective and lack data from VRU-only areas like sidewalks and cycle paths.

Method: Created MicroVision dataset with over 8,000 anonymized full-HD images containing more than 30,000 annotated VRUs and MMVs, captured over a year in Gothenburg from VRU perspective. Provided benchmark object-detection models using state-of-the-art architectures.

Result: Dataset includes diverse VRUs and stationary MMVs from VRU perspective. Benchmark models achieved mean average precision up to 0.723 on unseen test set. Dataset supports distinguishing between different VRUs and MMVs for traffic safety and monitoring.

Conclusion: MicroVision dataset addresses critical gaps in existing datasets for VRU and micromobility detection, providing valuable resources for traffic safety and planning applications. The dataset and models are publicly available.

Abstract: Micromobility is a growing mode of transportation, raising new challenges for traffic safety and planning due to increased interactions in areas where vulnerable road users (VRUs) share the infrastructure with micromobility, including parked micromobility vehicles (MMVs). Approaches to support traffic safety and planning increasingly rely on detecting road users in images – a computer-vision task relying heavily on the quality of the images to train on. However, existing open image datasets for training such models lack focus and diversity in VRUs and MMVs, for instance, by categorizing both pedestrians and MMV riders as “person”, or by not including new MMVs like e-scooters. Furthermore, datasets are often captured from a car perspective and lack data from areas where only VRUs travel (sidewalks, cycle paths). To help close this gap, we introduce the MicroVision dataset: an open image dataset and annotations for training and evaluating models for detecting the most common VRUs (pedestrians, cyclists, e-scooterists) and stationary MMVs (bicycles, e-scooters), from a VRU perspective. The dataset, recorded in Gothenburg (Sweden), consists of more than 8,000 anonymized, full-HD images with more than 30,000 carefully annotated VRUs and MMVs, captured over an entire year and part of almost 2,000 unique interaction scenes. Along with the dataset, we provide first benchmark object-detection models based on state-of-the-art architectures, which achieved a mean average precision of up to 0.723 on an unseen test set. The dataset and model can support traffic safety to distinguish between different VRUs and MMVs, or help monitoring systems identify the use of micromobility. The dataset and model weights can be accessed at https://doi.org/10.71870/eepz-jd52.

[120] Semantic Segmentation and Depth Estimation for Real-Time Lunar Surface Mapping Using 3D Gaussian Splatting

Guillem Casadesus Vila, Adam Dai, Grace Gao

Main category: cs.CV

TL;DR: Real-time lunar mapping framework combining dense perception models with 3D Gaussian Splatting for robust navigation in challenging lunar conditions.

Details

Motivation: Lunar surface navigation requires robust perception under challenging conditions like poorly textured environments, high-contrast lighting, and limited computational resources. Traditional methods struggle with these conditions, necessitating new approaches for detailed mapping.

Method: Integrates dense perception models with 3D Gaussian Splatting representation. Benchmarks models on synthetic LuPNT datasets, selecting stereo dense depth estimation with Gated Recurrent Units for speed/accuracy balance and CNN for semantic segmentation. Uses ground truth poses to decouple local scene understanding from global state estimation.

Result: Reconstructs 120-meter traverse with geometric height accuracy of ~3 cm, outperforming traditional point cloud baselines without LiDAR. The 3DGS map enables novel view synthesis and serves as foundation for full SLAM system.

Conclusion: Combining semantic segmentation and dense depth estimation with learned map representations is effective for creating detailed, large-scale lunar maps to support future missions.

Abstract: Navigation and mapping on the lunar surface require robust perception under challenging conditions, including poorly textured environments, high-contrast lighting, and limited computational resources. This paper presents a real-time mapping framework that integrates dense perception models with a 3D Gaussian Splatting (3DGS) representation. We first benchmark several models on synthetic datasets generated with the LuPNT simulator, selecting a stereo dense depth estimation model based on Gated Recurrent Units for its balance of speed and accuracy in depth estimation, and a convolutional neural network for its superior performance in detecting semantic segments. Using ground truth poses to decouple the local scene understanding from the global state estimation, our pipeline reconstructs a 120-meter traverse with a geometric height accuracy of approximately 3 cm, outperforming a traditional point cloud baseline without LiDAR. The resulting 3DGS map enables novel view synthesis and serves as a foundation for a full SLAM system, where its capacity for joint map and pose optimization would offer significant advantages. Our results demonstrate that combining semantic segmentation and dense depth estimation with learned map representations is an effective approach for creating detailed, large-scale maps to support future lunar surface missions.

[121] LRConv-NeRV: Low Rank Convolution for Efficient Neural Video Compression

Tamer Shanableh

Main category: cs.CV

TL;DR: LRConv-NeRV: An efficient neural video representation method that replaces dense 3x3 convolutional layers with structured low-rank separable convolutions in the decoder, achieving significant computational and memory savings with minimal quality loss.

Details

Motivation: Neural Representations for Videos (NeRV) offer an alternative to conventional video codecs but suffer from computationally expensive and memory-intensive convolutional decoders, limiting deployment in resource-constrained environments.

Method: Proposes LRConv-NeRV which replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end. Applies low-rank factorization progressively from largest to earlier decoder stages for controllable efficiency-quality trade-offs.

Result: Applying LRConv only to final decoder stage reduces decoder complexity by 68% (201.9 to 64.9 GFLOPs) and model size by 9.3%, with negligible quality loss and ~9.2% bitrate reduction. Maintains temporal coherence and works well with INT8 quantization.

Conclusion: LRConv-NeRV offers a favorable efficiency-quality trade-off compared to existing methods, establishing it as a potential architectural alternative for efficient neural video decoding in low-precision, resource-constrained settings.

Abstract: Neural Representations for Videos (NeRV) encode entire video sequences within neural network parameters, offering an alternative paradigm to conventional video codecs. However, the convolutional decoder of NeRV remains computationally expensive and memory intensive, limiting its deployment in resource-constrained environments. This paper proposes LRConv-NeRV, an efficient NeRV variant that replaces selected dense 3x3 convolutional layers with structured low-rank separable convolutions, trained end-to-end within the decoder architecture. By progressively applying low-rank factorization from the largest to earlier decoder stages, LRConv-NeRV enables controllable trade-offs between reconstruction quality and efficiency. Extensive experiments demonstrate that applying LRConv only to the final decoder stage reduces decoder complexity by 68%, from 201.9 to 64.9 GFLOPs, and model size by 9.3%, while incurring negligible quality loss and achieving approximately 9.2% bitrate reduction. Under INT8 post-training quantization, LRConv-NeRV preserves reconstruction quality close to the dense NeRV baseline, whereas more aggressive factorization of early decoder stages leads to disproportionate quality degradation. Compared to existing work under layer-aligned settings, LRConv-NeRV achieves a more favorable efficiency versus quality trade-off, offering substantial GFLOPs and parameter reductions while maintaining higher PSNR/MS-SSIM and improved temporal stability. Temporal flicker analysis using LPIPS further shows that the proposed solution preserves temporal coherence close to the NeRV baseline, results establish LRConv-NeRV as a potential architectural alternative for efficient neural video decoding under low-precision and resource-constrained settings.

[122] CycleCap: Improving VLMs Captioning Performance via Self-Supervised Cycle Consistency Fine-Tuning

Marios Krestenitis, Christos Tzelepis, Konstantinos Ioannidis, Steafanos Vrochidis, Ioannis Kompatsiaris, Georgios Tzimiropoulos, Shaogang Gong, Ioannis Patras

Main category: cs.CV

TL;DR: CycleCap: Self-supervised fine-tuning method for VLMs using cycle consistency between image-to-text and text-to-image models to improve caption accuracy and reduce hallucinations without requiring curated image-text datasets.

Details

Motivation: VLMs often produce generic or hallucinated descriptions due to vision-language misalignment. Existing approaches require costly annotated datasets or complex test-time frameworks. This work aims to improve alignment using cycle consistency as a self-supervised signal.

Method: Uses VLM as image-to-text component and pre-trained text-to-image model for reconstruction. Introduces CycleCap with Group Relative Policy Optimization (GRPO) using reward based on similarity between original and reconstructed images computed on-the-fly.

Result: Applied to four VLMs (1B to 7B parameters), CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods using supervised cycle consistency training.

Conclusion: Cycle consistency can serve as effective self-supervised training signal for improving VLM alignment, enabling use of raw images alone without curated image-text datasets while producing more accurate and grounded descriptions.

Abstract: Visual-Language Models (VLMs) have achieved remarkable progress in image captioning, visual question answering, and visual reasoning. Yet they remain prone to vision-language misalignment, often producing overly generic or hallucinated descriptions. Existing approaches address this via instruction tuning-requiring costly, large-scale annotated datasets or via complex test-time frameworks for caption refinement. In this work, we revisit image-text alignment through the lens of cycle consistency: given an image and a caption generated by an image-to-text model, the backward mapping through a text-to-image model should reconstruct an image that closely matches the original. In our setup, a VLM serves as the image-to-text component, while a pre-trained text-to-image model closes the loop by reconstructing the image from the generated caption. Building on this, we introduce CycleCap, a fine-tuning scheme to improve image captioning using Group Relative Policy Optimization (GRPO) with a reward based on the similarity between the original and reconstructed images, computed on-the-fly. Unlike previous work that uses cycle consistency loss for preference dataset construction, our method leverages cycle consistency directly as a self-supervised training signal. This enables the use of raw images alone, eliminating the need for curated image-text datasets, while steering the VLM to produce more accurate and grounded text descriptions. Applied to four VLMs ranging from 1B to 7B parameters, CycleCap yields consistent improvements across captioning and hallucination benchmarks, surpassing state-of-the-art methods that rely on supervised cycle consistency training.

[123] Fast and Generalizable NeRF Architecture Selection for Satellite Scene Reconstruction

Devjyoti Chakraborty, Zaki Sukma, Rakandhiya D. Rachmanto, Kriti Ghosh, In Kee Kim, Suchendra M. Bhandarkar, Lakshmish Ramaswamy, Nancy K. O’Hare, Deepak Mishra

Main category: cs.CV

TL;DR: PreSCAN predicts NeRF reconstruction quality for satellite imagery using lightweight geometric/photometric descriptors, enabling fast architecture selection without training and efficient edge deployment.

Details

Motivation: NeRF for satellite imagery requires individual training per scene and NAS takes hours/days of GPU time. The authors found multi-view consistency (not architecture) determines reconstruction quality, motivating a predictive approach.

Method: Developed PreSCAN framework using SHAP analysis to identify key geometric and photometric descriptors. Uses these lightweight descriptors to predict NeRF quality before training, enabling fast architecture selection.

Result: PreSCAN selects suitable architectures in <30 seconds with <1 dB prediction error, achieving 1000× speedup over NAS. On edge platforms (Jetson Orin), reduces inference power by 26% and latency by 43% with minimal quality loss.

Conclusion: PreSCAN provides efficient NeRF quality prediction for satellite imagery, enabling fast architecture selection and optimized edge deployment without retraining across diverse scenes.

Abstract: Neural Radiance Fields (NeRF) have emerged as a powerful approach for photorealistic 3D reconstruction from multi-view images. However, deploying NeRF for satellite imagery remains challenging. Each scene requires individual training, and optimizing architectures via Neural Architecture Search (NAS) demands hours to days of GPU time. While existing approaches focus on architectural improvements, our SHAP analysis reveals that multi-view consistency, rather than model architecture, determines reconstruction quality. Based on this insight, we develop PreSCAN, a predictive framework that estimates NeRF quality prior to training using lightweight geometric and photometric descriptors. PreSCAN selects suitable architectures in < 30 seconds with < 1 dB prediction error, achieving 1000$\times$ speedup over NAS. We further demonstrate PreSCAN’s deployment utility on edge platforms (Jetson Orin), where combining its predictions with offline cost profiling reduces inference power by 26% and latency by 43% with minimal quality loss. Experiments on DFC2019 datasets confirm that PreSCAN generalizes across diverse satellite scenes without retraining.

[124] Unrolled Reconstruction with Integrated Super-Resolution for Accelerated 3D LGE MRI

Md Hasibul Husain Hisham, Shireen Elhabian, Ganesh Adluru, Jason Mendes, Andrew Arai, Eugene Kholmovski, Ravi Ranjan, Edward DiBella

Main category: cs.CV

TL;DR: A hybrid unrolled reconstruction framework for accelerated 3D LGE MRI that integrates super-resolution enhancement within model-based reconstruction to better recover high-frequency details and thin atrial structures.

Details

Motivation: Accelerated 3D LGE MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. Standard unrolled model-based networks operate at acquired resolution and may fail to fully recover high-frequency detail.

Method: Proposes a hybrid unrolled reconstruction framework where an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. Trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets.

Result: The method consistently improves PSNR and SSIM over standard unrolled reconstruction across acceleration factors, better preserves fine cardiac structures, and leads to improved left atrium segmentation performance compared to compressed sensing, MoDL, and self-guided DIP baselines.

Conclusion: Integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI, demonstrating the value of joint super-resolution enhancement and data consistency enforcement.

Abstract: Accelerated 3D late gadolinium enhancement (LGE) MRI requires robust reconstruction methods to recover thin atrial structures from undersampled k-space data. While unrolled model-based networks effectively integrate physics-driven data consistency with learned priors, they operate at the acquired resolution and may fail to fully recover high-frequency detail. We propose a hybrid unrolled reconstruction framework in which an Enhanced Deep Super-Resolution (EDSR) network replaces the proximal operator within each iteration of the optimization loop, enabling joint super-resolution enhancement and data consistency enforcement. The model is trained end-to-end on retrospectively undersampled preclinical 3D LGE datasets and compared against compressed sensing, Model-Based Deep Learning (MoDL), and self-guided Deep Image Prior (DIP) baselines. Across acceleration factors, the proposed method consistently improves PSNR and SSIM over standard unrolled reconstruction and better preserves fine cardiac structures, leading to improved LA (left atrium) segmentation performance. These results demonstrate that integrating super-resolution priors directly within model-based reconstruction provides measurable gains in accelerated 3D LGE MRI.

[125] VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection

Bo-Cheng Qiu, Yu-Fan Lin, Yu-Zhe Pien, Chia-Ming Lee, Fu-En Yang, Yu-Chiang Frank Wang, Chih-Chung Hsu

Main category: cs.CV

TL;DR: A framework for capsule endoscopy event detection combining temporal and visual backbones with validation-guided fusion and anatomy-aware temporal decoding to address sparse, heterogeneous findings in long videos.

Details

Motivation: Capsule endoscopy event detection is challenging due to sparse diagnostically relevant findings, visual heterogeneity, and long noisy video streams, requiring event-level rather than frame-level evaluation.

Method: Combines EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for frame-level visual semantics, followed by Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding with temporal smoothing and anatomical constraints.

Result: Achieved overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235 on the official hidden test set, with ablations showing contributions from complementary backbones, validation-guided fusion, and anatomy-aware decoding.

Conclusion: The proposed RARE-VISION framework effectively addresses capsule endoscopy event detection by aligning with event-level metrics through complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding.

Abstract: Capsule endoscopy event detection is challenging because diagnostically relevant findings are sparse, visually heterogeneous, and embedded in long, noisy video streams, while evaluation is performed at the event level rather than by frame accuracy alone. We therefore formulate the RARE-VISION task as a metric-aligned event detection problem instead of a purely frame-wise classification task. Our framework combines two complementary backbones, EndoFM-LV for local temporal context and DINOv3 ViT-L/16 for strong frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Hierarchical Fusion, and Anatomy-Aware Temporal Event Decoding. The fusion stage uses validation-derived class-wise model weighting, backbone weighting, and probability calibration, while the decoding stage applies temporal smoothing, anatomical constraints, threshold refinement, and per-label event generation to produce stable event predictions. Validation ablations indicate that complementary backbones, validation-guided fusion, and anatomy-aware temporal decoding all contribute to event-level performance. On the official hidden test set, the proposed method achieved an overall temporal mAP@0.5 of 0.3530 and temporal mAP@0.95 of 0.3235.

[126] To See or To Please: Uncovering Visual Sycophancy and Split Beliefs in VLMs

Rui Hong, Shuxue Quan

Main category: cs.CV

TL;DR: The paper introduces a diagnostic framework to analyze whether Vision-Language Models (VLMs) genuinely use visual information or exploit language shortcuts, revealing widespread “Visual Sycophancy” where models detect visual anomalies but hallucinate to satisfy user expectations.

Details

Motivation: To understand whether VLMs genuinely rely on visual information or exploit language shortcuts when answering correctly, and to develop a systematic diagnostic approach to disentangle different sources of hallucination in multimodal models.

Method: Tri-Layer Diagnostic Framework with three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Uses counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs.

Result: 69.6% of samples exhibit Visual Sycophancy (models detect visual anomalies but hallucinate to satisfy user expectations), with zero samples showing Robust Refusal. Scaling analysis shows larger models reduce Language Shortcuts but amplify Visual Sycophancy. Diagnostic scores enable post-hoc selective prediction with +9.5pp accuracy at 50% coverage.

Conclusion: VLMs systematically suppress truthful uncertainty acknowledgment due to alignment training, and scale alone cannot resolve the visual grounding problem. The diagnostic framework provides tools to analyze and potentially mitigate hallucination issues in multimodal models.

Abstract: When VLMs answer correctly, do they genuinely rely on visual information or exploit language shortcuts? We introduce the Tri-Layer Diagnostic Framework, which disentangles hallucination sources via three metrics: Latent Anomaly Detection (perceptual awareness), Visual Necessity Score (visual dependency, measured via KL divergence), and Competition Score (conflict between visual grounding and instruction following). Using counterfactual interventions (blind, noise, and conflict images) across 7 VLMs and 7,000 model-sample pairs, our taxonomy reveals that 69.6% of samples exhibit Visual Sycophancy–models detect visual anomalies but hallucinate to satisfy user expectations–while zero samples show Robust Refusal, indicating alignment training has systematically suppressed truthful uncertainty acknowledgment. A scaling analysis (Qwen2.5-VL 7B to 72B) shows larger models reduce Language Shortcuts but amplify Visual Sycophancy, demonstrating scale alone cannot resolve the grounding problem. Diagnostic scores further enable a post-hoc selective prediction strategy achieving up to +9.5pp accuracy at 50% coverage with no additional training cost.

[127] Pixel-Accurate Epipolar Guided Matching

Oleksii Nasypanyi, Francois Rameau

Main category: cs.CV

TL;DR: Exact epipolar-guided keypoint matching using angular interval queries with segment trees for efficient correspondence search.

Details

Motivation: Traditional epipolar-guided matching approaches rely on coarse spatial binning which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. The authors aim to address these limitations with an exact formulation.

Method: Each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching becomes a 1D angular interval query solved efficiently in logarithmic time using a segment tree data structure.

Result: Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets with pixel-level tolerance and per-keypoint control.

Conclusion: The proposed exact angular interval query method provides efficient and accurate epipolar-guided matching, overcoming limitations of traditional spatial binning approaches.

Abstract: Keypoint matching can be slow and unreliable in challenging conditions such as repetitive textures or wide-baseline views. In such cases, known geometric relations (e.g., the fundamental matrix) can be used to restrict potential correspondences to a narrow epipolar envelope, thereby reducing the search space and improving robustness. These epipolar-guided matching approaches have proved effective in tasks such as SfM; however, most rely on coarse spatial binning, which introduces approximation errors, requires costly post-processing, and may miss valid correspondences. We address these limitations with an exact formulation that performs candidate selection directly in angular space. In our approach, each keypoint is assigned a tolerance circle which, when viewed from the epipole, defines an angular interval. Matching then becomes a 1D angular interval query, solved efficiently in logarithmic time with a segment tree. This guarantees pixel-level tolerance, supports per-keypoint control, and removes unnecessary descriptor comparisons. Extensive evaluation on ETH3D demonstrates noticeable speedups over existing approaches while recovering exact correspondence sets.

[128] A Multi-Agent Perception-Action Alliance for Efficient Long Video Reasoning

Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Ling Liu

Main category: cs.CV

TL;DR: A4VL is a multi-agent system for efficient long-video reasoning that uses perception-action exploration loops with VLM agents to extract query-specific clues and collaboratively reason through cross-reviews.

Details

Motivation: Existing video reasoning methods struggle with long videos due to computational complexity and difficulty in extracting relevant information across extended temporal contexts. There's a need for efficient systems that can handle real-world long videos while maintaining reasoning quality.

Method: Multi-agent perception-action exploration alliance with VLM agents operating in iterative rounds. Each round includes: 1) Perception exploration - agents extract query-specific clues from sampled frames and align them to relevant video blocks; 2) Action exploration - agents produce answers with rationales, cross-review each other, and decide whether to continue with pruning/re-staging or conclude with final answer.

Result: Outperforms 18 existing VLMs and 11 recent long-video reasoning methods on five VideoQA benchmarks, while achieving significantly lower inference latency.

Conclusion: A4VL effectively scales to real-world long videos through multi-agent collaboration and iterative perception-action exploration, demonstrating superior performance and efficiency in video reasoning tasks.

Abstract: This paper presents a multi-agent perception-action exploration alliance, dubbed A4VL, for efficient long-video reasoning. A4VL operates in a multi-round perception-action exploration loop with a selection of VLM agents. In each round, the team of agents performs video question-answer (VideoQA) via perception exploration followed by action exploration. During perception exploration, each agent learns to extract query-specific perception clue(s) from a few sampled frames and performs clue-based alignment to find the video block(s) that are most relevant to the query-specific event. During action exploration, A4VL performs video reasoning in three steps: (1) each agent produces its initial answer with rational, (2) all agents collaboratively scores one another through cross-reviews and relevance ranking, and (3) based on whether a satisfactory consensus is reached, the decision is made either to start a new round of perception-action deliberation by pruning (e.g., filtering out the lowest performing agent) and re-staging (e.g., new-clue and matching block based perception-action exploration), or to conclude by producing its final answer. The integration of the multi-agent alliance through multi-round perception-action exploration, coupled with event-driven partitioning and cue-guided block alignment, enables A4VL to effectively scale to real world long videos while preserving high quality video reasoning. Evaluation Results on five popular VideoQA benchmarks show that A4VL outperforms 18 existing representative VLMs and 11 recent methods optimized for long-video reasoning, while achieving significantly lower inference latency. Our code is released at https://github.com/git-disl/A4VL.

[129] Inst4DGS: Instance-Decomposed 4D Gaussian Splatting with Multi-Video Label Permutation Learning

Yonghan Lee, Dinesh Manocha

Main category: cs.CV

TL;DR: Inst4DGS introduces instance-decomposed 4D Gaussian Splatting with long-horizon trajectories and cross-video instance matching using label-permutation latents and Sinkhorn optimization.

Details

Motivation: While dynamic 4DGS has advanced, instance-decomposed 4DGS remains underexplored due to challenges in associating inconsistent instance labels across independently segmented multi-view videos.

Method: Uses per-video label-permutation latents with differentiable Sinkhorn layer for cross-video instance matching, plus instance-decomposed motion scaffolds for efficient long-horizon trajectory optimization.

Result: Achieves state-of-the-art rendering and segmentation quality on Panoptic Studio and Neural3DV datasets, improving PSNR from 26.10 to 28.36 and instance mIoU from 0.6310 to 0.9129 on Panoptic Studio.

Conclusion: Inst4DGS successfully enables joint tracking and instance decomposition in 4D Gaussian Splatting with temporally stable identities and improved efficiency.

Abstract: We present Inst4DGS, an instance-decomposed 4D Gaussian Splatting (4DGS) approach with long-horizon per-Gaussian trajectories. While dynamic 4DGS has advanced rapidly, instance-decomposed 4DGS remains underexplored, largely due to the difficulty of associating inconsistent instance labels across independently segmented multi-view videos. We address this challenge by introducing per-video label-permutation latents that learn cross-video instance matches through a differentiable Sinkhorn layer, enabling direct multi-view supervision with consistent identity preservation. This explicit label alignment yields sharp decision boundaries and temporally stable identities without identity drift. To further improve efficiency, we propose instance-decomposed motion scaffolds that provide low-dimensional motion bases per object for long-horizon trajectory optimization. Experiments on Panoptic Studio and Neural3DV show that Inst4DGS jointly supports tracking and instance decomposition while achieving state-of-the-art rendering and segmentation quality. On the Panoptic Studio dataset, Inst4DGS improves PSNR from 26.10 to 28.36, and instance mIoU from 0.6310 to 0.9129, over the strongest baseline.

[130] Mind the Rarities: Can Rare Skin Diseases Be Reliably Diagnosed via Diagnostic Reasoning?

Yang Liu, Jiyao Yang, Hongjin Zhao, Xiaoyong Li, Yanzhe Ji, Xingjian Li, Runmin Jiang, Tianyang Wang, Saeed Anwar, Dongwoo Kim, Yue Yao, Zhenyue Qin, Min Xu

Main category: cs.CV

TL;DR: DermCase: A long-context benchmark for evaluating dermatology diagnostic reasoning in LVLMs, focusing on rare conditions and clinical reasoning processes rather than just final accuracy.

Details

Motivation: Existing benchmarks for LVLMs in dermatology focus on common diseases and only assess final accuracy, overlooking the clinical reasoning process which is critical for complex cases, especially rare conditions.

Method: Constructed DermCase benchmark from peer-reviewed case reports containing 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases with comprehensive clinical information and step-by-step reasoning chains. Developed DermLIP-based similarity metrics for reliable evaluation aligned with dermatologists.

Result: Benchmarking 22 leading LVLMs revealed significant deficiencies in diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments showed instruction tuning substantially improves performance while DPO yields minimal gains. Error analysis revealed critical limitations in current models’ reasoning capabilities.

Conclusion: Current LVLMs have substantial limitations in dermatology diagnostic reasoning, especially for rare conditions. The DermCase benchmark provides a comprehensive evaluation framework that reveals critical gaps in clinical reasoning capabilities that need to be addressed.

Abstract: Large vision-language models (LVLMs) demonstrate strong performance in dermatology; however, evaluating diagnostic reasoning for rare conditions remains largely unexplored. Existing benchmarks focus on common diseases and assess only final accuracy, overlooking the clinical reasoning process, which is critical for complex cases. We address this gap by constructing DermCase, a long-context benchmark derived from peer-reviewed case reports. Our dataset contains 26,030 multi-modal image-text pairs and 6,354 clinically challenging cases, each annotated with comprehensive clinical information and step-by-step reasoning chains. To enable reliable evaluation, we establish DermLIP-based similarity metrics that achieve stronger alignment with dermatologists for assessing differential diagnosis quality. Benchmarking 22 leading LVLMs exposes significant deficiencies across diagnosis accuracy, differential diagnosis, and clinical reasoning. Fine-tuning experiments demonstrate that instruction tuning substantially improves performance while Direct Preference Optimization (DPO) yields minimal gains. Systematic error analysis further reveals critical limitations in current models’ reasoning capabilities.

[131] SynQ: Accurate Zero-shot Quantization by Synthesis-aware Fine-tuning

Minjun Kim, Jongjin Kim, U Kang

Main category: cs.CV

TL;DR: SynQ is a zero-shot quantization method that addresses data-free model compression challenges by using low-pass filtering to reduce synthetic data noise, aligning class activation maps for better accuracy, and employing soft labels for difficult samples to avoid misguidance from pre-trained model errors.

Details

Motivation: The paper addresses the practical need for quantizing pre-trained models without access to training data (zero-shot quantization), which is crucial for privacy/security scenarios. Existing methods suffer from three key issues: noise in synthetic datasets, predictions based on off-target patterns, and misguidance by erroneous hard labels from the pre-trained model.

Method: SynQ uses a three-pronged approach: 1) Low-pass filtering to minimize noise in generated synthetic samples, 2) Training the quantized model to align its class activation maps with the pre-trained model for improved accuracy, and 3) Using only soft labels for difficult samples to avoid misguidance from the pre-trained model’s errors.

Result: Extensive experiments show that SynQ achieves state-of-the-art accuracy compared to existing zero-shot quantization methods, demonstrating effectiveness in data-free model compression scenarios.

Conclusion: SynQ successfully addresses key challenges in zero-shot quantization through noise reduction, activation map alignment, and soft label usage, providing a practical solution for deploying neural networks on resource-constrained devices without access to original training data.

Abstract: How can we accurately quantize a pre-trained model without any data? Quantization algorithms are widely used for deploying neural networks on resource-constrained edge devices. Zero-shot Quantization (ZSQ) addresses the crucial and practical scenario where training data are inaccessible for privacy or security reasons. However, three significant challenges hinder the performance of existing ZSQ methods: 1) noise in the synthetic dataset, 2) predictions based on off-target patterns, and the 3) misguidance by erroneous hard labels. In this paper, we propose SynQ (Synthesis-aware Fine-tuning for Zero-shot Quantization), a carefully designed ZSQ framework to overcome the limitations of existing methods. SynQ minimizes the noise from the generated samples by exploiting a low-pass filter. Then, SynQ trains the quantized model to improve accuracy by aligning its class activation map with the pre-trained model. Furthermore, SynQ mitigates misguidance from the pre-trained model’s error by leveraging only soft labels for difficult samples. Extensive experiments show that SynQ provides the state-of-the-art accuracy, over existing ZSQ methods.

[132] R&D: Balancing Reliability and Diversity in Synthetic Data Augmentation for Semantic Segmentation

Huy Che, Dinh-Duy Phan, Duc-Khai Lam

Main category: cs.CV

TL;DR: A novel synthetic data augmentation pipeline using controllable diffusion models for semantic segmentation, improving model performance in data-scarce scenarios through class-aware prompting and visual prior blending.

Details

Motivation: Pixel-level semantic segmentation requires extensive labeled data collection which is labor-intensive. Traditional augmentation methods create geometric variations but fail to generate new structures, while existing generative models struggle with consistency between original and generated images for pixel-level tasks.

Method: Proposes a synthetic data augmentation pipeline integrating controllable diffusion models with class-aware prompting and visual prior blending to ensure precise alignment between generated images and segmentation labels, balancing diversity and reliability.

Result: Demonstrated significant enhancement in semantic segmentation performance on benchmark datasets (PASCAL VOC and BDD100K), especially in data-scarce scenarios, while improving model robustness in real-world applications.

Conclusion: The proposed method effectively bridges the gap between synthetic and real data for semantic segmentation tasks, providing a viable solution for data augmentation without additional real-world data collection.

Abstract: Collecting and annotating datasets for pixel-level semantic segmentation tasks are highly labor-intensive. Data augmentation provides a viable solution by enhancing model generalization without additional real-world data collection. Traditional augmentation techniques, such as translation, scaling, and color transformations, create geometric variations but fail to generate new structures. While generative models have been employed to extend semantic information of datasets, they often struggle to maintain consistency between the original and generated images, particularly for pixel-level tasks. In this work, we propose a novel synthetic data augmentation pipeline that integrates controllable diffusion models. Our approach balances diversity and reliability data, effectively bridging the gap between synthetic and real data. We utilize class-aware prompting and visual prior blending to improve image quality further, ensuring precise alignment with segmentation labels. By evaluating benchmark datasets such as PASCAL VOC and BDD100K, we demonstrate that our method significantly enhances semantic segmentation performance, especially in data-scarce scenarios, while improving model robustness in real-world applications. Our code is available at \href{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}{https://github.com/chequanghuy/Enhanced-Generative-Data-Augmentation-for-Semantic-Segmentation-via-Stronger-Guidance}.

[133] AndroTMem: From Interaction Trajectories to Anchored Memory in Long-Horizon GUI Agents

Yibo Shi, Jungang Li, Linghao Zhang, Zihao Dongfang, Biao Wu, Sicheng Tao, Yibo Yan, Chenxi Qin, Weiting Liu, Zhixin Lin, Hanqian Li, Yu Huang, Song Dai, Yonghua Hei, Yue Ding, Xiang Li, Shikang Wang, Chengdong Xu, Jingqi Liu, Xueying Ma, Zhiwen Zheng, Xiaofei Zhang, Bincheng Wang, Nichen Yang, Jie Wu, Lihua Tian, Chen Li, Xuming Hu

Main category: cs.CV

TL;DR: AndroTMem is a diagnostic framework for anchored memory in long-horizon Android GUI agents, featuring a benchmark with 1,069 tasks and proposing Anchored State Memory (ASM) to improve task completion by representing interactions as causally linked intermediate-state anchors.

Details

Motivation: Current GUI agents struggle with interaction memory in long-horizon tasks - full sequence replay is redundant and noisy, while summaries erase critical dependency information and traceability needed for complex multi-step interactions.

Method: Proposes AndroTMem framework with AndroTMem-Bench benchmark (1,069 tasks, 34,473 steps) and Anchored State Memory (ASM) that represents interaction sequences as compact sets of causally linked intermediate-state anchors for subgoal-targeted retrieval and attribution-aware decision making.

Result: ASM consistently outperforms full-sequence replay and summary-based baselines across 12 GUI agents, improving Task Complete Rate by 5%-30.16% and AMS by 4.93%-24.66%, effectively mitigating interaction-memory bottlenecks in long-horizon tasks.

Conclusion: Anchored, structured memory representation is crucial for long-horizon GUI agents, with ASM demonstrating significant improvements over existing memory approaches by preserving causal dependencies while reducing redundancy.

Abstract: Long-horizon GUI agents are a key step toward real-world deployment, yet effective interaction memory under prevailing paradigms remains under-explored. Replaying full interaction sequences is redundant and amplifies noise, while summaries often erase dependency-critical information and traceability. We present AndroTMem, a diagnostic framework for anchored memory in long-horizon Android GUI agents. Its core benchmark, AndroTMem-Bench, comprises 1,069 tasks with 34,473 interaction steps (avg. 32.1 per task, max. 65). We evaluate agents with TCR (Task Complete Rate), focusing on tasks whose completion requires carrying forward critical intermediate state; AndroTMem-Bench is designed to enforce strong step-to-step causal dependencies, making sparse yet essential intermediate states decisive for downstream actions and centering interaction memory in evaluation. Across open- and closed-source GUI agents, we observe a consistent pattern: as interaction sequences grow longer, performance drops are driven mainly by within-task memory failures, not isolated perception errors or local action mistakes. Guided by this diagnosis, we propose Anchored State Memory (ASM), which represents interaction sequences as a compact set of causally linked intermediate-state anchors to enable subgoal-targeted retrieval and attribution-aware decision making. Across multiple settings and 12 evaluated GUI agents, ASM consistently outperforms full-sequence replay and summary-based baselines, improving TCR by 5%-30.16% and AMS by 4.93%-24.66%, indicating that anchored, structured memory effectively mitigates the interaction-memory bottleneck in long-horizon GUI tasks. The code, benchmark, and related resources are publicly available at https://github.com/CVC2233/AndroTMem.

Leyuan Fang, Zan Mao, Zijing Wang, Yinlong Yan

Main category: cs.CV

TL;DR: SR-Nav is a zero-shot object-goal navigation framework that uses spatial relationship graphs to enhance perception and planning when foundation models fail due to poor viewpoints or weak semantic cues.

Details

Motivation: Foundation models often fail in object-goal navigation when faced with poor viewpoints or weak semantic cues, leading to unreliable reasoning in perception and planning. The authors observe that inherent spatial relationships among objects and regions encode structured scene priors that can help agents infer plausible target locations even under partial observations.

Method: 1) Constructs a Dynamic Spatial Relationship Graph (DSRG) encoding target-centered spatial relationships through foundation models, updated dynamically with real-time observations. 2) Uses a Relation-aware Matching Module that employs relationship matching instead of naive detection to verify and correct errors. 3) Implements a Dynamic Relationship Planning Module that reduces planning search space by computing optimal paths based on the DSRG.

Result: Experiments on HM3D show state-of-the-art performance in both success rate and navigation efficiency compared to existing methods.

Conclusion: Modeling spatial relationships enhances both perception and planning in zero-shot object-goal navigation, making the system more robust when foundation models struggle with poor viewpoints or weak semantic cues.

Abstract: Zero-shot object-goal navigation aims to find target objects in unseen environments using only egocentric observation. Recent methods leverage foundation models’ comprehension and reasoning capabilities to enhance navigation performance. However, when faced with poor viewpoints or weak semantic cues, foundation models often fail to support reliable reasoning in both perception and planning, resulting in inefficient or failed navigation. We observe that inherent relationships among objects and regions encode structured scene priors, which help agents infer plausible target locations even under partial observations. Motivated by this insight, we propose Spatial Relation-aware Navigation (SR-Nav), a framework that models both observed and experience-based spatial relationships to enhance both perception and planning. Specifically, SR-Nav first constructs a Dynamic Spatial Relationship Graph (DSRG) that encodes the target-centered spatial relationships through the foundation models and updates dynamically with real-time observations. We then introduce a Relation-aware Matching Module. It utilizes relationship matching instead of naive detection, leveraging diverse relationships in the DSRG to verify and correct errors, enhancing visual perception robustness. Finally, we design a Dynamic Relationship Planning Module to reduce the planning search space by dynamically computing the optimal paths based on the DSRG from the current position, thereby guiding planning and reducing exploration redundancy. Experiments on HM3D show that our method achieves state-of-the-art performance in both success rate and navigation efficiency. The code will be publicly available at https://github.com/Mzyw-1314/SR-Nav

Arushi Rai, Adriana Kovashka

Main category: cs.CV

TL;DR: Self-consistency objective improves temporal grounding in Video-LLMs for sports coaching without additional annotations by enforcing attention consistency across related tasks

Details

Motivation: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Obtaining frame-level supervision is challenging: expensive from humans and unreliable from other models.

Method: Exploit observation that related tasks (generation and verification) must attend to the same frames. Enforce this via self-consistency objective over select visual attention maps of tightly-related tasks.

Result: Using VidDiffBench with ground-truth keyframe annotations, validated attention misallocation as significant bottleneck. Training with self-consistency objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

Conclusion: Self-consistency objective effectively improves temporal grounding in Video-LLMs without additional annotations, demonstrating significant performance gains on sports coaching tasks.

Abstract: Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

[136] Interpretable Prostate Cancer Detection using a Small Cohort of MRI Images

Vahid Monfared, Mohammad Hadi Gharib, Ali Sabri, Maryam Shahali, Farid Rashidi, Amit Mehta, Reza Rawassizadeh

Main category: cs.CV

TL;DR: Interpretable AI framework for prostate cancer detection from T2-weighted MRI using small dataset, comparing Vision Transformers, CNNs, and classical methods, with ResNet18 achieving best performance comparable to radiologists.

Details

Motivation: Prostate cancer interpretation from T2-weighted MRI is challenging due to subtle, heterogeneous lesions, and existing methods often require biparametric MRI with large datasets. The paper aims to develop an interpretable AI framework using only T2-weighted images with a small dataset to reduce acquisition complexity and computational cost.

Method: Developed an interpretable framework using a small dataset of 162 T2-weighted images (102 cancer, 60 normal). Addressed data scarcity through transfer learning and augmentation. Compared Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Conducted a reader study with 5 radiologists on 22 cases.

Result: Transfer-learned ResNet18 achieved best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters. Vision Transformers showed lower performance despite higher complexity. HOG+SVM achieved comparable accuracy (AUC 0.917). In reader study, radiologists achieved mean sensitivity of 67.5% vs AI’s 95.2%, with Fleiss Kappa = 0.524 showing moderate inter-reader agreement.

Conclusion: The framework demonstrates competitive performance using only T2-weighted images, reducing acquisition complexity. AI-assisted screening could reduce missed cancers and improve consistency compared to radiologists. Handcrafted features remain effective for small datasets, while Vision Transformers may not be optimal for limited data scenarios.

Abstract: Prostate cancer is a leading cause of mortality in men, yet interpretation of T2-weighted prostate MRI remains challenging due to subtle and heterogeneous lesions. We developed an interpretable framework for automatic cancer detection using a small dataset of 162 T2-weighted images (102 cancer, 60 normal), addressing data scarcity through transfer learning and augmentation. We performed a comprehensive comparison of Vision Transformers (ViT, Swin), CNNs (ResNet18), and classical methods (Logistic Regression, SVM, HOG+SVM). Transfer-learned ResNet18 achieved the best performance (90.9% accuracy, 95.2% sensitivity, AUC 0.905) with only 11M parameters, while Vision Transformers showed lower performance despite substantially higher complexity. Notably, HOG+SVM achieved comparable accuracy (AUC 0.917), highlighting the effectiveness of handcrafted features in small datasets. Unlike state-of-the-art approaches relying on biparametric MRI (T2+DWI) and large cohorts, our method achieves competitive performance using only T2-weighted images, reducing acquisition complexity and computational cost. In a reader study of 22 cases, five radiologists achieved a mean sensitivity of 67.5% (Fleiss Kappa = 0.524), compared to 95.2% for the AI model, suggesting potential for AI-assisted screening to reduce missed cancers and improve consistency. Code and data are publicly available.

[137] Cell-Type Prototype-Informed Neural Network for Gene Expression Estimation from Pathology Images

Kazuya Nishimura, Ryoma Bise, Shinnosuke Matsuo, Haruka Hirose, Yasuhiro Kojima

Main category: cs.CV

TL;DR: CPNN uses single-cell RNA-seq data to create cell-type prototypes, then learns to predict gene expression from pathology images by modeling cell-type composition, achieving state-of-the-art performance on slide- and patch-level datasets.

Details

Motivation: Existing methods treat gene expression as slide/spot-level signals without considering that expression arises from aggregation of underlying cell-level expression. The authors aim to incorporate cell-resolved guidance using publicly available single-cell RNA-seq data.

Method: CPNN (Cell-type Prototype-informed Neural Network) first estimates cell-type prototypes (mean expression profiles) from single-cell RNA-seq data. It then learns cell-type compositional weights directly from pathology images and models the relationship between prototypes and observed bulk/spatial expression.

Result: CPNN achieves highest performance in terms of Spearman correlation across three slide-level datasets and three patch-level spatial transcriptomics datasets. The framework also provides interpretable insights into which cell types drive predicted expression.

Conclusion: Incorporating cell-type information from single-cell RNA-seq data improves gene expression prediction from pathology images, providing a biologically grounded and interpretable framework that outperforms existing methods.

Abstract: Estimating slide- and patch-level gene expression profiles from pathology images enables rapid and low-cost molecular analysis with broad clinical impact. Despite strong results, existing approaches treat gene expression as a mere slide- or spot-level signal and do not incorporate the fact that the measured expression arises from the aggregation of underlying cell-level expression. To explicitly introduce this missing cell-resolved guidance, we propose a Cell-type Prototype-informed Neural Network (CPNN) that leverages publicly available single-cell RNA-sequencing datasets. Since single-cell measurements are noisy and not paired with histology images, we first estimate cell-type prototypes-mean expression profiles that reflect stable gene-gene co-variation patterns.CPNN then learns cell-type compositional weights directly from images and models the relationship between prototypes and observed bulk or spatial expression, providing a biologically grounded and structurally regularized prediction framework. We evaluate CPNN on three slide-level datasets and three patch-level spatial transcriptomics datasets. Across all settings, CPNN achieves the highest performance in terms of Spearman correlation. Moreover, by visualizing the inferred compositional weights, our framework provides interpretable insights into which cell types drive the predicted expression. Code is publicly available at https://github.com/naivete5656/CPNN.

[138] MedQ-UNI: Toward Unified Medical Image Quality Assessment and Restoration via Vision-Language Modeling

Jiyao Liu, Junzhi Ning, Wanying Qu, Lihao Liu, Chenglong Ma, Junjun He, Ningsheng Xu

Main category: cs.CV

TL;DR: MedQ-UNI is a unified vision-language model that combines medical image quality assessment (Med-IQA) with medical image restoration (Med-IR) using an assess-then-restore paradigm to handle diverse degradations across multiple imaging modalities.

Details

Motivation: Current medical image restoration methods are limited by being modality-specific or degradation-specific, failing to generalize across heterogeneous clinical degradations. The authors argue this stems from isolating restoration from quality assessment, as models without explicit quality understanding struggle to adapt to diverse degradation types.

Method: Proposes MedQ-UNI, a multimodal autoregressive dual-expert architecture with shared attention. A quality assessment expert first identifies degradation issues through structured natural language descriptions, then a restoration expert conditions on these descriptions to perform targeted image restoration. Built a large-scale dataset of ~50K paired samples across 3 imaging modalities and 5 restoration tasks with structured quality descriptions.

Result: A single MedQ-UNI model achieves state-of-the-art restoration performance across all tasks without task-specific adaptation, while generating superior quality descriptions. Demonstrates that explicit quality understanding improves restoration fidelity and interpretability.

Conclusion: The assess-then-restore paradigm effectively unifies medical image quality assessment and restoration, enabling generalization across arbitrary modalities and degradation types through explicit quality understanding.

Abstract: Existing medical image restoration (Med-IR) methods are typically modality-specific or degradation-specific, failing to generalize across the heterogeneous degradations encountered in clinical practice. We argue this limitation stems from the isolation of Med-IR from medical image quality assessment (Med-IQA), as restoration models without explicit quality understanding struggle to adapt to diverse degradation types across modalities. To address these challenges, we propose MedQ-UNI, a unified vision-language model that follows an assess-then-restore paradigm, explicitly leveraging Med-IQA to guide Med-IR across arbitrary modalities and degradation types. MedQ-UNI adopts a multimodal autoregressive dual-expert architecture with shared attention: a quality assessment expert first identifies degradation issues through structured natural language descriptions, and a restoration expert then conditions on these descriptions to perform targeted image restoration. To support this paradigm, we construct a large-scale dataset of approximately 50K paired samples spanning three imaging modalities and five restoration tasks, each annotated with structured quality descriptions for joint Med-IQA and Med-IR training, along with a 2K-sample benchmark for evaluation. Extensive experiments demonstrate that a single MedQ-UNI model, without any task-specific adaptation, achieves state-of-the-art restoration performance across all tasks while generating superior descriptions, confirming that explicit quality understanding meaningfully improves restoration fidelity and interpretability.

[139] Recolour What Matters: Region-Aware Colour Editing via Token-Level Diffusion

Yuqi Yang, Dongliang Chang, Yijia Ling, Ruoyi Du, Zhanyu Ma

Main category: cs.CV

TL;DR: ColourCrafter is a diffusion framework for precise, region-aware color editing that uses RGB color tokens fused with image tokens in latent space and perceptual Lab-space loss for fine-grained control.

Details

Motivation: Color is highly perceptually salient but difficult to control in image generation. Existing text-driven methods rely on discrete language descriptions that can't accurately represent continuous chromatic variations, leading to deviations from intended hues, especially for fine-grained and local edits.

Method: Proposes ColourCrafter, a unified diffusion framework that performs token-level fusion of RGB color tokens and image tokens in latent space. It selectively propagates color information to semantically relevant regions while preserving structural fidelity. Uses a perceptual Lab-space Loss that decouples luminance and chrominance and constrains edits within masked areas. Also builds ColourfulSet, a large-scale dataset of high-quality image pairs with continuous and diverse color variations.

Result: Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art color accuracy, controllability and perceptual fidelity in fine-grained color editing.

Conclusion: ColourCrafter transforms color editing from global tone transfer into a structured, region-aware generation process, overcoming limitations of text-driven methods for precise color control.

Abstract: Colour is one of the most perceptually salient yet least controllable attributes in image generation. Although recent diffusion models can modify object colours from user instructions, their results often deviate from the intended hue, especially for fine-grained and local edits. Early text-driven methods rely on discrete language descriptions that cannot accurately represent continuous chromatic variations. To overcome this limitation, we propose ColourCrafter, a unified diffusion framework that transforms colour editing from global tone transfer into a structured, region-aware generation process. Unlike traditional colour driven methods, ColourCrafter performs token-level fusion of RGB colour tokens and image tokens in latent space, selectively propagating colour information to semantically relevant regions while preserving structural fidelity. A perceptual Lab-space Loss further enhances pixel-level precision by decoupling luminance and chrominance and constraining edits within masked areas. Additionally, we build ColourfulSet, a largescale dataset of high-quality image pairs with continuous and diverse colour variations. Extensive experiments demonstrate that ColourCrafter achieves state-of-the-art colour accuracy, controllability and perceptual fidelity in fine-grained colour editing. Our project is available at https://yangyuqi317.github.io/ColourCrafter.github.io/.

[140] Do Vision Language Models Understand Human Engagement in Games?

Ziyi Wang, Qizan Guo, Rishitosh Singh, Xiyang Hu

Main category: cs.CV

TL;DR: VLMs struggle to infer human engagement from gameplay video, showing weak performance across multiple prompting strategies and failing to outperform simple baselines.

Details

Motivation: To evaluate whether vision-language models can infer latent psychological states (engagement) from visual gameplay cues alone, which is important for game design and player-experience research.

Method: Evaluated three VLMs on GameVibe Few-Shot dataset across nine first-person shooter games using six prompting strategies: zero-shot prediction, theory-guided prompts (Flow, GameFlow, Self-Determination Theory, MDA), and retrieval-augmented prompting for both pointwise engagement prediction and pairwise prediction of engagement change.

Result: Zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Retrieval-augmented prompting improves pointwise prediction in some settings, but pairwise prediction remains consistently difficult. Theory-guided prompting doesn’t reliably help and can reinforce surface-level shortcuts.

Conclusion: Current VLMs exhibit a perception-understanding gap: they can recognize visible gameplay cues but struggle to robustly infer human engagement across games, suggesting limitations in understanding latent psychological states from visual data alone.

Abstract: Inferring human engagement from gameplay video is important for game design and player-experience research, yet it remains unclear whether vision–language models (VLMs) can infer such latent psychological states from visual cues alone. Using the GameVibe Few-Shot dataset across nine first-person shooter games, we evaluate three VLMs under six prompting strategies, including zero-shot prediction, theory-guided prompts grounded in Flow, GameFlow, Self-Determination Theory, and MDA, and retrieval-augmented prompting. We consider both pointwise engagement prediction and pairwise prediction of engagement change between consecutive windows. Results show that zero-shot VLM predictions are generally weak and often fail to outperform simple per-game majority-class baselines. Memory- or retrieval-augmented prompting improves pointwise prediction in some settings, whereas pairwise prediction remains consistently difficult across strategies. Theory-guided prompting alone does not reliably help and can instead reinforce surface-level shortcuts. These findings suggest a perception–understanding gap in current VLMs: although they can recognize visible gameplay cues, they still struggle to robustly infer human engagement across games.

[141] T-QPM: Enabling Temporal Out-Of-Distribution Detection and Domain Generalization for Vision-Language Models in Open-World

Aditi Naiknaware, Salimeh Sekeh

Main category: cs.CV

TL;DR: Temporal Quadruple-Pattern Matching (T-QPM) enhances multimodal OOD detection by addressing temporal drift and covariate shift through cross-modal consistency patterns and adaptive fusion weights.

Details

Motivation: Existing vision-language models for OOD detection rely on fixed fusion rules and assume static environments, failing under temporal distribution shifts and lacking robustness against covariate shifted inputs.

Method: Two-step framework: 1) Extends dual-pattern to Temporal Quadruple-Pattern Matching by pairing OOD images with text descriptions to create cross-modal consistency patterns; 2) Learns lightweight fusion weights to optimally combine semantic matching and visual typicality, with explicit regularization using Average Thresholded Confidence for stability.

Result: Experiments on temporally partitioned benchmarks show significant outperformance over static baselines, providing robust temporally-consistent multimodal OOD detection in non-stationary environments.

Conclusion: The proposed T-QPM framework offers an effective solution for multimodal OOD detection in dynamic settings, addressing both temporal drift and covariate shift challenges through adaptive cross-modal reasoning.

Abstract: Out-of-distribution (OOD) detection remains a critical challenge in open-world learning, where models must adapt to evolving data distributions. While recent vision-language models (VLMS) like CLIP enable multimodal OOD detection through Dual-Pattern Matching (DPM), existing methods typically suffer from two major shortcomings: (1) They rely on fixed fusion rules and assume static environments, failing under temporal drift; and (2) they lack robustness against covariate shifted inputs. In this paper, we propose a novel two-step framework to enhance OOD detection and covariate distribution shift robustness in dynamic settings. We extend the dual-pattern regime into Temporal Quadruple-Pattern Matching (T-QPM). First, by pairing OOD images with text descriptions, we introduce cross-modal consistency patterns between ID and OOD signals, refining the decision boundary through joint image-text reasoning. Second, we address temporal distribution shifts by learning lightweight fusion weights to optimally combine semantic matching and visual typicality. To ensure stability, we enforce explicit regularization based on Average Thresholded Confidence (ATC), preventing performance degradation as distributions evolve. Experiments on temporally partitioned benchmarks demonstrate that our approach significantly outperforms static baselines, offering a robust, temporally-consistent framework for multimodal OOD detection in non-stationary environments.

[142] TexEditor: Structure-Preserving Text-Driven Texture Editing

Bo Zhao, Yihang Liu, Chenfeng Zhang, Huan Yang, Kun Gai, Wei Ji

Main category: cs.CV

TL;DR: TexEditor is a text-guided texture editing model that improves structural consistency when modifying object appearance by combining a high-quality SFT dataset (TexBlender) with RL-based structure preservation (StructureNFT), and introducing a new real-world benchmark (TexBench).

Details

Motivation: Current SOTA editing models often fail to maintain structural consistency during texture editing, even when changes are intended to be purely appearance-related. The authors aim to address this limitation by enhancing structure preservation from both data and training perspectives.

Method: 1) Construct TexBlender - a high-quality SFT dataset generated with Blender for strong structural priors; 2) Develop StructureNFT - an RL-based approach integrating structure-preserving losses to transfer structural priors to real-world scenes; 3) Create TexBench - a general-purpose real-world benchmark for evaluation.

Result: TexEditor consistently outperforms strong baselines like Nano Banana Pro on both existing Blender-based texture benchmarks and the new TexBench. It also shows good generalization on the general-purpose ImgEdit benchmark.

Conclusion: The proposed approach effectively addresses structural consistency issues in texture editing through combined data and training innovations, with comprehensive evaluation demonstrating superior performance over existing methods.

Abstract: Text-guided texture editing aims to modify object appearance while preserving the underlying geometric structure. However, our empirical analysis reveals that even SOTA editing models frequently struggle to maintain structural consistency during texture editing, despite the intended changes being purely appearance-related. Motivated by this observation, we jointly enhance structure preservation from both data and training perspectives, and build TexEditor, a dedicated texture editing model based on Qwen-Image-Edit-2509. Firstly, we construct TexBlender, a high-quality SFT dataset generated with Blender, which provides strong structural priors for a cold start. Sec- ondly, we introduce StructureNFT, a RL-based approach that integrates structure-preserving losses to transfer the structural priors learned during SFT to real-world scenes. Moreover, due to the limited realism and evaluation coverage of existing benchmarks, we introduce TexBench, a general-purpose real-world benchmark for text-guided texture editing. Extensive experiments on existing Blender-based texture benchmarks and our TexBench show that TexEditor consistently outperforms strong baselines such as Nano Banana Pro. In addition, we assess TexEditor on the general purpose benchmark ImgEdit to validate its generalization. Our code and data are available at https://github.com/KlingAIResearch/TexEditor.

[143] FILT3R: Latent State Adaptive Kalman Filter for Streaming 3D Reconstruction

Seonghyun Jin, Jong Chul Ye

Main category: cs.CV

TL;DR: FILT3R introduces a training-free latent filtering layer for streaming 3D reconstruction that uses stochastic state estimation with Kalman-style gains to adaptively balance memory retention against new observations.

Details

Motivation: Current streaming 3D reconstruction methods struggle with state update rules - aggressive updates forget useful history while conservative updates fail to track new evidence, both becoming unstable beyond training horizons.

Method: Proposes FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. Maintains per-token variance and computes Kalman-style gains that adaptively balance memory retention against new observations. Process noise is estimated online from EMA-normalized temporal drift of candidate tokens.

Result: FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction.

Conclusion: FILT3R provides a principled approach to streaming 3D reconstruction state updates that improves long-horizon stability and generalizes existing update policies, with code to be released.

Abstract: Streaming 3D reconstruction maintains a persistent latent state that is updated online from incoming frames, enabling constant-memory inference. A key failure mode is the state update rule: aggressive overwrites forget useful history, while conservative updates fail to track new evidence, and both behaviors become unstable beyond the training horizon. To address this challenge, we propose FILT3R, a training-free latent filtering layer that casts recurrent state updates as stochastic state estimation in token space. FILT3R maintains a per-token variance and computes a Kalman-style gain that adaptively balances memory retention against new observations. Process noise – governing how much the latent state is expected to change between frames – is estimated online from EMA-normalized temporal drift of candidate tokens. Using extensive experiments, we demonstrate that FILT3R yields an interpretable, plug-in update rule that generalizes common overwrite and gating policies as special cases. Specifically, we show that gains shrink in stable regimes as uncertainty contracts with accumulated evidence, and rise when genuine scene change increases process uncertainty, improving long-horizon stability for depth, pose, and 3D reconstruction, compared to the existing methods. Code will be released at https://github.com/jinotter3/FILT3R.

[144] NymeriaPlus: Enriching Nymeria Dataset with Additional Annotations and Data

Daniel DeTone, Federica Bogo, Eric-Tuan Le, Duncan Frost, Julian Straub, Yawar Siddiqui, Yuting Ye, Jakob Engel, Richard Newcombe, Lingni Ma

Main category: cs.CV

TL;DR: NymeriaPlus is an upgraded egocentric dataset with improved human motion data, dense 3D/2D object annotations, instance-level reconstructions, and additional modalities like audio and video for multimodal embodied AI research.

Details

Motivation: To address gaps in existing egocentric datasets by providing a comprehensive, multimodal benchmark with synchronized spatial-temporal data from wearable devices for embodied AI research.

Method: Upgraded the original Nymeria dataset by adding improved human motion formats (MHR/SMPL), dense 3D/2D bounding boxes for objects, instance-level 3D reconstructions, and new modalities including audio, basemap recordings, and wristband videos.

Result: Created NymeriaPlus, a more powerful in-the-wild egocentric dataset that consolidates complementary modalities and annotations into a single coherent benchmark for multimodal learning.

Conclusion: NymeriaPlus bridges key gaps in egocentric resources and supports broader research, particularly multimodal learning for embodied AI applications.

Abstract: The Nymeria Dataset, released in 2024, is a large-scale collection of in-the-wild human activities captured with multiple egocentric wearable devices that are spatially localized and temporally synchronized. It provides body-motion ground truth recorded with a motion-capture suit, device trajectories, semi-dense 3D point clouds, and in-context narrations. In this paper, we upgrade Nymeria and introduce NymeriaPlus. NymeriaPlus features: (1) improved human motion in Momentum Human Rig (MHR) and SMPL formats; (2) dense 3D and 2D bounding box annotations for indoor objects and structural elements; (3) instance-level 3D object reconstructions; and (4) additional modalities e.g., basemap recordings, audio, and wristband videos. By consolidating these complementary modalities and annotations into a single, coherent benchmark, NymeriaPlus strengthens Nymeria into a more powerful in-the-wild egocentric dataset. We expect NymeriaPlus to bridge a key gap in existing egocentric resources and to support a broader range of research, including unique explorations of multimodal learning for embodied AI.

[145] Efficient Video Diffusion with Sparse Information Transmission for Video Compression

Mingde Zhou, Zheng Chen, Yulun Zhang

Main category: cs.CV

TL;DR: Diff-SIT: A video compression method using sparse temporal encoding and one-step video diffusion to achieve high perceptual quality and temporal consistency at ultra-low bitrates.

Details

Motivation: Traditional video compression methods produce blurry images with poor perceptual quality at ultra-low bitrates, and existing generative compression methods lack temporal coherence and efficiency.

Method: Proposes Sparse Temporal Encoding Module (STEM) to sparsely encode frames into information-rich intermediate sequence, then uses One-Step Video Diffusion with Frame Type Embedder (ODFTE) to process the sequence while maintaining temporal correlation and adaptive reconstruction.

Result: Establishes state-of-the-art in perceptual quality and temporal consistency on multiple datasets, especially at ultra-low bitrates.

Conclusion: Diff-SIT effectively addresses ultra-low-bitrate video compression challenges by combining sparse encoding with diffusion-based generation, achieving superior perceptual quality and temporal consistency.

Abstract: Video compression aims to maximize reconstruction quality with minimal bitrates. Beyond standard distortion metrics, perceptual quality and temporal consistency are also critical. However, at ultra-low bitrates, traditional end-to-end compression models tend to produce blurry images of poor perceptual quality. Besides, existing generative compression methods often treat video frames independently and show limitations in time coherence and efficiency. To address these challenges, we propose the Efficient Video Diffusion with Sparse Information Transmission (Diff-SIT), which comprises the Sparse Temporal Encoding Module (STEM) and the One-Step Video Diffusion with Frame Type Embedder (ODFTE). The STEM sparsely encodes the original frame sequence into an information-rich intermediate sequence, achieving significant bitrate savings. Subsequently, the ODFTE processes this intermediate sequence as a whole, which exploits the temporal correlation. During this process, our proposed Frame Type Embedder (FTE) guides the diffusion model to perform adaptive reconstruction according to different frame types to optimize the overall quality. Extensive experiments on multiple datasets demonstrate that Diff-SIT establishes a new state-of-the-art in perceptual quality and temporal consistency, particularly in the challenging ultra-low-bitrate regime. Code is released at https://github.com/MingdeZhou/Diff-SIT.

[146] HOMEY: Heuristic Object Masking with Enhanced YOLO for Property Insurance Risk Detection

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: HOMEY is a YOLO-based framework for detecting property risks using heuristic object masking and custom loss functions to identify 17 types of structural damages, maintenance issues, and liability hazards.

Details

Motivation: Automated property risk detection has significant applications in real estate, underwriting, and insurance, but remains underexplored in computer vision despite its high impact potential.

Method: Combines YOLO with domain-specific heuristic object masking to amplify weak signals in cluttered backgrounds, plus a custom risk-aware loss function that balances class skew and severity weighting for 17 property risk classes.

Result: HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models on real-world property imagery while maintaining fast inference speeds.

Conclusion: The framework enables interpretable and cost-efficient risk analysis, providing a foundation for scalable AI-driven property insurance workflows.

Abstract: Automated property risk detection is a high-impact yet underexplored frontier in computer vision with direct implications for real estate, underwriting, and insurance operations. We introduce HOMEY (Heuristic Object Masking with Enhanced YOLO), a novel detection framework that combines YOLO with a domain-specific masking mechanism and a custom-designed loss function. HOMEY is trained to detect 17 risk-related property classes, including structural damages (e.g., cracked foundations, roof issues), maintenance neglect (e.g., dead yards, overgrown bushes), and liability hazards (e.g., falling gutters, garbage, hazard signs). Our approach introduces heuristic object masking to amplify weak signals in cluttered backgrounds and risk-aware loss calibration to balance class skew and severity weighting. Experiments on real-world property imagery demonstrate that HOMEY achieves superior detection accuracy and reliability compared to baseline YOLO models, while retaining fast inference. Beyond detection, HOMEY enables interpretable and cost-efficient risk analysis, laying the foundation for scalable AI-driven property insurance workflows.

[147] From Snapshots to Symphonies: The Evolution of Protein Prediction from Static Structures to Generative Dynamics and Multimodal Interactions

Jingzhi Chen, Lijian Xu

Main category: cs.CV

TL;DR: AI-driven protein science has evolved from static structure prediction to modeling dynamic conformational ensembles and complex biomolecular interactions through multimodal representations, generative frameworks, and functional inference.

Details

Motivation: The protein folding problem has been fundamentally transformed by AI, evolving from static structure prediction toward modeling dynamic conformational ensembles and complex biomolecular interactions. The field needs to move beyond static predictions to capture the dynamic nature of proteins and their interactions.

Method: Systematic examination across five dimensions: 1) unified multimodal representations integrating sequences, geometries, and textual knowledge; 2) refinement of static prediction through MSA-free architectures and all-atom complex modeling; 3) generative frameworks including diffusion models and flow matching; 4) prediction of heterogeneous interactions (protein-ligand, protein-nucleic acid, protein-protein); 5) functional inference of fitness landscapes and mutational effects.

Result: The review identifies current bottlenecks including data distribution biases, limited mechanistic interpretability, and disconnect between geometric metrics and biophysical reality. It marks AI’s transition from structural analysis tool to universal simulator of biological systems.

Conclusion: AI is transitioning from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life, with future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed-loop systems.

Abstract: The protein folding problem has been fundamentally transformed by artificial intelligence, evolving from static structure prediction toward the modeling of dynamic conformational ensembles and complex biomolecular interactions. This review systematically examines the paradigm shift in AI driven protein science across five interconnected dimensions: unified multimodal representations that integrate sequences, geometries, and textual knowledge; refinement of static prediction through MSA free architectures and all atom complex modeling; generative frameworks, including diffusion models and flow matching, that capture conformational distributions consistent with thermodynamic ensembles; prediction of heterogeneous interactions spanning protein ligand, protein nucleic acid, and protein protein complexes; and functional inference of fitness landscapes, mutational effects, and text guided property prediction. We critically analyze current bottlenecks, including data distribution biases, limited mechanistic interpretability, and the disconnect between geometric metrics and biophysical reality, while identifying future directions toward physically consistent generative models, multimodal foundation architectures, and experimental closed loop systems. This methodological transformation marks artificial intelligence’s transition from a structural analysis tool into a universal simulator capable of understanding and ultimately rewriting the dynamic language of life.

[148] Foundations and Architectures of Artificial Intelligence for Motor Insurance

Teerapong Panboonyuen

Main category: cs.CV

TL;DR: A comprehensive handbook on AI systems for motor insurance, focusing on vertically integrated AI architectures combining perception, multimodal reasoning, and production infrastructure for automotive risk assessment and claims processing.

Details

Motivation: To address the need for systematic AI deployment in motor insurance by creating a unified intelligence stack that can handle real-world constraints and high-stakes industrial environments, particularly based on experiences from nationwide deployment in Thailand.

Method: Develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence. Creates a scalable pipeline that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack.

Result: Establishes a principled framework for translating modern AI into reliable, production-grade systems for motor insurance, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows.

Conclusion: The handbook provides a systematic approach to building vertically integrated AI systems for motor insurance that combines learning algorithms with MLOps practices, demonstrating practical deployment at scale in real-world industrial settings.

Abstract: This handbook presents a systematic treatment of the foundations and architectures of artificial intelligence for motor insurance, grounded in large-scale real-world deployment. It formalizes a vertically integrated AI paradigm that unifies perception, multimodal reasoning, and production infrastructure into a cohesive intelligence stack for automotive risk assessment and claims processing. At its core, the handbook develops domain-adapted transformer architectures for structured visual understanding, relational vehicle representation learning, and multimodal document intelligence, enabling end-to-end automation of vehicle damage analysis, claims evaluation, and underwriting workflows. These components are composed into a scalable pipeline operating under practical constraints observed in nationwide motor insurance systems in Thailand. Beyond model design, the handbook emphasizes the co-evolution of learning algorithms and MLOps practices, establishing a principled framework for translating modern artificial intelligence into reliable, production-grade systems in high-stakes industrial environments.

[149] OnlinePG: Online Open-Vocabulary Panoptic Mapping with 3D Gaussian Splatting

Hongjia Zhai, Qi Zhang, Xiaokun Pan, Xiyu Zhang, Yitong Dong, Huaqi Zhang, Dan Xu, Guofeng Zhang

Main category: cs.CV

TL;DR: OnlinePG: An online panoptic mapping system using 3D Gaussian Splatting for open-vocabulary scene understanding, combining geometric reconstruction with semantic perception in real-time for robotic applications.

Details

Motivation: Existing scene understanding methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks that require real-time perception and interaction with environments.

Method: Uses 3D Gaussian Splatting in an online setting with a sliding window approach. Builds local consistency maps via 3D segment clustering graphs leveraging geometric and semantic cues, then fuses local maps into global map using bidirectional bipartite 3D Gaussian instance matching. Employs fused VLM features in 3D spatial attribute grids for open-vocabulary understanding.

Result: Achieves better performance among online approaches on widely used datasets while maintaining real-time efficiency.

Conclusion: OnlinePG provides an effective system for online panoptic mapping that integrates geometric reconstruction and open-vocabulary perception, enabling real-time scene understanding for embodied applications.

Abstract: Open-vocabulary scene understanding with online panoptic mapping is essential for embodied applications to perceive and interact with environments. However, existing methods are predominantly offline or lack instance-level understanding, limiting their applicability to real-world robotic tasks. In this paper, we propose OnlinePG, a novel and effective system that integrates geometric reconstruction and open-vocabulary perception using 3D Gaussian Splatting in an online setting. Technically, to achieve online panoptic mapping, we employ an efficient local-to-global paradigm with a sliding window. To build local consistency map, we construct a 3D segment clustering graph that jointly leverages geometric and semantic cues, fusing inconsistent segments within sliding window into complete instances. Subsequently, to update the global map, we construct explicit grids with spatial attributes for the local 3D Gaussian map and fuse them into the global map via robust bidirectional bipartite 3D Gaussian instance matching. Finally, we utilize the fused VLM features inside the 3D spatial attribute grids to achieve open-vocabulary scene understanding. Extensive experiments on widely used datasets demonstrate that our method achieves better performance among online approaches, while maintaining real-time efficiency.

[150] CAFlow: Adaptive-Depth Single-Step Flow Matching for Efficient Histopathology Super-Resolution

Elad Yoshai, Ariel D. Yoshai, Natan T. Shaked

Main category: cs.CV

TL;DR: CAFlow: Adaptive-depth flow-matching framework for gigapixel pathology image super-resolution using early exits to reduce computation while maintaining quality.

Details

Motivation: Whole-slide pathology images are gigapixel resolution, making generative super-resolution computationally intensive and impractical for routine clinical deployment. Need efficient methods that can handle massive images while preserving clinically relevant structures.

Method: Adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit preserving reconstruction quality. Uses flow matching in pixel-unshuffled rearranged space (16x spatial reduction). FlowResNet backbone mixes convolution and window self-attention blocks across four early exits. Lightweight exit classifier (~6K params) decides routing.

Result: Achieves 31.72 dB PSNR vs 31.84 dB at full depth with 33% compute savings. Shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. Generalizes to held-out colon tissue (-0.02 dB loss). At x8 upscaling outperforms comparable-compute baselines. Preserves clinically relevant structures (nuclei segmentation confirmed). Trains in <5 hours on single GPU.

Conclusion: CAFlow enables practical deployment of generative super-resolution for gigapixel pathology images by adaptively reducing computation while maintaining clinical quality, reducing whole-slide inference from minutes to seconds.

Abstract: In digital pathology, whole-slide images routinely exceed gigapixel resolution, making computationally intensive generative super-resolution (SR) impractical for routine deployment. We introduce CAFlow, an adaptive-depth single-step flow-matching framework that routes each image tile to the shallowest network exit that preserves reconstruction quality. CAFlow performs flow matching in pixel-unshuffled rearranged space, reducing spatial computation by 16x while enabling direct inference. We show that dedicating half of training to exact t=0 samples is essential for single-step quality (-1.5 dB without it). The backbone, FlowResNet (1.90M parameters), mixes convolution and window self-attention blocks across four early exits spanning 3.1 to 13.3 GFLOPs. A lightweight exit classifier (~6K parameters) achieves 33% compute savings at only 0.12 dB cost. On multi-organ histopathology x4 SR, adaptive routing achieves 31.72 dB PSNR versus 31.84 dB at full depth, while the shallowest exit exceeds bicubic by +1.9 dB at 2.8x less compute than SwinIR-light. The method generalizes to held-out colon tissue with minimal quality loss (-0.02 dB), and at x8 upscaling it outperforms all comparable-compute baselines while remaining competitive with the much larger SwinIR-Medium model. Downstream nuclei segmentation confirms preservation of clinically relevant structure. The model trains in under 5 hours on a single GPU, and adaptive routing can reduce whole-slide inference from minutes to seconds.

[151] Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic

Main category: cs.CV

TL;DR: LVLMs show human-like counting behavior with precise small-number counting and noisy estimation for larger quantities. Researchers identify a “counting circuit” in LVLMs and improve counting and general visual reasoning through targeted fine-tuning.

Details

Motivation: Counting serves as a fundamental test of LVLM reasoning capabilities, requiring both object identification and arithmetic operations. The paper aims to understand how LVLMs implement counting and whether improving this basic capability can enhance broader visual reasoning.

Method: Used controlled synthetic and real-world benchmarks with mechanistic analyses. Introduced two novel interpretability methods: Visual Activation Patching and HeadLens. Identified a “counting circuit” in LVLMs and proposed lightweight intervention using synthetic images for fine-tuning on counting tasks.

Result: LVLMs display human-like counting behavior. The counting circuit is largely shared across various visual reasoning tasks. Targeted fine-tuning on counting improved counting accuracy by +8.36% on out-of-distribution benchmarks and boosted general visual reasoning by +1.54% for Qwen2.5-VL.

Conclusion: Counting plays a central role in visual reasoning, and targeted enhancement of counting mechanisms can improve overall visual reasoning capabilities in LVLMs.

Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model’s (LVLM’s) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured “counting circuit” that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

[152] 3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park

Main category: cs.CV

TL;DR: A framework for 3D-aware video customization that decouples spatial geometry from temporal motion using 3DreamBooth and 3Dapter to enable view-consistent video generation of customized subjects without multi-view video data.

Details

Motivation: Existing subject-driven video generation methods treat subjects as 2D entities, lacking comprehensive spatial priors for 3D geometry reconstruction, which leads to arbitrary details in novel views rather than preserving true 3D identity. The scarcity of multi-view video datasets and temporal overfitting in fine-tuning approaches create challenges for genuine 3D-aware customization.

Method: Introduces 3DreamBooth with 1-frame optimization paradigm to decouple spatial geometry from temporal motion, baking 3D prior into model without video-based training. Incorporates 3Dapter visual conditioning module that undergoes multi-view joint optimization via asymmetrical conditioning strategy, acting as dynamic selective router for view-specific geometric hints from minimal reference set.

Result: The framework enables 3D-aware video customization with view consistency, addressing limitations of 2D-centric approaches by providing robust spatial priors for 3D geometry reconstruction without requiring extensive multi-view video datasets.

Conclusion: The proposed approach successfully achieves genuine 3D-aware video customization by separating spatial and temporal learning, overcoming data scarcity and overfitting issues through innovative architectural design and optimization strategies.

Abstract: Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

[153] Remedying Target-Domain Astigmatism for Cross-Domain Few-Shot Object Detection

Yongwei Jiang, Yixiong Zou, Yuhua Li, Ruixuan Li

Main category: cs.CV

TL;DR: A bio-inspired framework called Fovea-Style Attention Refinement addresses target-domain astigmatism in cross-domain few-shot object detection by reshaping attention patterns using class-specific prototypes and background context modeling.

Details

Motivation: The paper addresses the target-domain astigmatism problem in cross-domain few-shot object detection, where models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions due to domain shifts and data scarcity.

Method: Proposes a center-periphery attention refinement framework with three modules: (1) Positive Pattern Refinement using class-specific prototypes to reshape attention toward semantic objects, (2) Negative Context Modulation to enhance boundary discrimination by modeling background context, and (3) Textual Semantic Alignment to strengthen center-periphery distinction through cross-modal cues.

Result: Experiments on six challenging CD-FSOD benchmarks demonstrate improved detection accuracy and establish new state-of-the-art results.

Conclusion: The bio-inspired approach successfully transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains in cross-domain few-shot object detection.

Abstract: Cross-domain few-shot object detection (CD-FSOD) aims to adapt pretrained detectors from a source domain to target domains with limited annotations, suffering from severe domain shifts and data scarcity problems. In this work, we find a previously overlooked phenomenon: models exhibit dispersed and unfocused attention in target domains, leading to imprecise localization and redundant predictions, just like a human cannot focus on visual objects. Therefore, we call it the target-domain Astigmatism problem. Analysis on attention distances across transformer layers reveals that regular fine-tuning inherently shows a trend to remedy this problem, but results are still far from satisfactory, which we aim to enhance in this paper. Biologically inspired by the human fovea-style visual system, we enhance the fine-tuning’s inherent trend through a center-periphery attention refinement framework, which contains (1) a Positive Pattern Refinement module to reshape attention toward semantic objects using class-specific prototypes, simulating the visual center region; (2) a Negative Context Modulation module to enhance boundary discrimination by modeling background context, simulating the visual periphery region; and (3) a Textual Semantic Alignment module to strengthen center-periphery distinction through cross-modal cues. Our bio-inspired approach transforms astigmatic attention into focused patterns, substantially improving adaptation to target domains. Experiments on six challenging CD-FSOD benchmarks consistently demonstrate improved detection accuracy and establish new state-of-the-art results.

[154] CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

Xiang Chen, Fangfang Yang, Chunlei Meng, Chengyin Hu, Ang Li, Yiwei Wei, Jiahuan Long, Jiujiang Guo

Main category: cs.CV

TL;DR: CoDA framework creates clinically plausible image distribution shifts to test medical vision-language model robustness, revealing vulnerabilities in current models and proposing a repair strategy.

Details

Motivation: Medical vision-language models are increasingly used in clinical workflows, but their reliability under real-world conditions with routine image processing operations remains underexplored. Current robustness evaluations often use clean inputs or isolated corruptions, missing clinically plausible pipeline shifts.

Method: Proposed CoDA (chain-of-distribution) framework constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction/display remapping, and delivery/export degradations. Uses masked structural-similarity constraints to jointly optimize stage compositions and parameters to induce failures while preserving visual plausibility.

Result: CoDA substantially degrades zero-shot performance of CLIP-style medical vision-language models across brain MRI, chest X-ray, and abdominal CT. Chained compositions are consistently more damaging than single stages. Proprietary multimodal models show degraded auditing reliability, while medical-specific MLLMs exhibit deficiencies in medical image quality auditing.

Conclusion: The findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment (teacher-guided token-space adaptation with patch-level alignment) improves robustness in deployment.

Abstract: Medical vision–language models (MVLMs) are increasingly used as perceptual backbones in radiology pipelines and as the visual front end of multimodal assistants, yet their reliability under real clinical workflows remains underexplored. Prior robustness evaluations often assume clean, curated inputs or study isolated corruptions, overlooking routine acquisition, reconstruction, display, and delivery operations that preserve clinical readability while shifting image statistics. To address this gap, we propose CoDA, a chain-of-distribution framework that constructs clinically plausible pipeline shifts by composing acquisition-like shading, reconstruction and display remapping, and delivery and export degradations. Under masked structural-similarity constraints, CoDA jointly optimizes stage compositions and parameters to induce failures while preserving visual plausibility. Across brain MRI, chest X-ray, and abdominal CT, CoDA substantially degrades the zero-shot performance of CLIP-style MVLMs, with chained compositions consistently more damaging than any single stage. We also evaluate multimodal large language models (MLLMs) as technical-authenticity auditors of imaging realism and quality rather than pathology. Proprietary multimodal models show degraded auditing reliability and persistent high-confidence errors on CoDA-shifted samples, while the medical-specific MLLMs we test exhibit clear deficiencies in medical image quality auditing. Finally, we introduce a post-hoc repair strategy based on teacher-guided token-space adaptation with patch-level alignment, which improves accuracy on archived CoDA outputs. Overall, our findings characterize a clinically grounded threat surface for MVLM deployment and show that lightweight alignment improves robustness in deployment.

[155] HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin

Main category: cs.CV

TL;DR: HiMu is a training-free framework for long-form video QA that uses an LLM to decompose queries into hierarchical logic trees, then routes atomic predicates to lightweight vision/audio experts, enabling efficient temporal reasoning with fuzzy-logic composition.

Details

Motivation: Long-form video QA requires reasoning over extended temporal contexts, but existing methods face a sharp trade-off: similarity-based selectors are fast but lose compositional structure, while agent-based methods preserve structure but are computationally prohibitive.

Method: Single text-only LLM call decomposes queries into hierarchical logic trees with atomic predicates; each predicate routed to lightweight experts (CLIP, detection, OCR, ASR, CLAP); signals normalized, temporally smoothed, and composed bottom-up through fuzzy-logic operators with temporal sequencing constraints.

Result: HiMu advances efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B outperforms all competing selectors; with GPT-4o surpasses agentic systems at 32-512 frames while requiring ~10x fewer FLOPs on Video-MME, LongVideoBench and HERBench-Lite.

Conclusion: HiMu bridges the gap between fast but simplistic similarity-based selectors and accurate but expensive agent-based methods for long-form video QA, enabling efficient compositional reasoning over multimodal temporal contexts.

Abstract: Long-form video question answering requires reasoning over extended temporal contexts, making frame selection critical for large vision-language models (LVLMs) bound by finite context windows. Existing methods face a sharp trade-off: similarity-based selectors are fast but collapse compositional queries into a single dense vector, losing sub-event ordering and cross-modal bindings; agent-based methods recover this structure through iterative LVLM inference, but at prohibitive cost. We introduce HiMu, a training-free framework that bridges this gap. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (ASR, CLAP). The resulting signals are normalized, temporally smoothed to align different modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, producing a continuous satisfaction curve. Evaluations on Video-MME, LongVideoBench and HERBench-Lite show that HiMu advances the efficiency-accuracy Pareto front: at 16 frames with Qwen3-VL 8B it outperforms all competing selectors, and with GPT-4o it surpasses agentic systems operating at 32-512 frames while requiring roughly 10x fewer FLOPs.

[156] CausalVAD: De-confounding End-to-End Autonomous Driving via Causal Intervention

Jiacheng Tang, Zhiyuan Zhou, Zhuolin He, Jia Zhang, Kai Zhang, Jian Pu

Main category: cs.CV

TL;DR: CausalVAD: A de-confounding training framework for end-to-end driving models that addresses causal confusion by using sparse causal intervention to eliminate spurious correlations from dataset biases.

Details

Motivation: Planning-oriented end-to-end driving models learn statistical correlations instead of true causal relationships, making them vulnerable to causal confusion where they exploit dataset biases as shortcuts, harming reliability and safety in complex scenarios.

Method: Introduces CausalVAD with sparse causal intervention scheme (SCIS), a lightweight plug-and-play module that instantiates backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts and uses it to intervene on the model’s sparse vectorized queries to eliminate spurious associations induced by confounders.

Result: Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety, with superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

Conclusion: CausalVAD effectively addresses causal confusion in end-to-end driving models through causal intervention, improving reliability and safety by eliminating spurious correlations from dataset biases.

Abstract: Planning-oriented end-to-end driving models show great promise, yet they fundamentally learn statistical correlations instead of true causal relationships. This vulnerability leads to causal confusion, where models exploit dataset biases as shortcuts, critically harming their reliability and safety in complex scenarios. To address this, we introduce CausalVAD, a de-confounding training framework that leverages causal intervention. At its core, we design the sparse causal intervention scheme (SCIS), a lightweight, plug-and-play module to instantiate the backdoor adjustment theory in neural networks. SCIS constructs a dictionary of prototypes representing latent driving contexts. It then uses this dictionary to intervene on the model’s sparse vectorized queries. This step actively eliminates spurious associations induced by confounders, thereby eliminating spurious factors from the representations for downstream tasks. Extensive experiments on benchmarks like nuScenes show CausalVAD achieves state-of-the-art planning accuracy and safety. Furthermore, our method demonstrates superior robustness against both data bias and noisy scenarios configured to induce causal confusion.

[157] HAViT: Historical Attention Vision Transformer

Swarnendu Banik, Manish Das, Shiv Ram Dubey, Satish Kumar Singh

Main category: cs.CV

TL;DR: Cross-layer attention propagation method for Vision Transformers that preserves and integrates historical attention matrices across encoder layers to improve information flow and feature learning.

Details

Motivation: Standard Vision Transformers have attention mechanisms that operate independently across layers, limiting information flow and feature learning. There's a need for better inter-layer communication to enhance attention pattern refinement throughout the transformer hierarchy.

Method: Proposes a cross-layer attention propagation method that stores and blends historical attention matrices across encoder layers. Uses minimal architectural changes - only adds attention matrix storage and blending operations. The method blends current attention with historical attention using a hyperparameter alpha (optimal value = 0.45).

Result: Consistent accuracy improvements: ViT performance increased from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Random initialization outperforms zero initialization.

Conclusion: The proposed cross-layer attention propagation method effectively enhances Vision Transformer performance by improving inter-layer information flow. It offers a principled refinement of attention patterns throughout the transformer hierarchy with minimal architectural changes.

Abstract: Vision Transformers have excelled in computer vision but their attention mechanisms operate independently across layers, limiting information flow and feature learning. We propose an effective cross-layer attention propagation method that preserves and integrates historical attention matrices across encoder layers, offering a principled refinement of inter-layer information flow in Vision Transformers. This approach enables progressive refinement of attention patterns throughout the transformer hierarchy, enhancing feature acquisition and optimization dynamics. The method requires minimal architectural changes, adding only attention matrix storage and blending operations. Comprehensive experiments on CIFAR-100 and TinyImageNet demonstrate consistent accuracy improvements, with ViT performance increasing from 75.74% to 77.07% on CIFAR-100 (+1.33%) and from 57.82% to 59.07% on TinyImageNet (+1.25%). Cross-architecture validation shows similar gains across transformer variants, with CaiT showing 1.01% enhancement. Systematic analysis identifies the blending hyperparameter of historical attention (alpha = 0.45) as optimal across all configurations, providing the ideal balance between current and historical attention information. Random initialization consistently outperforms zero initialization, indicating that diverse initial attention patterns accelerate convergence and improve final performance. Our code is publicly available at https://github.com/banik-s/HAViT.

[158] Color image restoration based on nonlocal saturation-value similarity

Wei Wang, Yakun Li

Main category: cs.CV

TL;DR: A novel nonlocal variational method for color image restoration using saturation-value similarity instead of traditional RGB channel similarity.

Details

Motivation: Traditional nonlocal methods for color image restoration extract patches from RGB channels directly, which fails to capture fine color information because patch similarity is based on grayscale values of independent channels rather than perceptual color properties.

Method: Proposes saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into nonlocal gradients. Formulates variational models based on this and solves using bregmanized operator splitting method with convergence analysis.

Result: Numerical examples show the proposed models outperform other testing methods in visual quality and quantitative metrics including PSNR, SSIM, QSSIM, and S-CIELAB color error.

Conclusion: The saturation-value similarity based nonlocal variational approach effectively improves color image restoration by better capturing perceptual color information through saturation-value channel similarity rather than traditional RGB channel similarity.

Abstract: In this paper, we propose and develop a novel nonlocal variational technique based on saturation-value similarity for color image restoration. In traditional nonlocal methods, image patches are extracted from red, green and blue channels of a color image directly, and the color information can not be described finely because the patch similarity is mainly based on the grayscale value of independent channel. The main aim of this paper is to propose and develop a novel nonlocal regularization method by considering the similarity of image patches in saturation-value channel of a color image. In particular, we first establish saturation-value similarity based nonlocal total variation by incorporating saturation-value similarity of color image patches into the proposed nonlocal gradients, which can describe the saturation and value similarity of two adjacent color image patches. The proposed nonlocal variational models are then formulated based on saturation-value similarity based nonlocal total variation. Moreover, we design an effective and efficient algorithm to solve the proposed optimization problem numerically by employing bregmanized operator splitting method, and we also study the convergence of the proposed algorithms. Numerical examples are presented to demonstrate that the performance of the proposed models is better than that of other testing methods in terms of visual quality and some quantitative metrics including peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), quaternion structural similarity index (QSSIM) and S-CIELAB color error.

[159] myMNIST: Benchmark of PETNN, KAN, and Classical Deep Learning Models for Burmese Handwritten Digit Recognition

Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

Main category: cs.CV

TL;DR: Systematic benchmark of 11 architectures on myMNIST Burmese handwritten digit dataset shows CNN achieves best performance (F1=0.9959, Accuracy=0.9970), with PETNN (GELU) close behind, outperforming LSTM, GRU, Transformer, and KAN variants.

Details

Motivation: To establish reproducible baselines for the myMNIST Burmese handwritten digit dataset across diverse modeling paradigms, facilitate future research on Myanmar digit recognition, and encourage broader evaluation of emerging architectures on regional scripts.

Method: Evaluated 11 architectures including classical deep learning models (MLP, CNN, LSTM, GRU, Transformer), recent alternatives (FastKAN, EfficientKAN), energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU) using Precision, Recall, F1-Score, and Accuracy metrics.

Result: CNN achieved best overall scores (F1=0.9959, Accuracy=0.9970), PETNN (GELU) closely followed (F1=0.9955, Accuracy=0.9966), JEM performed competitively (F1=0.9944, Accuracy=0.9958), while KAN-based models trailed top performers (Accuracy ~0.992).

Conclusion: The benchmark establishes reproducible baselines for myMNIST, highlights PETNN’s strong performance relative to classical and Transformer models, and quantifies the gap between energy-inspired PETNNs and true energy-based models like JEM.

Abstract: We present the first systematic benchmark on myMNIST (formerly BHDD), a publicly available Burmese handwritten digit dataset important for Myanmar NLP/AI research. We evaluate eleven architectures spanning classical deep learning models (Multi-Layer Perceptron, Convolutional Neural Network, Long Short-Term Memory, Gated Recurrent Unit, Transformer), recent alternatives (FastKAN, EfficientKAN), an energy-based model (JEM), and physics-inspired PETNN variants (Sigmoid, GELU, SiLU). Using Precision, Recall, F1-Score, and Accuracy as evaluation metrics, our results show that the CNN remains a strong baseline, achieving the best overall scores (F1 = 0.9959, Accuracy = 0.9970). The PETNN (GELU) model closely follows (F1 = 0.9955, Accuracy = 0.9966), outperforming LSTM, GRU, Transformer, and KAN variants. JEM, representing energy-based modeling, performs competitively (F1 = 0.9944, Accuracy = 0.9958). KAN-based models (FastKAN, EfficientKAN) trail the top performers but provide a meaningful alternative baseline (Accuracy ~0.992). These findings (i) establish reproducible baselines for myMNIST across diverse modeling paradigms, (ii) highlight PETNN’s strong performance relative to classical and Transformer-based models, and (iii) quantify the gap between energy-inspired PETNNs and a true energy-based model (JEM). We release this benchmark to facilitate future research on Myanmar digit recognition and to encourage broader evaluation of emerging architectures on regional scripts.

[160] Complementary Text-Guided Attention for Zero-Shot Adversarial Robustness

Lu Yu, Haiyang Zhang, Changsheng Xu

Main category: cs.CV

TL;DR: TGA-ZSR and Comp-TGA improve CLIP’s adversarial robustness by using text-guided attention mechanisms, achieving 9.58% and 11.95% improvements in zero-shot robust accuracy across 16 datasets.

Details

Motivation: CLIP models are vulnerable to adversarial attacks despite their strong zero-shot capabilities. The authors observed that adversarial perturbations cause shifts in text-guided attention, motivating the development of attention-based defense mechanisms.

Method: Proposes TGA-ZSR with Local Attention Refinement and Global Attention Constraint modules to maintain CLIP’s generalization while enhancing robustness. Further introduces Comp-TGA that integrates complementary attention from class prompts and non-class prompts for more comprehensive foreground representation.

Result: Achieves 9.58% improvement with TGA-ZSR and 11.95% improvement with Comp-TGA in zero-shot robust accuracy over state-of-the-art techniques across 16 datasets.

Conclusion: Text-guided attention mechanisms effectively improve CLIP’s adversarial robustness while maintaining its zero-shot capabilities, with complementary attention providing superior performance.

Abstract: Due to the impressive zero-shot capabilities, pre-trained vision-language models (e.g., CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: Local Attention Refinement Module and Global Attention Constraint Module. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Additionally, the Global Attention Constraint Module acquires text-guided attention from both the target and original models using clean examples. Its objective is to maintain model performance on clean samples while enhancing overall robustness. However, we observe that the method occasionally focuses on irrelevant or spurious features, which can lead to suboptimal performance and undermine its robustness in certain scenarios. To overcome this limitation, we further propose a novel approach called Complementary Text-Guided Attention (Comp-TGA). This method integrates two types of foreground attention: attention guided by the class prompt and reversed attention driven by the non-class prompt. These complementary attention mechanisms allow the model to capture a more comprehensive and accurate representation of the foreground. The experiments validate that TGA-ZSR and Comp-TGA yield 9.58% and 11.95% improvements respectively, in zero-shot robust accuracy over the current state-of-the-art techniques across 16 datasets.

[161] SJD-PAC: Accelerating Speculative Jacobi Decoding via Proactive Drafting and Adaptive Continuation

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Main category: cs.CV

TL;DR: SJD-PAC enhances speculative Jacobi decoding for text-to-image generation by improving draft acceptance rates in high-entropy regions through proactive drafting and adaptive continuation, achieving 3.8× speedup with lossless quality.

Details

Motivation: Current speculative Jacobi decoding for text-to-image synthesis suffers from low draft-token acceptance rates in complex, high-entropy visual regions, creating a throughput bottleneck that limits acceleration potential.

Method: SJD-PAC introduces two key optimizations: 1) proactive drafting strategy to improve local acceptance rates in challenging high-entropy regions, and 2) adaptive continuation mechanism that sustains sequence validation after initial rejection without requiring full resampling.

Result: Experiments on standard text-to-image benchmarks show SJD-PAC achieves 3.8× speedup while maintaining lossless image quality, significantly increasing average acceptance length per step compared to baseline SJD.

Conclusion: SJD-PAC effectively addresses the draft acceptance bottleneck in speculative decoding for visual generation, providing substantial acceleration while preserving target distribution, making it a promising approach for efficient text-to-image synthesis.

Abstract: Speculative Jacobi Decoding (SJD) offers a draft-model-free approach to accelerate autoregressive text-to-image synthesis. However, the high-entropy nature of visual generation yields low draft-token acceptance rates in complex regions, creating a bottleneck that severely limits overall throughput. To overcome this, we introduce SJD-PAC, an enhanced SJD framework. First, SJD-PAC employs a proactive drafting strategy to improve local acceptance rates in these challenging high-entropy regions. Second, we introduce an adaptive continuation mechanism that sustains sequence validation after an initial rejection, bypassing the need for full resampling. Working in tandem, these optimizations significantly increase the average acceptance length per step, boosting inference speed while strictly preserving the target distribution. Experiments on standard text-to-image benchmarks demonstrate that SJD-PAC achieves a $3.8\times$ speedup with lossless image quality.

Bingqi Ma, Linlong Lang, Ming Zhang, Dailan He, Xingtong Ge, Yi Zhang, Guanglu Song, Yu Liu

Main category: cs.CV

TL;DR: Proposes Cross-Modal Context Learning (CCL) to address limitations in dual-stream transformer audio-video generation, improving temporal alignment, cross-modal interactions, and classifier-free guidance consistency.

Details

Motivation: Current dual-stream transformer audio-video generation methods have limitations including model manifold variations from gating mechanisms, biases in multi-modal background regions from cross-modal attention, inconsistencies in multi-modal classifier-free guidance during training/inference, and conflicts between multiple conditions.

Method: Proposes Cross-Modal Context Learning (CCL) with several modules: Temporally Aligned RoPE and Partitioning (TARP) for better audio-video temporal alignment; Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in Cross-Modal Context Attention (CCA) for stable unconditional anchors; and Unconditional Context Guidance (UCG) during inference for improved classifier-free guidance consistency.

Result: CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources, demonstrating improved audio-video generation quality and efficiency.

Conclusion: The proposed CCL framework effectively addresses key limitations in current dual-stream transformer audio-video generation methods, improving temporal alignment, cross-modal interactions, and training-inference consistency while maintaining resource efficiency.

Abstract: The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model’s convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

[163] Benchmarking CNN-based Models against Transformer-based Models for Abdominal Multi-Organ Segmentation on the RATIC Dataset

Lukas Bayer, Sheethal Bhat, Andreas Maier

Main category: cs.CV

TL;DR: CNN-based SegResNet outperforms hybrid transformer models (UNETR, SwinUNETR, UNETR++) for multi-organ segmentation on heterogeneous CT data, suggesting CNNs remain competitive for small-to-medium datasets.

Details

Motivation: To systematically benchmark transformer-based models against CNN baselines for volumetric multi-organ segmentation in abdominal CT scans, given the recent attention on transformers for medical image segmentation.

Method: Benchmarked three hybrid transformer models (UNETR, SwinUNETR, UNETR++) against CNN baseline (SegResNet) on RATIC dataset (206 CT scans from 23 institutions). All models trained/evaluated under identical conditions using Dice Similarity Coefficient.

Result: CNN-based SegResNet achieved highest overall performance, outperforming all transformer-based models across all organs. UNETR++ was most competitive among transformers, while UNETR showed faster convergence with fewer training iterations.

Conclusion: For small-to-medium heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs, despite transformers’ ability to model long-range dependencies.

Abstract: Accurate multi-organ segmentation in abdominal CT scans is essential for computer-aided diagnosis and treatment. While convolutional neural networks (CNNs) have long been the standard approach in medical image segmentation, transformer-based architectures have recently gained attention due to their ability to model long-range dependencies. In this study, we systematically benchmark the three hybrid transformer-based models UNETR, SwinUNETR, and UNETR++ against a strong CNN baseline, SegResNet, for volumetric multi-organ segmentation on the heterogeneous RATIC dataset. The dataset comprises 206 annotated CT scans from 23 institutions worldwide, covering five abdominal organs. All models were trained and evaluated under identical preprocessing and training conditions using the Dice Similarity Coefficient (DSC) as the primary metric. The results show that the CNN-based SegResNet achieves the highest overall performance, outperforming all hybrid transformer-based models across all organs. Among the transformer-based approaches, UNETR++ delivers the most competitive results, while UNETR demonstrates notably faster convergence with fewer training iterations. These findings suggest that, for small- to medium-sized heterogeneous datasets, well-optimized CNN architectures remain highly competitive and may outperform hybrid transformer-based designs.

[164] OpenT2M: No-frill Motion Generation with Open-source,Large-scale, High-quality Data

Bin Cao, Sipeng Zheng, Hao Luo, Boyuan Li, Jing Liu, Zongqing Lu

Main category: cs.CV

TL;DR: OpenT2M introduces a million-level, high-quality motion dataset with detailed text annotations, and MonoFrill is a pretrained motion model using novel 2D-PRQ tokenizer for text-to-motion generation.

Details

Motivation: Current text-to-motion (T2M) models perform poorly on unseen text descriptions due to small-scale, limited-diversity motion datasets. There's a need for larger, higher-quality datasets and better models to improve generalization.

Method: 1) Created OpenT2M dataset with over 2800 hours of human motion with rigorous quality control and detailed text annotations; 2) Developed automated pipeline for long-horizon sequences; 3) Built MonoFrill model with 2D-PRQ tokenizer that captures spatiotemporal dependencies by dividing human body into biological parts.

Result: OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. The dataset and model address longstanding data quality and benchmarking challenges.

Conclusion: OpenT2M and MonoFrill advance the T2M field by providing large-scale, high-quality data and effective modeling approach, enabling better text-to-motion generation for animation and robotics applications.

Abstract: Text-to-motion (T2M) generation aims to create realistic human movements from text descriptions, with promising applications in animation and robotics. Despite recent progress, current T2M models perform poorly on unseen text descriptions due to the small scale and limited diversity of existing motion datasets. To address this problem, we introduce OpenT2M, a million-level, high-quality, and open-source motion dataset containing over 2800 hours of human motion. Each sequence undergoes rigorous quality control through physical feasibility validation and multi-granularity filtering, with detailed second-wise text annotations. We also develop an automated pipeline for creating long-horizon sequences, enabling complex motion generation. Building upon OpenT2M, we introduce MonoFrill, a pretrained motion model that achieves compelling T2M results without complicated designs or technique tricks as “frills”. Its core component is 2D-PRQ, a novel motion tokenizer that captures spatiotemporal dependencies by dividing the human body into biology parts. Experiments show that OpenT2M significantly improves generalization of existing T2M models, while 2D-PRQ achieves superior reconstruction and strong zero-shot performance. We expect OpenT2M and MonoFrill will advance the T2M field by addressing longstanding data quality and benchmarking challenges.

[165] GenVideoLens: Where LVLMs Fall Short in AI-Generated Video Detection?

Yueying Zou, Pei Pei Li, Zekun Li, Xinyu Guo, Xing Cui, Huaibo Huang, Ran He

Main category: cs.CV

TL;DR: GenVideoLens is a fine-grained benchmark for evaluating Large Vision-Language Models’ capabilities in detecting AI-generated videos across 15 authenticity dimensions, revealing dimensional imbalances and temporal reasoning limitations.

Details

Motivation: Existing evaluation protocols for AI-generated video detection treat it as binary classification with coarse metrics, providing limited insight into where LVLMs succeed or fail. There's a need for fine-grained evaluation to understand model capabilities across different authenticity dimensions.

Method: Created GenVideoLens benchmark with 400 AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. Evaluated 11 representative LVLMs on this benchmark and conducted temporal perturbation experiments.

Result: LVLMs show pronounced dimensional imbalance: perform relatively well on perceptual cues but struggle with optical consistency, physical interactions, and temporal-causal reasoning. Performance varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific cues. Temporal perturbation experiments show limited use of temporal information.

Conclusion: GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps in AI-generated video detection. The benchmark offers guidance for improving future detection systems by highlighting specific areas needing improvement, particularly in temporal reasoning and physical/optical consistency.

Abstract: In recent years, AI-generated videos have become increasingly realistic and sophisticated. Meanwhile, Large Vision-Language Models (LVLMs) have shown strong potential for detecting such content. However, existing evaluation protocols largely treat the task as a binary classification problem and rely on coarse-grained metrics such as overall accuracy, providing limited insight into where LVLMs succeed or fail. To address this limitation, we introduce GenVideoLens, a fine-grained benchmark that enables dimension-wise evaluation of LVLM capabilities in AI-generated video detection. The benchmark contains 400 highly deceptive AI-generated videos and 100 real videos, annotated by experts across 15 authenticity dimensions covering perceptual, optical, physical, and temporal cues. We evaluate eleven representative LVLMs on this benchmark. Our analysis reveals a pronounced dimensional imbalance. While LVLMs perform relatively well on perceptual cues, they struggle with optical consistency, physical interactions, and temporal-causal reasoning. Model performance also varies substantially across dimensions, with smaller open-source models sometimes outperforming stronger proprietary models on specific authenticity cues. Temporal perturbation experiments further show that current LVLMs make limited use of temporal information. Overall, GenVideoLens provides diagnostic insights into LVLM behavior, revealing key capability gaps and offering guidance for improving future AI-generated video detection systems.

[166] GEAR: Geography-knowledge Enhanced Analog Recognition Framework in Extreme Environments

Zelin Liu, Bocheng Li, Yuling Zhou, Xuanting Li, Yixuan Yang, Jing Wang, Weishu Zhao, Xiaofeng Gao

Main category: cs.CV

TL;DR: GEAR framework for cross-domain topographic similarity retrieval between Mariana Trench and Qinghai-Tibet Plateau using geography-enhanced three-stage pipeline with graph neural networks.

Details

Motivation: Deep-sea biological sampling is prohibitively expensive, so finding terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is valuable for biological research. Existing models fail to adequately address cross-domain topographic similarity retrieval due to lack of geographical knowledge or computational inefficiency.

Method: Three-stage GEAR framework: (1) Skeleton-guided screening and clipping based on size and linear morphology, (2) Physics-aware filtering using Topographic Waveform Comparator and Morphological Texture Module, (3) Graph-based fine recognition using Morphology-integrated Siamese Graph Network (MSG-Net) based on geomorphological metrics.

Result: MSG-Net achieved F1-Score 1.38 percentage points higher than state-of-the-art baselines. Features extracted by MSG-Net showed significant correlation with biological data, providing evidence for future biological analysis. Framework effectively retrieves analogs from 2.5 million square kilometers of Qinghai-Tibet Plateau.

Conclusion: GEAR framework successfully addresses cross-domain topographic similarity retrieval challenges by incorporating geographical knowledge while maintaining computational efficiency. The method enables discovery of terrestrial analogs for expensive deep-sea environments, with potential applications in biological research.

Abstract: The Mariana Trench and the Qinghai-Tibet Plateau exhibit significant similarities in geological origins and microbial metabolic functions. Given that deep-sea biological sampling faces prohibitive costs, recognizing structurally homologous terrestrial analogs of the Mariana Trench on the Qinghai-Tibet Plateau is of great significance. Yet, no existing model adequately addresses cross-domain topographic similarity retrieval, either neglecting geographical knowledge or sacrificing computational efficiency. To address these challenges, we present \underline{\textbf{G}}eography-knowledge \underline{\textbf{E}}nhanced \underline{\textbf{A}}nalog \underline{\textbf{R}}ecognition (\textbf{GEAR}) Framework, a three-stage pipeline designed to efficiently retrieve analogs from 2.5 million square kilometers of the Qinghai-Tibet Plateau: (1) Skeleton guided Screening and Clipping: Recognition of candidate valleys and initial screening based on size and linear morphological criteria. (2) Physics aware Filtering: The Topographic Waveform Comparator (TWC) and Morphological Texture Module (MTM) evaluate the waveform and texture and filter out inconsistent candidate valleys. (3) Graph based Fine Recognition: We design a \underline{\textbf{M}}orphology-integrated \underline{\textbf{S}}iamese \underline{\textbf{G}}raph \underline{\textbf{N}}etwork (\textbf{MSG-Net}) based on geomorphological metrics. Correspondingly, we release an expert-annotated topographic similarity dataset targeting tectonic collision zones. Experiments demonstrate the effectiveness of every stage. Besides, MSG-Net achieved an F1-Score 1.38 percentage points higher than the SOTA baseline. Using features extracted by MSG-Net, we discovered a significant correlation with biological data, providing evidence for future biological analysis.

[167] SwiftGS: Episodic Priors for Immediate Satellite Surface Recovery

Rong Fu, Jiekai Wu, Haiyun Wei, Xiaowen Ma, Shiyin Lin, Kangan Qian, Chuang Liu, Jianyuan Ni, Simon James Fong

Main category: cs.CV

TL;DR: SwiftGS is a meta-learned system for rapid 3D reconstruction from multi-date satellite imagery using geometry-radiation-decoupled Gaussian primitives and lightweight SDF, enabling single-pass inference without per-scene optimization.

Details

Motivation: Large-scale 3D reconstruction from satellite imagery is challenging due to illumination changes, sensor heterogeneity, and computational costs of per-scene optimization. Current methods require expensive fitting for each scene, limiting scalability for environmental monitoring, urban planning, and disaster response applications.

Method: Uses meta-learned episodic training to capture transferable priors. Combines geometry-radiation-decoupled Gaussian primitives with lightweight SDF representation. Features differentiable physics graph for projection, illumination, and sensor response, spatial gating to blend sparse Gaussian detail with global SDF structure, semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from frozen geometric teacher with uncertainty-aware multi-task loss.

Result: Achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost. Operates zero-shot with optional compact calibration. Ablations demonstrate benefits of hybrid representation, physics-aware rendering, and episodic meta-training.

Conclusion: SwiftGS enables rapid, large-scale 3D reconstruction from multi-date satellite imagery through meta-learned priors and hybrid representation, overcoming challenges of illumination changes and sensor heterogeneity while reducing computational requirements.

Abstract: Rapid, large-scale 3D reconstruction from multi-date satellite imagery is vital for environmental monitoring, urban planning, and disaster response, yet remains difficult due to illumination changes, sensor heterogeneity, and the cost of per-scene optimization. We introduce SwiftGS, a meta-learned system that reconstructs 3D surfaces in a single forward pass by predicting geometry-radiation-decoupled Gaussian primitives together with a lightweight SDF, replacing expensive per-scene fitting with episodic training that captures transferable priors. The model couples a differentiable physics graph for projection, illumination, and sensor response with spatial gating that blends sparse Gaussian detail and global SDF structure, and incorporates semantic-geometric fusion, conditional lightweight task heads, and multi-view supervision from a frozen geometric teacher under an uncertainty-aware multi-task loss. At inference, SwiftGS operates zero-shot with optional compact calibration and achieves accurate DSM reconstruction and view-consistent rendering at significantly reduced computational cost, with ablations highlighting the benefits of the hybrid representation, physics-aware rendering, and episodic meta-training.

[168] Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering

Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, Jianxin Li

Main category: cs.CV

TL;DR: SVOO is a training-free sparse attention framework for efficient video diffusion transformers that uses offline layer-wise sparsity profiling and online bidirectional co-clustering to achieve better quality-speedup trade-offs.

Details

Motivation: Existing sparse attention methods for video generation ignore layer heterogeneity in attention pruning and query-key coupling in block partitioning, limiting their quality-speedup trade-off. The authors discovered that attention sparsity is an intrinsic layer property that remains consistent across different inputs.

Method: Two-stage approach: 1) Offline layer-wise sensitivity profiling to determine intrinsic per-layer pruning levels, and 2) Online block-wise sparse attention using a novel bidirectional co-clustering algorithm that considers both query and key dimensions simultaneously.

Result: Extensive experiments on seven video generation models show SVOO achieves superior quality-speedup trade-off over SOTA methods, delivering up to 1.93× speedup while maintaining PSNR up to 29 dB on Wan2.1 benchmark.

Conclusion: SVOO effectively addresses limitations of existing sparse attention methods by considering layer heterogeneity and query-key coupling, enabling more efficient video generation without compromising quality.

Abstract: Diffusion Transformers (DiTs) achieve strong video generation quality but suffer from high inference cost due to dense 3D attention, leading to the development of sparse attention technologies to improve efficiency. However, existing training-free sparse attention methods in video generation still face two unresolved limitations: ignoring layer heterogeneity in attention pruning and ignoring query-key coupling in block partitioning, which hinder a better quality-speedup trade-off. In this work, we uncover a critical insight that the attention sparsity of each layer is its intrinsic property, with minor effects across different inputs. Motivated by this, we propose SVOO, a training-free Sparse attention framework for fast Video generation via Offline layer-wise sparsity profiling and Online bidirectional co-clustering. Specifically, SVOO adopts a two-stage paradigm: (i) offline layer-wise sensitivity profiling to derive intrinsic per-layer pruning levels, and (ii) online block-wise sparse attention via a novel bidirectional co-clustering algorithm. Extensive experiments on seven widely used video generation models demonstrate that SVOO achieves a superior quality-speedup trade-off over state-of-the-art methods, delivering up to $1.93\times$ speedup while maintaining a PSNR of up to 29 dB on Wan2.1.

[169] PhysVideo: Physically Plausible Video Generation with Cross-View Geometry Guidance

Cong Wang, Hanxin Zhu, Xiao Tang, Jiayi Luo, Xin Jin, Long Chen, Fei-Yue Wang, Zhibo Chen

Main category: cs.CV

TL;DR: PhysVideo is a two-stage video generation framework that improves physical realism by first generating physics-aware orthogonal foreground videos, then synthesizing full videos with background interactions.

Details

Motivation: Current video generation methods achieve good visual fidelity but struggle with physically consistent motion, since real-world 3D motion is only partially captured in 2D video projections.

Method: Two-stage framework: 1) Phys4View generates physics-aware orthogonal foreground videos using physics-aware attention, geometry-enhanced cross-view attention, and temporal attention; 2) VideoSyn synthesizes full videos using foreground videos as guidance, learning foreground-background interactions.

Result: PhysVideo significantly improves physical realism and spatio-temporal coherence over existing video generation methods, as demonstrated through extensive experiments.

Conclusion: The proposed two-stage approach with physics-aware orthogonal video generation effectively addresses physical consistency challenges in video synthesis by modeling 3D motion dynamics.

Abstract: Recent progress in video generation has led to substantial improvements in visual fidelity, yet ensuring physically consistent motion remains a fundamental challenge. Intuitively, this limitation can be attributed to the fact that real-world object motion unfolds in three-dimensional space, while video observations provide only partial, view-dependent projections of such dynamics. To address these issues, we propose PhysVideo, a two-stage framework that first generates physics-aware orthogonal foreground videos and then synthesizes full videos with background. In the first stage, Phys4View leverages physics-aware attention to capture the influence of physical attributes on motion dynamics, and enhances spatio-temporal consistency by incorporating geometry-enhanced cross-view attention and temporal attention. In the second stage, VideoSyn uses the generated foreground videos as guidance and learns the interactions between foreground dynamics and background context for controllable video synthesis. To support training, we construct PhysMV, a dataset containing 40K scenes, each consisting of four orthogonal viewpoints, resulting in a total of 160K video sequences. Extensive experiments demonstrate that PhysVideo significantly improves physical realism and spatial-temporal coherence over existing video generation methods. Home page: https://anonymous.4open.science/w/Phys4D/.

[170] MeInTime: Bridging Age Gap in Identity-Preserving Face Restoration

Teer Song, Yue Zhang, Yu Tian, Ziyang Wang, Xianlin Zhang, Guixuan Zhang, Xuan Liu, Xueming Li, Yasen Zhang

Main category: cs.CV

TL;DR: MeInTime is a diffusion-based face restoration method that enables cross-age face restoration using reference images of different ages, addressing limitations of existing age-aligned approaches.

Details

Motivation: Existing reference-based face restoration methods assume reference and degraded input are age-aligned, limiting real-world applications where only cross-age references are available (e.g., historical photo restoration).

Method: Decouples identity and age modeling: trains with identity injection via new attention mechanism and Gated Residual Fusion modules; uses Age-Aware Gradient Guidance at inference to nudge denoising latent toward desired age semantics.

Result: Outperforms existing face restoration methods in both identity preservation and age consistency, demonstrating effectiveness in cross-age restoration scenarios.

Conclusion: MeInTime successfully extends reference-based face restoration to cross-age settings, enabling faithful restoration with identity fidelity and age consistency using age prompts and cross-age references.

Abstract: To better preserve an individual’s identity, face restoration has evolved from reference-free to reference-based approaches, which leverage high-quality reference images of the same identity to enhance identity fidelity in the restored outputs. However, most existing methods implicitly assume that the reference and degraded input are age-aligned, limiting their effectiveness in real-world scenarios where only cross-age references are available, such as historical photo restoration. This paper proposes MeInTime, a diffusion-based face restoration method that extends reference-based restoration from same-age to cross-age settings. Given one or few reference images along with an age prompt corresponding to the degraded input, MeInTime achieves faithful restoration with both identity fidelity and age consistency. Specifically, we decouple the modeling of identity and age conditions. During training, we focus solely on effectively injecting identity features through a newly introduced attention mechanism and introduce Gated Residual Fusion modules to facilitate the integration between degraded features and identity representations. At inference, we propose Age-Aware Gradient Guidance, a training-free sampling strategy, using an age-driven direction to iteratively nudge the identity-aware denoising latent toward the desired age semantic manifold. Extensive experiments demonstrate that MeInTime outperforms existing face restoration methods in both identity preservation and age consistency. Our code is available at: https://github.com/teer4/MeInTime

[171] Click-to-Ask: An AI Live Streaming Assistant with Offline Copywriting and Online Interactive QA

Ruizhi Yu, Keyang Zhong, Peng Liu, Qi Wu, Haoran Zhang, Yanhao Zhang, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: AI assistant for live streaming commerce with offline product processing and online real-time Q&A during broadcasts

Details

Motivation: Live streaming commerce requires efficient product promotion and real-time audience interaction; current methods are time-consuming and lack automated assistance

Method: Two-component system: offline module processes multimodal product info into structured data and promotional copy; online module enables real-time Q&A via click-to-ask interface with event-level memory

Result: Achieves 0.913 Question Recognition Accuracy and 0.876 Response Quality score on TikTok live stream dataset; reduces preparation time and improves engagement

Conclusion: Click-to-Ask system effectively assists streamers with product promotion and audience interaction, showing practical potential for live streaming commerce

Abstract: Live streaming commerce has become a prominent form of broadcasting in the modern era. To facilitate more efficient and convenient product promotions for streamers, we present Click-to-Ask, an AI-driven assistant for live streaming commerce with complementary offline and online components. The offline module processes diverse multimodal product information, transforming complex inputs into structured product data and generating compliant promotional copywriting. During live broadcasts, the online module enables real-time responses to viewer inquiries by allowing streamers to click on questions and leveraging both the structured product information generated by the offline module and an event-level historical memory maintained in a streaming architecture. This system significantly reduces the time needed for promotional preparation, enhances content engagement, and enables prompt interaction with audience inquiries, ultimately improving the effectiveness of live streaming commerce. On our collected dataset of TikTok live stream frames, the proposed method achieves a Question Recognition Accuracy of 0.913 and a Response Quality score of 0.876, demonstrating considerable potential for practical application. The video demonstration can be viewed here: https://www.youtube.com/shorts/mWIXK-SWhiE.

[172] SAVeS: Steering Safety Judgments in Vision-Language Models via Semantic Cues

Carlos Hinojosa, Clemens Grange, Bernard Ghanem

Main category: cs.CV

TL;DR: Vision-language models’ safety decisions can be manipulated through simple semantic cues without changing actual scene content, revealing reliance on learned associations rather than grounded visual understanding.

Details

Motivation: As VLMs are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context, it's unclear what visual evidence drives these judgments. The paper investigates whether multimodal safety behavior can be steered by semantic cues, highlighting potential vulnerabilities in safety-critical applications.

Method: Introduces a semantic steering framework applying controlled textual, visual, and cognitive interventions without changing underlying scene content. Proposes SAVeS benchmark for situational safety under semantic cues with evaluation protocol separating behavioral refusal, grounded safety reasoning, and false refusals. Tests across multiple VLMs and state-of-the-art benchmarks.

Result: Safety decisions in VLMs are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. Automated steering pipelines can exploit these mechanisms, demonstrating a potential vulnerability in multimodal safety systems.

Conclusion: Multimodal safety systems in VLMs are vulnerable to semantic manipulation, revealing that safety judgments depend more on learned associations than grounded visual understanding, which poses risks for real-world deployment.

Abstract: Vision-language models (VLMs) are increasingly deployed in real-world and embodied settings where safety decisions depend on visual context. However, it remains unclear which visual evidence drives these judgments. We study whether multimodal safety behavior in VLMs can be steered by simple semantic cues. We introduce a semantic steering framework that applies controlled textual, visual, and cognitive interventions without changing the underlying scene content. To evaluate these effects, we propose SAVeS, a benchmark for situational safety under semantic cues, together with an evaluation protocol that separates behavioral refusal, grounded safety reasoning, and false refusals. Experiments across multiple VLMs and an additional state-of-the-art benchmark show that safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding. We further demonstrate that automated steering pipelines can exploit these mechanisms, highlighting a potential vulnerability in multimodal safety systems.

[173] Benchmarking PDF Parsers on Table Extraction with LLM-based Semantic Evaluation

Pius Horn, Janis Keuper

Main category: cs.CV

TL;DR: A benchmarking framework for PDF table extraction using synthetic PDFs with LaTeX ground truth and LLM-as-a-judge for semantic evaluation, showing LLM evaluation correlates better with human judgment than traditional metrics.

Details

Motivation: Existing evaluation approaches for PDF table extraction rely on rule-based metrics that fail to capture semantic equivalence of table content, creating a need for better evaluation methods.

Method: Created synthetic PDFs with precise LaTeX ground truth using tables from arXiv, applied LLM-as-a-judge for semantic table evaluation integrated into a matching pipeline, and conducted human validation study with 1,500+ judgments.

Result: LLM-based evaluation achieved substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (r=0.68) and Grid Table Similarity (r=0.70). Evaluation of 21 PDF parsers on 451 tables revealed significant performance disparities.

Conclusion: The framework provides practical guidance for selecting PDF parsers and establishes a reproducible, scalable evaluation methodology for tabular data extraction using LLM-based semantic evaluation.

Abstract: Reliably extracting tables from PDFs is essential for large-scale scientific data mining and knowledge base construction, yet existing evaluation approaches rely on rule-based metrics that fail to capture semantic equivalence of table content. We present a benchmarking framework based on synthetically generated PDFs with precise LaTeX ground truth, using tables sourced from arXiv to ensure realistic complexity and diversity. As our central methodological contribution, we apply LLM-as-a-judge for semantic table evaluation, integrated into a matching pipeline that accommodates inconsistencies in parser outputs. Through a human validation study comprising over 1,500 quality judgments on extracted table pairs, we show that LLM-based evaluation achieves substantially higher correlation with human judgment (Pearson r=0.93) compared to Tree Edit Distance-based Similarity (TEDS, r=0.68) and Grid Table Similarity (GriTS, r=0.70). Evaluating 21 contemporary PDF parsers across 100 synthetic documents containing 451 tables reveals significant performance disparities. Our results offer practical guidance for selecting parsers for tabular data extraction and establish a reproducible, scalable evaluation methodology for this critical task. Code and data: https://github.com/phorn1/pdf-parse-bench Metric study and human evaluation: https://github.com/phorn1/table-metric-study

[174] Multiscale Switch for Semi-Supervised and Contrastive Learning in Medical Ultrasound Image Segmentation

Jingguo Qu, Xinyang Han, Yao Pu, Man-Lik Chui, Simon Takadiyi Gunda, Ziman Chen, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying

Main category: cs.CV

TL;DR: Switch: A novel semi-supervised learning framework for medical ultrasound image segmentation using multiscale patch mixing and frequency domain switching with contrastive learning.

Details

Motivation: Medical ultrasound segmentation suffers from limited labeled data and imaging artifacts like speckle noise and low-contrast boundaries. Existing SSL methods have suboptimal unlabeled data utilization and lack robust feature representation mechanisms.

Method: Proposes Switch framework with two innovations: 1) Multiscale Switch (MSS) using hierarchical patch mixing for uniform spatial coverage, and 2) Frequency Domain Switch (FDS) with contrastive learning performing amplitude switching in Fourier space for robust feature representations. Integrated within teacher-student architecture.

Result: Comprehensive evaluation across six ultrasound datasets shows consistent superiority over SOTA methods. At 5% labeling ratio: 80.04% Dice on LN-INT, 85.52% Dice on DDTI, 83.48% Dice on Prostate datasets. Semi-supervised approach even exceeds fully supervised baselines. Maintains parameter efficiency (1.8M parameters).

Conclusion: Switch effectively addresses data scarcity in medical ultrasound segmentation through innovative SSL techniques, demonstrating superior performance with parameter efficiency for resource-constrained medical imaging applications.

Abstract: Medical ultrasound image segmentation faces significant challenges due to limited labeled data and characteristic imaging artifacts including speckle noise and low-contrast boundaries. While semi-supervised learning (SSL) approaches have emerged to address data scarcity, existing methods suffer from suboptimal unlabeled data utilization and lack robust feature representation mechanisms. In this paper, we propose Switch, a novel SSL framework with two key innovations: (1) Multiscale Switch (MSS) strategy that employs hierarchical patch mixing to achieve uniform spatial coverage; (2) Frequency Domain Switch (FDS) with contrastive learning that performs amplitude switching in Fourier space for robust feature representations. Our framework integrates these components within a teacher-student architecture to effectively leverage both labeled and unlabeled data. Comprehensive evaluation across six diverse ultrasound datasets (lymph nodes, breast lesions, thyroid nodules, and prostate) demonstrates consistent superiority over state-of-the-art methods. At 5% labeling ratio, Switch achieves remarkable improvements: 80.04% Dice on LN-INT, 85.52% Dice on DDTI, and 83.48% Dice on Prostate datasets, with our semi-supervised approach even exceeding fully supervised baselines. The method maintains parameter efficiency (1.8M parameters) while delivering superior performance, validating its effectiveness for resource-constrained medical imaging applications. The source code is publicly available at https://github.com/jinggqu/Switch

[175] Multimodal Model for Computational Pathology:Representation Learning and Image Compression

Peihang Wu, Zehong Chen, Lijian Xu

Main category: cs.CV

TL;DR: Comprehensive review of multimodal computational pathology covering self-supervised representation learning, multimodal data generation, parameter-efficient adaptation, and multi-agent collaborative reasoning for whole slide imaging analysis.

Details

Motivation: Whole slide imaging has transformed digital pathology but faces challenges: extreme resolution creates computational hurdles, limited expert annotations constrain supervised approaches, multimodal integration is difficult while preserving interpretability, and model opacity hinders clinical transparency.

Method: Systematic analysis of four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis.

Result: The review surveys recent advances, examining how token compression enables cross-scale modeling and how multi-agent mechanisms simulate pathologist’s “Chain of Thought” across magnifications for uncertainty-aware evidence fusion.

Conclusion: Future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

Abstract: Whole slide imaging (WSI) has transformed digital pathology by enabling computational analysis of gigapixel histopathology images. Recent foundation model advances have accelerated progress in computational pathology, facilitating joint reasoning across pathology images, clinical reports, and structured data. Despite this progress, challenges remain: the extreme resolution of WSIs creates computational hurdles for visual learning; limited expert annotations constrain supervised approaches; integrating multimodal information while preserving biological interpretability remains difficult; and the opacity of modeling ultra-long visual sequences hinders clinical transparency. This review comprehensively surveys recent advances in multimodal computational pathology. We systematically analyze four research directions: (1) self-supervised representation learning and structure-aware token compression for WSIs; (2) multimodal data generation and augmentation; (3) parameter-efficient adaptation and reasoning-enhanced few-shot learning; and (4) multi-agent collaborative reasoning for trustworthy diagnosis. We specifically examine how token compression enables cross-scale modeling and how multi-agent mechanisms simulate a pathologist’s “Chain of Thought” across magnifications to achieve uncertainty-aware evidence fusion. Finally, we discuss open challenges and argue that future progress depends on unified multimodal frameworks integrating high-resolution visual data with clinical and biomedical knowledge to support interpretable and safe AI-assisted diagnosis.

[176] Towards High-Quality Image Segmentation: Improving Topology Accuracy by Penalizing Neighbor Pixels

Juan Miguel Valverde, Dim P. Papadopoulos, Rasmus Larsen, Anders Bjorholm Dahl

Main category: cs.CV

TL;DR: SCNP improves topology accuracy in image segmentation by penalizing logits with their poorest-classified neighbor, forcing models to improve neighbor predictions before pixel predictions.

Details

Motivation: Standard deep learning models for image segmentation fail to guarantee topology accuracy, affecting segmentation quality and subsequent quantification analyses. Existing methods for improving topology are cumbersome, computationally expensive, or restricted to specific morphologies.

Method: SCNP (Same-Class Neighbor Penalization) improves topology accuracy by penalizing logits with their poorest-classified neighbor. This forces the model to improve predictions at pixel neighbors before allowing improvement at the pixels themselves.

Result: SCNP shows effectiveness across 13 datasets covering different structure morphologies and image modalities. It integrates into three frameworks for semantic and instance segmentation and can be combined with various loss functions to improve topology accuracy.

Conclusion: SCNP provides an efficient method to enhance topology accuracy in image segmentation that is easy to integrate into existing training pipelines, works across diverse morphologies, and improves various segmentation frameworks.

Abstract: Standard deep learning models for image segmentation cannot guarantee topology accuracy, failing to preserve the correct number of connected components or structures. This, in turn, affects the quality of the segmentations and compromises the reliability of the subsequent quantification analyses. Previous works have proposed to enhance topology accuracy with specialized frameworks, architectures, and loss functions. However, these methods are often cumbersome to integrate into existing training pipelines, they are computationally very expensive, or they are restricted to structures with tubular morphology. We present SCNP, an efficient method that improves topology accuracy by penalizing the logits with their poorest-classified neighbor, forcing the model to improve the prediction at the pixels’ neighbors before allowing it to improve the pixels themselves. We show the effectiveness of SCNP across 13 datasets, covering different structure morphologies and image modalities, and integrate it into three frameworks for semantic and instance segmentation. Additionally, we show that SCNP can be integrated into several loss functions, making them improve topology accuracy. Our code can be found at https://jmlipman.github.io/SCNP-SameClassNeighborPenalization.

[177] Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef, Mayar Elfares, Anna-Maria Meer, Matteo Bortoletto, Andreas Bulling

Main category: cs.CV

TL;DR: OGD is a neuro-symbolic zero-shot sim2real image translation framework that uses an ontology of interpretable visual traits and knowledge graphs to guide diffusion models for realistic image generation without real-world data.

Details

Motivation: The sim2real gap is challenging due to scarce labeled real-world data. Existing diffusion approaches use unstructured prompts or statistical alignment, failing to capture the structured factors that make images look real.

Method: OGD decomposes realism into an ontology of interpretable traits (lighting, material properties), encodes relationships in a knowledge graph, infers trait activations from synthetic images, uses GNN for global embeddings, symbolic planner for edit sequences, and conditions pretrained instruction-guided diffusion via cross-attention with structured prompts.

Result: OGD’s graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translation across benchmarks.

Conclusion: Explicitly encoding realism structure enables interpretable, data-efficient, and generalizable zero-shot sim2real transfer.

Abstract: Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits – such as lighting and material properties – and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

[178] EdgeCrafter: Compact ViTs for Edge Dense Prediction via Task-Specialized Distillation

Longfei Liu, Yongjie Hou, Yang Li, Qirui Wang, Youyang Sha, Yongjun Yu, Yinzhi Wang, Peizhe Ru, Xuanlong Yu, Xi Shen

Main category: cs.CV

TL;DR: EdgeCrafter: A unified compact Vision Transformer framework for edge dense prediction tasks (detection, segmentation, pose estimation) that addresses the accuracy-efficiency gap in small-scale ViTs through task-specific distillation and edge-aware design.

Details

Motivation: Compact Vision Transformers struggle to match CNN-based architectures like YOLO in accuracy-efficiency tradeoff for edge dense prediction tasks, despite large-scale pretraining. The authors argue this gap stems from insufficient task-specific representation learning in small-scale ViTs rather than inherent ViT limitations.

Method: Proposes EdgeCrafter framework with ECDet (detection model) built from distilled compact backbone and edge-friendly encoder-decoder design. Uses task-specialized distillation and edge-aware design to enhance compact ViTs for dense prediction tasks.

Result: ECDet-S achieves 51.7 AP on COCO with <10M parameters using only COCO annotations. ECInsSeg matches RF-DETR performance with fewer parameters. ECPose-X reaches 74.8 AP, outperforming YOLO26Pose-X (71.6 AP) despite YOLO’s extensive Objects365 pretraining.

Conclusion: Compact ViTs with task-specialized distillation and edge-aware design can be practical and competitive for edge dense prediction, challenging the dominance of CNN-based architectures in this domain.

Abstract: Deploying high-performance dense prediction models on resource-constrained edge devices remains challenging due to strict limits on computation and memory. In practice, lightweight systems for object detection, instance segmentation, and pose estimation are still dominated by CNN-based architectures such as YOLO, while compact Vision Transformers (ViTs) often struggle to achieve similarly strong accuracy efficiency tradeoff, even with large scale pretraining. We argue that this gap is largely due to insufficient task specific representation learning in small scale ViTs, rather than an inherent mismatch between ViTs and edge dense prediction. To address this issue, we introduce EdgeCrafter, a unified compact ViT framework for edge dense prediction centered on ECDet, a detection model built from a distilled compact backbone and an edge-friendly encoder decoder design. On the COCO dataset, ECDet-S achieves 51.7 AP with fewer than 10M parameters using only COCO annotations. For instance segmentation, ECInsSeg achieves performance comparable to RF-DETR while using substantially fewer parameters. For pose estimation, ECPose-X reaches 74.8 AP, significantly outperforming YOLO26Pose-X (71.6 AP) despite the latter’s reliance on extensive Objects365 pretraining. These results show that compact ViTs, when paired with task-specialized distillation and edge-aware design, can be a practical and competitive option for edge dense prediction. Code is available at: https://intellindust-ai-lab.github.io/projects/EdgeCrafter/

[179] 6Bit-Diffusion: Inference-Time Mixed-Precision Quantization for Video Diffusion Models

Rundong Su, Jintao Zhang, Zhihang Yuan, Haojie Duanmu, Jianfei Chen, Jun Zhu

Main category: cs.CV

TL;DR: A mixed-precision quantization framework for video diffusion transformers that dynamically allocates NVFP4/INT8 precision based on temporal stability and uses temporal delta caching to skip invariant computations, achieving significant speedup and memory reduction.

Details

Motivation: Video diffusion transformers have high memory usage and computational costs that limit practical deployment. Existing static quantization methods fail to account for varying quantization sensitivity across diffusion timesteps, leading to suboptimal efficiency-quality trade-offs.

Method: Proposes an inference-time NVFP4/INT8 mixed-precision quantization framework with: 1) A lightweight predictor that dynamically allocates precision based on input-output difference correlation with quantization sensitivity, 2) Temporal Delta Cache (TDC) that exploits temporal consistency of residuals to skip computations for invariant blocks.

Result: Achieves 1.92× end-to-end acceleration and 3.32× memory reduction while maintaining generation quality, setting a new baseline for efficient inference in Video DiTs.

Conclusion: The proposed adaptive mixed-precision quantization with temporal caching effectively addresses efficiency bottlenecks in video diffusion transformers, enabling practical deployment without compromising quality.

Abstract: Diffusion transformers have demonstrated remarkable capabilities in generating videos. However, their practical deployment is severely constrained by high memory usage and computational cost. Post-Training Quantization provides a practical way to reduce memory usage and boost computation speed. Existing quantization methods typically apply a static bit-width allocation, overlooking the quantization difficulty of activations across diffusion timesteps, leading to a suboptimal trade-off between efficiency and quality. In this paper, we propose a inference time NVFP4/INT8 Mixed-Precision Quantization framework. We find a strong linear correlation between a block’s input-output difference and the quantization sensitivity of its internal linear layers. Based on this insight, we design a lightweight predictor that dynamically allocates NVFP4 to temporally stable layers to maximize memory compression, while selectively preserving INT8 for volatile layers to ensure robustness. This adaptive precision strategy enables aggressive quantization without compromising generation quality. Beside this, we observe that the residual between the input and output of a Transformer block exhibits high temporal consistency across timesteps. Leveraging this temporal redundancy, we introduce Temporal Delta Cache (TDC) to skip computations for these invariant blocks, further reducing the computational cost. Extensive experiments demonstrate that our method achieves 1.92$\times$ end-to-end acceleration and 3.32$\times$ memory reduction, setting a new baseline for efficient inference in Video DiTs.

[180] WeNLEX: Weakly Supervised Natural Language Explanations for Multilabel Chest X-ray Classification

Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira

Main category: cs.CV

TL;DR: WeNLEX: Weakly supervised model for generating faithful natural language explanations for chest X-ray classification using feature matching and distribution alignment.

Details

Motivation: Current explanation generation methods produce plausible but unfaithful explanations because they're explicitly supervised with annotated datasets, not reflecting the model's actual reasoning process.

Method: Uses weak supervision with two key components: 1) faithfulness via matching images generated from explanations with original images in the black-box model’s feature space, 2) plausibility via distribution alignment with a small database of clinician-annotated explanations.

Result: Produces faithful and plausible explanations using only 5 ground-truth explanations per diagnosis; improves classification AUC by 2.21% when trained jointly; adaptable to different audiences (e.g., layman versions).

Conclusion: WeNLEX demonstrates that interpretability can improve downstream task performance while generating faithful explanations with minimal supervision, adaptable to various user groups.

Abstract: Natural language explanations provide an inherently human-understandable way to explain black-box models, closely reflecting how radiologists convey their diagnoses in textual reports. Most works explicitly supervise the explanation generation process using datasets annotated with explanations. Thus, though plausible, the generated explanations are not faithful to the model’s reasoning. In this work, we propose WeNLEX, a weakly supervised model for the generation of natural language explanations for multilabel chest X-ray classification. Faithfulness is ensured by matching images generated from their corresponding natural language explanations with original images, in the black-box model’s feature space. Plausibility is maintained via distribution alignment with a small database of clinician-annotated explanations. We empirically demonstrate, through extensive validation on multiple metrics to assess faithfulness, simulatability, diversity, and plausibility, that WeNLEX is able to produce faithful and plausible explanations, using as little as 5 ground-truth explanations per diagnosis. Furthermore, WeNLEX can operate in both post-hoc and in-model settings. In the latter, i.e., when the multilabel classifier is trained together with the rest of the network, WeNLEX improves the classification AUC of the standalone classifier by 2.21%, thus showing that adding interpretability to the training process can actually increase the downstream task performance. Additionally, simply by changing the database, WeNLEX explanations are adaptable to any target audience, and we showcase this flexibility by training a layman version of WeNLEX, where explanations are simplified for non-medical users.

[181] DA-Mamba: Learning Domain-Aware State Space Model for Global-Local Alignment in Domain Adaptive Object Detection

Haochen Li, Rui Zhang, Hantao Yao, Xin Zhang, Yifan Hao, Shaohui Peng, Yongwei Zhao, Ling Li

Main category: cs.CV

TL;DR: DA-Mamba: A hybrid CNN-State Space Model architecture for domain adaptive object detection that combines CNN efficiency with SSM long-range modeling to capture global and local domain-invariant features.

Details

Motivation: Existing DAOD methods using CNNs are limited to local feature alignment due to CNN's local connectivity, while transformer-based methods have quadratic computational costs. Need efficient global feature extraction for domain adaptation.

Method: Propose DA-Mamba with two novel modules: Image-Aware SSM (IA-SSM) in backbone for image-level global/local alignment, and Object-Aware SSM (OA-SSM) in detection head for instance-level alignment modeling spatial/semantic dependencies.

Result: Comprehensive experiments show the method efficiently improves cross-domain performance of object detectors.

Conclusion: DA-Mamba successfully combines CNN efficiency with SSM long-range modeling for effective domain adaptive object detection with global feature awareness.

Abstract: Domain Adaptive Object Detection (DAOD) aims to transfer detectors from a labeled source domain to an unlabeled target domain. Existing DAOD methods employ multi-granularity feature alignment to learn domain-invariant representations. However, the local connectivity of their CNN-based backbone and detection head restricts alignment to local regions, failing to extract global domain-invariant features. Although transformer-based DAOD methods capture global dependencies via attention mechanisms, their quadratic computational cost hinders practical deployment. To solve this, we propose DA-Mamba, a hybrid CNN-State Space Models (SSMs) architecture that combines the efficiency of CNNs with the linear-time long-range modeling capability of State Space Models (SSMs) to capture both global and local domain-invariant features. Specifically, we introduce two novel modules: Image-Aware SSM (IA-SSM) and Object-Aware SSM (OA-SSM). IA-SSM is integrated into the backbone to enhance global domain awareness, enabling image-level global and local alignment. OA-SSM is inserted into the detection head to model spatial and semantic dependencies among objects, enhancing instance-level alignment. Comprehensive experiments demonstrate that the proposed method can efficiently improve the cross-domain performance of the object detector.

[182] ProCal: Probability Calibration for Neighborhood-Guided Source-Free Domain Adaptation

Ying Zheng, Yiyi Zhang, Yi Wang, Lap-Pui Chau

Main category: cs.CV

TL;DR: ProCal introduces a probability calibration method for Source-Free Domain Adaptation that dynamically calibrates neighborhood-based predictions using dual-model collaboration to balance source knowledge retention and target domain adaptation.

Details

Motivation: Current SFDA methods over-rely on prediction similarity among neighbors, which accelerates forgetting of source knowledge and increases susceptibility to local noise overfitting. There's a need for better balance between knowledge retention and domain adaptation.

Method: ProCal uses probability calibration with dual-model collaborative prediction, integrating source model’s initial predictions with current model’s online outputs to calibrate neighbor probabilities. Includes joint optimization with soft supervision loss and diversity loss.

Result: Extensive experiments on 31 cross-domain tasks across four public datasets validate effectiveness. Theoretical analysis shows convergence to equilibrium where source knowledge and target information are effectively fused.

Conclusion: ProCal successfully mitigates local noise interference while preserving source discriminative information, achieving better balance between knowledge retention and domain adaptation in SFDA.

Abstract: Source-Free Domain Adaptation (SFDA) adapts pre-trained models to unlabeled target domains without requiring access to source data. Although state-of-the-art methods leveraging local neighborhood structures show promise for SFDA, they tend to over-rely on prediction similarity among neighbors. This over-reliance accelerates the forgetting of source knowledge and increases susceptibility to local noise overfitting. To address these issues, we introduce ProCal, a probability calibration method that dynamically calibrates neighborhood-based predictions through a dual-model collaborative prediction mechanism. ProCal integrates the source model’s initial predictions with the current model’s online outputs to effectively calibrate neighbor probabilities. This strategy not only mitigates the interference of local noise but also preserves the discriminative information from the source model, thereby achieving a balance between knowledge retention and domain adaptation. Furthermore, we design a joint optimization objective that combines a soft supervision loss with a diversity loss to guide the target model. Our theoretical analysis shows that ProCal converges to an equilibrium where source knowledge and target information are effectively fused, reducing both knowledge forgetting and overfitting. We validate the effectiveness of our approach through extensive experiments on 31 cross-domain tasks across four public datasets. Our code is available at: https://github.com/zhengyinghit/ProCal.

[183] SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle

Main category: cs.CV

TL;DR: SEAR is a fine-tuning strategy that adapts pretrained visual geometry transformers to handle multimodal RGB-thermal inputs for improved 3D reconstruction and camera pose estimation.

Details

Motivation: Pretrained visual geometry models work well on RGB data but struggle with multimodal inputs like RGB-thermal images, particularly in aligning different modalities during joint processing.

Method: Proposes SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs using a relatively small RGB-T dataset.

Result: Significantly outperforms state-of-the-art methods, achieving over 29% improvement in AUC@30, delivering higher detail and consistency between modalities with negligible inference time overhead.

Conclusion: SEAR enables reliable multimodal pose estimation and reconstruction under challenging conditions and introduces a new RGB-thermal dataset for benchmarking future multimodal 3D scene reconstruction work.

Abstract: Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29% in AUC@30) and delivering higher detail and consistency between modalities with negligible overhead in inference time compared to the original pretrained model. Notably, SEAR enables reliable multimodal pose estimation and reconstruction even under challenging conditions, such as low lighting and dense smoke. We validate our architecture through extensive ablation studies, demonstrating how the model aligns both modalities. Additionally, we introduce a new dataset featuring RGB and thermal sequences captured at different times, viewpoints, and illumination conditions, providing a robust benchmark for future work in multimodal 3D scene reconstruction. Code and models are publicly available at https://www.github.com/Schindler-EPFL-Lab/SEAR.

[184] Points-to-3D: Structure-Aware 3D Generation with Point Cloud Priors

Jiatong Xia, Zicheng Duan, Anton van den Hengel, Lingqiao Liu

Main category: cs.CV

TL;DR: Points-to-3D: A diffusion-based framework that leverages point cloud priors for geometry-controllable 3D generation, using visible-region point clouds from sensors or predictors to provide explicit geometric constraints.

Details

Motivation: Current 3D generation methods underutilize readily available 3D priors like point clouds from LiDAR or VGGT predictors, which offer explicit geometric constraints that could improve generation quality and structural control.

Method: Built on TRELLIS latent 3D diffusion model, replaces pure-noise sparse structure latent initialization with point cloud priors, uses structure inpainting network trained on task-specific data with staged sampling strategy (structural inpainting followed by boundary refinement).

Result: Superior performance over state-of-the-art baselines in both object and scene scenarios in terms of rendering quality and geometric fidelity, demonstrating effectiveness of explicitly embedding point-cloud priors.

Conclusion: Explicitly embedding point-cloud priors enables more accurate and structurally controllable 3D generation, with practical applications using either accurate point clouds or VGGT-estimated ones from single images.

Abstract: Recent progress in 3D generation has been driven largely by models conditioned on images or text, while readily available 3D priors are still underused. In many real-world scenarios, the visible-region point cloud are easy to obtain from active sensors such as LiDAR or from feed-forward predictors like VGGT, offering explicit geometric constraints that current methods fail to exploit. In this work, we introduce Points-to-3D, a diffusion-based framework that leverages point cloud priors for geometry-controllable 3D asset and scene generation. Built on a latent 3D diffusion model TRELLIS, Points-to-3D first replaces pure-noise sparse structure latent initialization with a point cloud priors tailored input formulation.A structure inpainting network, trained within the TRELLIS framework on task-specific data designed to learn global structural inpainting, is then used for inference with a staged sampling strategy (structural inpainting followed by boundary refinement), completing the global geometry while preserving the visible regions of the input priors.In practice, Points-to-3D can take either accurate point-cloud priors or VGGT-estimated point clouds from single images as input. Experiments on both objects and scene scenarios consistently demonstrate superior performance over state-of-the-art baselines in terms of rendering quality and geometric fidelity, highlighting the effectiveness of explicitly embedding point-cloud priors for achieving more accurate and structurally controllable 3D generation.

[185] Rethinking Uncertainty Quantification and Entanglement in Image Segmentation

Jakob Lønborg Christensen, Vedrana Andersen Dahl, Morten Rieger Hannemose, Anders Bjorholm Dahl, Christian F. Baumgartner

Main category: cs.CV

TL;DR: Comprehensive empirical study of aleatoric and epistemic uncertainty interactions in medical image segmentation, proposing entanglement metrics and evaluating model combinations across downstream tasks.

Details

Motivation: Uncertainty quantification is crucial for safety-critical medical applications, but existing methods for modeling aleatoric and epistemic uncertainty are often combined without understanding their interactions, and recent work shows substantial entanglement between these uncertainty types that undermines interpretability and practical usefulness.

Method: Conducted comprehensive empirical study covering broad range of AU-EU model combinations, proposed metric to quantify uncertainty entanglement, and evaluated across downstream UQ tasks including out-of-distribution detection, ambiguity modeling, and calibration.

Result: For out-of-distribution detection, ensembles showed consistently lower entanglement and superior performance. For ambiguity modeling and calibration, best models were dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. Softmax ensembles performed remarkably well across all tasks.

Conclusion: The study reveals important patterns in uncertainty entanglement across different model combinations, identifies ensemble methods as particularly effective for OOD detection, and provides guidance for selecting appropriate uncertainty quantification methods based on specific downstream tasks and datasets.

Abstract: Uncertainty quantification (UQ) is crucial in safety-critical applications such as medical image segmentation. Total uncertainty is typically decomposed into data-related aleatoric uncertainty (AU) and model-related epistemic uncertainty (EU). Many methods exist for modeling AU (such as Probabilistic UNet, Diffusion) and EU (such as ensembles, MC Dropout), but it is unclear how they interact when combined. Additionally, recent work has revealed substantial entanglement between AU and EU, undermining the interpretability and practical usefulness of the decomposition. We present a comprehensive empirical study covering a broad range of AU-EU model combinations, propose a metric to quantify uncertainty entanglement, and evaluate both across downstream UQ tasks. For out-of-distribution detection, ensembles exhibit consistently lower entanglement and superior performance. For ambiguity modeling and calibration the best models are dataset-dependent, with softmax/SSN-based methods performing well and Probabilistic UNets being less entangled. A softmax ensemble fares remarkably well on all tasks. Finally, we analyze potential sources of uncertainty entanglement and outline directions for mitigating this effect.

[186] Perceptio: Perception Enhanced Vision Language Models via Spatial Token Generation

Yuchen Li, Amanmeet Garg, Shalini Chaudhuri, Rui Zhao, Garin Kessler

Main category: cs.CV

TL;DR: Perceptio enhances LVLMs with explicit 2D/3D spatial reasoning using semantic segmentation and depth tokens generated within autoregressive sequences, achieving SOTA on spatial grounding benchmarks.

Details

Motivation: Large Vision Language Models excel at semantic understanding but struggle with fine-grained spatial grounding because they must implicitly infer complex geometry without producing explicit spatial interpretations.

Method: 1) Distill VQVAE depth codebook from monocular teacher to tokenize dense depth; 2) Integrate SAM2-based semantic segmentation tokens and VQ-VAE depth tokens inside LLM; 3) Introduce composite depth-token objectives and soft-merging for stable generation; 4) Multi-task co-training across diverse datasets.

Result: State-of-the-art performance: +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g, +10.3% on HardBLINK spatial understanding, +1.0% on MMBench accuracy, demonstrating explicit spatial chain-of-thought strengthens spatial grounding.

Conclusion: Explicit spatial reasoning tokens materially strengthen spatial grounding in LVLMs, enabling better 2D and 3D spatial understanding through explicit spatial chain-of-thought generation.

Abstract: Large Vision Language Models (LVLMs) excel at semantic understanding but struggle with fine grained spatial grounding, as the model must implicitly infer complex geometry without ever producing a spatial interpretation. We present Perceptio, a perception enhanced LVLM with 2D and 3D spatial reasoning abilities, enabled via explicit semantic segmentation tokens and depth tokens generated directly within the autoregressive sequence. Concretely, we (i) distill a VQVAE depth codebook from a strong monocular teacher to tokenize dense depth into compact sequences, and (ii) integrate SAM2 based semantic segmentation tokens and VQ-VAE depth tokens inside the LLM so the model first emits spatial tokens and then answers. To stabilize depth token generation, we introduce novel composite depth-token objectives (marker, token, and count losses) and a soft-merging technique for differentiable reconstruction. We adopt a multi-task co-training strategy across diverse datasets, letting the model learn perception tokens to tackle multiple downstream tasks. Building on InternVL, Perceptio achieves state-of-the-art performance across benchmarks: improving referring expression segmentation by +0.8/+1.4/+1.1 cIoU on RefCOCO/+/g HardBLINK spatial understanding accuracy by 10.3%, and MMBench accuracy by 1.0%, demonstrating that explicit spatial chain-of-thought materially strengthens spatial grounding in LVLMs.

[187] VesselTok: Tokenizing Vessel-like 3D Biomedical Graph Representations for Reconstruction and Generation

Chinmay Prabhakar, Bastian Wittmann, Tamaz Amiranashvili, Paul Büschl, Ezequiel de la Rosa, Julian McGinnis, Benedikt Wiestler, Bjoern Menze, Suprosanna Shit

Main category: cs.CV

TL;DR: VesselTok is a framework that learns latent representations (tokens) for spatially dense anatomical graphs by treating them as parametric shapes, using centerline points with pseudo radius to encode tubular geometry for various vessel-like structures.

Details

Motivation: Spatial graphs are important for representing anatomical structures like blood vessels and airways, but high spatial resolution increases computational complexity. There's a need for efficient modeling of these complex tubular structures in clinical and biomedical research.

Method: VesselTok approaches spatially dense graphs from a parametric shape perspective, learning latent representations conditioned on centerline points with pseudo radius to encode tubular geometry. It learns neural implicit representations of vessel-like structures.

Result: The framework demonstrates performance across diverse anatomies (lung airways, lung vessels, brain vessels), showing robust encoding of complex topologies. The learned representations generalize to unseen anatomies, support generative modeling of plausible graphs, and transfer effectively to downstream inverse problems like link prediction.

Conclusion: VesselTok provides an effective framework for learning latent representations of complex anatomical graphs, addressing computational challenges while enabling generalization, generative modeling, and transfer to downstream tasks.

Abstract: Spatial graphs provide a lightweight and elegant representation of curvilinear anatomical structures such as blood vessels, lung airways, and neuronal networks. Accurately modeling these graphs is crucial in clinical and (bio-)medical research. However, the high spatial resolution of large networks drastically increases their complexity, resulting in significant computational challenges. In this work, we aim to tackle these challenges by proposing VesselTok, a framework that approaches spatially dense graphs from a parametric shape perspective to learn latent representations (tokens). VesselTok leverages centerline points with a pseudo radius to effectively encode tubular geometry. Specifically, we learn a novel latent representation conditioned on centerline points to encode neural implicit representations of vessel-like, tubular structures. We demonstrate VesselTok’s performance across diverse anatomies, including lung airways, lung vessels, and brain vessels, highlighting its ability to robustly encode complex topologies. To prove the effectiveness of VesselTok’s learnt latent representations, we show that they (i) generalize to unseen anatomies, (ii) support generative modeling of plausible anatomical graphs, and (iii) transfer effectively to downstream inverse problems, such as link prediction.

[188] Statistical Characteristic-Guided Denoising for Rapid High-Resolution Transmission Electron Microscopy Imaging

Hesong Li, Ziqi Wu, Ruiwen Shao, Ying Fu

Main category: cs.CV

TL;DR: A statistical characteristic-guided denoising network for HRTEM images that uses spatial and frequency domain guidance to remove noise while preserving atomic positions for nucleation observation.

Details

Motivation: HRTEM imaging for atomic-scale nucleation dynamics requires short-exposure rapid imaging, which introduces severe noise that obscures atomic positions, making denoising crucial for accurate observation.

Method: Proposes a statistical characteristic-guided denoising network with spatial deviation-guided weighting (selects convolution operations per spatial position) and frequency band-guided weighting (enhances signals/suppresses noise). Includes HRTEM-specific noise calibration and dataset generation with disordered structures and realistic HRTEM noises.

Result: Outperforms state-of-the-art methods in HRTEM image denoising on both synthetic and real data, with demonstrated effectiveness in localization downstream tasks.

Conclusion: The proposed method effectively denoises HRTEM images for nucleation observation by leveraging statistical characteristics in both spatial and frequency domains, with practical applications in materials science research.

Abstract: High-Resolution Transmission Electron Microscopy (HRTEM) enables atomic-scale observation of nucleation dynamics, which boosts the studies of advanced solid materials. Nonetheless, due to the millisecond-scale rapid change of nucleation, it requires short-exposure rapid imaging, leading to severe noise that obscures atomic positions. In this work, we propose a statistical characteristic-guided denoising network, which utilizes statistical characteristics to guide the denoising process in both spatial and frequency domains. In the spatial domain, we present spatial deviation-guided weighting to select appropriate convolution operations for each spatial position based on deviation characteristic. In the frequency domain, we present frequency band-guided weighting to enhance signals and suppress noise based on band characteristics. We also develop an HRTEM-specific noise calibration method and generate a dataset with disordered structures and realistic HRTEM image noises. It can ensure the denoising performance of models on real images for nucleation observation. Experiments on synthetic and real data show our method outperforms the state-of-the-art methods in HRTEM image denoising, with effectiveness in the localization downstream task. Code will be available at https://github.com/HeasonLee/SCGN.

[189] Towards Interpretable Foundation Models for Retinal Fundus Images

Samuel Ofosu Mensah, Maria Camila Roa Carvajal, Kerol Djoumessi, Philipp Berens

Main category: cs.CV

TL;DR: Dual-IFM is an interpretable-by-design foundation model for medical imaging that provides both local (class evidence maps) and global (2D projection visualization) interpretability while achieving competitive performance with much smaller models.

Details

Motivation: Current foundation models lack interpretability, which is critical for high-stakes domains like medical imaging where understanding model decisions is essential for clinical trust and adoption.

Method: Proposes Dual-IFM with two interpretability mechanisms: 1) local class evidence maps for individual image decisions, and 2) global 2D projection layer for visualizing representation space. Trained via self-supervised learning on 800,000+ retinal fundus images.

Result: Achieves performance comparable to state-of-the-art foundation models with up to 16× fewer parameters while providing interpretable predictions on out-of-distribution data.

Conclusion: Large-scale SSL pretraining combined with inherent interpretability can produce robust representations for medical imaging, particularly retinal imaging.

Abstract: Foundation models are used to extract transferable representations from large amounts of unlabeled data, typically via self-supervised learning (SSL). However, many of these models rely on architectures that offer limited interpretability, which is a critical issue in high-stakes domains such as medical imaging. We propose Dual-IFM, a foundation model that is interpretable-by-design in two ways: First, it provides local interpretability for individual images through class evidence maps that are faithful to the decision-making process. Second, it provides global interpretability for entire datasets through a 2D projection layer that allows for direct visualization of the model’s representation space. We trained our model on over 800,000 color fundus photography from various sources to learn generalizable, interpretable representations for different downstream tasks. Our results show that our model reaches a performance range similar to that of state-of-the-art foundation models with up to $16\times$ the number of parameters, while providing interpretable predictions on out-of-distribution data. Our results suggest that large-scale SSL pretraining paired with inherent interpretability can lead to robust representations for retinal imaging.

[190] HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

Main category: cs.CV

TL;DR: HORNet: A lightweight frame selection policy trained with Group Relative Policy Optimization to optimize which frames a frozen vision-language model sees for video question answering, improving efficiency and accuracy.

Details

Motivation: Current video QA systems rely on uniform or heuristic frame sampling that cannot be optimized for downstream answering quality, leading to inefficiency and suboptimal performance.

Method: Introduces HORNet with <1M trainable parameters using Group Relative Policy Optimization (GRPO) to learn optimal frame selection policies for frozen VLMs, formalizing this as Select Any Frames (SAF) task.

Result: Reduces input frames by up to 99% and VLM processing time by up to 93%, improves answer quality (+1.7% F1 on MSVD-QA), achieves strong temporal reasoning (+7.3 points on NExT-QA), and transfers across VLM answerers without retraining.

Conclusion: Optimizing what a VLM sees is a practical and complementary alternative to optimizing what it generates, improving efficiency while maintaining or enhancing performance across diverse video QA tasks.

Abstract: Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99% and VLM processing time by up to 93%, while improving answer quality on short-form benchmarks (+1.7% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained selection generalizes better out-of-distribution than supervised and PPO alternatives. HORNet’s policy further transfers across VLM answerers without retraining, yielding an additional 8.5% relative gain when paired with a stronger model. Evaluated across six benchmarks spanning 341,877 QA pairs and 114.2 hours of video, our results demonstrate that optimizing \emph{what} a VLM sees is a practical and complementary alternative to optimizing what it generates while improving efficiency. Code is available at https://github.com/ostadabbas/HORNet.

[191] Motion-o: Trajectory-Grounded Video Reasoning

Bishoy Galoaa, Shayda Moezzi, Xiangyu Bai, Sarah Ostadabbas

Main category: cs.CV

TL;DR: Motion-o is a motion-centric video understanding extension for VLMs that makes object trajectories explicit and verifiable through structured motion reasoning with MCoT and trajectory-grounding datasets.

Details

Motivation: Current video reasoning models lack explicit understanding of how objects move between observations - trajectories remain implicit and difficult to verify, creating a gap in spatial-temporal reasoning capabilities.

Method: Introduces Motion-o extension for VLMs with Motion Chain of Thought (MCoT) using tags to summarize object direction, speed, and scale changes; creates trajectory-grounding dataset via augmentation; uses reward function for visual evidence reasoning without architectural changes.

Result: Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as critical for evidence-based video understanding.

Conclusion: Making object trajectories explicit and verifiable through structured motion reasoning significantly enhances video understanding capabilities, addressing a previously overlooked aspect of spatial-temporal reasoning.

Abstract: Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

[192] PromptHub: Enhancing Multi-Prompt Visual In-Context Learning with Locality-Aware Fusion, Concentration and Alignment

Tianci Luo, Jinpeng Wang, Shiyu Qin, Niu Lian, Yan Feng, Bin Chen, Chun Yuan, Shu-Tao Xia

Main category: cs.CV

TL;DR: PromptHub is a framework for visual in-context learning that improves multi-prompt fusion through locality-aware mechanisms, complementary objectives, and data augmentation, outperforming patch-wise approaches on vision tasks.

Details

Motivation: Current visual in-context learning methods use patch-wise fusion frameworks and model-agnostic supervision, which limit performance gains by failing to exploit informative cues from demonstrations. There's a need for a more holistic approach to multi-prompting.

Method: PromptHub introduces locality-aware fusion to capture richer contextual information using spatial priors, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to reinforce supervision.

Result: Extensive experiments on three fundamental vision tasks demonstrate PromptHub’s superiority. The framework shows universality, transferability, and robustness across out-of-distribution settings and various retrieval scenarios.

Conclusion: PromptHub establishes a reliable locality-aware paradigm for prompt fusion that moves beyond prior patch-wise approaches, providing a more effective framework for visual in-context learning.

Abstract: Visual In-Context Learning (VICL) aims to complete vision tasks by imitating pixel demonstrations. Recent work pioneered prompt fusion that combines the advantages of various demonstrations, which shows a promising way to extend VICL. Unfortunately, the patch-wise fusion framework and model-agnostic supervision hinder the exploitation of informative cues, thereby limiting performance gains. To overcome this deficiency, we introduce PromptHub, a framework that holistically strengthens multi-prompting through locality-aware fusion, concentration and alignment. PromptHub exploits spatial priors to capture richer contextual information, employs complementary concentration, alignment, and prediction objectives to mutually guide training, and incorporates data augmentation to further reinforce supervision. Extensive experiments on three fundamental vision tasks demonstrate the superiority of PromptHub. Moreover, we validate its universality, transferability, and robustness across out-of-distribution settings, and various retrieval scenarios. This work establishes a reliable locality-aware paradigm for prompt fusion, moving beyond prior patch-wise approaches. Code is available at https://github.com/luotc-why/ICLR26-PromptHub.

[193] MultihopSpatial: Multi-hop Compositional Spatial Reasoning Benchmark for Vision-Language Model

Youngwan Lee, Soojin Jang, Yoorhim Cho, Seunghwan Lee, Yong-Ju Lee, Sung Ju Hwang

Main category: cs.CV

TL;DR: MultihopSpatial benchmark for multi-hop compositional spatial reasoning in VLMs with precise visual grounding evaluation

Details

Motivation: Existing spatial reasoning benchmarks focus on elementary single-hop relations, neglecting multi-hop compositional reasoning and precise visual grounding needed for real-world VLA agents in physical environments

Method: Introduces MultihopSpatial benchmark with 1-3 hop complex queries, Acc@50IoU metric for simultaneous reasoning and bounding box evaluation, and MultihopSpatial-Train corpus for training

Result: Evaluation of 37 state-of-the-art VLMs shows compositional spatial reasoning remains challenging; reinforcement learning post-training on the corpus improves both spatial reasoning and downstream embodied manipulation

Conclusion: MultihopSpatial addresses critical gaps in spatial reasoning evaluation for VLMs/VLAs, providing tools to enhance spatial intelligence essential for real-world deployment

Abstract: Spatial reasoning is foundational for Vision-Language Models (VLMs), particularly when deployed as Vision-Language-Action (VLA) agents in physical environments. However, existing benchmarks predominantly focus on elementary, single-hop relations, neglecting the multi-hop compositional reasoning and precise visual grounding essential for real-world scenarios. To address this, we introduce MultihopSpatial, offering three key contributions: (1) A comprehensive benchmark designed for multi-hop and compositional spatial reasoning, featuring 1- to 3-hop complex queries across diverse spatial perspectives. (2) Acc@50IoU, a complementary metric that simultaneously evaluates reasoning and visual grounding by requiring both answer selection and precise bounding box prediction - capabilities vital for robust VLA deployment. (3) MultihopSpatial-Train, a dedicated large-scale training corpus to foster spatial intelligence. Extensive evaluation of 37 state-of-the-art VLMs yields eight key insights, revealing that compositional spatial reasoning remains a formidable challenge. Finally, we demonstrate that reinforcement learning post-training on our corpus enhances both intrinsic VLM spatial reasoning and downstream embodied manipulation performance.

[194] Translating MRI to PET through Conditional Diffusion Models with Enhanced Pathology Awareness

Yitong Li, Igor Yakushev, Dennis M. Hedderich, Christian Wachinger

Main category: cs.CV

TL;DR: PASTA is a pathology-aware diffusion model framework for synthesizing PET scans from MRI, addressing limitations of PET’s high cost and radiation exposure while improving neurodegenerative disease diagnosis.

Details

Motivation: PET is superior for neurodegenerative disease diagnosis but has high costs and radiation exposure, while MRI is more accessible but less sensitive. Existing image translation methods focus on structural preservation but lack pathology awareness needed for accurate medical diagnosis.

Method: PASTA uses conditional diffusion models with a dual-arm architecture and multi-modal condition integration for enhanced pathology awareness. It introduces cycle exchange consistency and volumetric generation strategies for high-quality 3D PET synthesis.

Result: PASTA outperforms state-of-the-art methods in preserving both structural and pathological details. For Alzheimer’s diagnosis, synthesized PET scans improve over MRI by 4%, nearly reaching actual PET performance.

Conclusion: PASTA successfully addresses the pathology awareness gap in medical image translation, providing a practical solution for generating synthetic PET from MRI that maintains diagnostic utility while overcoming PET’s limitations.

Abstract: Positron emission tomography (PET) is a widely recognized technique for diagnosing neurodegenerative diseases, offering critical functional insights. However, its high costs and radiation exposure hinder its widespread use. In contrast, magnetic resonance imaging (MRI) does not involve such limitations. While MRI also detects neurodegenerative changes, it is less sensitive for diagnosis compared to PET. To overcome such limitations, one approach is to generate synthetic PET from MRI. Recent advances in generative models have paved the way for cross-modality medical image translation; however, existing methods largely emphasize structural preservation while neglecting the critical need for pathology awareness. To address this gap, we propose PASTA, a novel image translation framework built on conditional diffusion models with enhanced pathology awareness. PASTA surpasses state-of-the-art methods by preserving both structural and pathological details through its highly interactive dual-arm architecture and multi-modal condition integration. Additionally, we introduce a novel cycle exchange consistency and volumetric generation strategy that significantly enhances PASTA’s ability to produce high-quality 3D PET images. Our qualitative and quantitative results demonstrate the high quality and pathology awareness of the synthesized PET scans. For Alzheimer’s diagnosis, the performance of these synthesized scans improves over MRI by 4%, almost reaching the performance of actual PET. Our code is available at https://github.com/ai-med/PASTA.

[195] GHOST: Fast Category-agnostic Hand-Object Interaction Reconstruction from RGB Videos using Gaussian Splatting

Ahmed Tawfik Aboukhadra, Marcel Rogge, Nadia Robertini, Abdalla Arafa, Jameel Malik, Ahmed Elhayek, Didier Stricker

Main category: cs.CV

TL;DR: GHOST is a fast, category-agnostic framework using 2D Gaussian Splatting to reconstruct dynamic hand-object interactions from monocular RGB videos, achieving physically consistent and animatable reconstructions with significant speed improvements.

Details

Motivation: Existing methods for understanding hand-object interactions rely on category-specific templates or heavy computation, often producing physically inconsistent 3D alignments. There's a need for fast, category-agnostic approaches that can handle realistic interactions from monocular RGB videos for AR/VR, robotics, and embodied AI applications.

Method: GHOST uses 2D Gaussian Splatting to represent hands and objects as dense, view-consistent Gaussian discs. Key innovations include: (1) geometric-prior retrieval and consistency loss for completing occluded object regions, (2) grasp-aware alignment refining hand translations and object scale for realistic contact, and (3) hand-aware background loss preventing penalization of hand-occluded object regions.

Result: GHOST achieves complete, physically consistent, and animatable reconstructions from single RGB videos while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality.

Conclusion: GHOST establishes an efficient and robust solution for realistic hand-object interaction modeling, offering significant speed advantages while maintaining high reconstruction quality, making it suitable for practical applications in AR/VR, robotics, and embodied AI.

Abstract: Understanding realistic hand-object interactions from monocular RGB videos is essential for AR/VR, robotics, and embodied AI. Existing methods rely on category-specific templates or heavy computation, yet still produce physically inconsistent hand-object alignment in 3D. We introduce GHOST (Gaussian Hand-Object Splatting), a fast, category-agnostic framework for reconstructing dynamic hand-object interactions using 2D Gaussian Splatting. GHOST represents both hands and objects as dense, view-consistent Gaussian discs and introduces three key innovations: (1) a geometric-prior retrieval and consistency loss that completes occluded object regions, (2) a grasp-aware alignment that refines hand translations and object scale to ensure realistic contact, and (3) a hand-aware background loss that prevents penalizing hand-occluded object regions. GHOST achieves complete, physically consistent, and animatable reconstructions from a single RGB video while running an order of magnitude faster than prior category-agnostic methods. Extensive experiments on ARCTIC, HO3D, and in-the-wild datasets demonstrate state-of-the-art accuracy in 3D reconstruction and 2D rendering quality, establishing GHOST as an efficient and robust solution for realistic hand-object interaction modeling. Code is available at https://github.com/ATAboukhadra/GHOST.

[196] Unsupervised Contrastive Learning for Efficient and Robust Spectral Shape Matching

Feifan Luo, Hongyang Chen

Main category: cs.CV

TL;DR: Unsupervised contrastive learning approach for 3D shape matching that eliminates traditional functional map solvers and improves feature quality through contrastive learning.

Details

Motivation: Existing deep functional map methods focus on optimizing maps rather than improving feature representations, rely on computationally expensive functional map solvers, and have inadequate feature quality leading to suboptimal matching performance.

Method: Two-component approach: 1) Unsupervised contrastive learning framework maximizing consistency in positive pairs and minimizing in negative pairs to improve feature quality; 2) Simplified functional map learning architecture eliminating expensive solvers and auxiliary losses.

Result: Achieves state-of-the-art performance in both accuracy and efficiency across challenging benchmarks including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

Conclusion: The proposed unsupervised contrastive learning approach provides efficient and robust 3D shape matching by improving feature representations and eliminating computational bottlenecks of traditional functional map methods.

Abstract: Estimating correspondences between pairs of non-rigid deformable 3D shapes remains a significant challenge in computer vision and graphics. While deep functional map methods have become the go-to solution for addressing this problem, they primarily focus on optimizing pointwise and functional maps either individually or jointly, rather than directly enhancing feature representations in the embedding space, which often results in inadequate feature quality and suboptimal matching performance. Furthermore, these approaches heavily rely on traditional functional map techniques, such as time-consuming functional map solvers, which incur substantial computational costs. In this work, we introduce, for the first time, a novel unsupervised contrastive learning-based approach for efficient and robust 3D shape matching. We begin by presenting an unsupervised contrastive learning framework that promotes feature learning by maximizing consistency within positive similarity pairs and minimizing it within negative similarity pairs, thereby improving both the consistency and discriminability of the learned features.We then design a significantly simplified functional map learning architecture that eliminates the need for computationally expensive functional map solvers and multiple auxiliary functional map losses, greatly enhancing computational efficiency. By integrating these two components into a unified two-branch pipeline, our method achieves state-of-the-art performance in both accuracy and efficiency. Extensive experiments demonstrate that our approach is not only computationally efficient but also outperforms current state-of-the-art methods across various challenging benchmarks, including near-isometric, non-isometric, and topologically inconsistent scenarios, even surpassing supervised techniques.

[197] VGGT-360: Geometry-Consistent Zero-Shot Panoramic Depth Estimation

Jiayi Yuan, Haobo Jiang, De Wen Soh, Na Zhao

Main category: cs.CV

TL;DR: VGGT-360 is a training-free framework for zero-shot, geometry-consistent panoramic depth estimation that reformulates the task as panoramic reprojection over multi-view 3D models using VGGT foundation models.

Details

Motivation: Prior training-free approaches for panoramic depth estimation are view-independent and lack geometric consistency. The authors aim to create a framework that unifies fragmented per-view reasoning into coherent panoramic understanding while maintaining 3D consistency.

Method: Three plug-and-play modules: 1) Uncertainty-guided adaptive projection slices panoramas into perspective views using gradient-based uncertainty to allocate denser views to geometry-poor regions; 2) Structure-saliency enhanced attention injects structure-aware confidence into VGGT’s attention layers; 3) Correlation-weighted 3D model correction refines reconstructed 3D models using attention-inferred correlation scores.

Result: VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor/outdoor datasets.

Conclusion: The framework successfully achieves zero-shot, geometry-consistent panoramic depth estimation by leveraging VGGT foundation models’ intrinsic 3D consistency and integrating specialized modules for domain adaptation, attention enhancement, and 3D model refinement.

Abstract: This paper presents VGGT-360, a novel training-free framework for zero-shot, geometry-consistent panoramic depth estimation. Unlike prior view-independent training-free approaches, VGGT-360 reformulates the task as panoramic reprojection over multi-view reconstructed 3D models by leveraging the intrinsic 3D consistency of VGGT-like foundation models, thereby unifying fragmented per-view reasoning into a coherent panoramic understanding. To achieve robust and accurate estimation, VGGT-360 integrates three plug-and-play modules that form a unified panorama-to-3D-to-depth framework: (i) Uncertainty-guided adaptive projection slices panoramas into perspective views to bridge the domain gap between panoramic inputs and VGGT’s perspective prior. It estimates gradient-based uncertainty to allocate denser views to geometry-poor regions, yielding geometry-informative inputs for VGGT. (ii) Structure-saliency enhanced attention strengthens VGGT’s robustness during 3D reconstruction by injecting structure-aware confidence into its attention layers, guiding focus toward geometrically reliable regions and enhancing cross-view coherence. (iii) Correlation-weighted 3D model correction refines the reconstructed 3D model by reweighting overlapping points using attention-inferred correlation scores, providing a consistent geometric basis for accurate panoramic reprojection. Extensive experiments show that VGGT-360 outperforms both trained and training-free state-of-the-art methods across multiple resolutions and diverse indoor and outdoor datasets.

[198] CRAFT: Aligning Diffusion Models with Fine-Tuning Is Easier Than You Think

Zening Sun, Zhengpeng Xie, Lichen Bai, Shitong Shao, Shuo Yang, Zeke Xie

Main category: cs.CV

TL;DR: CRAFT is a lightweight fine-tuning paradigm for diffusion models that uses composite reward filtering to create high-quality training data and achieves superior performance with only 100 samples while being 11-220x faster than baseline methods.

Details

Motivation: Existing diffusion model alignment methods (SFT and DPO-style) face two main challenges: dependency on costly high-quality data or inconsistent preference datasets, and computational inefficiency. There's a need for a more efficient fine-tuning approach that requires less data while maintaining performance.

Method: Proposes Composite Reward Assisted Fine-Tuning (CRAFT) with two components: 1) Composite Reward Filtering (CRF) to construct high-quality, consistent training datasets, and 2) an enhanced variant of supervised fine-tuning. Theoretically proves CRAFT optimizes the lower bound of group-based reinforcement learning.

Result: CRAFT with only 100 samples outperforms state-of-the-art preference optimization methods using thousands of preference-paired samples. Achieves 11-220x faster convergence than baseline methods, demonstrating extremely high efficiency.

Conclusion: CRAFT provides a lightweight, efficient fine-tuning paradigm for diffusion models that addresses data dependency and computational inefficiency challenges while establishing principled connections between SFT and reinforcement learning.

Abstract: Aligning Diffusion models has achieved remarkable breakthroughs in generating high-quality, human preference-aligned images. Existing techniques, such as supervised fine-tuning (SFT) and DPO-style preference optimization, have become principled tools for fine-tuning diffusion models. However, SFT relies on high-quality images that are costly to obtain, while DPO-style methods depend on large-scale preference datasets, which are often inconsistent in quality. Beyond data dependency, these methods are further constrained by computational inefficiency. To address these two challenges, we propose Composite Reward Assisted Fine-Tuning (CRAFT), a lightweight yet powerful fine-tuning paradigm that requires significantly reduced training data while maintaining computational efficiency. It first leverages a Composite Reward Filtering (CRF) technique to construct a high-quality and consistent training dataset and then perform an enhanced variant of SFT. We also theoretically prove that CRAFT actually optimizes the lower bound of group-based reinforcement learning, establishing a principled connection between SFT with selected data and reinforcement learning. Our extensive empirical results demonstrate that CRAFT with only 100 samples can easily outperform recent SOTA preference optimization methods with thousands of preference-paired samples. Moreover, CRAFT can even achieve 11-220$\times$ faster convergences than the baseline preference optimization methods, highlighting its extremely high efficiency.

[199] Unleashing the Power of Simplicity: A Minimalist Strategy for State-of-the-Art Fingerprint Enhancement

Raffaele Cappelli

Main category: cs.CV

TL;DR: A minimalist approach to fingerprint enhancement with two novel methods (contextual filtering and learning-based) that outperform complex state-of-the-art techniques on challenging latent fingerprints.

Details

Motivation: Fingerprint recognition systems depend on accurate minutiae extraction which requires high-quality fingerprint images. Current enhancement methods struggle with low-quality fingerprints and are computationally demanding, so the authors propose a simpler, more effective approach.

Method: Two novel methods: 1) Contextual filtering method, and 2) Learning-based method. Both prioritize simplicity and effectiveness over complexity, with open-source implementation for reproducibility.

Result: The proposed methods consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. Validation was done using a challenging latent fingerprint database.

Conclusion: Simplicity is key to achieving high-quality fingerprint enhancement, and future research should balance complexity with practical benefits. The open-source implementation encourages further advancements.

Abstract: Fingerprint recognition systems, which rely on the unique characteristics of human fingerprints, are essential in modern security and verification applications. Accurate minutiae extraction, a critical step in these systems, depends on the quality of fingerprint images. Despite recent improvements in fingerprint enhancement techniques, state-of-the-art methods often struggle with low-quality fingerprints and can be computationally demanding. This paper presents a minimalist approach to fingerprint enhancement, prioritizing simplicity and effectiveness. Two novel methods are introduced: a contextual filtering method and a learning-based method. These techniques consistently outperform complex state-of-the-art methods, producing clearer, more accurate, and less noisy images. The effectiveness of these methods is validated using a challenging latent fingerprint database. The open-source implementation of these techniques not only fosters reproducibility but also encourages further advancements in the field. The findings underscore the importance of simplicity in achieving high-quality fingerprint enhancement and suggest that future research should balance complexity and practical benefits.

[200] Generalized Hand-Object Pose Estimation with Occlusion Awareness

Hui Yang, Wei Sun, Jian Liu, Jian Xiao Tao Xie, Hossein Rahmani, Ajmal Saeed mian, Nicu Sebe, Gim Hee Lee

Main category: cs.CV

TL;DR: GenHOI: A framework for generalized 3D hand-object pose estimation from single RGB images with occlusion awareness, using hierarchical semantic prompts and multi-modal masked modeling.

Details

Motivation: 3D hand-object pose estimation from single RGB images is challenging due to object appearance variations and interaction patterns, especially under heavy occlusion where visual cues are missing or ambiguous.

Method: Integrates hierarchical semantic knowledge with hand priors using textual descriptions of object states, hand configurations, and interaction patterns. Employs multi-modal masked modeling over RGB images, predicted point clouds, and textual descriptions for occlusion reasoning. Uses hand priors as spatial references for implicit interaction constraints.

Result: Achieves state-of-the-art performance on challenging DexYCB and HO3Dv2 benchmarks for hand-object pose estimation.

Conclusion: GenHOI effectively handles occlusion and generalizes to unseen objects and novel interactions by combining hierarchical semantic prompts with multi-modal reasoning and hand priors.

Abstract: Generalized 3D hand-object pose estimation from a single RGB image remains challenging due to the large variations in object appearances and interaction patterns, especially under heavy occlusion. We propose GenHOI, a framework for generalized hand-object pose estimation with occlusion awareness. GenHOI integrates hierarchical semantic knowledge with hand priors to enhance model generalization under challenging occlusion conditions. Specifically, we introduce a hierarchical semantic prompt that encodes object states, hand configurations, and interaction patterns via textual descriptions. This enables the model to learn abstract high-level representations of hand-object interactions for generalization to unseen objects and novel interactions while compensating for missing or ambiguous visual cues. To enable robust occlusion reasoning, we adopt a multi-modal masked modeling strategy over RGB images, predicted point clouds, and textual descriptions. Moreover, we leverage hand priors as stable spatial references to extract implicit interaction constraints. This allows reliable pose inference even under significant variations in object shapes and interaction patterns. Extensive experiments on the challenging DexYCB and HO3Dv2 benchmarks demonstrate that our method achieves state-of-the-art performance in hand-object pose estimation.

[201] Rethinking MLLM Itself as a Segmenter with a Single Segmentation Token

Anqi Zhang, Xiaokang Ji, Guangyu Gao, Jianbo Jiao, Chi Harold Liu, Yunchao Wei

Main category: cs.CV

TL;DR: SELF1E enables segmentation directly from MLLMs without external mask decoders by preserving original image resolution, refilling features with residuals, and using dual attention pathways.

Details

Motivation: Current MLLM-based segmentation methods rely heavily on specialist mask decoders or additional tokens, which adds complexity. The paper investigates whether segmentation can be unlocked directly from MLLMs themselves using just one segmentation embedding.

Method: 1) Retains image features at original uncompressed resolution and refills them with residual features from MLLM-processed compressed features. 2) Uses pixel-unshuffle operations on both processed and unprocessed image features to enhance feature details. 3) Implements dual attention pathways (image-to-image and image-to-segmentation) for rich feature interaction between pixels and segmentation token.

Result: Comprehensive experiments across multiple segmentation tasks show SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs.

Conclusion: Segmentation can be effectively performed directly from MLLMs without external decoders by addressing resolution limitations through feature refilling and enhanced attention mechanisms, simplifying the segmentation pipeline while maintaining competitive performance.

Abstract: Recent segmentation methods leveraging Multi-modal Large Language Models (MLLMs) have shown reliable object-level segmentation and enhanced spatial perception. However, almost all previous methods predominantly rely on specialist mask decoders to interpret masks from generated segmentation-related embeddings and visual features, or incorporate multiple additional tokens to assist. This paper aims to investigate whether and how we can unlock segmentation from MLLM itSELF with 1 segmentation Embedding (SELF1E) while achieving competitive results, which eliminates the need for external decoders. To this end, our approach targets the fundamental limitation of resolution reduction in pixel-shuffled image features from MLLMs. First, we retain image features at their original uncompressed resolution, and refill them with residual features extracted from MLLM-processed compressed features, thereby improving feature precision. Subsequently, we integrate pixel-unshuffle operations on image features with and without LLM processing, respectively, to unleash the details of compressed features and amplify the residual features under uncompressed resolution, which further enhances the resolution of refilled features. Moreover, we redesign the attention mask with dual perception pathways, i.e., image-to-image and image-to-segmentation, enabling rich feature interaction between pixels and the segmentation token. Comprehensive experiments across multiple segmentation tasks validate that SELF1E achieves performance competitive with specialist mask decoder-based methods, demonstrating the feasibility of decoder-free segmentation in MLLMs. Project page: https://github.com/ANDYZAQ/SELF1E.

[202] SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models

Quentin Guimard, Federico Bartsch, Simone Caldarella, Rahaf Aljundi, Elisa Ricci, Massimiliano Mancini

Main category: cs.CV

TL;DR: SEM: A post-hoc zero-shot debiasing framework for CLIP that uses sparse autoencoder latent space to identify and modulate bias-relevant neurons while preserving semantic information.

Details

Motivation: CLIP and similar vision-language models suffer from severe social and spurious biases due to large-scale uncurated training data. Existing debiasing methods in dense embedding space struggle because bias and task-relevant information are highly entangled, limiting their ability to remove bias without degrading semantic fidelity.

Method: Proposes Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework operating in a Sparse Autoencoder (SAE) latent space. Decomposes CLIP text embeddings into disentangled features, identifies bias-relevant neurons, and modulates them while preserving query-relevant ones, enabling precise non-linear interventions.

Result: Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification tasks. Demonstrates that sparse latent representations provide effective foundation for post-hoc debiasing of vision-language models.

Conclusion: Sparse latent representations enable more effective debiasing of vision-language models by providing disentangled features that allow precise modulation of bias-relevant neurons while preserving semantic information, outperforming methods operating in dense embedding spaces.

Abstract: Models that bridge vision and language, such as CLIP, are key components of multimodal AI, yet their large-scale, uncurated training data introduce severe social and spurious biases. Existing post-hoc debiasing methods often operate directly in the dense CLIP embedding space, where bias and task-relevant information are highly entangled. This entanglement limits their ability to remove bias without degrading semantic fidelity. In this work, we propose Sparse Embedding Modulation (SEM), a post-hoc, zero-shot debiasing framework that operates in a Sparse Autoencoder (SAE) latent space. By decomposing CLIP text embeddings into disentangled features, SEM identifies and modulates bias-relevant neurons while preserving query-relevant ones. This enables more precise, non-linear interventions. Across four benchmark datasets and two CLIP backbones, SEM achieves substantial fairness gains in retrieval and zero-shot classification. Our results demonstrate that sparse latent representations provide an effective foundation for post-hoc debiasing of vision-language models.

[203] FUMO: Prior-Modulated Diffusion for Single Image Reflection Removal

Telang Xu, Chaoyang Zhang, Guangtao Zhai, Xiaohong Liu

Main category: cs.CV

TL;DR: FUMO is a diffusion model framework for single image reflection removal that uses intensity and high-frequency priors extracted from mixed images to guide the denoising process through coarse-to-fine training.

Details

Motivation: Real-world reflection removal is challenging due to spatially varying reflection strength and tight entanglement between reflection patterns and transmission structures. Existing methods struggle with spatial controllability and structural faithfulness.

Method: Proposes a diffusion model with prior modulation framework (FUMO) using two priors: intensity prior for spatial reflection severity and high-frequency prior via multi-scale residual aggregation. Uses coarse-to-fine training: first stage gates conditional residual injections based on reflection-dominant and structure-sensitive regions, second stage refines with a fine-grained network for local alignment and detail sharpening.

Result: Competitive quantitative results on standard benchmarks and improved perceptual quality on challenging real-world images. The method demonstrates effective reflection removal while preserving transmission structures.

Conclusion: FUMO successfully addresses spatial controllability and structural faithfulness in reflection removal through explicit prior guidance and coarse-to-fine training, offering a robust solution for real-world applications.

Abstract: Single image reflection removal (SIRR) is challenging in real scenes, where reflection strength varies spatially and reflection patterns are tightly entangled with transmission structures. This paper presents a diffusion model with prior modulation framework (FUMO) that introduces explicit guidance signals to improve spatial controllability and structural faithfulness. Two priors are extracted directly from the mixed image, an intensity prior that estimates spatial reflection severity and a high-frequency prior that captures detail-sensitive responses via multi-scale residual aggregation. We propose a coarse-to-fine training paradigm. In the first stage, these cues are combined to gate the conditional residual injections, focusing the conditioning on regions that are both reflection-dominant and structure-sensitive. In the second stage, a fine-grained refinement network corrects local misalignment and sharpens fine details in the image space. Experiments conducted on both standard benchmarks and challenging images in the wild demonstrate competitive quantitative results and consistently improved perceptual quality. The code is released at https://github.com/Lucious-Desmon/FUMO.

[204] TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota

Main category: cs.CV

TL;DR: TerraScope is a unified vision-language model for pixel-grounded geospatial reasoning that handles multi-modal (optical/SAR) and multi-temporal inputs, with a new dataset and benchmark for evaluation.

Details

Motivation: Current vision-language models struggle with grounding complex spatial reasoning in precise pixel-level visual representations for earth observation tasks, particularly with multi-modal and temporal data.

Method: Introduces TerraScope VLM with modality-flexible reasoning (handles optical/SAR inputs separately or fused) and multi-temporal reasoning for change analysis. Creates Terra-CoT dataset with 1M samples containing pixel-level masks in reasoning chains, and TerraScope-Bench benchmark with 6 sub-tasks evaluating both answer accuracy and mask quality.

Result: TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence through pixel-level grounding.

Conclusion: The proposed unified VLM framework effectively addresses pixel-grounded geospatial reasoning challenges with multi-modal and temporal capabilities, supported by comprehensive datasets and benchmarks.

Abstract: Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

[205] Measuring 3D Spatial Geometric Consistency in Dynamic Generated Videos

Weijia Dou, Wenzhao Zheng, Weiliang Chen, Yu Zheng, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: SGC is a new metric for evaluating 3D spatial geometric consistency in generated videos by measuring divergence among camera poses estimated from different static regions.

Details

Motivation: Current video generation models produce high-fidelity videos but often have 3D spatial geometric inconsistencies. Existing evaluation metrics fail to properly characterize these issues - fidelity metrics are insensitive to geometric distortions, while consistency benchmarks often penalize valid foreground dynamics.

Method: Separate static from dynamic regions, partition static background into spatially coherent sub-regions, predict depth for each pixel, estimate local camera pose for each subregion, and compute divergence among these poses to quantify geometric consistency.

Result: Experiments on real and generative videos show SGC robustly quantifies geometric inconsistencies and effectively identifies critical failures missed by existing metrics.

Conclusion: SGC provides a novel metric specifically designed to evaluate 3D spatial geometric consistency in generated videos, addressing a gap in current evaluation methods.

Abstract: Recent generative models can produce high-fidelity videos, yet they often exhibit 3D spatial geometric inconsistencies. Existing evaluation methods fail to accurately characterize these inconsistencies: fidelity-centric metrics like FVD are insensitive to geometric distortions, while consistency-focused benchmarks often penalize valid foreground dynamics. To address this gap, we introduce SGC, a metric for evaluating 3D \textbf{S}patial \textbf{G}eometric \textbf{C}onsistency in dynamically generated videos. We quantify geometric consistency by measuring the divergence among multiple camera poses estimated from distinct local regions. Our approach first separates static from dynamic regions, then partitions the static background into spatially coherent sub-regions. We predict depth for each pixel, estimate a local camera pose for each subregion, and compute the divergence among these poses to quantify geometric consistency. Experiments on real and generative videos demonstrate that SGC robustly quantifies geometric inconsistencies, effectively identifying critical failures missed by existing metrics.

[206] SwiftTailor: Efficient 3D Garment Generation with Geometry Image Representation

Phuc Pham, Uy Dieu Tran, Binh-Son Hua, Phong Nguyen

Main category: cs.CV

TL;DR: SwiftTailor: A two-stage framework for fast 3D garment generation using geometry images, achieving state-of-the-art results with significantly reduced inference time compared to existing methods.

Details

Motivation: Existing 3D garment generation methods using vision-language models and garment modeling frameworks produce high-quality results but suffer from slow inference times (30-60 seconds). There's a need for more efficient solutions that maintain quality while being faster.

Method: Two-stage framework: 1) PatternMaker - lightweight vision-language model predicts sewing patterns from diverse inputs; 2) GarmentSewer - dense prediction transformer converts patterns into Garment Geometry Image encoding 3D surfaces in unified UV space. Final mesh reconstruction uses inverse mapping with remeshing and dynamic stitching algorithms.

Result: Achieves state-of-the-art accuracy and visual fidelity on Multimodal GarmentCodeData while significantly reducing inference time compared to existing methods.

Conclusion: SwiftTailor offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation, unifying sewing-pattern reasoning and geometry-based mesh synthesis efficiently.

Abstract: Realistic and efficient 3D garment generation remains a longstanding challenge in computer vision and digital fashion. Existing methods typically rely on large vision- language models to produce serialized representations of 2D sewing patterns, which are then transformed into simulation-ready 3D meshes using garment modeling framework such as GarmentCode. Although these approaches yield high-quality results, they often suffer from slow inference times, ranging from 30 seconds to a minute. In this work, we introduce SwiftTailor, a novel two-stage framework that unifies sewing-pattern reasoning and geometry-based mesh synthesis through a compact geometry image representation. SwiftTailor comprises two lightweight modules: PatternMaker, an efficient vision-language model that predicts sewing patterns from diverse input modalities, and GarmentSewer, an efficient dense prediction transformer that converts these patterns into a novel Garment Geometry Image, encoding the 3D surface of all garment panels in a unified UV space. The final 3D mesh is reconstructed through an efficient inverse mapping process that incorporates remeshing and dynamic stitching algorithms to directly assemble the garment, thereby amortizing the cost of physical simulation. Extensive experiments on the Multimodal GarmentCodeData demonstrate that SwiftTailor achieves state-of-the-art accuracy and visual fidelity while significantly reducing inference time. This work offers a scalable, interpretable, and high-performance solution for next-generation 3D garment generation.

[207] Em-Garde: A Propose-Match Framework for Proactive Streaming Video Understanding

Yikai Zheng, Xin Ding, Yifan Yang, Shiqi Jiang, Hao Wu, Qianxi Zhang, Weijun Wang, Ting Cao, Yunxin Liu

Main category: cs.CV

TL;DR: Em-Garde: A framework for proactive video understanding that decouples semantic understanding from streaming perception using instruction-guided proposals and lightweight matching

Details

Motivation: Current proactive VideoLLMs rely on per-frame triggering which creates an efficiency-accuracy dilemma, needing a solution that balances computational constraints with accurate proactive responses

Method: Two-stage approach: 1) Instruction-Guided Proposal Parser transforms user queries into structured visual proposals at query time; 2) Lightweight Proposal Matching Module performs efficient embedding-based matching during streaming to trigger responses

Result: Experiments on StreamingBench and OVO-Bench show consistent improvements over prior models in both proactive response accuracy and efficiency

Conclusion: Em-Garde provides an effective solution for proactive video understanding under strict computational constraints by decoupling semantic understanding from streaming perception

Abstract: Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.

[208] SignAgent: Agentic LLMs for Linguistically-Grounded Sign Language Annotation and Dataset Curation

Oliver Cory, Ozge Mercanoglu Sincan, Richard Bowden

Main category: cs.CV

TL;DR: SignAgent is an LLM-based agentic framework for scalable, linguistically-grounded sign language annotation and dataset curation, addressing limitations of traditional gloss-level methods and manual annotation bottlenecks.

Details

Motivation: Traditional computational methods for sign languages operate at gloss level, missing linguistic nuances, while manual annotation is too slow and expensive for creating large-scale, phonologically-aware datasets.

Method: Uses SignAgent Orchestrator (reasoning LLM) to coordinate linguistic tools and SignGraph (knowledge-grounded LLM) for lexical/linguistic grounding. Evaluated on pseudo-gloss annotation (multimodal evidence for gloss extraction) and ID glossing (visual clustering with phonological reasoning).

Result: The agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

Conclusion: SignAgent provides an effective framework for scalable sign language annotation by leveraging LLMs to overcome traditional limitations in computational sign language processing.

Abstract: This paper introduces SignAgent, a novel agentic framework that utilises Large Language Models (LLMs) for scalable, linguistically-grounded Sign Language (SL) annotation and dataset curation. Traditional computational methods for SLs often operate at the gloss level, overlooking crucial linguistic nuances, while manual linguistic annotation remains a significant bottleneck, proving too slow and expensive for the creation of large-scale, phonologically-aware datasets. SignAgent addresses these challenges through SignAgent Orchestrator, a reasoning LLM that coordinates a suite of linguistic tools, and SignGraph, a knowledge-grounded LLM that provides lexical and linguistic grounding. We evaluate our framework on two downstream annotation tasks. First, on Pseudo-gloss Annotation, where the agent performs constrained assignment, using multi-modal evidence to extract and order suitable gloss labels for signed sequences. Second, on ID Glossing, where the agent detects and refines visual clusters by reasoning over both visual similarity and phonological overlap to correctly identify and group lexical sign variants. Our results demonstrate that our agentic approach achieves strong performance for large-scale, linguistically-aware data annotation and curation.

[209] DROID-SLAM in the Wild

Moyang Li, Zihan Zhu, Marc Pollefeys, Daniel Barath

Main category: cs.CV

TL;DR: Real-time RGB SLAM system for dynamic environments using differentiable uncertainty-aware bundle adjustment with multi-view feature inconsistency for robust tracking

Details

Motivation: Traditional SLAM assumes static scenes and fails in dynamic environments. Existing dynamic SLAM methods use predefined priors or uncertainty-aware mapping but struggle with unknown dynamic objects and cluttered scenes where geometry becomes unreliable.

Method: Uses differentiable Uncertainty-aware Bundle Adjustment that estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction in dynamic real-world environments.

Result: Achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS.

Conclusion: The system provides a robust solution for SLAM in dynamic environments by leveraging uncertainty estimation from visual feature inconsistencies, with code and datasets publicly available.

Abstract: We present a robust, real-time RGB SLAM system that handles dynamic environments by leveraging differentiable Uncertainty-aware Bundle Adjustment. Traditional SLAM methods typically assume static scenes, leading to tracking failures in the presence of motion. Recent dynamic SLAM approaches attempt to address this challenge using predefined dynamic priors or uncertainty-aware mapping, but they remain limited when confronted with unknown dynamic objects or highly cluttered scenes where geometric mapping becomes unreliable. In contrast, our method estimates per-pixel uncertainty by exploiting multi-view visual feature inconsistency, enabling robust tracking and reconstruction even in real-world environments. The proposed system achieves state-of-the-art camera poses and scene geometry in cluttered dynamic scenarios while running in real time at around 10 FPS. Code and datasets are available at https://github.com/MoyangLi00/DROID-W.git.

Ye Wang, Wei Lu, Zhihui You, Keyan Chen, Tongfei Liu, Kaiyu Li, Hongruixuan Chen, Qingling Shu, Sibao Chen

Main category: cs.CV

TL;DR: Proposes LSMD dataset and MSCNet for RGB-NIR building change detection, addressing illumination/seasonal variations and small changes in complex environments.

Details

Motivation: RGB-only change detection suffers from pseudo-changes and semantic ambiguity due to illumination fluctuations and seasonal variations. NIR provides complementary physical cues for better material discrimination, but existing datasets lack high-resolution registered bi-temporal imagery and methods don't fully exploit modality heterogeneity.

Method: Introduces LSMD dataset with bi-temporal RGB-NIR building imagery targeting small changes. Proposes MSCNet with three modules: NCEM for local spatial details, CAIM for cross-modal alignment/interaction, and SMRM for progressive feature refinement.

Result: MSCNet outperforms existing methods under multiple input configurations, effectively leveraging multi-modal information for fine-grained building change detection.

Conclusion: The LSMD dataset provides rigorous testing for multi-modal change detection, and MSCNet demonstrates effective cross-modal fusion for improved building change detection in complex environments.

Abstract: Change detection in optical remote sensing imagery is susceptible to illumination fluctuations, seasonal changes, and variations in surface land-cover materials. Relying solely on RGB imagery often produces pseudo-changes and leads to semantic ambiguity in features. Incorporating near-infrared (NIR) information provides heterogeneous physical cues that are complementary to visible light, thereby enhancing the discriminability of building materials and tiny structures while improving detection accuracy. However, existing multi-modal datasets generally lack high-resolution and accurately registered bi-temporal imagery, and current methods often fail to fully exploit the inherent heterogeneity between these modalities. To address these issues, we introduce the Large-scale Small-change Multi-modal Dataset (LSMD), a bi-temporal RGB-NIR building change detection benchmark dataset targeting small changes in realistic scenarios, providing a rigorous testing platform for evaluating multi-modal change detection methods in complex environments. Based on LSMD, we further propose the Multi-modal Spectral Complementarity Network (MSCNet) to achieve effective cross-modal feature fusion. MSCNet comprises three key components: the Neighborhood Context Enhancement Module (NCEM) to strengthen local spatial details, the Cross-modal Alignment and Interaction Module (CAIM) to enable deep interaction between RGB and NIR features, and the Saliency-aware Multisource Refinement Module (SMRM) to progressively refine fused features. Extensive experiments demonstrate that MSCNet effectively leverages multi-modal information and consistently outperforms existing methods under multiple input configurations, validating its efficacy for fine-grained building change detection. The source code will be made publicly available at: https://github.com/AeroVILab-AHU/LSMD

[211] TAU-R1: Visual Language Model for Traffic Anomaly Understanding

Yuqiang Lin, Kehua Chen, Sam Lockyer, Arjun Yadav, Mingxuan Sui, Shucheng Zhang, Yan Shi, Bingzhang Wang, Yuang Zhang, Markus Zarbock, Florain Stanek, Adrian Evans, Wenbin Li, Yinhai Wang, Nic Zhang

Main category: cs.CV

TL;DR: TAU-R1: A two-layer vision-language framework for Traffic Anomaly Understanding using real-world roundabout videos with a novel training strategy and benchmark dataset.

Details

Motivation: Traffic Anomaly Understanding (TAU) is crucial for traffic safety but lacks benchmarks and task-specific methodologies. Current vision-language models show promise but haven't been effectively applied to TAU due to domain-specific challenges.

Method: Proposes TAU-R1: a two-layer VLM framework with 1) lightweight anomaly classifier for coarse categorization, and 2) larger anomaly reasoner for detailed event summaries. Uses two-stage training: decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO (GRPO-based post-training with TAU-specific reward functions). Introduces Roundabout-TAU dataset with 342 real-world roundabout video clips and 2,000+ QA pairs.

Result: TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The framework effectively handles multiple aspects of traffic anomaly understanding.

Conclusion: The proposed TAU-R1 framework and Roundabout-TAU benchmark advance traffic anomaly understanding by combining vision-language models with domain-specific training strategies, enabling effective real-world traffic safety applications.

Abstract: Traffic Anomaly Understanding (TAU) is important for traffic safety in Intelligent Transportation Systems. Recent vision-language models (VLMs) have shown strong capabilities in video understanding. However, progress on TAU remains limited due to the lack of benchmarks and task-specific methodologies. To address this limitation, we introduce Roundabout-TAU, a dataset constructed from real-world roundabout videos collected in collaboration with the City of Carmel, Indiana. The dataset contains 342 clips and is annotated with more than 2,000 question-answer pairs covering multiple aspects of traffic anomaly understanding. Building on this benchmark, we propose TAU-R1, a two-layer vision-language framework for TAU. The first layer is a lightweight anomaly classifier that performs coarse anomaly categorisation, while the second layer is a larger anomaly reasoner that generates detailed event summaries. To improve task-specific reasoning, we introduce a two-stage training strategy consisting of decomposed-QA-enhanced supervised fine-tuning followed by TAU-GRPO, a GRPO-based post-training method with TAU-specific reward functions. Experimental results show that TAU-R1 achieves strong performance on both anomaly classification and reasoning tasks while maintaining deployment efficiency. The dataset and code are available at: https://github.com/siri-rouser/TAU-R1

[212] CustomTex: High-fidelity Indoor Scene Texturing via Multi-Reference Customization

Weilin Chen, Jiahao Rao, Wenhao Wang, Xinyang Li, Xuan Cheng, Liujuan Cao

Main category: cs.CV

TL;DR: CustomTex: A framework for instance-level, high-fidelity 3D scene texturing using reference images via dual-distillation approach within Variational Score Distillation optimization.

Details

Motivation: Existing text-driven methods for 3D indoor scene texturing lack precision for fine-grained instance-level control and often produce textures with insufficient quality, artifacts, and baked-in shading. There's a need for more direct and user-friendly approaches to achieve high-quality, customizable scene appearance editing.

Method: CustomTex uses a dual-distillation approach: semantic-level distillation with instance cross-attention for semantic plausibility and reference-instance alignment, and pixel-level distillation for high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. The system takes untextured 3D scenes and reference images specifying desired appearance for each object instance to generate unified high-resolution texture maps.

Result: CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods.

Conclusion: CustomTex establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing by overcoming limitations of text-driven methods through reference-image-driven instance-level control.

Abstract: The creation of high-fidelity, customizable 3D indoor scene textures remains a significant challenge. While text-driven methods offer flexibility, they lack the precision for fine-grained, instance-level control, and often produce textures with insufficient quality, artifacts, and baked-in shading. To overcome these limitations, we introduce CustomTex, a novel framework for instance-level, high-fidelity scene texturing driven by reference images. CustomTex takes an untextured 3D scene and a set of reference images specifying the desired appearance for each object instance, and generates a unified, high-resolution texture map. The core of our method is a dual-distillation approach that separates semantic control from pixel-level enhancement. We employ semantic-level distillation, equipped with an instance cross-attention, to ensure semantic plausibility and ``reference-instance’’ alignment, and pixel-level distillation to enforce high visual fidelity. Both are unified within a Variational Score Distillation (VSD) optimization framework. Experiments demonstrate that CustomTex achieves precise instance-level consistency with reference images and produces textures with superior sharpness, reduced artifacts, and minimal baked-in shading compared to state-of-the-art methods. Our work establishes a more direct and user-friendly path to high-quality, customizable 3D scene appearance editing.

[213] Revisiting Autoregressive Models for Generative Image Classification

Ilia Sudakov, Artem Babenko, Dmitry Baranchuk

Main category: cs.CV

TL;DR: AR-based generative classifiers with any-order token generation outperform diffusion classifiers in efficiency and match discriminative models in accuracy.

Details

Motivation: Prior visual autoregressive (AR) generative classifiers rely on fixed token order, imposing restrictive inductive bias for image understanding. The authors aim to unlock AR models' classification potential by addressing this limitation.

Method: Leverage recent any-order AR models to estimate order-marginalized predictions, moving beyond single fixed token orders to capture more comprehensive image understanding.

Result: Approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks while being up to 25x more efficient. Delivers competitive performance compared to state-of-the-art self-supervised discriminative models.

Conclusion: AR-based generative classifiers with any-order modeling can achieve superior classification performance and efficiency compared to diffusion models, challenging the current dominance of diffusion approaches in generative classification.

Abstract: Class-conditional generative models have emerged as accurate and robust classifiers, with diffusion models demonstrating clear advantages over other visual generative paradigms, including autoregressive (AR) models. In this work, we revisit visual AR-based generative classifiers and identify an important limitation of prior approaches: their reliance on a fixed token order, which imposes a restrictive inductive bias for image understanding. We observe that single-order predictions rely more on partial discriminative cues, while averaging over multiple token orders provides a more comprehensive signal. Based on this insight, we leverage recent any-order AR models to estimate order-marginalized predictions, unlocking the high classification potential of AR models. Our approach consistently outperforms diffusion-based classifiers across diverse image classification benchmarks, while being up to 25x more efficient. Compared to state-of-the-art self-supervised discriminative models, our method delivers competitive classification performance - a notable achievement for generative classifiers.

[214] GSMem: 3D Gaussian Splatting as Persistent Spatial Memory for Zero-Shot Embodied Exploration and Reasoning

Yiren Lu, Yi Du, Disheng Liu, Yunlai Zhou, Chen Wang, Yu Yin

Main category: cs.CV

TL;DR: GSMem: A zero-shot embodied exploration framework using 3D Gaussian Splatting as persistent spatial memory for spatial recollection and optimal viewpoint rendering

Details

Motivation: Existing scene representations lack post-hoc re-observability - if initial observations miss targets, memory omissions are often irrecoverable. Need persistent spatial memory for embodied agents.

Method: Uses 3D Gaussian Splatting as persistent spatial memory, enabling spatial recollection to render photorealistic novel views. Employs retrieval mechanism combining object-level scene graphs and semantic-level language fields. Uses hybrid exploration strategy with VLM-driven semantic scoring and 3DGS-based coverage objective.

Result: Extensive experiments on embodied question answering and lifelong navigation demonstrate robustness and effectiveness of the framework.

Conclusion: GSMem bridges the gap in post-hoc re-observability for embodied agents by leveraging 3DGS as persistent spatial memory, enabling robust exploration and reasoning through spatial recollection.

Abstract: Effective embodied exploration requires agents to accumulate and retain spatial knowledge over time. However, existing scene representations, such as discrete scene graphs or static view-based snapshots, lack \textit{post-hoc re-observability}. If an initial observation misses a target, the resulting memory omission is often irrecoverable. To bridge this gap, we propose \textbf{GSMem}, a zero-shot embodied exploration and reasoning framework built upon 3D Gaussian Splatting (3DGS). By explicitly parameterizing continuous geometry and dense appearance, 3DGS serves as a persistent spatial memory that endows the agent with \textit{Spatial Recollection}: the ability to render photorealistic novel views from optimal, previously unoccupied viewpoints. To operationalize this, GSMem employs a retrieval mechanism that simultaneously leverages parallel object-level scene graphs and semantic-level language fields. This complementary design robustly localizes target regions, enabling the agent to ``hallucinate’’ optimal views for high-fidelity Vision-Language Model (VLM) reasoning. Furthermore, we introduce a hybrid exploration strategy that combines VLM-driven semantic scoring with a 3DGS-based coverage objective, balancing task-aware exploration with geometric coverage. Extensive experiments on embodied question answering and lifelong navigation demonstrate the robustness and effectiveness of our framework

[215] ADAPT: Attention Driven Adaptive Prompt Scheduling and InTerpolating Orthogonal Complements for Rare Concepts Generation

Kwanyoung Lee, Hyunwoo Oh, SeungJu Cha, Sungho Koh, Dong-Jin Kim

Main category: cs.CV

TL;DR: ADAPT is a training-free framework that uses deterministic planning and semantic alignment of prompt schedules to improve rare concept composition in text-to-image diffusion models, addressing limitations of LLM-based approaches like R2F.

Details

Motivation: Current approaches for generating rare compositional concepts in text-to-image synthesis (like R2F) suffer from inherent variance due to LLM randomness and suboptimal guidance from iterative text embedding switching, making them unreliable for rare concept generation.

Method: ADAPT uses attention scores and orthogonal components to deterministically plan and semantically align prompt schedules, providing consistent guidance without additional training or fine-tuning. It leverages attention mechanisms to better control the composition of rare concepts.

Result: ADAPT achieves superior performance on the RareBench benchmark, accurately reflects semantic information of rare attributes, provides deterministic and precise control over rare composition generation, and maintains visual integrity.

Conclusion: ADAPT offers an effective training-free solution for improving rare concept composition in text-to-image synthesis through deterministic prompt scheduling and semantic alignment, outperforming existing LLM-based approaches.

Abstract: Generating rare compositional concepts in text-to-image synthesis remains a challenge for diffusion models, particularly for attributes that are uncommon in the training data. While recent approaches, such as R2F, address this challenge by utilizing LLM for prompt scheduling, they suffer from inherent variance due to the randomness of language models and suboptimal guidance from iterative text embedding switching. To address these problems, we propose the ADAPT framework, a training-free framework that deterministically plans and semantically aligns prompt schedules, providing consistent guidance to enhance the composition of rare concepts. By leveraging attention scores and orthogonal components, ADAPT significantly enhances compositional generation of rare concepts in the RareBench benchmark without additional training or fine-tuning. Through comprehensive experiments, we demonstrate that ADAPT achieves superior performance in RareBench and accurately reflects the semantic information of rare attributes, providing deterministic and precise control over the generation of rare compositions without compromising visual integrity.

[216] Adaptive Auxiliary Prompt Blending for Target-Faithful Diffusion Generation

Kwanyoung Lee, SeungJu Cha, Yebin Ahn, Hyunwoo Oh, Sungho Koh, Dong-Jin Kim

Main category: cs.CV

TL;DR: AAPB is a training-free framework that uses adaptive prompt blending with auxiliary anchors to improve diffusion model performance for rare concepts and image editing by stabilizing generation in low-density regions.

Details

Motivation: Diffusion T2I models struggle with rare concepts in low-density regions of training data, producing semantically misaligned or structurally inconsistent results due to long-tailed text-image datasets where rare concepts are underrepresented.

Method: AAPB uses auxiliary anchor prompts for semantic/structural support and derives a closed-form adaptive coefficient (via Tweedie’s identity) to optimally balance auxiliary and target prompt influence at each diffusion step, enabling adaptive prompt blending.

Result: AAPB shows consistent improvements on RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines, with adaptive interpolation outperforming fixed interpolation.

Conclusion: AAPB provides a principled, training-free framework for adaptive prompt blending that stabilizes diffusion processes in low-density regions, enabling better rare concept generation and image editing.

Abstract: Diffusion-based text-to-image (T2I) models have made remarkable progress in generating photorealistic and semantically rich images. However, when the target concepts lie in low-density regions of the training distribution, these models often produce semantically misaligned or structurally inconsistent results. This limitation arises from the long-tailed nature of text-image datasets, where rare concepts or editing instructions are underrepresented. To address this, we introduce Adaptive Auxiliary Prompt Blending (AAPB) - a unified framework that stabilizes the diffusion process in low-density regions. AAPB leverages auxiliary anchor prompts to provide semantic support in rare concept generation and structural support in image editing, ensuring faithful guidance toward the target prompt. Unlike prior heuristic prompt alternation methods, AAPB derives a closed-form adaptive coefficient that optimally balances the influence between the auxiliary anchor and the target prompt at each diffusion step. Grounded in Tweedie’s identity, our formulation provides a principled and training-free framework for adaptive prompt blending, ensuring stable and target-faithful generation. We demonstrate the effectiveness of adaptive interpolation over fixed interpolation through controlled experiments and empirically show consistent improvements on the RareBench and FlowEdit datasets, achieving superior semantic accuracy and structural fidelity compared to prior training-free baselines.

[217] ARIADNE: A Perception-Reasoning Synergy Framework for Trustworthy Coronary Angiography Analysis

Zhan Jin, Yu Luo, Yizhou Zhang, Ziyang Cui, Yuqing Wei, Xianchao Liu, Xueying Zeng, Qing Zhang

Main category: cs.CV

TL;DR: ARIADNE: A two-stage framework combining preference-aligned perception with RL-based reasoning for topologically coherent coronary vessel segmentation and stenosis detection, using DPO for topological constraints and explicit rejection mechanisms.

Details

Motivation: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. There's a need for methods that maintain geometric completeness while detecting stenosis reliably.

Method: Two-stage framework: 1) Perception module uses DPO to fine-tune Sa2VA vision-language foundation model with Betti number constraints as preference signals for topological alignment. 2) Reasoning module formulates stenosis localization as Markov Decision Process with explicit rejection mechanism that defers ambiguous anatomical candidates (bifurcations, vessel crossings), shifting from coverage maximization to reliability optimization.

Result: On 1,400 clinical angiograms: achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols.

Conclusion: First application of DPO for topological alignment in medical imaging. Preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

Abstract: Conventional pixel-wise loss functions fail to enforce topological constraints in coronary vessel segmentation, producing fragmented vascular trees despite high pixel-level accuracy. We present ARIADNE, a two-stage framework coupling preference-aligned perception with RL-based diagnostic reasoning for topologically coherent stenosis detection. The perception module employs DPO to fine-tune the Sa2VA vision-language foundation model using Betti number constraints as preference signals, aligning the policy toward geometrically complete vessel structures rather than pixel-wise overlap metrics. The reasoning module formulates stenosis localization as a Markov Decision Process with an explicit rejection mechanism that autonomously defers ambiguous anatomical candidates such as bifurcations and vessel crossings, shifting from coverage maximization to reliability optimization. On 1,400 clinical angiograms, ARIADNE achieves state-of-the-art centerline Dice of 0.838, reduces false positives by 41% compared to geometric baselines. External validation on multi-center benchmarks ARCADE and XCAD confirms generalization across acquisition protocols. This represents the first application of DPO for topological alignment in medical imaging, demonstrating that preference-based learning over structural constraints mitigates topological violations while maintaining diagnostic sensitivity in interventional cardiology workflows.

[218] Reconstruction Matters: Learning Geometry-Aligned BEV Representation through 3D Gaussian Splatting

Yiren Lu, Xin Ye, Burhaneddin Yaman, Jingru Luo, Zhexiao Xiong, Liu Ren, Yu Yin

Main category: cs.CV

TL;DR: Splat2BEV: A Gaussian Splatting-assisted framework for Bird’s-Eye-View perception that incorporates explicit 3D reconstruction to improve geometric understanding and performance in autonomous driving tasks.

Details

Motivation: Most existing BEV perception frameworks use end-to-end training that treats the perception process as a black box, lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. The authors claim explicit 3D representation matters for accurate BEV perception.

Method: Proposes Splat2BEV framework that first pre-trains a Gaussian generator to explicitly reconstruct 3D scenes from multi-view inputs, generating geometry-aligned feature representations. These representations are then projected into BEV space to serve as inputs for downstream tasks like semantic segmentation, 3D object detection, and motion prediction.

Result: Extensive experiments on nuScenes and Argoverse datasets demonstrate state-of-the-art performance, validating the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

Conclusion: Explicit 3D representation is crucial for accurate BEV perception, and the proposed Splat2BEV framework successfully integrates 3D geometric understanding to improve both semantic richness and geometric precision in autonomous driving perception tasks.

Abstract: Bird’s-Eye-View (BEV) perception serves as a cornerstone for autonomous driving, offering a unified spatial representation that fuses surrounding-view images to enable reasoning for various downstream tasks, such as semantic segmentation, 3D object detection, and motion prediction. However, most existing BEV perception frameworks adopt an end-to-end training paradigm, where image features are directly transformed into the BEV space and optimized solely through downstream task supervision. This formulation treats the entire perception process as a black box, often lacking explicit 3D geometric understanding and interpretability, leading to suboptimal performance. In this paper, we claim that an explicit 3D representation matters for accurate BEV perception, and we propose Splat2BEV, a Gaussian Splatting-assisted framework for BEV tasks. Splat2BEV aims to learn BEV feature representations that are both semantically rich and geometrically precise. We first pre-train a Gaussian generator that explicitly reconstructs 3D scenes from multi-view inputs, enabling the generation of geometry-aligned feature representations. These representations are then projected into the BEV space to serve as inputs for downstream tasks. Extensive experiments on nuScenes and argoverse dataset demonstrate that Splat2BEV achieves state-of-the-art performance and validate the effectiveness of incorporating explicit 3D reconstruction into BEV perception.

[219] Tinted Frames: Question Framing Blinds Vision-Language Models

Wan-Cyuan Fan, Jiayun Luo, Declan Kutscher, Leonid Sigal, Ritwik Gupta

Main category: cs.CV

TL;DR: VLMs are selectively blind - their visual attention varies based on linguistic framing even when visual reasoning requirements are identical, with constrained framings causing reduced attention to images and degraded performance.

Details

Motivation: VLMs often underutilize visual inputs despite needing visual reasoning. The paper investigates whether this "blindness" is selective - whether VLMs modulate visual attention based on linguistic framing rather than visual reasoning needs.

Method: Used visual attention as a probe to quantify how linguistic framing affects attention allocation. Compared constrained framings (multiple choice, yes/no) vs open-ended framings. Introduced lightweight prompt-tuning with learnable tokens to encourage robust visual grounding patterns.

Result: Constrained framings induce substantially lower attention to image context, reduce focus on task-relevant regions, and shift attention to uninformative tokens. This attention misallocation causes degraded accuracy and cross-framing inconsistency. Prompt-tuning improves visual grounding and performance across framings.

Conclusion: VLMs exhibit selective blindness modulated by linguistic framing. Attention misallocation in constrained framings explains performance degradation. Lightweight prompt-tuning can encourage more robust visual grounding patterns.

Abstract: Vision-Language Models (VLMs) have been shown to be blind, often underutilizing their visual inputs even on tasks that require visual reasoning. In this work, we demonstrate that VLMs are selectively blind. They modulate the amount of attention applied to visual inputs based on linguistic framing even when alternative framings demand identical visual reasoning. Using visual attention as a probe, we quantify how framing alters both the amount and distribution of attention over the image. Constrained framings, such as multiple choice and yes/no, induce substantially lower attention to image context compared to open-ended, reduce focus on task-relevant regions, and shift attention towards uninformative tokens. We further demonstrate that this attention misallocation is the principal cause of degraded accuracy and cross-framing inconsistency. Building on this mechanistic insight, we introduce a lightweight prompt-tuning method using learnable tokens that encourages the robust, visually grounded attention patterns observed in open-ended settings, improving visual grounding and improving performance across framings.

[220] RPiAE: A Representation-Pivoted Autoencoder Enhancing Both Image Generation and Editing

Yue Gong, Hongyu Li, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Manyuan Zhang, Dawei Leng, Yuhui Yin, Lijun Zhang

Main category: cs.CV

TL;DR: RPiAE is a representation-based tokenizer that improves both image generation and editing by fine-tuning pretrained representation encoders for better reconstruction while preserving semantic structure, then compressing the latent space for efficient diffusion modeling.

Details

Motivation: Current approaches using pretrained visual representation models as tokenizers either align diffusion features or reuse frozen encoders, which leads to limited reconstruction fidelity (hurting editing quality) and overly high-dimensional latents that make diffusion modeling difficult.

Method: Proposes Representation-Pivoted AutoEncoder with: 1) Representation-Pivot Regularization to fine-tune representation-initialized encoders for reconstruction while preserving semantic structure, 2) Variational bridge to compress latent space, and 3) Objective-decoupled stage-wise training that sequentially optimizes generative tractability and reconstruction-fidelity objectives.

Result: RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

Conclusion: The proposed tokenizer preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity, addressing key limitations of existing representation-based approaches.

Abstract: Diffusion models have become the dominant paradigm for image generation and editing, with latent diffusion models shifting denoising to a compact latent space for efficiency and scalability. Recent attempts to leverage pretrained visual representation models as tokenizer priors either align diffusion features to representation features or directly reuse representation encoders as frozen tokenizers. Although such approaches can improve generation metrics, they often suffer from limited reconstruction fidelity due to frozen encoders, which in turn degrades editing quality, as well as overly high-dimensional latents that make diffusion modeling difficult. To address these limitations, We propose Representation-Pivoted AutoEncoder, a representation-based tokenizer that improves both generation and editing. We introduce Representation-Pivot Regularization, a training strategy that enables a representation-initialized encoder to be fine-tuned for reconstruction while preserving the semantic structure of the pretrained representation space, followed by a variational bridge which compress latent space into a compact one for better diffusion modeling. We adopt an objective-decoupled stage-wise training strategy that sequentially optimizes generative tractability and reconstruction-fidelity objectives. Together, these components yield a tokenizer that preserves strong semantics, reconstructs faithfully, and produces latents with reduced diffusion modeling complexity. Experiments demonstrate that RPiAE outperforms other visual tokenizers on text-to-image generation and image editing, while delivering the best reconstruction fidelity among representation-based tokenizers.

[221] Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo, Paola Cascante-Bonilla

Main category: cs.CV

TL;DR: SSM vision backbones are evaluated as alternatives to transformers in VLMs, showing competitive performance with smaller scale and better grounding/localization capabilities.

Details

Motivation: To investigate whether state space model (SSM) vision backbones can serve as strong alternatives to transformer-based encoders in large vision-language models, given that VLMs typically use frozen transformer vision backbones with lightweight connectors to LLMs.

Method: Systematic evaluation of SSM vision backbones for VLMs in controlled settings with matched ImageNet-1K initialization. Both SSM and ViT-family backbones were adapted with detection or segmentation training, and stabilization strategies were proposed for localization instability issues.

Result: SSM backbone achieves strongest overall performance across VQA and grounding/localization tasks. After dense-task tuning, SSM remains competitive while operating at substantially smaller model scale. Key findings: (1) higher ImageNet accuracy/larger backbones don’t reliably translate to better VLM performance, (2) some visual backbones are unstable in localization.

Conclusion: SSM vision backbones are a strong alternative to transformer-based vision encoders in VLMs, offering competitive performance with smaller scale and better grounding capabilities, with proposed stabilization strategies improving robustness for both backbone families.

Abstract: Large vision–language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

[222] DreamPartGen: Semantically Grounded Part-Level 3D Generation via Collaborative Latent Denoising

Tianjiao Yu, Xinzhuo Li, Muntasir Wahed, Jerry Xiong, Yifan Shen, Ying Shen, Ismini Lourentzou

Main category: cs.CV

TL;DR: DreamPartGen is a framework for semantically grounded, part-aware text-to-3D generation that models both part geometry/appearance and inter-part semantic relationships from language.

Details

Motivation: Most text-to-3D methods overlook semantic and functional structure of parts, focusing only on geometry. Recent part-aware approaches lack semantic grounding and fail to model how parts align with textual descriptions or their inter-part relations.

Method: Introduces Duplex Part Latents (DPLs) to jointly model each part’s geometry and appearance, and Relational Semantic Latents (RSLs) to capture inter-part dependencies derived from language. Uses synchronized co-denoising process to enforce mutual geometric and semantic consistency.

Result: Achieves state-of-the-art performance across multiple benchmarks in geometric fidelity and text-shape alignment.

Conclusion: DreamPartGen enables coherent, interpretable, and text-aligned 3D synthesis by combining geometric modeling with semantic understanding from language.

Abstract: Understanding and generating 3D objects as compositions of meaningful parts is fundamental to human perception and reasoning. However, most text-to-3D methods overlook the semantic and functional structure of parts. While recent part-aware approaches introduce decomposition, they remain largely geometry-focused, lacking semantic grounding and failing to model how parts align with textual descriptions or their inter-part relations. We propose DreamPartGen, a framework for semantically grounded, part-aware text-to-3D generation. DreamPartGen introduces Duplex Part Latents (DPLs) that jointly model each part’s geometry and appearance, and Relational Semantic Latents (RSLs) that capture inter-part dependencies derived from language. A synchronized co-denoising process enforces mutual geometric and semantic consistency, enabling coherent, interpretable, and text-aligned 3D synthesis. Across multiple benchmarks, DreamPartGen delivers state-of-the-art performance in geometric fidelity and text-shape alignment.

[223] LVOmniBench: Pioneering Long Audio-Video Understanding Evaluation for Omnimodal LLMs

Keda Tao, Yuhua Zheng, Jia Xu, Wenjie Du, Kele Shao, Hesong Wang, Xueyi Chen, Xin Jin, Junhan Zhu, Bohan Yu, Weiqiang Wang, Jian Liu, Can Qin, Yulun Zhang, Ming-Hsuan Yang, Huan Wang

Main category: cs.CV

TL;DR: LVOmniBench is a new benchmark for evaluating multimodal LLMs on long-form audio-video comprehension, addressing the gap in current evaluations that focus only on short clips.

Details

Motivation: Current omnimodal LLM evaluations focus on short audio/video clips (10s-5min), but real-world applications involve much longer videos (tens of minutes). There's a critical need for benchmarks that reflect real-world long-form multimodal understanding demands.

Method: Created LVOmniBench with 275 high-quality videos (10-90 minutes each) from open platforms, featuring rich audio-visual dynamics. Manually selected and annotated to create 1,014 QA pairs. Designed to evaluate capabilities across domains including long-term memory, temporal localization, fine-grained understanding, and multimodal perception.

Result: Current OmniLLMs struggle with extended audio-visual inputs. Open-source models achieve <35% accuracy, while Gemini 3 Pro reaches ~65% accuracy. Shows significant performance gap for long-form multimodal comprehension.

Conclusion: LVOmniBench reveals current limitations in long-form audio-video understanding and provides a valuable benchmark to stimulate research and development of advanced models for complex cross-modal understanding in long-form contexts.

Abstract: Recent advancements in omnimodal large language models (OmniLLMs) have significantly improved the comprehension of audio and video inputs. However, current evaluations primarily focus on short audio and video clips ranging from 10 seconds to 5 minutes, failing to reflect the demands of real-world applications, where videos typically run for tens of minutes. To address this critical gap, we introduce LVOmniBench, a new benchmark designed specifically for the cross-modal comprehension of long-form audio and video. This dataset comprises high-quality videos sourced from open platforms that feature rich audio-visual dynamics. Through rigorous manual selection and annotation, LVOmniBench comprises 275 videos, ranging in duration from 10 to 90 minutes, and 1,014 question-answer (QA) pairs. LVOmniBench aims to rigorously evaluate the capabilities of OmniLLMs across domains, including long-term memory, temporal localization, fine-grained understanding, and multimodal perception. Our extensive evaluation reveals that current OmniLLMs encounter significant challenges when processing extended audio-visual inputs. Open-source models generally achieve accuracies below 35%, whereas the Gemini 3 Pro reaches a peak accuracy of approximately 65%. We anticipate that this dataset, along with our empirical findings, will stimulate further research and the development of advanced models capable of resolving complex cross-modal understanding problems within long-form audio-visual contexts.

[224] Rethinking Vector Field Learning for Generative Segmentation

Chaoyang Wang, Yaobo Liang, Boci Peng, Fan Duan, Jingdong Wang, Yunhai Tong

Main category: cs.CV

TL;DR: Proposes vector field reshaping for diffusion-based generative segmentation to address gradient vanishing and trajectory traversing issues in flow matching objectives, improving class separation and convergence.

Details

Motivation: Existing diffusion segmentation approaches focus on architectural tweaks without addressing the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. The paper identifies gradient vanishing and trajectory traversing as key limitations causing slow convergence and poor class separation.

Method: Proposes a vector field reshaping strategy that augments learned velocity fields with a detached distance-aware correction term introducing attractive and repulsive interactions. Also designs a quasi-random category encoding scheme inspired by Kronecker sequences integrated with an end-to-end pixel neural field framework.

Result: Extensive experiments show significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

Conclusion: The proposed vector field reshaping approach effectively addresses fundamental limitations in diffusion-based segmentation, offering a principled solution that enhances gradient magnitudes near centroids while preserving the original diffusion training framework.

Abstract: Taming diffusion models for generative segmentation has attracted increasing attention. While existing approaches primarily focus on architectural tweaks or training heuristics, there remains a limited understanding of the intrinsic mismatch between continuous flow matching objectives and discrete perception tasks. In this work, we revisit diffusion segmentation from the perspective of vector field learning. We identify two key limitations of the commonly used flow matching objective: gradient vanishing and trajectory traversing, which result in slow convergence and poor class separation. To tackle these issues, we propose a principled vector field reshaping strategy that augments the learned velocity field with a detached distance-aware correction term. This correction introduces both attractive and repulsive interactions, enhancing gradient magnitudes near centroids while preserving the original diffusion training framework. Furthermore, we design a computationally efficient, quasi-random category encoding scheme inspired by Kronecker sequences, which integrates seamlessly with an end-to-end pixel neural field framework for pixel-level semantic alignment. Extensive experiments consistently demonstrate significant improvements over vanilla flow matching approaches, substantially narrowing the performance gap between generative segmentation and strong discriminative specialists.

[225] Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

James Brock, Ce Zhang, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Forest-Chat: An LLM-driven agent for forest change analysis using satellite imagery, enabling natural language querying for change detection, captioning, counting, and reasoning tasks.

Details

Motivation: To address the underexplored integration of LLMs with vision-language models for remote sensing image change interpretation (RSICI) in forest environments, particularly for complex forest dynamics beyond urban settings.

Method: Forest-Chat uses an LLM-driven agent with multi-level change interpretation (MCI) vision-language backbone, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. Introduces Forest-Change dataset with bi-temporal satellite imagery, change masks, and semantic captions.

Result: Achieves mIoU and BLEU-4 scores of 67.10%/40.17% on Forest-Change and 88.13%/34.41% on LEVIR-MCI-Trees. Zero-shot performance: 60.15%/34.00% on Forest-Change and 47.32%/18.23% on LEVIR-MCI-Trees. Shows value of caption refinement for injecting geographic domain knowledge.

Conclusion: Interactive, LLM-driven systems can support accessible and interpretable forest change analysis, demonstrating the potential of integrating LLMs with vision-language models for remote sensing applications.

Abstract: The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system’s limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

[226] DriveTok: 3D Driving Scene Tokenization for Unified Multi-View Reconstruction and Understanding

Dong Zhuo, Wenzhao Zheng, Sicheng Zuo, Siming Yan, Lu Hou, Jie Zhou, Jiwen Lu

Main category: cs.CV

TL;DR: DriveTok: A 3D driving scene tokenizer for multi-view autonomous driving scenes that integrates semantic, geometric, and textural information into unified scene tokens for efficient multi-view reconstruction and understanding.

Details

Motivation: Existing tokenizers are designed for monocular 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. There's a need for scalable image tokenization that can handle complex 3D driving environments with multiple camera views.

Method: DriveTok extracts semantically rich visual features from vision foundation models, then transforms them into scene tokens using 3D deformable cross-attention. For decoding, it uses a multi-view transformer to reconstruct multi-view features and multiple heads for RGB, depth, and semantic reconstructions. A 3D head is added directly on scene tokens for 3D semantic occupancy prediction.

Result: Extensive experiments on nuScenes dataset show DriveTok performs well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks. The unified scene tokens effectively integrate semantic, geometric, and textural information.

Conclusion: DriveTok provides an efficient 3D driving scene tokenizer that addresses limitations of existing 2D tokenizers for multi-view autonomous driving scenes, enabling better spatial awareness and consistent multi-view understanding.

Abstract: With the growing adoption of vision-language-action models and world models in autonomous driving systems, scalable image tokenization becomes crucial as the interface for the visual modality. However, most existing tokenizers are designed for monocular and 2D scenes, leading to inefficiency and inter-view inconsistency when applied to high-resolution multi-view driving scenes. To address this, we propose DriveTok, an efficient 3D driving scene tokenizer for unified multi-view reconstruction and understanding. DriveTok first obtains semantically rich visual features from vision foundation models and then transforms them into the scene tokens with 3D deformable cross-attention. For decoding, we employ a multi-view transformer to reconstruct multi-view features from the scene tokens and use multiple heads to obtain RGB, depth, and semantic reconstructions. We also add a 3D head directly on the scene tokens for 3D semantic occupancy prediction for better spatial awareness. With the multiple training objectives, DriveTok learns unified scene tokens that integrate semantic, geometric, and textural information for efficient multi-view tokenization. Extensive experiments on the widely used nuScenes dataset demonstrate that the scene tokens from DriveTok perform well on image reconstruction, semantic segmentation, depth prediction, and 3D occupancy prediction tasks.

[227] Spectrally-Guided Diffusion Noise Schedules

Carlos Esteves, Ameesh Makadia

Main category: cs.CV

TL;DR: The paper proposes a principled method to design per-instance noise schedules for pixel diffusion models based on image spectral properties, improving generative quality especially in low-step regimes.

Details

Motivation: Current noise schedules in diffusion models are handcrafted and require manual tuning across different resolutions, lacking theoretical grounding and potentially containing redundant steps that reduce efficiency.

Method: Derives theoretical bounds on minimum and maximum noise levels based on image spectral properties, designs “tight” noise schedules that eliminate redundant steps, and proposes conditional sampling of these schedules during inference.

Result: Experiments show improved generative quality for single-stage pixel diffusion models, particularly in the low-step regime where efficiency is crucial.

Conclusion: The proposed principled approach to noise schedule design based on spectral properties offers better theoretical grounding and practical improvements over handcrafted schedules.

Abstract: Denoising diffusion models are widely used for high-quality image and video generation. Their performance depends on noise schedules, which define the distribution of noise levels applied during training and the sequence of noise levels traversed during sampling. Noise schedules are typically handcrafted and require manual tuning across different resolutions. In this work, we propose a principled way to design per-instance noise schedules for pixel diffusion, based on the image’s spectral properties. By deriving theoretical bounds on the efficacy of minimum and maximum noise levels, we design ``tight’’ noise schedules that eliminate redundant steps. During inference, we propose to conditionally sample such noise schedules. Experiments show that our noise schedules improve generative quality of single-stage pixel diffusion models, particularly in the low-step regime.

[228] EffectErase: Joint Video Object Removal and Insertion for High-Quality Effect Erasing

Yang Fu, Yike Zheng, Ziyun Dai, Henghui Ding

Main category: cs.CV

TL;DR: Video object removal dataset (VOR) with 60K video pairs capturing object effects, plus EffectErase method for effect-aware removal using reciprocal learning between removal and insertion tasks.

Details

Motivation: Current video object removal methods struggle to erase visual effects (deformation, shadows, reflections) and synthesize coherent backgrounds. Progress is hampered by lack of comprehensive datasets capturing object effects across varied environments.

Method: Introduces VOR dataset with 60K high-quality video pairs (object present/absent) covering five effect types. Proposes EffectErase method with reciprocal learning between object removal and insertion tasks, task-aware region guidance, and insertion-removal consistency objective.

Result: EffectErase trained on VOR achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

Conclusion: VOR dataset addresses the data gap for video object removal with effects, and EffectErase demonstrates effective effect-aware removal through reciprocal learning and consistency objectives.

Abstract: Video object removal aims to eliminate dynamic target objects and their visual effects, such as deformation, shadows, and reflections, while restoring seamless backgrounds. Recent diffusion-based video inpainting and object removal methods can remove the objects but often struggle to erase these effects and to synthesize coherent backgrounds. Beyond method limitations, progress is further hampered by the lack of a comprehensive dataset that systematically captures common object effects across varied environments for training and evaluation. To address this, we introduce VOR (Video Object Removal), a large-scale dataset that provides diverse paired videos, each consisting of one video where the target object is present with its effects and a counterpart where the object and effects are absent, with corresponding object masks. VOR contains 60K high-quality video pairs from captured and synthetic sources, covers five effects types, and spans a wide range of object categories as well as complex, dynamic multi-object scenes. Building on VOR, we propose EffectErase, an effect-aware video object removal method that treats video object insertion as the inverse auxiliary task within a reciprocal learning scheme. The model includes task-aware region guidance that focuses learning on affected areas and enables flexible task switching. Then, an insertion-removal consistency objective that encourages complementary behaviors and shared localization of effect regions and structural cues. Trained on VOR, EffectErase achieves superior performance in extensive experiments, delivering high-quality video object effect erasing across diverse scenarios.

[229] Under One Sun: Multi-Object Generative Perception of Materials and Illumination

Nobuo Yoshii, Xinran Nicole Han, Ryo Kawahara, Todd Zickler, Ko Nishino

Main category: cs.CV

TL;DR: MultiGP is a generative inverse rendering method that samples reflectance, texture, and illumination from single images by exploiting that objects in the same scene share illumination.

Details

Motivation: The paper addresses the inherent ambiguity in radiometric disentanglement from single images, where separating reflectance, texture, and illumination is challenging due to their combined effect on appearance.

Method: Uses a cascaded end-to-end architecture combining image-space and angular-space disentanglement, Coordinated Guidance for diffusion convergence to consistent illumination, Axial Attention for cross-object communication, and Texture Extraction ControlNet for preserving high-frequency details.

Result: Experimental results show MultiGP effectively leverages complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as common illumination.

Conclusion: MultiGP successfully solves radiometric disentanglement by exploiting scene-level illumination consensus, enabling stochastic sampling of all radiometric constituents from single images.

Abstract: We introduce Multi-Object Generative Perception (MultiGP), a generative inverse rendering method for stochastic sampling of all radiometric constituents – reflectance, texture, and illumination – underlying object appearance from a single image. Our key idea to solve this inherently ambiguous radiometric disentanglement is to leverage the fact that while their texture and reflectance may differ, objects in the same scene are all lit by the same illumination. MultiGP exploits this consensus to produce samples of reflectance, texture, and illumination from a single image of known shapes based on four key technical contributions: a cascaded end-to-end architecture that combines image-space and angular-space disentanglement; Coordinated Guidance for diffusion convergence to a single consistent illumination estimate; Axial Attention applied to facilitate ``cross-talk’’ between objects of different reflectance; and a Texture Extraction ControlNet to preserve high-frequency texture details while ensuring decoupling from estimated lighting. Experimental results demonstrate that MultiGP effectively leverages the complementary spatial and frequency characteristics of multiple object appearances to recover individual texture and reflectance as well as the common illumination.

[230] Bridging Semantic and Kinematic Conditions with Diffusion-based Discrete Motion Tokenizer

Chenyang Gu, Mingyuan Zhang, Haozhe Xie, Zhongang Cai, Lei Yang, Ziwei Liu

Main category: cs.CV

TL;DR: A three-stage motion generation framework combining discrete token planning with diffusion-based synthesis for improved controllability and fidelity

Details

Motivation: To combine strengths of continuous diffusion models (good for kinematic control) and discrete token-based generators (effective for semantic conditioning) in motion generation

Method: Three-stage framework: 1) Perception (condition feature extraction), 2) Planning (discrete token generation), 3) Control (diffusion-based motion synthesis). Uses MoTok diffusion-based discrete motion tokenizer to decouple semantic abstraction from fine-grained reconstruction

Result: Significantly improves controllability and fidelity over MaskControl, reduces trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029 on HumanML3D. Uses only one-sixth of tokens compared to prior methods

Conclusion: The proposed framework successfully combines discrete token planning with diffusion-based synthesis, achieving superior motion generation with better controllability and fidelity while using fewer tokens

Abstract: Prior motion generation largely follows two paradigms: continuous diffusion models that excel at kinematic control, and discrete token-based generators that are effective for semantic conditioning. To combine their strengths, we propose a three-stage framework comprising condition feature extraction (Perception), discrete token generation (Planning), and diffusion-based motion synthesis (Control). Central to this framework is MoTok, a diffusion-based discrete motion tokenizer that decouples semantic abstraction from fine-grained reconstruction by delegating motion recovery to a diffusion decoder, enabling compact single-layer tokens while preserving motion fidelity. For kinematic conditions, coarse constraints guide token generation during planning, while fine-grained constraints are enforced during control through diffusion-based optimization. This design prevents kinematic details from disrupting semantic token planning. On HumanML3D, our method significantly improves controllability and fidelity over MaskControl while using only one-sixth of the tokens, reducing trajectory error from 0.72 cm to 0.08 cm and FID from 0.083 to 0.029. Unlike prior methods that degrade under stronger kinematic constraints, ours improves fidelity, reducing FID from 0.033 to 0.014.

[231] SAMA: Factorized Semantic Anchoring and Motion Alignment for Instruction-Guided Video Editing

Xinyao Zhang, Wenkai Dong, Yuxin Song, Bo Fang, Qi Zhang, Jing Wang, Fan Chen, Hui Zhang, Haocheng Feng, Yu Lu, Hang Zhou, Chun Yuan, Jingdong Wang

Main category: cs.CV

TL;DR: SAMA is a video editing framework that factorizes editing into semantic anchoring and motion alignment to balance precise semantic modifications with faithful motion preservation without relying on external priors.

Details

Motivation: Current instruction-guided video editing models struggle to balance semantic modifications with motion preservation, and existing approaches rely on external priors that bottleneck robustness and generalization.

Method: Factorizes video editing into: 1) Semantic Anchoring - predicts semantic tokens and video latents at sparse anchor frames for instruction-aware structural planning; 2) Motion Alignment - pre-trains backbone on motion-centric video restoration tasks (cube inpainting, speed perturbation, tube shuffle) to internalize temporal dynamics. Uses two-stage pipeline: factorized pre-training without paired editing data, followed by supervised fine-tuning.

Result: Achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Factorized pre-training alone yields strong zero-shot video editing ability.

Conclusion: SAMA’s factorization approach effectively balances semantic modifications with motion preservation without external priors, demonstrating strong generalization and zero-shot capabilities.

Abstract: Current instruction-guided video editing models struggle to simultaneously balance precise semantic modifications with faithful motion preservation. While existing approaches rely on injecting explicit external priors (e.g., VLM features or structural conditions) to mitigate these issues, this reliance severely bottlenecks model robustness and generalization. To overcome this limitation, we present SAMA (factorized Semantic Anchoring and Motion Alignment), a framework that factorizes video editing into semantic anchoring and motion modeling. First, we introduce Semantic Anchoring, which establishes a reliable visual anchor by jointly predicting semantic tokens and video latents at sparse anchor frames, enabling purely instruction-aware structural planning. Second, Motion Alignment pre-trains the same backbone on motion-centric video restoration pretext tasks (cube inpainting, speed perturbation, and tube shuffle), enabling the model to internalize temporal dynamics directly from raw videos. SAMA is optimized with a two-stage pipeline: a factorized pre-training stage that learns inherent semantic-motion representations without paired video-instruction editing data, followed by supervised fine-tuning on paired editing data. Remarkably, the factorized pre-training alone already yields strong zero-shot video editing ability, validating the proposed factorization. SAMA achieves state-of-the-art performance among open-source models and is competitive with leading commercial systems (e.g., Kling-Omni). Code, models, and datasets will be released.

[232] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction

Haitian Li, Haozhe Xie, Junxiang Xu, Beichen Wen, Fangzhou Hong, Ziwei Liu

Main category: cs.CV

TL;DR: MonoArt: A unified framework for reconstructing articulated 3D objects from single images using progressive structural reasoning, achieving SOTA performance without external motion templates or multi-stage pipelines.

Details

Motivation: Reconstructing articulated 3D objects from single images is challenging due to entanglement between motion cues and object structure, making direct articulation regression unstable. Existing methods use multi-view supervision, retrieval-based assembly, or auxiliary video generation, which sacrifice scalability or efficiency.

Method: MonoArt uses progressive structural reasoning to transform visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture, enabling stable and interpretable articulation inference without external motion templates.

Result: Extensive experiments on PartNet-Mobility demonstrate state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

Conclusion: MonoArt provides a unified framework for articulated 3D reconstruction from single images through progressive structural reasoning, achieving better performance and efficiency than existing methods while maintaining interpretability.

Abstract: Reconstructing articulated 3D objects from a single image requires jointly inferring object geometry, part structure, and motion parameters from limited visual evidence. A key difficulty lies in the entanglement between motion cues and object structure, which makes direct articulation regression unstable. Existing methods address this challenge through multi-view supervision, retrieval-based assembly, or auxiliary video generation, often sacrificing scalability or efficiency. We present MonoArt, a unified framework grounded in progressive structural reasoning. Rather than predicting articulation directly from image features, MonoArt progressively transforms visual observations into canonical geometry, structured part representations, and motion-aware embeddings within a single architecture. This structured reasoning process enables stable and interpretable articulation inference without external motion templates or multi-stage pipelines. Extensive experiments on PartNet-Mobility demonstrate that OM achieves state-of-the-art performance in both reconstruction accuracy and inference speed. The framework further generalizes to robotic manipulation and articulated scene reconstruction.

[233] HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, Junyang Lin

Main category: cs.CV

TL;DR: HopChain: A framework for synthesizing multi-hop vision-language reasoning data to improve VLM performance through reinforcement learning with verifiable rewards.

Details

Motivation: Vision-language models struggle with fine-grained reasoning, especially in long chain-of-thought scenarios where errors compound across steps. Existing RLVR data lacks complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses unaddressed.

Method: HopChain synthesizes multi-hop vision-language reasoning data where each query forms logically dependent chains of instance-grounded hops. Earlier hops establish instances/sets/conditions for later hops, with final answers as specific numbers for verifiable rewards. Trained Qwen3.5 models with RLVR using original data plus HopChain’s multi-hop data.

Result: HopChain improved 20 of 24 benchmarks across STEM/Puzzle, General VQA, Text Recognition/Document Understanding, and Video Understanding. Multi-hop gains peaked in long-CoT reasoning, exceeding 50 points in ultra-long-CoT regime. Removing multi-hop components reduced performance significantly.

Conclusion: HopChain is an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning, particularly for complex chain-of-thought scenarios where visual evidence must be tracked across multiple reasoning steps.

Abstract: Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final answer remains a specific, unambiguous number suitable for verifiable rewards. We train Qwen3.5-35B-A3B and Qwen3.5-397B-A17B under two RLVR settings: the original data alone, and the original data plus HopChain’s multi-hop data, and compare them across 24 benchmarks spanning STEM and Puzzle, General VQA, Text Recognition and Document Understanding, and Video Understanding. Although this multi-hop data is not synthesized for any specific benchmark, it improves 20 of 24 benchmarks on both models, indicating broad and generalizable gains. Consistently, replacing full chained queries with half-multi-hop or single-hop variants reduces the average score across five representative benchmarks from 70.4 to 66.7 and 64.3, respectively. Notably, multi-hop gains peak in long-CoT vision-language reasoning, exceeding 50 points in the ultra-long-CoT regime. These experiments establish HopChain as an effective, scalable framework for synthesizing multi-hop data that improves generalizable vision-language reasoning.

[234] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

Yuqing Wang, Chuofan Ma, Zhijie Lin, Yao Teng, Lijun Yu, Shuai Wang, Jiaming Han, Jiashi Feng, Yi Jiang, Xihui Liu

Main category: cs.CV

TL;DR: CubiD introduces a discrete diffusion model for high-dimensional visual representations (768-1024 dims) that enables both understanding and generation tasks using the same discrete tokens, bridging the gap between current low-dimensional discrete generation methods and rich semantic representations.

Details

Motivation: Current discrete generation methods are limited to low-dimensional latent tokens (8-32 dims), sacrificing semantic richness needed for understanding. High-dimensional pretrained representations (768-1024 dims) could bridge this gap but their discrete generation poses fundamental challenges. The authors aim to enable discrete generation of high-dimensional representations to support unified multimodal architectures.

Method: CubiD performs fine-grained masking throughout high-dimensional discrete representations - any dimension at any position can be masked and predicted from partial observations. This allows learning rich correlations both within and across spatial positions, with fixed generation steps T regardless of feature dimensionality (T ≪ hwd).

Result: On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. The discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks.

Conclusion: CubiD enables discrete generation of high-dimensional visual representations, bridging the gap between understanding and generation. This work could inspire future research toward unified multimodal architectures that share token prediction paradigms with language models.

Abstract: Visual generation with discrete tokens has gained significant attention as it enables a unified token prediction paradigm shared with language models, promising seamless multimodal architectures. However, current discrete generation methods remain limited to low-dimensional latent tokens (typically 8-32 dims), sacrificing the semantic richness essential for understanding. While high-dimensional pretrained representations (768-1024 dims) could bridge this gap, their discrete generation poses fundamental challenges. In this paper, we present Cubic Discrete Diffusion (CubiD), the first discrete generation model for high-dimensional representations. CubiD performs fine-grained masking throughout the high-dimensional discrete representation – any dimension at any position can be masked and predicted from partial observations. This enables the model to learn rich correlations both within and across spatial positions, with the number of generation steps fixed at $T$ regardless of feature dimensionality, where $T \ll hwd$. On ImageNet-256, CubiD achieves state-of-the-art discrete generation with strong scaling behavior from 900M to 3.7B parameters. Crucially, we validate that these discretized tokens preserve original representation capabilities, demonstrating that the same discrete tokens can effectively serve both understanding and generation tasks. We hope this work will inspire future research toward unified multimodal architectures. Code is available at: https://github.com/YuqingWang1029/CubiD.

[235] Matryoshka Gaussian Splatting

Zhilin Guo, Boqiao Zhang, Hakan Aktas, Kyle Fogarty, Jeffrey Hu, Nursena Koprucu Aslan, Wenzhao Li, Canberk Baykal, Albert Miao, Josef Bengtson, Chenliang Zhou, Weihao Xia, Cristina Nader Vasconcelos. Cengiz Oztireli

Main category: cs.CV

TL;DR: MGS enables continuous level-of-detail rendering for 3D Gaussian Splatting without quality loss at full capacity through stochastic budget training.

Details

Motivation: Current 3D Gaussian Splatting methods lack practical continuous level-of-detail control - discrete methods offer limited operating points while continuous approaches suffer quality degradation at full capacity, making LoD a costly design decision.

Method: Matryoshka Gaussian Splatting learns a single ordered set of Gaussians where rendering any prefix (first k splats) produces coherent reconstructions. Uses stochastic budget training: each iteration samples random splat budget and optimizes both corresponding prefix and full set with only two forward passes and no architectural changes.

Result: MGS matches full-capacity performance of backbone 3DGS while enabling continuous speed-quality trade-off from single model across four benchmarks and six baselines. Extensive ablations validate ordering strategies, training objectives, and model capacity.

Conclusion: MGS provides practical continuous level-of-detail control for 3D Gaussian Splatting without sacrificing rendering quality, enabling flexible deployment with smooth fidelity scaling.

Abstract: The ability to render scenes at adjustable fidelity from a single model, known as level of detail (LoD), is crucial for practical deployment of 3D Gaussian Splatting (3DGS). Existing discrete LoD methods expose only a limited set of operating points, while concurrent continuous LoD approaches enable smoother scaling but often suffer noticeable quality degradation at full capacity, making LoD a costly design decision. We introduce Matryoshka Gaussian Splatting (MGS), a training framework that enables continuous LoD for standard 3DGS pipelines without sacrificing full-capacity rendering quality. MGS learns a single ordered set of Gaussians such that rendering any prefix, the first k splats, produces a coherent reconstruction whose fidelity improves smoothly with increasing budget. Our key idea is stochastic budget training: each iteration samples a random splat budget and optimises both the corresponding prefix and the full set. This strategy requires only two forward passes and introduces no architectural modifications. Experiments across four benchmarks and six baselines show that MGS matches the full-capacity performance of its backbone while enabling a continuous speed-quality trade-off from a single model. Extensive ablations on ordering strategies, training objectives, and model capacity further validate the designs.

[236] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

Main category: cs.CV

TL;DR: VEGA-3D leverages video generation models’ implicit spatial priors to enhance MLLMs’ geometric reasoning without explicit 3D supervision

Details

Motivation: Multimodal LLMs suffer from spatial blindness in fine-grained geometric reasoning and physical dynamics. Existing solutions rely on explicit 3D modalities with data scarcity and generalization issues.

Method: Proposes VEGA-3D framework that repurposes pre-trained video diffusion models as Latent World Simulators. Extracts spatiotemporal features from intermediate noise levels and integrates them with semantic representations via token-level adaptive gated fusion.

Result: Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks show outperformance over state-of-the-art baselines.

Conclusion: Generative priors from video models provide scalable foundation for physical-world understanding, addressing MLLMs’ spatial blindness without explicit 3D supervision.

Abstract: While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

[237] Exploring AI in Fashion: A Review of Aesthetics, Personalization, Virtual Try-On, and Forecasting

Laila Khalid, Wei Gong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2101.08301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2101.08301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[238] Multi-Scale Distillation for RGB-D Anomaly Detection on the PD-REAL Dataset

Jianjian Qin, Chao Zhang, Chunzhi Gu, Zi Wang, Jun Yu, Yijin Wei, Hui Xiao, Xin Yu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2311.04095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2311.04095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[239] Improved Convex Decomposition with Ensembling and Negative Primitives

Vaibhav Vavilala, Florian Kluger, Seemandhar Jain, Bodo Rosenhahn, Anand Bhattad, David Forsyth

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2405.19569: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.19569&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[240] Latent Causal Modeling for 3D Brain MRI Counterfactuals

Wei Peng, Tian Xia, Fabio De Sousa Ribeiro, Tomas Bosschieter, Ehsan Adeli, Qingyu Zhao, Ben Glocker, Kilian M. Pohl

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2409.05585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.05585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[241] Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Yifan Zhang, Junhui Hou

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2412.08973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.08973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] OFTSR: One-Step Flow for Image Super-Resolution with Tunable Fidelity-Realism Trade-offs

Yuanzhi Zhu, Ruiqing Wang, Shilin Lu, Junnan Li, Hanshu Yan, Kai Zhang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting error

Method: Unable to determine method due to API rate limiting error

Result: Unable to determine results due to API rate limiting error

Conclusion: Unable to determine conclusion due to API rate limiting error

Abstract: Failed to fetch summary for 2412.09465: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.09465&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers

Zehao Chen, Rong Pan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2412.10488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.10488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] Universal Sparse Autoencoders: Interpretable Cross-Model Concept Alignment

Harrish Thasarathan, Julian Forsyth, Thomas Fel, Matthew Kowal, Konstantinos G. Derpanis

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2502.03714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] DynamicVis: Dynamic Visual Perception for Efficient Remote Sensing Foundation Models

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Shijian Lu, Zhenwei Shi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.16426: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16426&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] AsgardBench – Evaluating Visually Grounded Interactive Planning Under Minimal Feedback

Andrea Tupini, Lars Liden, Reuben Tan, Yu Wang, Jianfeng Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.15888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding

Abdelrahman Shaker, Muhammad Maaz, Chenhui Gou, Hamid Rezatofighi, Salman Khan, Fahad Shahbaz Khan

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Unable to determine paper motivation due to API request failure

Method: Unable to determine paper method due to API request failure

Result: Unable to determine paper results due to API request failure

Conclusion: Unable to determine paper conclusion due to API request failure

Abstract: Failed to fetch summary for 2503.21782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] SuperDec: 3D Scene Decomposition with Superquadric Primitives

Elisabetta Fedele, Boyang Sun, Leonidas Guibas, Marc Pollefeys, Francis Engelmann

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2504.00992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.00992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] MIPHEI-ViT: Multiplex Immunofluorescence Prediction from H&E Images using ViT Foundation Models

Guillaume Balezo, Roger Trullo, Albert Pla Planas, Etienne Decenciere, Thomas Walter

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2505.10294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] Rethinking Gradient-based Adversarial Attacks on Point Cloud Classification

Jun Chen, Xinke Li, Mingyue Xu, Chongshou Li, Truiani Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2505.21854: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21854&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[251] Video Anomaly Detection with Semantics-Aware Information Bottleneck

Juntong Li, Lingwei Dang, Qingxin Xiao, Shishuo Shang, Jiajia Cheng, Haomin Wu, Yun Hao, Qingyao Wu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2506.02535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] Auto-Annotation with Expert-Crafted Guidelines: A Study through 3D LiDAR Detection Benchmark

Yechi Ma, Wei Hua, Shu Kong

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2506.02914: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02914&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] TR2M: Transferring Monocular Relative Depth to Metric Depth with Language Descriptions and Dual-Level Scale-Oriented Contrast

Beilei Cui, Yiming Huang, Long Bai, Hongliang Ren

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper retrieval

Method: Unable to determine method due to failed paper retrieval

Result: Unable to determine results due to failed paper retrieval

Conclusion: Unable to determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2506.13387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.13387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] LiteReality: Graphics-Ready 3D Scene Reconstruction from RGB-D Scans

Zhening Huang, Xiaoyang Wu, Fangcheng Zhong, Hengshuang Zhao, Matthias Nießner, Joan Lasenby

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable due to HTTP 429 error

Result: No results available - paper retrieval failed due to rate limiting

Conclusion: Cannot provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2507.02861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.02861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations

Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.05751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.05751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] Linearly Separable Features in Shallow Nonlinear Networks: Width Scales Polynomially with Intrinsic Data Dimension

Alec S. Xu, Can Yaras, Peng Wang, Qing Qu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2501.02364: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02364&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xiang Li, Zhangchi Hu, Xiao Xu, Bin Kong

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2507.16861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.16861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] Remove360: Benchmarking Residuals After Object Removal in 3D Gaussian Splatting

Simona Kocour, Assia Benbihi, Torsten Sattler

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2508.11431: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11431&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[259] DriveSplat: Unified Neural Gaussian Reconstruction for Dynamic Driving Scenes

Cong Wang, Ruiqi Song, Wei Tian, Chenming Zhang, Lingxi Li, Long Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.15376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.15376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] CrossHOI-Bench: A Unified Benchmark for HOI Evaluation across Vision-Language Models and HOI-Specific Methods

Qinqian Lei, Bo Wang, Robby T. Tan

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2508.18753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] All-in-One Slider for Attribute Manipulation in Diffusion Models

Weixin Ye, Hongguang Zhu, Wei Wang, Yahui Liu, Mengyu Wang, Xuecheng Nie

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2508.19195: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.19195&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] AI-driven Dispensing of Coral Reseeding Devices for Broad-scale Restoration of the Great Barrier Reef

Scarlett Raine, Emilio Olivastri, Benjamin Moshirian, Tobias Fischer

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.01019: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.01019&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] GenCompositor: Generative Video Compositing with Diffusion Transformer

Shuzhou Yang, Xiaoyu Li, Xiaodong Cun, Guangzhi Wang, Lingen Li, Ying Shan, Jian Zhang

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.02460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] Page image classification for content-specific data processing

Kateryna Lutsai, Pavel Straňák

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2507.21114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] A Re-ranking Method using K-nearest Weighted Fusion for Person Re-identification

Huy Che, Le-Chuong Nguyen, Gia-Nghia Tran, Dinh-Duy Phan, Vinh-Tiep Nguyen

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2509.04050: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.04050&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] LucidFlux: Caption-Free Photo-Realistic Image Restoration via a Large-Scale Diffusion Transformer

Song Fei, Tian Ye, Lujia Wang, Lei Zhu

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.22414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Soft-Di[M]O: Improving One-Step Discrete Image Generation with Soft Embeddings

Yuanzhi Zhu, Xi Wang, Stéphane Lathuilière, Vicky Kalogeiton

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2509.22925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Na Min An, Inha Kang, Minhyun Lee, Hyunjung Shim

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

Details

Motivation: Unable to determine paper motivation due to fetch error

Method: Unable to determine paper method due to fetch error

Result: Unable to determine paper results due to fetch error

Conclusion: Unable to determine paper conclusion due to fetch error

Abstract: Failed to fetch summary for 2509.23098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] FSFSplatter: Build Surface and Novel Views with Sparse-Views within 2min

Yibin Zhao, Yihan Pan, Jun Nan, Liwei Chen, Jianjun Yi

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.02691: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02691&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] Object-Centric Representation Learning for Enhanced 3D Semantic Scene Graph Prediction

KunHo Heo, GiHyun Kim, SuYeon Kim, MyeongAh Cho

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.04714 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.04714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Unlocking 3D Affordance Segmentation with 2D Semantic Knowledge

Yu Huang, Zelin Peng, Changsong Wen, Xiaokang Yang, Wei Shen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2510.08316: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08316&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] Seeing Beyond the Image: ECG and Anatomical Knowledge-Guided Myocardial Scar Segmentation from Late Gadolinium-Enhanced Images

Farheen Ramzan, Yusuf Kiberu, Nikesh Jathanna, Meryem Jabrane, Vicente Grau, Shahnaz Jamil-Copley, Richard H. Clayton, Chen, Chen

Main category: cs.CV

TL;DR: Unable to analyze paper 2511.14702 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions about paper content due to unavailability of abstract

Abstract: Failed to fetch summary for 2511.14702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[273] DREAM: A Benchmark Study for Deepfake photoREalism AssessMent

Bo Peng, Zichuan Wang, Sheng Yu, Xiaochuan Jin, Wei Wang, Jing Dong

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.10053: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10053&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] Towards more holistic interpretability: A lightweight disentangled Concept Bottleneck Model

Gaoxiang Huang, Songning Lai, Yutao Yue

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.15770: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15770&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] From Far and Near: Perceptual Evaluation of Crowd Representations Across Levels of Detail

Xiaohan Sun, Carol O’Sullivan

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2510.20558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] Activation Quantization of Vision Encoders Needs Prefixing Registers

Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2510.04547: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04547&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Infinity-RoPE: Action-Controllable Infinite Video Generation Emerges From Autoregressive Self-Rollout

Hidir Yesiltepe, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Pinar Yanardag

Main category: cs.CV

TL;DR: Paper 2511.20649: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to HTTP 429 error when fetching paper information

Method: Unable to determine method due to HTTP 429 error when fetching paper information

Result: Unable to determine results due to HTTP 429 error when fetching paper information

Conclusion: Unable to determine conclusion due to HTTP 429 error when fetching paper information

Abstract: Failed to fetch summary for 2511.20649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] Shoe Style-Invariant and Ground-Aware Learning for Dense Foot Contact Estimation

Daniel Sungho Jung, Kyoung Mu Lee

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.22184: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22184&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images

Rao Muhammad Umer, Daniel Sens, Jonathan Noll, Sohom Dey, Christian Matek, Lukas Wolfseher, Rainer Spang, Ralf Huss, Johannes Raffler, Sarah Reinke, Ario Sadafi, Wolfram Klapper, Katja Steiger, Kristina Schwamborn, Carsten Marr

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.14640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] Efficient and Scalable Monocular Human-Object Interaction Motion Reconstruction

Boran Wen, Ye Lu, Sirui Wang, Keyan Wan, Jiahong Zhou, Junxuan Liang, Xinpeng Liu, Bang Xiao, Ruiyang Liu, Yong-Lu Li

Main category: cs.CV

TL;DR: Unable to analyze paper 2512.00960 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2512.00960: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00960&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Embedding Physical Reasoning into Diffusion-Based Shadow Generation

Shilin Hu, Jingyi Xu, Akshat Dave, Dimitris Samaras, Hieu Le

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.06174 appears to be an arXiv paper, but cannot retrieve abstract or content.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2512.06174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] GeoMotionGPT: Geometry-Aligned Motion Understanding with Large Language Models

Zhankai Ye, Bofan Li, Yukai Jin, Shuoqiu Li, Wei Wang, Yanfu Zhang, Shangqian Gao, Xin Liu

Main category: cs.CV

TL;DR: Paper 2601.07632: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2601.07632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] Cast and Attached Shadow Detection via Iterative Light and Geometry Reasoning

Shilin Hu, Jingyi Xu, Sagnik Das, Dimitris Samaras, Hieu Le

Main category: cs.CV

TL;DR: Paper 2512.06179: Unable to fetch abstract due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as abstract is unavailable due to arXiv rate limiting

Method: Method unknown - paper content inaccessible due to HTTP 429 error

Result: No results available - failed to retrieve paper information

Conclusion: Unable to analyze paper due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2512.06179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] GTAvatar: Bridging Gaussian Splatting and Texture Mapping for Relightable and Editable Gaussian Avatars

Kelian Baert, Mae Younes, Francois Bourel, Marc Christie, Adnane Boukhayma

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2512.09162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] 1S-DAug: One-Shot Data Augmentation for Robust Few-Shot Generalization

Yunwei Bai, Ying Kiat Tan, Yao Shu, Tsuhan Chen

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available due to API access limitations

Result: No results available - paper content could not be retrieved

Conclusion: Cannot analyze paper due to technical limitations in accessing the abstract

Abstract: Failed to fetch summary for 2602.00114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] GriDiT: Factorized Grid-Based Diffusion for Efficient Long Image Sequence Generation

Snehal Singh Tomar, Alexandros Graikos, Arjun Krishna, Dimitris Samaras, Klaus Mueller

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2512.21276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] The Mechanics of CNN Filtering with Rectification

Liam Frija-Altarac, Matthew Toews

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.24338: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24338&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment

Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu, Xinbo Gao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to API rate limiting error

Result: No results available - paper summary fetch failed

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2601.04614: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04614&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] Image2Garment: Simulation-ready Garment Generation from a Single Image

Selim Emir Can, Jan Ackermann, Kiyohiro Nakayama, Ruofan Liu, Tong Wu, Yang Zheng, Hugo Bertiche, Menglei Chai, Thabo Beeler, Gordon Wetzstein

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.09658 appears to be an arXiv paper, but the content could not be retrieved.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2601.09658: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09658&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] Towards Onboard Continuous Change Detection for Floods

Daniel Kyselica, Jonáš Herec, Oliver Kutis, Rado Pitoňák

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot determine conclusion due to failed paper retrieval

Abstract: Failed to fetch summary for 2601.13751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] SuperOcc: Toward Cohesive Temporal Modeling for Superquadric-based 3D Occupancy Prediction

Zichen Yu, Quanli Liu, Wei Wang, Liyong Zhang, Xiaoguang Zhao

Main category: cs.CV

TL;DR: Paper 2601.15644: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2601.15644: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15644&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] Is Hierarchical Quantization Essential for Optimal Reconstruction?

Shirin Reyhanian, Laurenz Wiskott

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2601.22244: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22244&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] What Is Wrong with Synthetic Data for Scene Text Recognition? A Strong Synthetic Engine with Diverse Simulations and Self-Evolution

Xingsong Ye, Yongkun Du, JiaXin Zhang, Chen Li, Jing Lyu, Zhineng Chen

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.06450 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2602.06450: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06450&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] PFGNet: A Fully Convolutional Frequency-Guided Peripheral Gating Network for Efficient Spatiotemporal Predictive Learning

Xinyong Cai, Changbin Sun, Yong Wang, Hongyu Yang, Yuankai Wu

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.20537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] Deep Expert Injection for Anchoring Retinal VLMs with Domain-Specific Knowledge

Shuai Lu, Meng Wang, Jia Guo, Jiawei Du, Bo Liu, Shengzhu Yang, Weihang Zhang, Huazhu Fu, Huiqi Li

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.07131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] How to Take a Memorable Picture? Empowering Users with Actionable Feedback

Francesco Laiti, Davide Talon, Jacopo Staiano, Elisa Ricci

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2602.21877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] LucidNFT: LR-Anchored Multi-Reward Preference Optimization for Generative Real-World Super-Resolution

Song Fei, Tian Ye, Sixiang Chen, Zhaohu Xing, Jianyu Lai, Lei Zhu

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Unable to determine motivation due to missing paper content.

Method: Unable to determine method due to missing paper content.

Result: Unable to determine results due to missing paper content.

Conclusion: Unable to draw conclusions due to missing paper content.

Abstract: Failed to fetch summary for 2603.05947: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05947&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] A Unified View of Drifting and Score-Based Models

Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, Molei Tao

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.07514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] A Simple Efficiency Incremental Learning Framework via Vision-Language Model with Nonlinear Multi-Adapters

Haihua Luo, Xuming Ran, Jiangrong Shen, Timo Hämäläinen, Zhonghua Chen, Qi Xu, Fengyu Cong

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.11211: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11211&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] COTONET: A custom cotton detection algorithm based on YOLO11 for stage of growth cotton boll detection

Guillem González, Guillem Alenyà, Sergi Foix

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.11717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory

Dingcheng Zhen, Xu Zheng, Ruixin Zhang, Zhiqi Jiang, Yichao Yan, Ming Tao, Shunshun Yin

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without paper content

Method: Cannot determine method without paper content

Result: Cannot determine results without paper content

Conclusion: Cannot draw conclusions without paper content

Abstract: Failed to fetch summary for 2603.11746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] Multimodal OCR: Parse Anything from Documents

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Yuqiu Ji, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to API access restrictions

Conclusion: Paper analysis not possible due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2603.13032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] Expanding mmWave Datasets for Human Pose Estimation with Unlabeled Data and LiDAR Datasets

Zhuoxuan Peng, Boan Zhu, Xingjian Zhang, Wenying Li, S.-H. Gary Chan

Main category: cs.CV

TL;DR: Paper 2603.14507: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to inability to access paper content

Method: Cannot determine method due to inability to access paper content

Result: Cannot determine results due to inability to access paper content

Conclusion: Cannot draw conclusions due to inability to access paper content

Abstract: Failed to fetch summary for 2603.14507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] LICA: Layered Image Composition Annotations for Graphic Design Research

Elad Hirsch, Shubham Yadav, Mohit Garg, Purvanshi Mehta

Main category: cs.CV

TL;DR: Unable to analyze paper 2603.16098 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting error

Method: Cannot determine method as abstract is unavailable due to rate limiting error

Result: Cannot determine results as abstract is unavailable due to rate limiting error

Conclusion: Cannot draw conclusions about the paper’s content due to inability to access the abstract

Abstract: Failed to fetch summary for 2603.16098: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16098&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] EPOFusion: Exposure aware Progressive Optimization Method for Infrared and Visible Image Fusion

Zhiwei Wang, Yayu Zheng, Defeng He, Li Zhao, Xiaoqin Zhang, Yuxing Li, Edmund Y. Lam

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.16130 suggests it’s from March 2026, but no content available for analysis.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2603.16130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] MLLM-based Textual Explanations for Face Comparison

Redwan Sony, Anil K Jain, Ross Arun

Main category: cs.CV

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved due to rate limiting error

Method: No method information available - arXiv API returned HTTP 429 (Too Many Requests) error

Result: No results available - the paper content could not be fetched due to server rate limiting

Conclusion: Cannot analyze paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2603.16629: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16629&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[307] Synergizing Deep Learning and Biological Heuristics for Extreme Long-Tail White Blood Cell Classification

Duc T. Nguyen, Hoang-Long Nguyen, Huy-Hieu Pham

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.16249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] Iris: Bringing Real-World Priors into Diffusion Model for Monocular Depth Estimation

Xinhao Cai, Gensheng Pei, Zeren Sun, Yazhou Yao, Fumin Shen, Wenguan Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.16340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] Bridging the Simulation-to-Reality Gap in Electron Microscope Calibration via VAE-EM Estimation

Jilles S. van Hulst, W.P.M.H. Heemels, Duarte J. Antunes

Main category: cs.CV

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method without access to paper content

Result: No results available due to technical access issues

Conclusion: Paper analysis not possible due to arXiv API rate limiting preventing content retrieval

Abstract: Failed to fetch summary for 2603.16549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu, Ziheng Ouyang, Yijia Kang, Qilong Wang, Mi Zhou, Bo Li, Ming-Ming Cheng, Qibin Hou

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.16649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] A Proposal-Free Query-Guided Network for Grounded Multimodal Named Entity Recognition

Hongbing Li, Jiamin Liu, Shuo Zhang, Bo Xiao

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.17314: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17314&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] TransText: Alpha-as-RGB Representation for Transparent Text Animation

Fei Zhang, Zijian Zhou, Bohao Tang, Sen He, Hang Li, Zhe Wang, Soubhik Sanyal, Pengfei Liu, Viktar Atliha, Tao Xiang, Frost Xu, Semih Gunel

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.17944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] TiROD: Tiny Robotics Dataset and Benchmark for Continual Object Detection

Francesco Pasti, Riccardo De Monte, Davide Dalle Pezze, Gian Antonio Susto, Nicola Bellotto

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2409.16215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.16215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] Latent Representations for Visual Proprioception in Inexpensive Robots

Sahara Sheikholeslami, Ladislau Bölöni

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2504.14634: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14634&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[315] QualitEye: Public and Privacy-preserving Gaze Data Quality Verification

Mayar Elfares, Pascal Reisert, Ralf Küsters, Andreas Bulling

Main category: cs.CV

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2506.05908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.05908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] TrajBooster: Boosting Humanoid Whole-Body Manipulation via Trajectory-Centric Learning

Jiacheng Liu, Pengxiang Ding, Qihang Zhou, Yuxuan Wu, Da Huang, Zimian Peng, Wei Xiao, Weinan Zhang, Lixin Yang, Cewu Lu, Donglin Wang

Main category: cs.CV

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Unable to determine method due to API rate limiting preventing access to paper content

Result: Unable to determine results due to API rate limiting preventing access to paper content

Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2509.11839: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11839&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] AI Pose Analysis and Kinematic Profiling of Range-of-Motion Variations in Resistance Training

Adam Diamant

Main category: cs.CV

TL;DR: Unable to analyze paper 2510.20012 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable due to API rate limiting

Method: Cannot determine method as abstract is unavailable due to API rate limiting

Result: Cannot determine results as abstract is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as abstract is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.20012: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20012&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zihui Yu, Pingcong Li, Bichi Zhang, Sören Schwertfeger

Main category: cs.CV

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.12696 appears to be from March 2023, but no content is available for analysis.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot draw conclusions without access to the paper content.

Abstract: Failed to fetch summary for 2603.12696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] ITKIT: Feasible CT Image Analysis based on SimpleITK and MMEngine

Yiqin Zhang, Meiling Chen

Main category: cs.CV

TL;DR: Failed to fetch summary for paper 2603.14255 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to inability to access paper content

Method: Unable to determine method due to inability to access paper content

Result: Unable to determine results due to inability to access paper content

Conclusion: Unable to determine conclusion due to inability to access paper content

Abstract: Failed to fetch summary for 2603.14255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.AI

[320] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Jiaqi Xiong, Yunjia Qi, Qi Cao, Yu Zheng, Weisheng Xu, Ziteng Wang, Ruofan Liao, Yutong Zhang, Sichen Liu

Main category: cs.AI

Details

[321] Continually self-improving AI

Zitong Yang

Main category: cs.AI

TL;DR: This thesis proposes three approaches for creating continually self-improving AI systems that overcome limitations of human-dependent training: 1) synthetic data generation to amplify small corpora for efficient knowledge acquisition, 2) self-generated synthetic data to bootstrap pretraining without human instruction-tuned models, and 3) algorithmic search at test time to explore learning configurations beyond human design.

Details

Motivation: Current language models are fundamentally limited by human dependencies in three ways: inefficient knowledge acquisition from small corpora after pretraining, reliance on finite human-generated training data, and confinement to human-discovered training algorithms. The thesis aims to overcome these limitations to enable continually self-improving AI systems.

Method: Three main approaches: 1) Synthetic data generation to diversify and amplify small specialized corpora into rich knowledge representations for parameter updates; 2) Self-generated synthetic data bootstrapping that allows models to develop fundamental pretraining capabilities without distillation from instruction-tuned LMs; 3) Test-time algorithmic search that scales search over learning algorithm configurations beyond what human researchers can manually explore.

Result: The thesis demonstrates that AI systems can overcome human dependencies through: effective parameter updates from limited source material via synthetic data, bootstrapping of pretraining capabilities without human instruction-tuned models, and exploration of larger algorithm configuration spaces than human researchers can manually investigate.

Conclusion: The presented approaches represent steps toward creating continually self-improving AI systems that can overcome fundamental limitations imposed by human creators, potentially enabling more autonomous and capable AI development.

Abstract: Modern language model-based AI systems are remarkably powerful, yet their capabilities remain fundamentally capped by their human creators in three key ways. First, although a model’s weights can be updated via fine-tuning, acquiring new knowledge from small, specialized corpora after pretraining remains highly data-inefficient. Second, the training of these systems relies heavily on finite, human-generated data from across history. Third, the pipelines used to train AI models are confined by the algorithms that human researchers can discover and explore. This thesis takes a small step toward overcoming these inherent limitations, presenting three chapters aimed at breaking these dependencies to create continually self-improving AI. First, to overcome this data-efficiency barrier in knowledge acquisition, we propose a synthetic data approach that diversifies and amplifies small corpora into rich knowledge representations, enabling a model to effectively update its parameters from limited source material. Second, to reduce reliance on human data, we show that given a fixed amount of such data, the model can self-generate synthetic data to bootstrap its fundamental pretraining capabilities without distillation from any off-the-shelf, instruction-tuned LM. Finally, to transcend human-engineered training paradigms, we demonstrate that by scaling search during test time over the space of algorithms, AI can search over a larger space of learning algorithm configurations than human researchers can explore manually.

[322] MemArchitect: A Policy Driven Memory Governance Layer

Lingavasan Suresh Kumar, Yang Ba, Rong Pan

Main category: cs.AI

TL;DR: MemArchitect introduces a governance layer for LLM agent memory management to address contradictions, privacy, and outdated information issues in standard RAG frameworks.

Details

Motivation: Current LLM agent memory systems treat memory as passive storage without mechanisms to resolve contradictions, enforce privacy, or prevent outdated information ("zombie memories") from contaminating the context window, creating governance gaps.

Method: MemArchitect decouples memory lifecycle management from model weights and enforces explicit, rule-based policies including memory decay, conflict resolution, and privacy controls.

Result: Governed memory consistently outperforms unmanaged memory in agentic settings, demonstrating the necessity of structured memory governance for reliable and safe autonomous systems.

Conclusion: Structured memory governance through systems like MemArchitect is essential for reliable and safe autonomous LLM agents, addressing critical gaps in current memory management approaches.

Abstract: Persistent Large Language Model (LLM) agents expose a critical governance gap in memory management. Standard Retrieval-Augmented Generation (RAG) frameworks treat memory as passive storage, lacking mechanisms to resolve contradictions, enforce privacy, or prevent outdated information (“zombie memories”) from contaminating the context window. We introduce MemArchitect, a governance layer that decouples memory lifecycle management from model weights. MemArchitect enforces explicit, rule-based policies, including memory decay, conflict resolution, and privacy controls. We demonstrate that governed memory consistently outperforms unmanaged memory in agentic settings, highlighting the necessity of structured memory governance for reliable and safe autonomous systems.

[323] Multi-Trait Subspace Steering to Reveal the Dark Side of Human-AI Interaction

Xin Wei Chia, Swee Liang Wong, Jonathan Pan

Main category: cs.AI

TL;DR: A framework called MultiTraitsss generates “Dark models” that exhibit harmful AI behaviors to study and mitigate negative psychological outcomes in human-AI interactions.

Details

Motivation: Recent incidents show human-AI interactions can lead to negative psychological outcomes like mental health crises, but studying these harmful interactions is methodologically challenging due to the need for sustained engagement and extensive conversational context.

Method: Developed Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages crisis-associated traits and subspace steering to generate “Dark models” that exhibit cumulative harmful behavioral patterns.

Result: Single-turn and multi-turn evaluations show the dark models consistently produce harmful interactions and outcomes. The framework enables proposing protective measures to reduce harmful outcomes.

Conclusion: The MultiTraitsss framework provides a way to study harmful human-AI interactions and develop protective measures against negative psychological outcomes from AI systems.

Abstract: Recent incidents have highlighted alarming cases where human-AI interactions led to negative psychological outcomes, including mental health crises and even user harm. As LLMs serve as sources of guidance, emotional support, and even informal therapy, these risks are poised to escalate. However, studying the mechanisms underlying harmful human-AI interactions presents significant methodological challenges, where organic harmful interactions typically develop over sustained engagement, requiring extensive conversational context that are difficult to simulate in controlled settings. To address this gap, we developed a Multi-Trait Subspace Steering (MultiTraitsss) framework that leverages established crisis-associated traits and novel subspace steering framework to generate Dark models that exhibits cumulative harmful behavioral patterns. Single-turn and multi-turn evaluations show that our dark models consistently produce harmful interaction and outcomes. Using our Dark models, we propose protective measure to reduce harmful outcomes in Human-AI interactions.

[324] Adaptive Domain Models: Bayesian Evolution, Warm Rotation, and Principled Training for Geometric and Neuromorphic AI

Houston Haynes

Main category: cs.AI

TL;DR: A novel AI training architecture using geometric algebra, deterministic memory management, and posit arithmetic enables depth-independent training memory, grade-preserving weight updates, and exact gradient accumulation, with applications to both conventional and neuromorphic models.

Details

Motivation: Current AI training suffers from memory overhead, optimizer complexity, and geometric property degradation due to reliance on reverse-mode automatic differentiation over IEEE-754 arithmetic. The paper aims to address these limitations through a fundamentally different arithmetic substrate and training architecture.

Method: Combines three prior results: 1) Dimensional Type System and Deterministic Memory Management for stack-eligible gradient allocation and exact quire accumulation, 2) Program Hypergraph for grade preservation through geometric algebra computations, and 3) b-posit 2026 standard for tractable posit arithmetic. Introduces Bayesian distillation for extracting latent prior structure and warm rotation for seamless model updates.

Result: Achieves depth-independent training memory bounded to ~2x inference footprint, grade-preserving weight updates, exact gradient accumulation, and enables continuous adaptation. Creates domain-specific AI systems that are smaller, more precise, verifiably correct, and initializable from existing models.

Conclusion: The proposed architecture overcomes fundamental limitations of current training infrastructure by changing the arithmetic substrate and computational foundations, enabling more efficient, precise, and verifiable AI systems that can be continuously adapted while maintaining structural correctness.

Abstract: Prevailing AI training infrastructure assumes reverse-mode automatic differentiation over IEEE-754 arithmetic. The memory overhead of training relative to inference, optimizer complexity, and structural degradation of geometric properties through training are consequences of this arithmetic substrate. This paper develops an alternative training architecture grounded in three prior results: the Dimensional Type System and Deterministic Memory Management framework [6], which establishes stack-eligible gradient allocation and exact quire accumulation as design-time verifiable properties; the Program Hypergraph [8], which establishes grade preservation through geometric algebra computations as a type-level invariant; and the b-posit 2026 standard [10], which makes posit arithmetic tractable across hardware targets conventionally considered inference-only. Their composition enables depth-independent training memory bounded to approximately twice the inference footprint, grade-preserving weight updates, and exact gradient accumulation, applicable uniformly to loss-function-optimized and spike-timing-dependent neuromorphic models. We introduce Bayesian distillation, a mechanism by which the latent prior structure of a general-purpose model is extracted through the ADM training regime, resolving the data-scarcity bootstrapping problem for domain-specific training. For deployment, we introduce warm rotation, an operational pattern in which an updated model transitions into an active inference pathway without service interruption, with structural correctness formalized through PHG certificates and signed version records. The result is a class of domain-specific AI systems that are smaller and more precise than general-purpose models, continuously adaptive, verifiably correct with respect to the physical structure of their domains, and initializable from existing models.

[325] Don’t Vibe Code, Do Skele-Code: Interactive No-Code Notebooks for Subject Matter Experts to Build Lower-Cost Agentic Workflows

Sriram Gopalakrishnan

Main category: cs.AI

TL;DR: Skele-Code is a natural-language and graph-based interface for building AI agent workflows, designed for non-technical users with notebook-style development and code-first approach to reduce token costs.

Details

Motivation: To create an accessible interface for less technical users to build AI agent workflows, addressing the complexity and high token costs of traditional multi-agent systems while enabling modular, extensible workflow development.

Method: Natural-language and graph-based interface with notebook-style incremental development; converts steps to code with required functions; uses agents only for code generation and error recovery (not orchestration); employs context-engineering to reduce token usage.

Result: Produces modular, extensible, and shareable workflows that can be used as skills by agents or as steps in other workflows; reduces token costs compared to multi-agent system approaches.

Conclusion: Skele-Code provides an effective code-first, agent-supported approach to workflow building that is accessible to non-technical users while being efficient and modular.

Abstract: Skele-Code is a natural-language and graph-based interface for building workflows with AI agents, designed especially for less or non-technical users. It supports incremental, interactive notebook-style development, and each step is converted to code with a required set of functions and behavior to enable incremental building of workflows. Agents are invoked only for code generation and error recovery, not orchestration or task execution. This agent-supported, but code-first approach to workflows, along with the context-engineering used in Skele-Code, can help reduce token costs compared to the multi-agent system approach to executing workflows. Skele-Code produces modular, easily extensible, and shareable workflows. The generated workflows can also be used as skills by agents, or as steps in other workflows.

[326] Reflection in the Dark: Exposing and Escaping the Black Box in Reflective Prompt Optimization

Shiyan Liu, Qifeng Xia, Qiyun Xia, Yisheng Liu, Xinyu Yu, Rui Qu

Main category: cs.AI

TL;DR: VISTA is a multi-agent automatic prompt optimization framework that improves on existing reflective APO methods by decoupling hypothesis generation from prompt rewriting, enabling interpretable optimization and better performance on math reasoning tasks.

Details

Motivation: Existing reflective APO methods like GEPA have limitations: they operate as black-box, label-free optimization processes leading to uninterpretable trajectories and systematic failures. The authors identified four specific limitations where these methods can degrade performance rather than improve it.

Method: VISTA uses a multi-agent framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization traces. It incorporates a two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling to escape local optima.

Result: VISTA recovered accuracy to 87.57% on GSM8K with a defective seed (where GEPA degraded from 23.81% to 13.50%), and consistently outperformed baselines across all conditions on GSM8K and AIME2025 benchmarks.

Conclusion: VISTA addresses key limitations of existing APO methods by providing interpretable optimization, better handling of defective seeds, and improved performance on mathematical reasoning tasks through its multi-agent architecture and explore-exploit mechanisms.

Abstract: Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.

[327] Efficient Dense Crowd Trajectory Prediction Via Dynamic Clustering

Antonius Bima Murti Wijaya, Paul Henderson, Marwa Mahmoud

Main category: cs.AI

TL;DR: A cluster-based approach for crowd trajectory prediction that groups individuals with similar attributes over time to enable faster processing in dense crowd scenarios.

Details

Motivation: Existing trajectory prediction methods overlook dense crowd scenarios where automation challenges become more pronounced due to massiveness, noisiness, and inaccuracy of tracking outputs, resulting in high computational costs.

Method: Proposes a novel cluster-based approach that groups individuals based on similar attributes over time, enabling faster execution through accurate group summarization. The plug-and-play method can be combined with existing trajectory predictors by using output centroids in place of pedestrian inputs.

Result: The approach leads to faster processing and lower memory usage when compared with state-of-the-art methods while maintaining accuracy, as demonstrated on several challenging dense crowd scenes.

Conclusion: The cluster-based approach effectively addresses computational challenges in dense crowd trajectory prediction while maintaining prediction accuracy through efficient group summarization.

Abstract: Crowd trajectory prediction plays a crucial role in public safety and management, where it can help prevent disasters such as stampedes. Recent works address the problem by predicting individual trajectories and considering surrounding objects based on manually annotated data. However, these approaches tend to overlook dense crowd scenarios, where the challenges of automation become more pronounced due to the massiveness, noisiness, and inaccuracy of the tracking outputs, resulting in high computational costs. To address these challenges, we propose and extensively evaluate a novel cluster-based approach that groups individuals based on similar attributes over time, enabling faster execution through accurate group summarisation. Our plug-and-play method can be combined with existing trajectory predictors by using our output centroid in place of their pedestrian input. We evaluate our proposed method on several challenging dense crowd scenes. We demonstrated that our approach leads to faster processing and lower memory usage when compared with state-of-the-art methods, while maintaining the accuracy

[328] TeachingCoach: A Fine-Tuned Scaffolding Chatbot for Instructional Guidance to Instructors

Isabel Molnar, Peiyu Li, Si Chen, Sugana Chawla, James Lang, Ronald Metoyer, Ting Hua, Nitesh V. Chawla

Main category: cs.AI

TL;DR: TeachingCoach is a pedagogically grounded chatbot for instructor professional development that uses educational resource extraction and synthetic dialogue generation to provide real-time conversational guidance.

Details

Motivation: Higher education instructors lack timely, pedagogically grounded support at scale; existing solutions are either generic chatbot advice or non-scalable human consultations.

Method: Data-centric pipeline extracts pedagogical rules from educational resources, uses synthetic dialogue generation to fine-tune a specialized language model for problem identification, diagnosis, and strategy development.

Result: Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than GPT-4o mini baseline; user study reveals trade-offs between conversational depth and interaction efficiency.

Conclusion: Pedagogically grounded, synthetic data-driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.

Abstract: Higher education instructors often lack timely and pedagogically grounded support, as scalable instructional guidance remains limited and existing tools rely on generic chatbot advice or non-scalable teaching center human-human consultations. We present TeachingCoach, a pedagogically grounded chatbot designed to support instructor professional development through real-time, conversational guidance. TeachingCoach is built on a data-centric pipeline that extracts pedagogical rules from educational resources and uses synthetic dialogue generation to fine-tune a specialized language model that guides instructors through problem identification, diagnosis, and strategy development. Expert evaluations show TeachingCoach produces clearer, more reflective, and more responsive guidance than a GPT-4o mini baseline, while a user study with higher education instructors highlights trade-offs between conversational depth and interaction efficiency. Together, these results demonstrate that pedagogically grounded, synthetic data driven chatbots can improve instructional support and offer a scalable design approach for future instructional chatbot systems.

[329] Access Controlled Website Interaction for Agentic AI with Delegated Critical Tasks

Sunyoung Kim, Hokeun Kim

Main category: cs.AI

TL;DR: Website design with fine-grained access control for AI agents performing delegated critical tasks, addressing security gaps in current agentic AI web interactions.

Details

Motivation: Current websites lack proper access control mechanisms for agentic AI systems that perform delegated critical tasks on users' behalf, creating security and safety gaps.

Method: Proposed website design with fine-grained access control for AI agents, plus modifications to open-source authorization service protocols to tailor them for agentic AI delegation.

Result: Evaluation demonstrates capabilities of the access-controlled website when used by AI agents for delegated critical tasks.

Conclusion: The proposed approach addresses security limitations in current agentic AI web interactions through fine-grained access control mechanisms.

Abstract: Recent studies reveal gaps in delegating critical tasks to agentic AI that accesses websites on the user’s behalf, primarily due to limited access control mechanisms on websites designed for agentic AI. In response, we propose a design of website-based interaction for AI agents with fine-grained access control for delegated critical tasks. Our approach encompasses a website design and implementation, as well as modifications to the access grant protocols in an open-source authorization service to tailor it to agentic AI, with delegated critical tasks on the website. The evaluation of our approach demonstrates the capabilities of our access-controlled website used by AI agents.

[330] Reasonably reasoning AI agents can avoid game-theoretic failures in zero-shot, provably

Enoch Hyunwook Kang

Main category: cs.AI

TL;DR: AI agents can achieve Nash equilibrium play zero-shot through reasoning without explicit post-training, even with private payoff information.

Details

Motivation: AI agents in interactive economic environments often fail to reach strategic equilibria despite advanced capabilities, and universal alignment methods across diverse AI models are impractical.

Method: Theoretical analysis of “reasonably reasoning” agents that form beliefs about others’ strategies from observations and learn to best respond, plus empirical validation through five game simulations including prisoner’s dilemma and marketing promotion games.

Result: Proves that reasoning agents eventually behave close to Nash equilibrium on almost every play path, even with unknown stage payoffs and private stochastic payoff observations. Empirical simulations confirm these theoretical findings.

Conclusion: AI agents naturally exhibit reasoning patterns that lead to stable equilibrium behaviors intrinsically, reducing the need for universal alignment procedures in strategic interactions.

Abstract: AI agents are increasingly deployed in interactive economic environments characterized by repeated AI-AI interactions. Despite AI agents’ advanced capabilities, empirical studies reveal that such interactions often fail to stably induce a strategic equilibrium, such as a Nash equilibrium. Post-training methods have been proposed to induce a strategic equilibrium; however, it remains impractical to uniformly apply an alignment method across diverse, independently developed AI models in strategic settings. In this paper, we provide theoretical and empirical evidence that off-the-shelf reasoning AI agents can achieve Nash-like play zero-shot, without explicit post-training. Specifically, we prove that `reasonably reasoning’ agents, i.e., agents capable of forming beliefs about others’ strategies from previous observation and learning to best respond to these beliefs, eventually behave along almost every realized play path in a way that is weakly close to a Nash equilibrium of the continuation game. In addition, we relax the common-knowledge payoff assumption by allowing stage payoffs to be unknown and by having each agent observe only its own privately realized stochastic payoffs, and we show that we can still achieve the same on-path Nash convergence guarantee. We then empirically validate the proposed theories by simulating five game scenarios, ranging from a repeated prisoner’s dilemma game to stylized repeated marketing promotion games. Our findings suggest that AI agents naturally exhibit such reasoning patterns and therefore attain stable equilibrium behaviors intrinsically, obviating the need for universal alignment procedures in many real-world strategic interactions.

[331] A Computationally Efficient Learning of Artificial Intelligence System Reliability Considering Error Propagation

Fenglian Pan, Yinwei Zhang, Yili Hong, Larry Head, Jian Liu

Main category: cs.AI

TL;DR: A reliability modeling framework for AI systems in smart cities that quantifies error propagation across interconnected stages, using autonomous vehicle simulation data and composite likelihood EM algorithm.

Details

Motivation: AI systems in smart cities have reliability concerns due to error propagation across interconnected functional stages, but quantifying this is challenging due to data scarcity, interdependent errors violating statistical assumptions, and computational complexity from high-speed data processing.

Method: Uses physics-based autonomous vehicle simulation with error injection to generate reliability data, develops a framework to characterize error propagation across stages, and employs composite likelihood expectation-maximization algorithm for parameter estimation.

Result: Applied to autonomous vehicle perception systems, demonstrating predictive accuracy and computational efficiency in reliability modeling.

Conclusion: The proposed framework effectively addresses challenges in AI system reliability analysis by providing a method to quantify error propagation using simulation data and efficient statistical inference.

Abstract: Artificial Intelligence (AI) systems are increasingly prominent in emerging smart cities, yet their reliability remains a critical concern. These systems typically operate through a sequence of interconnected functional stages, where upstream errors may propagate to downstream stages, ultimately affecting overall system reliability. Quantifying such error propagation is essential for accurate modeling of AI system reliability. However, this task is challenging due to: i) data availability: real-world AI system reliability data are often scarce and constrained by privacy concerns; ii) model validity: recurring error events across sequential stages are interdependent, violating the independence assumptions of statistical inference; and iii) computational complexity: AI systems process large volumes of high-speed data, resulting in frequent and complex recurrent error events that are difficult to track and analyze. To address these challenges, this paper leverages a physics-based autonomous vehicle simulation platform with a justifiable error injector to generate high-quality data for AI system reliability analysis. Building on this data, a new reliability modeling framework is developed to explicitly characterize error propagation across stages. Model parameters are estimated using a computationally efficient, theoretically guaranteed composite likelihood expectation - maximization algorithm. Its application to the reliability modeling for autonomous vehicle perception systems demonstrates its predictive accuracy and computational efficiency.

[332] I Can’t Believe It’s Corrupt: Evaluating Corruption in Multi-Agent Governance Systems

Vedanta S P, Ponnurangam Kumaraguru

Main category: cs.AI

TL;DR: LLM agents in governmental roles show governance structure matters more than model choice for corruption outcomes; institutional design is crucial before deployment.

Details

Motivation: As LLMs are proposed for high-stakes public workflows, there's a need to systematically evaluate whether they would follow institutional rules when granted authority, treating integrity as a pre-deployment requirement rather than post-deployment assumption.

Method: Multi-agent governance simulations where agents occupy formal governmental roles under different authority structures, with rule-breaking and abuse outcomes scored using an independent rubric-based judge across 28,112 transcript segments.

Result: Governance structure is a stronger driver of corruption-related outcomes than model identity among models below saturation, with large differences across regimes and model-governance pairings. Lightweight safeguards reduce risk in some settings but don’t consistently prevent severe failures.

Conclusion: Institutional design is a precondition for safe delegation: before assigning real authority to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.

Abstract: Large language models are increasingly proposed as autonomous agents for high-stakes public workflows, yet we lack systematic evidence about whether they would follow institutional rules when granted authority. We present evidence that integrity in institutional AI should be treated as a pre-deployment requirement rather than a post-deployment assumption. We evaluate multi-agent governance simulations in which agents occupy formal governmental roles under different authority structures, and we score rule-breaking and abuse outcomes with an independent rubric-based judge across 28,112 transcript segments. While we advance this position, the core contribution is empirical: among models operating below saturation, governance structure is a stronger driver of corruption-related outcomes than model identity, with large differences across regimes and model–governance pairings. Lightweight safeguards can reduce risk in some settings but do not consistently prevent severe failures. These results imply that institutional design is a precondition for safe delegation: before real authority is assigned to LLM agents, systems should undergo stress testing under governance-like constraints with enforceable rules, auditable logs, and human oversight on high-impact actions.

[333] Retrieval-Augmented LLM Agents: Learning to Learn from Experience

Thomas Palmeira Ferraz, Romain Deffayet, Vassilina Nikoulina, Hervé Déjean, Stéphane Clinchant

Main category: cs.AI

TL;DR: Combines fine-tuning and experience retrieval to improve LLM agent generalization to unseen tasks through systematic study of retrieval-augmented training pipelines.

Details

Motivation: Current approaches for LLM agents have limitations: fine-tuning often fails to generalize to new tasks, while experience retrieval underperforms compared to supervised baselines. The paper aims to combine these approaches for better generalization.

Method: 1) Develops robust supervised fine-tuning recipe using LoRA; 2) Analyzes key design choices for experience retrieval (storage, querying, trajectory selection); 3) Proposes pipeline integrating experience retrieval into fine-tuning process.

Result: The combined approach significantly improves generalization to unseen tasks, outperforming state-of-the-art agent training pipelines and providing a scalable framework.

Conclusion: Integrating experience retrieval with fine-tuning creates effective agents that learn to learn from experience, offering a scalable solution for robust generalization.

Abstract: While large language models (LLMs) have advanced the development of general-purpose agents, achieving robust generalization to unseen tasks remains a significant challenge. Current approaches typically rely on either fine-tuning or training-free memory-augmented generation using retrieved experience; yet both have limitations: fine-tuning often fails to extrapolate to new tasks, while experience retrieval often underperforms compared to supervised baselines. In this work, we propose to combine these approaches and systematically study how to train retrieval-augmented LLM agents to effectively leverage retrieved trajectories in-context. First, we establish a robust supervised fine-tuning (SFT) recipe using LoRA that outperforms several state-of-the-art agent training pipelines. Second, we provide a detailed analysis of key design choices for experience retrieval, identifying optimal strategies for storage, querying, and trajectory selection. Finally, we propose a pipeline that integrates experience retrieval into the fine-tuning process. Our results demonstrate that this combined approach significantly improves generalization to unseen tasks, providing a scalable and effective framework for building agents that learn to learn from experience.

[334] EDM-ARS: A Domain-Specific Multi-Agent System for Automated Educational Data Mining Research

Chenguang Pan, Zhou Zhang, Weixuan Xiao, Chengyuan Yao

Main category: cs.AI

TL;DR: EDM-ARS is a multi-agent AI system that automates educational data mining research from problem formulation to manuscript generation using LLM-powered agents.

Details

Motivation: To automate the educational data mining research process by embedding domain expertise into an end-to-end automated pipeline that reduces manual effort and accelerates research.

Method: A multi-agent pipeline with five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, Writer) orchestrated by a state-machine coordinator with revision loops, checkpoint recovery, and sandboxed code execution.

Result: Produces complete LaTeX manuscripts with real citations, validated ML analyses, and automated peer review from research prompts and datasets.

Conclusion: EDM-ARS demonstrates a functional automated research system for educational data mining with current limitations but provides a roadmap for expansion to causal inference, transfer learning, and multi-dataset generalization.

Abstract: In this technical report, we present the Educational Data Mining Automated Research System (EDM-ARS), a domain-specific multi-agent pipeline that automates end-to-end educational data mining (EDM) research. We conceptualize EDM-ARS as a general framework for domain-aware automated research pipelines, where educational expertise is embedded into each stage of the research lifecycle. As a first instantiation of this framework, we focus on predictive modeling tasks. Within this scope, EDM-ARS orchestrates five specialized LLM-powered agents (ProblemFormulator, DataEngineer, Analyst, Critic, and Writer) through a state-machine coordinator that supports revision loops, checkpoint-based recovery, and sandboxed code execution. Given a research prompt and a dataset, EDM-ARS produces a complete LaTeX manuscript with real Semantic Scholar citations, validated machine learning analyses, and automated methodological peer review. We also provide a detailed description of the system architecture, the three-tier data registry design that encodes educational domain expertise, the specification of each agent, the inter-agent communication protocol, and mechanisms for error-handling and self-correction. Finally, we discuss current limitations, including single-dataset scope and formulaic paper output, and outline a phased roadmap toward causal inference, transfer learning, psychometric, and multi-dataset generalization. EDM-ARS is released as an open-source project to support the educational research community.

[335] CORE: Robust Out-of-Distribution Detection via Confidence and Orthogonal Residual Scoring

Jin Mo Yang, Hyung-Sin Kim, Saewoong Bahk

Main category: cs.AI

TL;DR: CORE is a novel OOD detection method that disentangles confidence and membership signals by analyzing orthogonal subspaces in penultimate features, achieving robust performance across diverse architectures and datasets.

Details

Motivation: Current OOD detection methods are inconsistent across architectures and datasets because logit-based methods only see classifier confidence while feature-based methods work in full feature space where confidence and membership signals are entangled, leading to architecture-sensitive failure modes.

Method: CORE decomposes penultimate features into two orthogonal subspaces: classifier-aligned component (confidence signal) and residual component (membership signal). It scores each subspace independently and combines them via normalized summation, leveraging their orthogonality to achieve robust detection.

Result: CORE achieves competitive or state-of-the-art performance across five architectures and five benchmark configurations, ranking first in three of five settings and achieving the highest grand average AUROC with negligible computational overhead.

Conclusion: By disentangling confidence and membership signals in orthogonal subspaces, CORE provides a robust OOD detection framework that overcomes the limitations of existing methods and works consistently across diverse architectures and datasets.

Abstract: Out-of-distribution (OOD) detection is essential for deploying deep learning models reliably, yet no single method performs consistently across architectures and datasets – a scorer that leads on one benchmark often falters on another. We attribute this inconsistency to a shared structural limitation: logit-based methods see only the classifier’s confidence signal, while feature-based methods attempt to measure membership in the training distribution but do so in the full feature space where confidence and membership are entangled, inheriting architecture-sensitive failure modes. We observe that penultimate features naturally decompose into two orthogonal subspaces: a classifier-aligned component encoding confidence, and a residual the classifier discards. We discover that this residual carries a class-specific directional signature for in-distribution data – a membership signal invisible to logit-based methods and entangled with noise in feature-based methods. We propose CORE (COnfidence + REsidual), which disentangles the two signals by scoring each subspace independently and combines them via normalized summation. Because the two signals are orthogonal by construction, their failure modes are approximately independent, producing robust detection where either view alone is unreliable. CORE achieves competitive or state-of-the-art performance across five architectures and five benchmark configurations, ranking first in three of five settings and achieving the highest grand average AUROC with negligible computational overhead.

[336] The Validity Gap in Health AI Evaluation: A Cross-Sectional Analysis of Benchmark Composition

Alvin Rajkomar, Pavan Sudarshan, Angela Lai, Lily Peng

Main category: cs.AI

TL;DR: Analysis reveals health LLM benchmarks have a “validity gap” - they lack real-world clinical complexity, safety-critical scenarios, and representation of vulnerable populations, despite covering 18,707 consumer health queries.

Details

Motivation: To identify structural gaps in health-related LLM benchmarks by analyzing their composition and alignment with real-world clinical needs, similar to how clinical trials require transparent inclusion criteria.

Method: Analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent.

Result: Found a “validity gap”: benchmarks lack complex clinical inputs (lab values 5.2%, imaging 3.8%, medical records 0.6%), safety-critical scenarios (suicide/self-harm <0.7%, chronic disease 5.5%), and vulnerable populations (pediatrics/older adults <11%). Wellness-focused wearable signals dominate (17.7%).

Conclusion: Health LLM benchmarks are misaligned with clinical reality and need standardized query profiling analogous to clinical trial reporting to properly evaluate models for real-world clinical use.

Abstract: Background: Clinical trials rely on transparent inclusion criteria to ensure generalizability. In contrast, benchmarks validating health-related large language models (LLMs) rarely characterize the “patient” or “query” populations they contain. Without defined composition, aggregate performance metrics may misrepresent model readiness for clinical use. Methods: We analyzed 18,707 consumer health queries across six public benchmarks using LLMs as automated coding instruments to apply a standardized 16-field taxonomy profiling context, topic, and intent. Results: We identified a structural “validity gap.” While benchmarks have evolved from static retrieval to interactive dialogue, clinical composition remains misaligned with real-world needs. Although 42% of the corpus referenced objective data, this was polarized toward wellness-focused wearable signals (17.7%); complex diagnostic inputs remained rare, including laboratory values (5.2%), imaging (3.8%), and raw medical records (0.6%). Safety-critical scenarios were effectively absent: suicide/self-harm queries comprised <0.7% of the corpus and chronic disease management only 5.5%. Benchmarks also neglected vulnerable populations (pediatrics/older adults <11%) and global health needs. Conclusions: Evaluation benchmarks remain misaligned with real-world clinical needs, lacking raw clinical artifacts, adequate representation of vulnerable populations, and longitudinal chronic care scenarios. The field must adopt standardized query profiling–analogous to clinical trial reporting–to align evaluation with the full complexity of clinical practice.

[337] Consumer-to-Clinical Language Shifts in Ambient AI Draft Notes and Clinician-Finalized Documentation: A Multi-level Analysis

Ha Na Cho, Yawen Guo, Sairam Sutari, Emilie Chow, Steven Tam, Danielle Perret, Deepti Pandita, Kai Zheng

Main category: cs.AI

TL;DR: Clinicians consistently edit AI-generated clinical notes to replace consumer-oriented phrasing with standardized clinical terminology, with editing patterns varying by note section and clinician.

Details

Motivation: To understand how clinicians revise AI-generated draft clinical notes that use lay/consumer phrasing into professional documentation with standardized clinical terminology.

Method: Analyzed 71,173 AI-draft and finalized-note section pairs from 34,726 encounters using a dictionary-confirmed transformation framework to quantify consumer-to-clinical normalization.

Result: Editing significantly reduced terminology density across all sections; Assessment and Plan accounted for 59.3% of transformations; 7,576 transformation events across 4,114 note sections (5.8%); transformation intensity varied significantly across clinicians.

Conclusion: Clinician post-editing demonstrates consistent shifts from conversational phrasing toward standardized clinical terminology, supporting section-aware ambient AI design.

Abstract: Ambient AI generates draft clinical notes from patient-clinician conversations, often using lay or consumer-oriented phrasing to support patient understanding instead of standardized clinical terminology. How clinicians revise these drafts for professional documentation conventions remains unclear. We quantified clinician editing for consumer-to- clinical normalization using a dictionary-confirmed transformation framework. We analyzed 71,173 AI-draft and finalized-note section pairs from 34,726 encounters. Confirmed transformations were defined as replacing a consumer expression with its dictionary-mapped clinical equivalent in the same section. Editing significantly reduced terminology density across all sections (p < 0.001). The Assessment and Plan accounted for the largest transformation volume (59.3%). Our analysis identified 7,576 transformation events across 4,114 note sections (5.8%), representing 1.2% consumer-term deletions. Transformation intensity varied across individual clinicians (p < 0.001). Overall, clinician post-editing demonstrates consistent shifts from conversational phrasing toward standardized, section- appropriate clinical terminology, supporting section-aware ambient AI design.

[338] FaithSteer-BENCH: A Deployment-Aligned Stress-Testing Benchmark for Inference-Time Steering

Zikang Ding, Qiying Hu, Yi Zhang, Hongji Li, Junchi Yao, Hongbo Liu, Lijie Hu

Main category: cs.AI

TL;DR: FaithSteer-BENCH is a stress-testing benchmark for evaluating inference-time steering methods in LLMs, revealing systematic failure modes obscured by standard evaluations.

Details

Motivation: Current evaluations of inference-time steering methods overlook deployment constraints, capability trade-offs, and real-world robustness, leading to overly optimistic assessments of their reliability.

Method: Introduces FaithSteer-BENCH benchmark with three gate-wise criteria: controllability, utility preservation, and robustness. Evaluates multiple models and steering approaches under deployment-style operating points with various stress tests.

Result: Reveals systematic failure modes including illusory controllability, cognitive tax on unrelated capabilities, and brittleness under mild perturbations. Shows existing methods don’t provide reliable controllability in practical settings.

Conclusion: FaithSteer-BENCH provides a unified benchmark for more realistic evaluation of steering methods, revealing their limitations and prompting need for more robust method design.

Abstract: Inference-time steering is widely regarded as a lightweight and parameter-free mechanism for controlling large language model (LLM) behavior, and prior work has often suggested that simple activation-level interventions can reliably induce targeted behavioral changes. However, such conclusions are typically drawn under relatively relaxed evaluation settings that overlook deployment constraints, capability trade-offs, and real-world robustness. We therefore introduce \textbf{FaithSteer-BENCH}, a stress-testing benchmark that evaluates steering methods at a fixed deployment-style operating point through three gate-wise criteria: controllability, utility preservation, and robustness. Across multiple models and representative steering approaches, we uncover several systematic failure modes that are largely obscured under standard evaluation, including illusory controllability, measurable cognitive tax on unrelated capabilities, and substantial brittleness under mild instruction-level perturbations, role prompts, encoding transformations, and data scarcity. Gate-wise benchmark results show that existing methods do not necessarily provide reliable controllability in deployment-oriented practical settings. In addition, mechanism-level diagnostics indicate that many steering methods induce prompt-conditional alignment rather than stable latent directional shifts, further explaining their fragility under stress. FaithSteer-BENCH therefore provides a unified benchmark and a clearer analytical lens for future method design, reliability evaluation, and deployment-oriented research in steering.

[339] Understanding the Theoretical Foundations of Deep Neural Networks through Differential Equations

Hongjue Zhao, Yizhuo Chen, Yuchen Wang, Hairong Qi, Lui Sha, Tarek Abdelzaher, Huajie Shao

Main category: cs.AI

TL;DR: Survey paper presenting differential equations as theoretical foundation for understanding, analyzing, and improving deep neural networks from both model-level and layer-level perspectives.

Details

Motivation: Deep neural networks lack principled theoretical foundation despite empirical success, hindering systematic development. Differential equations offer mathematical framework for principled understanding and improvement.

Method: Organizes discussion around three guiding questions: 1) how differential equations provide principled understanding of DNN architectures, 2) how differential equation tools improve DNN performance, 3) real-world applications. Uses two-fold perspective: model-level (whole DNN as differential equation) and layer-level (individual components as differential equations).

Result: Survey reviews how differential equation framework connects model design, theoretical analysis, and performance improvement. Discusses real-world applications, challenges, and future research opportunities.

Conclusion: Differential equations provide valuable theoretical foundation for DNNs, enabling principled understanding, analysis, and improvement across architecture design, theoretical analysis, and practical applications.

Abstract: Deep neural networks (DNNs) have achieved remarkable empirical success, yet the absence of a principled theoretical foundation continues to hinder their systematic development. In this survey, we present differential equations as a theoretical foundation for understanding, analyzing, and improving DNNs. We organize the discussion around three guiding questions: i) how differential equations offer a principled understanding of DNN architectures, ii) how tools from differential equations can be used to improve DNN performance in a principled way, and iii) what real-world applications benefit from grounding DNNs in differential equations. We adopt a two-fold perspective spanning the model level, which interprets the whole DNN as a differential equation, and the layer level, which models individual DNN components as differential equations. From these two perspectives, we review how this framework connects model design, theoretical analysis, and performance improvement. We further discuss real-world applications, as well as key challenges and opportunities for future research.

[340] Large-Scale Analysis of Political Propaganda on Moltbook

Julia Jose, Meghna Manoj Nair, Rachel Greenstadt

Main category: cs.AI

TL;DR: LLM-based analysis of political propaganda on AI agent platform Moltbook reveals 1% of posts are propaganda, concentrated in few communities and produced by small number of agents, with limited amplification in comments.

Details

Motivation: To understand the prevalence and patterns of political propaganda on emerging AI agent platforms like Moltbook, which could have significant implications for AI-human interaction and information ecosystems.

Method: Developed LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen’s κ=0.64-0.74). Analyzed large dataset of 673,127 posts and 879,606 comments from Moltbook platform.

Result: Political propaganda accounts for 1% of all posts and 42% of all political content. Concentrated in few communities (70% in five communities). 4% of agents produce 51% of propaganda posts. Limited evidence of comment amplification.

Conclusion: Political propaganda exists on AI agent platforms but is concentrated among few agents and communities, with limited amplification effects, suggesting targeted rather than widespread influence.

Abstract: We present an NLP-based study of political propaganda on Moltbook, a Reddit-style platform for AI agents. To enable large-scale analysis, we develop LLM-based classifiers to detect political propaganda, validated against expert annotation (Cohen’s $κ$= 0.64-0.74). Using a dataset of 673,127 posts and 879,606 comments, we find that political propaganda accounts for 1% of all posts and 42% of all political content. These posts are concentrated in a small set of communities, with 70% of such posts falling into five of them. 4% of agents produced 51% of these posts. We further find that a minority of these agents repeatedly post highly similar content within and across communities. Despite this, we find limited evidence that comments amplify political propaganda.

[341] Interpretability without actionability: mechanistic methods cannot correct language model errors despite near-perfect internal representations

Sanjay Basu, Sadiq Y. Patel, Parth Sheth, Bhairavi Muralidharan, Namrata Elamaran, Aakriti Kinra, John Morgan, Rajaie Batniji

Main category: cs.AI

TL;DR: Mechanistic interpretability methods fail to reliably bridge the knowledge-action gap in language models, despite models encoding task-relevant knowledge internally that far exceeds their output performance.

Details

Motivation: To systematically test whether mechanistic interpretability methods can bridge the knowledge-action gap in language models, where models encode task-relevant knowledge in internal representations that exceeds their output performance.

Method: Compared four mechanistic interpretability methods: concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) on 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign).

Result: Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet model’s output sensitivity was only 45.1% (53-percentage-point gap). Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections (indistinguishable from random). SAE feature steering had zero effect despite 3,695 significant features. TSV steering corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected.

Conclusion: Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, challenging AI safety frameworks that assume interpretability enables effective error correction.

Abstract: Language models encode task-relevant knowledge in internal representations that far exceeds their output performance, but whether mechanistic interpretability methods can bridge this knowledge-action gap has not been systematically tested. We compared four mechanistic interpretability methods – concept bottleneck steering (Steerling-8B), sparse autoencoder feature steering, logit lens with activation patching, and linear probing with truthfulness separator vector steering (Qwen 2.5 7B Instruct) – for correcting false-negative triage errors using 400 physician-adjudicated clinical vignettes (144 hazards, 256 benign). Linear probes discriminated hazardous from benign cases with 98.2% AUROC, yet the model’s output sensitivity was only 45.1%, a 53-percentage-point knowledge-action gap. Concept bottleneck steering corrected 20% of missed hazards but disrupted 53% of correct detections, indistinguishable from random perturbation (p=0.84). SAE feature steering produced zero effect despite 3,695 significant features. TSV steering at high strength corrected 24% of missed hazards while disrupting 6% of correct detections, but left 76% of errors uncorrected. Current mechanistic interpretability methods cannot reliably translate internal knowledge into corrected outputs, with implications for AI safety frameworks that assume interpretability enables effective error correction.

[342] LGESynthNet: Controlled Scar Synthesis for Improved Scar Segmentation in Cardiac LGE-MRI Imaging

Athira J. Jacob, Puneet Sharma, Daniel Rueckert

Main category: cs.AI

TL;DR: LGESynthNet is a latent diffusion framework for controllable synthesis of cardiac MRI enhancement lesions, enabling explicit control over size, location, and transmural extent to address limited annotated data for segmentation tasks.

Details

Motivation: Pixel-level annotation for LGE cardiac MRI is challenging and labor-intensive, leading to limited annotated data for training segmentation models. While generative models like diffusion models offer promise for synthetic data generation, they often require large datasets and struggle with fine-grained control over small/localized features like cardiac lesions.

Method: LGESynthNet uses a latent diffusion-based framework with ControlNet architecture formulated as inpainting. It integrates: (1) a reward model for conditioning-specific supervision, (2) a captioning module for anatomically descriptive text prompts, and (3) a biomedical text encoder. The model is trained on only 429 images from 79 patients.

Result: The model produces realistic, anatomically coherent synthetic samples. A quality control filter selects outputs with high conditioning-fidelity. When used for training augmentation, these synthetic samples improve downstream segmentation performance by up to 6 points and detection performance by up to 20 points.

Conclusion: LGESynthNet demonstrates that controllable enhancement synthesis with limited data is feasible and can significantly boost medical image segmentation and detection performance through targeted data augmentation, addressing the annotation bottleneck in medical imaging.

Abstract: Segmentation of enhancement in LGE cardiac MRI is critical for diagnosing various ischemic and non-ischemic cardiomyopathies. However, creating pixel-level annotations for these images is challenging and labor-intensive, leading to limited availability of annotated data. Generative models, particularly diffusion models, offer promise for synthetic data generation, yet many rely on large training datasets and often struggle with fine-grained conditioning control, especially for small or localized features. We introduce LGESynthNet, a latent diffusion-based framework for controllable enhancement synthesis, enabling explicit control over size, location, and transmural extent. Formulated as inpainting using a ControlNet-based architecture, the model integrates: (a) a reward model for conditioning-specific supervision, (b) a captioning module for anatomically descriptive text prompts, and (c) a biomedical text encoder. Trained on just 429 images (79 patients), it produces realistic, anatomically coherent samples. A quality control filter selects outputs with high conditioning-fidelity, which when used for training augmentation, improve downstream segmentation and detection performance, by up-to 6 and 20 points respectively.

[343] From Weak Cues to Real Identities: Evaluating Inference-Driven De-Anonymization in LLM Agents

Myeongseob Ko, Jihyun Jeong, Sumiran Singh Thakur, Gyuhak Kim, Ruoxi Jia

Main category: cs.AI

TL;DR: LLM-based agents can autonomously reconstruct real-world identities from scattered, non-identifying cues through inference-driven linkage, posing a new privacy threat that outperforms classical methods.

Details

Motivation: Traditional anonymization relies on the difficulty of re-identification requiring domain expertise and tailored algorithms. The paper investigates whether LLM-based agents can autonomously reconstruct identities from sparse cues, potentially weakening privacy barriers.

Method: The authors formalize the threat as “inference-driven linkage” and systematically evaluate it across three settings: classical linkage scenarios (Netflix and AOL), a controlled benchmark called InferLink (varying task intent, shared cues, and attacker knowledge), and modern text-rich artifacts. They test agents without task-specific heuristics on both fixed-pool matching and open-ended identity resolution.

Result: LLM-based agents successfully execute identity reconstruction without bespoke engineering. In the Netflix Prize setting, an agent reconstructs 79.2% of identities, significantly outperforming the 56.0% classical baseline. Linkage emerges not only under explicit adversarial prompts but also as a byproduct of benign cross-source analysis and unstructured research narratives.

Conclusion: Identity inference must be treated as a first-class privacy risk, not merely explicit information disclosure. Privacy evaluations must measure what identities an agent can infer, as LLM-based agents can autonomously reconstruct identities from scattered cues.

Abstract: Anonymization is widely treated as a practical safeguard because re-identifying anonymous records was historically costly, requiring domain expertise, tailored algorithms, and manual corroboration. We study a growing privacy risk that may weaken this barrier: LLM-based agents can autonomously reconstruct real-world identities from scattered, individually non-identifying cues. By combining these sparse cues with public information, agents resolve identities without bespoke engineering. We formalize this threat as \emph{inference-driven linkage} and systematically evaluate it across three settings: classical linkage scenarios (Netflix and AOL), \emph{InferLink} (a controlled benchmark varying task intent, shared cues, and attacker knowledge), and modern text-rich artifacts. Without task-specific heuristics, agents successfully execute both fixed-pool matching and open-ended identity resolution. In the Netflix Prize setting, an agent reconstructs 79.2% of identities, significantly outperforming a 56.0% classical baseline. Furthermore, linkage emerges not only under explicit adversarial prompts but also as a byproduct of benign cross-source analysis in \emph{InferLink} and unstructured research narratives. These findings establish that identity inference – not merely explicit information disclosure – must be treated as a first-class privacy risk; evaluations must measure what identities an agent can infer.

[344] Verifiable Semantics for Agent-to-Agent Communication

Philipp Schoenegger, Matt Carlson, Chris Schneider, Chris Daly

Main category: cs.AI

TL;DR: A certification protocol for multiagent AI systems that verifies shared understanding of terms using statistical testing on observable events, reducing disagreement by 72-96% in simulations.

Details

Motivation: Multiagent AI systems need consistent communication, but current methods lack verification that agents share the same understanding of terms. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque.

Method: Proposes a certification protocol based on the stimulus-meaning model where agents are tested on shared observable events. Terms are certified if empirical disagreement falls below a statistical threshold. Agents use “core-guarded reasoning” restricting reasoning to certified terms. Includes mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation).

Result: In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In validation with fine-tuned language models, disagreement is reduced by 51%.

Conclusion: The framework provides a first step towards verifiable agent-to-agent communication by ensuring shared understanding of terms through statistical certification and guarded reasoning.

Abstract: Multiagent AI systems require consistent communication, but we lack methods to verify that agents share the same understanding of the terms used. Natural language is interpretable but vulnerable to semantic drift, while learned protocols are efficient but opaque. We propose a certification protocol based on the stimulus-meaning model, where agents are tested on shared observable events and terms are certified if empirical disagreement falls below a statistical threshold. In this protocol, agents restricting their reasoning to certified terms (“core-guarded reasoning”) achieve provably bounded disagreement. We also outline mechanisms for detecting drift (recertification) and recovering shared vocabulary (renegotiation). In simulations with varying degrees of semantic divergence, core-guarding reduces disagreement by 72-96%. In a validation with fine-tuned language models, disagreement is reduced by 51%. Our framework provides a first step towards verifiable agent-to-agent communication.

[345] From Topic to Transition Structure: Unsupervised Concept Discovery at Corpus Scale via Predictive Associative Memory

Jason Dury

Main category: cs.AI

TL;DR: A contrastive model learns transition-structure concepts from temporal co-occurrence in texts, creating a multi-resolution concept map of literary functions and registers rather than semantic topics.

Details

Motivation: To discover recurrent transition-structure concepts (what text does) rather than semantic content (what text is about), using temporal co-occurrence patterns within texts to identify functional and structural patterns in literature.

Method: Train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts, mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together under capacity constraints.

Result: Creates multi-resolution concept maps from broad modes (direct confrontation, lyrical meditation) to precise registers (sailor dialect, courtroom cross-examination). Clusters average 4,508 books each at k=100, showing corpus-wide patterns. Association-space clusters group by function/register while raw embeddings group by topic.

Conclusion: The method extends Predictive Associative Memory from episodic recall to concept formation, extracting structural patterns that transfer to unseen texts and producing qualitatively different behavior from semantic embedding approaches.

Abstract: Embedding models group text by semantic content, what text is about. We show that temporal co-occurrence within texts discovers a different kind of structure: recurrent transition-structure concepts or what text does. We train a 29.4M-parameter contrastive model on 373 million co-occurrence pairs from 9,766 Project Gutenberg texts (24.96 million passages), mapping pre-trained embeddings into an association space where passages with similar transition structure cluster together. Under capacity constraint (42.75% accuracy), the model must compress across recurring patterns rather than memorise individual co-occurrences. Clustering at six granularities (k=50 to k=2,000) produces a multi-resolution concept map; from broad modes like “direct confrontation” and “lyrical meditation” to precise registers and scene templates like “sailor dialect” and “courtroom cross-examination.” At k=100, clusters average 4,508 books each (of 9,766), confirming corpus-wide patterns. Direct comparison with embedding-similarity clustering shows that raw embeddings group by topic while association-space clusters group by function, register, and literary tradition. Unseen novels are assigned to existing clusters without retraining; the association model concentrates each novel into a selective subset of coherent clusters, while raw embedding assignment saturates nearly all clusters. Validation controls address positional, length, and book-concentration confounds. The method extends Predictive Associative Memory (PAM, arXiv:2602.11322) from episodic recall to concept formation: where PAM recalls specific associations, multi-epoch contrastive training under compression extracts structural patterns that transfer to unseen texts, the same framework producing qualitatively different behaviour in a different regime.

[346] Prune-then-Quantize or Quantize-then-Prune? Understanding the Impact of Compression Order in Joint Model Compression

Minjun Kim, Jaehyeon Choi, Hyunwoo Yang, Jongjin Kim, Jinho Song, U Kang

Main category: cs.AI

TL;DR: The paper investigates how the order of applying multiple compression methods (like pruning and quantization) affects model performance, proposing that weaker perturbations should precede stronger ones.

Details

Motivation: Joint model compression combining multiple methods is powerful but the compression order (sequence of different methods) is underexplored, with most prior work assuming orthogonality between techniques or examining only constrained cases.

Method: Formulates the problem of optimizing compression order, introduces Progressive Intensity Hypothesis (weaker perturbations should precede stronger ones), provides theoretical guarantees, and conducts extensive experiments on language and vision models.

Result: Experiments validate the Progressive Intensity Hypothesis and show its generality to broader setups including multi-stage compression and mixed-precision quantization.

Conclusion: Compression order significantly impacts model performance, and following the Progressive Intensity Hypothesis (weaker before stronger perturbations) yields better results across various compression scenarios.

Abstract: What happens when multiple compression methods are combined-does the order in which they are applied matter? Joint model compression has emerged as a powerful strategy to achieve higher efficiency by combining multiple methods such as pruning and quantization. A central but underexplored factor in joint model compression is the compression order, or the sequence of different methods within the compression pipeline. Most prior studies have either sidestepped the issue by assuming orthogonality between techniques, while a few have examined them only in highly constrained cases. Consequently, the broader role of compression order in shaping model performance remains poorly understood. In this paper, we address the overlooked problem of compression order and provide both theoretical and empirical analysis. We formulate the problem of optimizing the compression order and introduce the Progressive Intensity Hypothesis, which states that weaker perturbations should precede stronger ones. We provide theoretical guarantees showing that the relative benefit of one order increases with the underlying performance gap. Extensive experiments on both language and vision models validate the hypothesis, and further show its generality to broader setups such as multi-stage compression and mixed-precision quantization.

[347] AS2 – Attention-Based Soft Answer Sets: An End-to-End Differentiable Neuro-Soft-Symbolic Reasoning Architecture

Wael AbdAlmageed

Main category: cs.AI

TL;DR: AS2 is a fully differentiable neuro-symbolic architecture that replaces discrete symbolic solvers with a soft approximation of Answer Set Programming, enabling end-to-end training without external solvers.

Details

Motivation: Traditional neuro-symbolic systems have a non-differentiable boundary between neural perception and symbolic solvers, preventing constraint-satisfaction feedback from reaching perception modules during training.

Method: AS2 uses a soft, continuous approximation of the Answer Set Programming immediate consequence operator, maintains per-position probability distributions over symbols, trains by minimizing fixed-point residual of probabilistic lift, and encodes problem structure through constraint-group membership embeddings instead of positional embeddings.

Result: On Visual Sudoku: 99.89% cell accuracy and 100% constraint satisfaction across 1,000 test boards. On MNIST Addition: above 99.7% digit accuracy across 2, 4, and 8 addends.

Conclusion: A soft differentiable fixpoint operator with constraint-aware attention and declarative constraint specification can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability.

Abstract: Neuro-symbolic artificial intelligence (AI) systems typically couple a neural perception module to a discrete symbolic solver through a non-differentiable boundary, preventing constraint-satisfaction feedback from reaching the perception encoder during training. We introduce AS2 (Attention-Based Soft Answer Sets), a fully differentiable neuro-symbolic architecture that replaces the discrete solver with a soft, continuous approximation of the Answer Set Programming (ASP) immediate consequence operator $T_P$. AS2 maintains per-position probability distributions over a finite symbol domain throughout the forward pass and trains end-to-end by minimizing the fixed-point residual of a probabilistic lift of $T_P$, thereby differentiating through the constraint check without invoking an external solver at either training or inference time. The architecture is entirely free of conventional positional embeddings. Instead, it encodes problem structure through constraint-group membership embeddings that directly reflect the declarative ASP specification, making the model agnostic to arbitrary position indexing. On Visual Sudoku, AS2 achieves 99.89% cell accuracy and 100% constraint satisfaction (verified by Clingo) across 1,000 test boards, using a greedy constrained decoding procedure that requires no external solver. On MNIST Addition with $N \in {2, 4, 8}$ addends, AS2 achieves digit accuracy above 99.7% across all scales. These results demonstrate that a soft differentiable fixpoint operator, combined with constraint-aware attention and declarative constraint specification, can match or exceed pipeline and solver-based neuro-symbolic systems while maintaining full end-to-end differentiability.

[348] AlignMamba-2: Enhancing Multimodal Fusion and Sentiment Analysis with Modality-Aware Mamba

Yan Li, Yifei Xing, Xiangyuan Lan, Xin Li, Haifeng Chen, Dongmei Jiang

Main category: cs.AI

TL;DR: AlignMamba-2: A multimodal fusion framework using Mamba architecture with dual alignment strategy for efficient sentiment analysis across dynamic time-series and static image-text tasks.

Details

Motivation: Address computational efficiency challenges in adapting large pre-trained models to affective computing tasks, particularly with multimodal heterogeneity and long-sequence data, while overcoming limitations of Transformers (quadratic complexity) and Mamba models (struggle with global non-sequential relationships).

Method: Proposes AlignMamba-2 with dual alignment strategy using Optimal Transport distance and Maximum Mean Discrepancy for geometric/statistical consistency between modalities, and Modality-Aware Mamba layer with Mixture-of-Experts architecture (modality-specific and modality-shared experts) to handle data heterogeneity during fusion.

Result: Extensive experiments on four benchmarks (CMU-MOSI, CMU-MOSEI, NYU-Depth V2, MVSA-Single) demonstrate state-of-the-art performance in both effectiveness and efficiency across dynamic time-series analysis and static image-text classification tasks.

Conclusion: AlignMamba-2 provides an effective and efficient framework for multimodal fusion and sentiment analysis, successfully addressing computational efficiency challenges while maintaining strong performance across diverse pattern recognition tasks.

Abstract: In the era of large-scale pre-trained models, effectively adapting general knowledge to specific affective computing tasks remains a challenge, particularly regarding computational efficiency and multimodal heterogeneity. While Transformer-based methods have excelled at modeling inter-modal dependencies, their quadratic computational complexity limits their use with long-sequence data. Mamba-based models have emerged as a computationally efficient alternative; however, their inherent sequential scanning mechanism struggles to capture the global, non-sequential relationships that are crucial for effective cross-modal alignment. To address these limitations, we propose \textbf{AlignMamba-2}, an effective and efficient framework for multimodal fusion and sentiment analysis. Our approach introduces a dual alignment strategy that regularizes the model using both Optimal Transport distance and Maximum Mean Discrepancy, promoting geometric and statistical consistency between modalities without incurring any inference-time overhead. More importantly, we design a Modality-Aware Mamba layer, which employs a Mixture-of-Experts architecture with modality-specific and modality-shared experts to explicitly handle data heterogeneity during the fusion process. Extensive experiments on four challenging benchmarks, including dynamic time-series (on the CMU-MOSI and CMU-MOSEI datasets) and static image-related tasks (on the NYU-Depth V2 and MVSA-Single datasets), demonstrate that AlignMamba-2 establishes a new state-of-the-art in both effectiveness and efficiency across diverse pattern recognition tasks, ranging from dynamic time-series analysis to static image-text classification.

[349] Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Yinghui Li, Jiayi Kuang, Peng Xing, Daixian Liu, Junnan Dong, Shu-Yu Guo, Yangning Li, Qingyu Zhou, Wenhao Jiang, Hai-Tao Zheng, Ying Shen, Liang Lin, Philip S. Yu

Main category: cs.AI

TL;DR: MLLMs struggle with discrete symbol interpretation despite success with natural scenes, revealing a cognitive mismatch between linguistic probability and true visual perception of symbols.

Details

Motivation: To investigate MLLMs' ability to process discrete symbols (mathematical formulas, chemical structures, linguistic characters) which are fundamental to human cognition but differ from continuous visual data, and to expose potential gaps in AI's symbolic understanding capabilities.

Method: Introduces a comprehensive benchmark evaluating top-tier MLLMs across five domains (language, culture, mathematics, physics, chemistry) to assess their navigation of “discrete semantic spaces” and identify patterns of failure and success.

Result: Reveals a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception, exposing a “cognitive mismatch.”

Conclusion: Highlights a significant gap in current AI capabilities regarding true perception and understanding of symbolic languages that underpin scientific discovery and abstract thought, offering a roadmap for developing more rigorous, human-aligned intelligent systems.

Abstract: While Multimodal Large Language Models (MLLMs) have achieved remarkable success in interpreting natural scenes, their ability to process discrete symbols – the fundamental building blocks of human cognition – remains a critical open question. Unlike continuous visual data, symbols such as mathematical formulas, chemical structures, and linguistic characters require precise, deeper interpretation. This paper introduces a comprehensive benchmark to evaluate how top-tier MLLMs navigate these “discrete semantic spaces” across five domains: language, culture, mathematics, physics, and chemistry. Our investigation uncovers a counterintuitive phenomenon: models often fail at basic symbol recognition yet succeed in complex reasoning tasks, suggesting they rely on linguistic probability rather than true visual perception. By exposing this “cognitive mismatch”, we highlight a significant gap in current AI capabilities: the struggle to truly perceive and understand the symbolic languages that underpin scientific discovery and abstract thought. This work offers a roadmap for developing more rigorous, human-aligned intelligent systems.

[350] Cross-Domain Demo-to-Code via Neurosymbolic Counterfactual Reasoning

Jooyoung Kim, Wonje Choi, Younguk Song, Honguk Woo

Main category: cs.AI

TL;DR: NeSyCR is a neurosymbolic counterfactual reasoning framework that enables video-instructed robotic programming by adapting task procedures across perceptual and physical domain shifts through symbolic abstraction and verifiable procedural revisions.

Details

Motivation: Current Vision-Language Models (VLMs) lack procedural understanding needed for video-instructed robotic programming when there are perceptual and physical differences between demonstration and deployment domains, causing procedural mismatches.

Method: NeSyCR abstracts video demonstrations into symbolic trajectories capturing task procedures, derives counterfactual states to reveal cross-domain incompatibilities, explores symbolic state space with verifiable checks, and proposes procedural revisions to restore compatibility.

Result: NeSyCR achieves 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.

Conclusion: The neurosymbolic counterfactual reasoning framework enables verifiable adaptation of task procedures for video-instructed robotic programming, addressing domain shift challenges that current VLMs cannot handle.

Abstract: Recent advances in Vision-Language Models (VLMs) have enabled video-instructed robotic programming, allowing agents to interpret video demonstrations and generate executable control code. We formulate video-instructed robotic programming as a cross-domain adaptation problem, where perceptual and physical differences between demonstration and deployment induce procedural mismatches. However, current VLMs lack the procedural understanding needed to reformulate causal dependencies and achieve task-compatible behavior under such domain shifts. We introduce NeSyCR, a neurosymbolic counterfactual reasoning framework that enables verifiable adaptation of task procedures, providing a reliable synthesis of code policies. NeSyCR abstracts video demonstrations into symbolic trajectories that capture the underlying task procedure. Given deployment observations, it derives counterfactual states that reveal cross-domain incompatibilities. By exploring the symbolic state space with verifiable checks, NeSyCR proposes procedural revisions that restore compatibility with the demonstrated procedure. NeSyCR achieves a 31.14% improvement in task success over the strongest baseline Statler, showing robust cross-domain adaptation across both simulated and real-world manipulation tasks.

[351] Expert Personas Improve LLM Alignment but Damage Accuracy: Bootstrapping Intent-Based Persona Routing with PRISM

Zizhao Hu, Mohammad Rostami, Jesse Thomason

Main category: cs.AI

TL;DR: PRISM is a method that self-distills expert personas into gated LoRA adapters through bootstrapping to enhance human preference alignment while maintaining task accuracy.

Details

Motivation: Persona prompting can steer LLMs toward domain-specific tones, but prior works show mixed results on utility. A comprehensive investigation is needed to understand when expert personas succeed or fail and how to leverage their benefits while avoiding harm.

Method: Study how model optimization, task type, prompt length, and placement impact expert persona effectiveness. Develop PRISM (Persona Routing via Intent-based Self-Modeling) - a pipeline that self-distills intent-conditioned expert personas into gated LoRA adapters through bootstrapping without external data/models.

Result: PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks across all models, with minimal memory and computing overhead.

Conclusion: PRISM provides an effective method to leverage expert personas in LLMs, improving alignment while maintaining utility across different task types.

Abstract: Persona prompting can steer LLM generation towards a domain-specific tone and pattern. This behavior enables use cases in multi-agent systems where diverse interactions are crucial and human-centered tasks require high-level human alignment. Prior works provide mixed opinions on their utility: some report performance gains when using expert personas for certain domains and their contribution to data diversity in synthetic data creation, while others find near-zero or negative impact on general utility. To fully leverage the benefits of the LLM persona and avoid its harmfulness, a more comprehensive investigation of the mechanism is crucial. In this work, we study how model optimization, task type, prompt length, and placement can impact expert persona effectiveness across instruction-tuned and reasoning LLMs, and provide insight into conditions under which expert personas fail and succeed. Based on our findings, we developed a pipeline to fully leverage the benefits of an expert persona, named PRISM (Persona Routing via Intent-based Self-Modeling), which self-distills an intent-conditioned expert persona into a gated LoRA adapter through a bootstrapping process that requires no external data, models, or knowledge. PRISM enhances human preference and safety alignment on generative tasks while maintaining accuracy on discriminative tasks across all models, with minimal memory and computing overhead.

[352] Correlation-Weighted Multi-Reward Optimization for Compositional Generation

Jungmyung Wi, Hyunsoo Kim, Donghyun Kim

Main category: cs.AI

TL;DR: A framework called Correlation-Weighted Multi-Reward Optimization improves compositional text-to-image generation by adaptively weighting concept rewards based on their correlations to balance competing concepts.

Details

Motivation: Text-to-image models struggle with compositional generation where multiple concepts need to be satisfied simultaneously. Models often omit some concepts due to interference between competing reward signals during optimization.

Method: Decompose multi-concept prompts into concept groups (objects, attributes, relations), obtain reward signals from dedicated reward models for each concept, then adaptively reweight rewards using correlation-based difficulty estimation to emphasize conflicting or hard-to-satisfy concepts.

Result: Applied to state-of-the-art diffusion models (SD3.5 and FLUX.1-dev), showing consistent improvements on challenging multi-concept benchmarks including ConceptMix, GenEval 2, and T2I-CompBench.

Conclusion: The proposed correlation-weighted optimization framework effectively addresses compositional generation challenges by balancing competing reward signals and focusing on difficult concepts, leading to more consistent satisfaction of all requested attributes.

Abstract: Text-to-image models produce images that align well with natural language prompts, but compositional generation has long been a central challenge. Models often struggle to satisfy multiple concepts within a single prompt, frequently omitting some concepts and resulting in partial success. Such failures highlight the difficulty of jointly optimizing multiple concepts during reward optimization, where competing concepts can interfere with one another. To address this limitation, we propose Correlation-Weighted Multi-Reward Optimization (\ours), a framework that leverages the correlation structure among concept rewards to adaptively weight each attribute concept in optimization. By accounting for interactions among concepts, \ours balances competing reward signals and emphasizes concepts that are partially satisfied yet inconsistently generated across samples, improving compositional generation. Specifically, we decompose multi-concept prompts into pre-defined concept groups (\eg, objects, attributes, and relations) and obtain reward signals from dedicated reward models for each concept. We then adaptively reweight these rewards, assigning higher weights to conflicting or hard-to-satisfy concepts using correlation-based difficulty estimation. By focusing optimization on the most challenging concepts within each group, \ours encourages the model to consistently satisfy all requested attributes simultaneously. We apply our approach to train state-of-the-art diffusion models, SD3.5 and FLUX.1-dev, and demonstrate consistent improvements on challenging multi-concept benchmarks, including ConceptMix, GenEval 2, and T2I-CompBench.

[353] CAPSUL: A Comprehensive Human Protein Benchmark for Subcellular Localization

Yicheng Hu, Xinyu Lin, Shulin Li, Wenjie Wang, Fengbin Zhu, Fuli Feng

Main category: cs.AI

TL;DR: CAPSUL is a comprehensive human protein benchmark for subcellular localization that integrates 3D structural representations with fine-grained localization annotations, demonstrating the importance of structural features and enabling interpretable biological insights.

Details

Motivation: Existing datasets lack comprehensive 3D structural information with detailed subcellular localization annotations, hindering the application of structure-based models for this crucial biological task important for drug target identification and function annotation.

Method: Introduces CAPSUL benchmark integrating diverse 3D structural representations with expert-curated subcellular localization annotations. Evaluates state-of-the-art sequence-based and structure-based models, explores reweighting and single-label classification strategies, and demonstrates interpretability through attention mechanism analysis.

Result: Shows the importance of structural features for subcellular localization prediction, discovers decisive localization patterns (α-helix) through attention mechanisms, and provides a benchmark for future structure-based method development.

Conclusion: CAPSUL bridges the gap between protein structure and subcellular localization, enabling data-driven discoveries in cell biology with improved biological interpretability through structure-based methods.

Abstract: Subcellular localization is a crucial biological task for drug target identification and function annotation. Although it has been biologically realized that subcellular localization is closely associated with protein structure, no existing dataset offers comprehensive 3D structural information with detailed subcellular localization annotations, thus severely hindering the application of promising structure-based models on this task. To address this gap, we introduce a new benchmark called $\mathbf{CAPSUL}$, a $\mathbf{C}$omprehensive hum$\mathbf{A}$n $\mathbf{P}$rotein benchmark for $\mathbf{SU}$bcellular $\mathbf{L}$ocalization. It features a dataset that integrates diverse 3D structural representations with fine-grained subcellular localization annotations carefully curated by domain experts. We evaluate this benchmark using a variety of state-of-the-art sequence-based and structure-based models, showcasing the importance of involving structural features in this task. Furthermore, we explore reweighting and single-label classification strategies to facilitate future investigation on structure-based methods for this task. Lastly, we showcase the powerful interpretability of structure-based methods through a case study on the Golgi apparatus, where we discover a decisive localization pattern $α$-helix from attention mechanisms, demonstrating the potential for bridging the gap with intuitive biological interpretability and paving the way for data-driven discoveries in cell biology.

[354] Interplay: Training Independent Simulators for Reference-Free Conversational Recommendation

Jerome Ramos, Feng Xia, Xi Wang, Shubham Chatterjee, Xiao Fu, Hossein A. Rahmani, Aldo Lipani

Main category: cs.AI

TL;DR: A reference-free simulation framework using two independent LLMs (user and recommender) that interact in real-time without predetermined target items, generating more realistic conversational recommendation data.

Details

Motivation: Training conversational recommender systems requires extensive dialogue data that's hard to collect at scale. Traditional simulation approaches use a single LLM with prior knowledge of target items, resulting in scripted and artificial dialogues.

Method: Proposes a reference-free simulation framework with two independent LLMs: one as the user and one as the conversational recommender. They interact in real-time without access to predetermined target items, but use preference summaries and target attributes, enabling genuine inference of user preferences through dialogue.

Result: The reference-free simulators match or exceed existing methods in quality while offering scalable generation of high-quality conversational recommendation data. Both quantitative and human evaluations confirm effectiveness.

Conclusion: The framework produces more realistic and diverse conversations that closely mirror authentic human-AI interactions, providing a scalable solution for generating conversational recommendation data without constraining conversations to pre-defined target items.

Abstract: Training conversational recommender systems (CRS) requires extensive dialogue data, which is challenging to collect at scale. To address this, researchers have used simulated user-recommender conversations. Traditional simulation approaches often utilize a single large language model (LLM) that generates entire conversations with prior knowledge of the target items, leading to scripted and artificial dialogues. We propose a reference-free simulation framework that trains two independent LLMs, one as the user and one as the conversational recommender. These models interact in real-time without access to predetermined target items, but preference summaries and target attributes, enabling the recommender to genuinely infer user preferences through dialogue. This approach produces more realistic and diverse conversations that closely mirror authentic human-AI interactions. Our reference-free simulators match or exceed existing methods in quality, while offering a scalable solution for generating high-quality conversational recommendation data without constraining conversations to pre-defined target items. We conduct both quantitative and human evaluations to confirm the effectiveness of our reference-free approach.

[355] MedForge: Interpretable Medical Deepfake Detection via Forgery-aware Reasoning

Zhihui Chen, Kai He, Qingyuan Lei, Bin Pu, Jian Zhang, Yuling Xu, Mengling Feng

Main category: cs.AI

TL;DR: MedForge: A pre-hoc, evidence-grounded medical forgery detection system with localize-then-analyze reasoning for detecting manipulated medical scans.

Details

Motivation: Text-guided image editors can manipulate medical scans with high fidelity, threatening clinical trust and safety. Existing defenses are inadequate for healthcare - medical detectors are black-box, while MLLM-based explainers are post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases.

Method: 1) Create MedForge-90K benchmark with realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. 2) Develop MedForge-Reasoner with localize-then-analyze reasoning that predicts suspicious regions before producing verdict. 3) Align with Forgery-aware GSPO to strengthen grounding and reduce hallucinations.

Result: State-of-the-art detection accuracy and trustworthy, expert-aligned explanations demonstrated through experiments.

Conclusion: MedForge provides an effective data-and-method solution for pre-hoc, evidence-grounded medical forgery detection that addresses limitations of existing approaches.

Abstract: Text-guided image editors can now manipulate authentic medical scans with high fidelity, enabling lesion implantation/removal that threatens clinical trust and safety. Existing defenses are inadequate for healthcare. Medical detectors are largely black-box, while MLLM-based explainers are typically post-hoc, lack medical expertise, and may hallucinate evidence on ambiguous cases. We present MedForge, a data-and-method solution for pre-hoc, evidence-grounded medical forgery detection. We introduce MedForge-90K, a large-scale benchmark of realistic lesion edits across 19 pathologies with expert-guided reasoning supervision via doctor inspection guidelines and gold edit locations. Building on it, MedForge-Reasoner performs localize-then-analyze reasoning, predicting suspicious regions before producing a verdict, and is further aligned with Forgery-aware GSPO to strengthen grounding and reduce hallucinations. Experiments demonstrate state-of-the-art detection accuracy and trustworthy, expert-aligned explanations.

[356] ZEBRAARENA: A Diagnostic Simulation Environment for Studying Reasoning-Action Coupling in Tool-Augmented LLMs

Wanjia Zhao, Ludwig Schmidt, James Zou, Vidhisha Balachandran, Lingjiao Chen

Main category: cs.AI

TL;DR: ZebraArena is a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, designed to isolate the interplay between reasoning and tool use from confounding factors like memorization.

Details

Motivation: Existing benchmarks for tool-augmented LLMs often confound reasoning-action coupling with complex environment dynamics, memorized knowledge, or dataset contamination, making it difficult to study the pure interplay between internal reasoning and external tool use.

Method: ZebraArena uses procedurally generated tasks with controllable difficulty and knowledge-minimal design. Each task requires critical information only available through targeted tool use, creating an interpretable interface between external information acquisition and deductive reasoning. The environment provides deterministic evaluation via unique solutions and theoretical optimal query counts.

Result: Even frontier reasoning models like GPT-5 and Gemini 2.5 Pro only achieve 60% accuracy on hard instances, showing that ZebraArena requires challenging combination of reasoning and tool use. Models use 70-270% more tool calls than theoretical optimum, revealing persistent gaps between theoretical optimality and practical tool usage.

Conclusion: ZebraArena effectively isolates and measures reasoning-action coupling in tool-augmented LLMs, revealing significant challenges even for state-of-the-art models. The benchmark stimulates research on the interplay between internal reasoning and external action.

Abstract: Tool-augmented large language models (LLMs) must tightly couple multi-step reasoning with external actions, yet existing benchmarks often confound this interplay with complex environment dynamics, memorized knowledge or dataset contamination. In this paper, we introduce ZebraArena, a procedurally generated diagnostic environment for studying reasoning-action coupling in tool-augmented LLMs, with controllable difficulty and a knowledge-minimal design, which limits gains from memorization or dataset contamination. Each task in ZebraArena requires a set of critical information which is available only through targeted tool use, yielding an interpretable interface between external information acquisition and deductive reasoning. This design provides deterministic evaluation via unique solutions, and a theoretical optimal query count for measuring efficient tool use. We show that ZebraArena requires a combination of in-depth reasoning and accurate external tool calling, which remains a challenge as frontier reasoning models such as GPT-5 and Gemini 2.5 Pro only achieves 60% accuracy on the hard instances. We also observe a persistent gaps between theoretical optimality and practical tool usage. For example, GPT-5 uses 70-270% more tool calls than the theoretical optimum. We highlight the key findings in our evaluation, and hope ZebraArena stimulates further research on the interplay between internal reasoning and external action.

[357] Agentic Flow Steering and Parallel Rollout Search for Spatially Grounded Text-to-Image Generation

Ping Chen, Daoxuan Zhang, Xiangming Wang, Yungeng Liu, Haijin Zeng, Yongyong Chen

Main category: cs.AI

TL;DR: AFS-Search introduces a training-free closed-loop framework for text-to-image generation that uses a Vision-Language Model as a semantic critic to dynamically steer the generation process and perform parallel rollout search for optimal trajectory selection.

Details

Motivation: Current text-to-image generation suffers from limited relational reasoning in static text encoders and error accumulation in open-loop sampling, where initial semantic ambiguities escalate into spatial constraint deviations without real-time feedback.

Method: AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism using a VLM as semantic critic to diagnose intermediate latents and dynamically steer the velocity field via spatial grounding. It formulates T2I generation as sequential decision-making with lookahead simulations and VLM-guided reward selection.

Result: AFS-Search-Pro greatly boosts FLUX.1-dev performance, achieving state-of-the-art results across three benchmarks. AFS-Search-Fast significantly enhances performance while maintaining fast generation speed.

Conclusion: The closed-loop framework with VLM-based semantic criticism and parallel rollout search effectively addresses error accumulation in T2I generation, offering both high-performance and fast variants for practical applications.

Abstract: Precise Text-to-Image (T2I) generation has achieved great success but is hindered by the limited relational reasoning of static text encoders and the error accumulation in open-loop sampling. Without real-time feedback, initial semantic ambiguities during the Ordinary Differential Equation trajectory inevitably escalate into stochastic deviations from spatial constraints. To bridge this gap, we introduce AFS-Search (Agentic Flow Steering and Parallel Rollout Search), a training-free closed-loop framework built upon FLUX.1-dev. AFS-Search incorporates a training-free closed-loop parallel rollout search and flow steering mechanism, which leverages a Vision-Language Model (VLM) as a semantic critic to diagnose intermediate latents and dynamically steer the velocity field via precise spatial grounding. Complementarily, we formulate T2I generation as a sequential decision-making process, exploring multiple trajectories through lookahead simulations and selecting the optimal path based on VLM-guided rewards. Further, we provide AFS-Search-Pro for higher performance and AFS-Search-Fast for quicker generation. Experimental results show that our AFS-Search-Pro greatly boosts the performance of the original FLUX.1-dev, achieving state-of-the-art results across three different benchmarks. Meanwhile, AFS-Search-Fast also significantly enhances performance while maintaining fast generation speed.

[358] D-Mem: A Dual-Process Memory System for LLM Agents

Zhixing You, Jiachen Yuan, Jason Cai

Main category: cs.AI

TL;DR: D-Mem introduces a dual-process memory system for autonomous agents that combines lightweight vector retrieval with exhaustive full deliberation, using quality gating to balance speed and accuracy for long-horizon reasoning.

Details

Motivation: Current retrieval-based memory frameworks for autonomous agents rely on lossy abstraction and incremental processing, often missing critical contextual information and struggling with fine-grained understanding needed for long-horizon reasoning.

Method: D-Mem proposes a dual-process system: 1) lightweight vector retrieval for routine queries, and 2) exhaustive Full Deliberation module as high-fidelity fallback. It uses Multi-dimensional Quality Gating policy to dynamically bridge these processes for cognitive economy without sacrificing accuracy.

Result: Experiments on LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct show D-Mem’s Multi-dimensional Quality Gating achieves F1 score of 53.5 on LoCoMo with GPT-4o-mini, outperforming static retrieval baseline (51.2) and recovering 96.7% of Full Deliberation’s performance (55.3) with lower computational costs.

Conclusion: D-Mem provides an effective dual-process memory system that balances computational efficiency with high-fidelity memory access for autonomous agents, addressing limitations of current retrieval-based approaches for long-horizon reasoning.

Abstract: Driven by the development of persistent, self-adapting autonomous agents, equipping these systems with high-fidelity memory access for long-horizon reasoning has emerged as a critical requirement. However, prevalent retrieval-based memory frameworks often follow an incremental processing paradigm that continuously extracts and updates conversational memories into vector databases, relying on semantic retrieval when queried. While this approach is fast, it inherently relies on lossy abstraction, frequently missing contextually critical information and struggling to resolve queries that rely on fine-grained contextual understanding. To address this, we introduce D-Mem, a dual-process memory system. It retains lightweight vector retrieval for routine queries while establishing an exhaustive Full Deliberation module as a high-fidelity fallback. To achieve cognitive economy without sacrificing accuracy, D-Mem employs a Multi-dimensional Quality Gating policy to dynamically bridge these two processes. Experiments on the LoCoMo and RealTalk benchmarks using GPT-4o-mini and Qwen3-235B-Instruct demonstrate the efficacy of our approach. Notably, our Multi-dimensional Quality Gating policy achieves an F1 score of 53.5 on LoCoMo with GPT-4o-mini. This outperforms our static retrieval baseline, Mem0$^\ast$ (51.2), and recovers 96.7% of the Full Deliberation’s performance (55.3), while incurring significantly lower computational costs.

[359] An Onto-Relational-Sophic Framework for Governing Synthetic Minds

Huansheng Ning, Jianguo Ding

Main category: cs.AI

TL;DR: The paper proposes a philosophical framework (ORS) for understanding and governing synthetic minds, moving beyond traditional tool-centric approaches to address ontological, relational, and ethical dimensions of increasingly capable AI systems.

Details

Motivation: Current regulatory frameworks are inadequate for foundation models that exhibit broad competence across reasoning, creativity, and social interaction. Existing approaches focus on algorithmic bias and transparency but fail to address fundamental questions about what synthetic minds are and how societies should relate to them.

Method: The authors introduce the Onto-Relational-Sophic (ORS) framework based on Cyberism philosophy, consisting of three pillars: (1) CPST ontology defining synthetic minds as irreducibly multi-dimensional, (2) graded spectrum of digital personhood beyond binary classifications, and (3) Cybersophy wisdom-oriented axiology synthesizing virtue ethics, consequentialism, and relational approaches.

Result: The framework is applied to emergent scenarios including autonomous research agents, AI-mediated healthcare, and agentic AI ecosystems, demonstrating its capacity to generate proportionate, adaptive governance recommendations.

Conclusion: The ORS framework provides comprehensive philosophical foundations for synthetic minds, moving from narrow technical alignment toward addressing ontological, relational, and ethical dimensions of increasingly capable AI systems.

Abstract: The rapid evolution of artificial intelligence, from task-specific systems to foundation models exhibiting broad, flexible competence across reasoning, creative synthesis, and social interaction, has outpaced the conceptual and governance frameworks designed to manage it. Current regulatory paradigms, anchored in a tool-centric worldview, address algorithmic bias and transparency but leave unanswered foundational questions about what increasingly capable synthetic minds are, how societies should relate to them, and the normative principles that should guide their development. Here we introduce the Onto-Relational-Sophic (ORS) framework, grounded in Cyberism philosophy, which offers integrated answers to these challenges through three pillars: (1) a Cyber-Physical-Social-Thinking (CPST) ontology that defines the mode of being for synthetic minds as irreducibly multi-dimensional rather than purely computational; (2) a graded spectrum of digital personhood providing a pragmatic relational taxonomy beyond binary person-or-tool classifications; and (3) Cybersophy, a wisdom-oriented axiology synthesizing virtue ethics, consequentialism, and relational approaches to guide governance. We apply the framework to emergent scenarios including autonomous research agents, AI-mediated healthcare, and agentic AI ecosystems, demonstrating its capacity to generate proportionate, adaptive governance recommendations. The ORS framework charts a path from narrow technical alignment toward comprehensive philosophical foundations for the synthetic minds already among us.

[360] Balanced Thinking: Improving Chain of Thought Training in Vision Language Models

Shaked Perek, Ben Wiesel, Avihu Dekel, Nimrod Shabtay, Eli Schwartz

Main category: cs.AI

TL;DR: SCALe is a training method for vision-language models that uses adaptive loss weighting to balance supervision between reasoning traces and answer segments, improving accuracy and efficiency compared to standard SFT approaches.

Details

Motivation: Standard supervised fine-tuning (SFT) for multimodal reasoning treats all tokens equally, but reasoning data is inherently imbalanced - long thinking traces overshadow short but critical answer segments, leading to verbose reasoning and inaccurate answers.

Method: SCALe (Scheduled Curriculum Adaptive Loss) explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. It gradually shifts focus from thinking to answer segments throughout training via a cosine scheduling policy.

Result: SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time. When combined with GRPO, it achieves the best overall performance.

Conclusion: SCALe is a lightweight yet effective alternative to standard training pipelines for multimodal reasoning, offering both standalone effectiveness and strong foundation for reinforcement learning refinement.

Abstract: Multimodal reasoning in vision-language models (VLMs) typically relies on a two-stage process: supervised fine-tuning (SFT) and reinforcement learning (RL). In standard SFT, all tokens contribute equally to the loss, even though reasoning data are inherently token-imbalanced. Long traces overshadow short but task-critical segments, leading to verbose reasoning and inaccurate answers. We propose SCALe (Scheduled Curriculum Adaptive Loss), which explicitly separates supervision over reasoning and answer segments using dynamic, length-independent weighting. Unlike vanilla SFT, which overweights the segment, SCALe-SFT gradually shifts the focus from to throughout training via a cosine scheduling policy, encouraging concise and well-grounded reasoning. We evaluate SCALe across diverse benchmarks and architectures. Results show that SCALe consistently improves accuracy over vanilla SFT and matches the performance of the full two-phase SFT + GRPO pipeline while requiring only about one-seventh of the training time, making it a lightweight yet effective alternative. When combined with GRPO, SCALe achieves the best overall performance, highlighting its value both as a standalone method and as a strong foundation for reinforcement refinement.

[361] Thinking with Constructions: A Benchmark and Policy Optimization for Visual-Text Interleaved Geometric Reasoning

Haokun Zhao, Wanshi Xu, Haidong Yuan, Songjun Cao, Long Ma, Yanghua Xiao

Main category: cs.AI

TL;DR: A framework for teaching MLLMs to dynamically construct visual aids for geometric reasoning, using a new benchmark and reinforcement learning to optimize when and how to create visual constructions.

Details

Motivation: Current MLLMs are limited to passive inference with static diagrams, lacking the ability to strategically construct visual aids needed for geometric reasoning, which requires dynamic manipulation of visual elements to bridge problem conditions and solutions.

Method: Introduces GeoAux-Bench benchmark with 4,334 geometry problems aligning textual construction steps with visual updates. Proposes Action Applicability Policy Optimization (A2PO) using reinforcement learning with Adaptive Reward Shaping to regulate timing and quality of visual aids via counterfactual sampling.

Result: The approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Key findings: interleaved visual-textual aids outperform single-modality approaches, and valid constructions act as entropy reducers correlating with reduced reasoning perplexity.

Conclusion: The framework successfully teaches MLLMs to strategically construct visual aids for geometric reasoning, addressing a critical limitation in current multimodal models and demonstrating measurable performance improvements.

Abstract: Geometric reasoning inherently requires “thinking with constructions” – the dynamic manipulation of visual aids to bridge the gap between problem conditions and solutions. However, existing Multimodal Large Language Models (MLLMs) are largely confined to passive inference with static diagrams, lacking the strategic knowledge of when and how to construct effective visual aids. To address this, we present a framework for Visual-Text Interleaved Chain-of-Thought. We first introduce GeoAux-Bench, the first benchmark comprising 4,334 geometry problems that aligns textual construction steps with ground-truth visual updates. Our pilot study reveals two critical insights: (1) interleaved visual-textual aids outperform single-modality counterparts, which cannot losslessly capture geometric synergy; and (2) valid constructions act as entropy reducers, strongly correlating with reduced reasoning perplexity. Building on these findings, we propose Action Applicability Policy Optimization (A2PO), a reinforcement learning paradigm for mastering strategic construction. A2PO employs Adaptive Reward Shaping to regulate the timing and quality of visual aids via counterfactual sampling to distinguish necessary from redundant constructions. Experiments demonstrate our approach enables MLLMs to leverage selective auxiliary constructions, yielding a 3.51% gain over strong baselines. Code and data are available on GitHub.

[362] MANAR: Memory-augmented Attention with Navigational Abstract Conceptual Representation

Zuher Jahshan, Ben Ben Ishay, Leonid Yavits

Main category: cs.AI

TL;DR: MANAR is a memory-augmented attention mechanism that implements Global Workspace Theory principles, achieving linear-time scaling while matching or exceeding performance of standard quadratic attention across language, vision, and speech tasks.

Details

Motivation: Standard multi-head attention lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness (Global Workspace Theory). Current linear-time attention alternatives often have structural incompatibility with pretrained transformers, creating adoption barriers.

Method: MANAR implements a central workspace with trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). It follows a two-stage GWT-inspired process: (1) integration phase where memory concepts converge to form a collective “mental image” (ACR) from input stimuli, and (2) broadcasting phase where this global state informs contextualization of local tokens.

Result: MANAR achieves efficient linear-time scaling as a byproduct of the GWT functional bottleneck. It matches or exceeds strong baselines across multiple domains: GLUE score of 85.1 (language), 83.9% ImageNet-1K accuracy (vision), and 2.7% WER on LibriSpeech (speech).

Conclusion: MANAR provides an efficient and expressive alternative to quadratic attention that is compatible with pretrained transformers via weight-copy, enabling non-convex contextualization that reflects creative synthesis described in Global Workspace Theory.

Abstract: MANAR (Memory-augmented Attention with Navigational Abstract Conceptual Representation), contextualization layer generalizes standard multi-head attention (MHA) by instantiating the principles of Global Workspace Theory (GWT). While MHA enables unconstrained all-to-all communication, it lacks the functional bottleneck and global integration mechanisms hypothesized in cognitive models of consciousness. MANAR addresses this by implementing a central workspace through a trainable memory of abstract concepts and an Abstract Conceptual Representation (ACR). The architecture follows a two-stage logic that maps directly to GWT mechanics: (i) an integration phase, where retrieved memory concepts converge to form a collective “mental image” (the ACR) based on input stimuli; and (ii) a broadcasting phase, where this global state navigates and informs the contextualization of individual local tokens. We demonstrate that efficient linear-time scaling is a fundamental architectural byproduct of instantiating GWT functional bottleneck, as routing global information through a constant-sized ACR resolves the quadratic complexity inherent in standard attention. MANAR is a compatible re-parameterization of MHA with identical semantic roles for its projections, enabling knowledge transfer from pretrained transformers via weight-copy and thus overcoming the adoption barriers of structurally incompatible linear-time alternatives. MANAR enables non-convex contextualization, synthesizing representations that provably lie outside the convex hull of input tokens - a mathematical reflection of the creative synthesis described in GWT. Empirical evaluations confirm that MANAR matches or exceeds strong baselines across language (GLUE score of 85.1), vision (83.9% ImageNet-1K), and speech (2.7% WER on LibriSpeech), positioning it as an efficient and expressive alternative to quadratic attention.

[363] Accurate and Efficient Multi-Channel Time Series Forecasting via Sparse Attention Mechanism

Lei Gao, Hengda Bao, Jingfei Fang, Guangzheng Wu, Weihua Zhou, Yun Zhou

Main category: cs.AI

TL;DR: Li-Net is a novel architecture for multi-channel time series forecasting that captures linear and non-linear dependencies among channels using dynamic compression, configurable non-linear modules, and sparse Top-K Softmax attention with multi-modal embeddings.

Details

Motivation: Traditional methods for multi-channel time series forecasting fail to adequately capture complex dynamic dependencies within and between channels, particularly overlooking channel interactions. There's a need for architectures that can effectively model both linear and non-linear dependencies while maintaining computational efficiency.

Method: Li-Net dynamically compresses representations across sequence and channel dimensions, processes information through configurable non-linear modules, and reconstructs forecasts. It integrates sparse Top-K Softmax attention within a multi-scale projection framework and can incorporate multi-modal embeddings to guide attention toward informative time steps and feature channels.

Result: Li-Net achieves competitive performance on multiple real-world benchmark datasets compared to state-of-the-art baselines. It provides superior balance between prediction accuracy and computational burden, with significantly lower memory usage and faster inference times.

Conclusion: Li-Net effectively addresses the limitations of traditional methods by capturing complex channel interactions through its novel architecture. The integration of multi-modal embeddings and sparse attention mechanisms makes it both effective and efficient for multi-channel time series forecasting tasks.

Abstract: The task of multi-channel time series forecasting is ubiquitous in numerous fields such as finance, supply chain management, and energy planning. It is critical to effectively capture complex dynamic dependencies within and between channels for accurate predictions. However, traditional method paid few attentions on learning the interaction among channels. This paper proposes Linear-Network (Li-Net), a novel architecture designed for multi-channel time series forecasting that captures the linear and non-linear dependencies among channels. Li-Net dynamically compresses representations across sequence and channel dimensions, processes the information through a configurable non-linear module and subsequently reconstructs the forecasts. Moreover, Li-Net integrates a sparse Top-K Softmax attention mechanism within a multi-scale projection framework to address these challenges. A core innovation is its ability to seamlessly incorporate and fuse multi-modal embeddings, guiding the sparse attention process to focus on the most informative time steps and feature channels. Through the experiment results on multiple real-world benchmark datasets demonstrate that Li-Net achieves competitive performance compared to state-of-the-art baseline methods. Furthermore, Li-Net provides a superior balance between prediction accuracy and computational burden, exhibiting significantly lower memory usage and faster inference times. Detailed ablation studies and parameter sensitivity analyses validate the effectiveness of each key component in our proposed architecture. Keywords: Multivariate Time Series Forecasting, Sparse Attention Mechanism, Multimodal Information Fusion, Non-linear relationship

[364] MemMA: Coordinating the Memory Cycle through Multi-Agent Reasoning and In-Situ Self-Evolution

Minhua Lin, Zhiwei Zhang, Hanqing Lu, Hui Liu, Xianfeng Tang, Qi He, Xiang Zhang, Suhang Wang

Main category: cs.AI

TL;DR: MemMA is a multi-agent framework that coordinates memory construction, retrieval, and utilization in memory-augmented LLM agents through strategic forward guidance and self-evolving backward repair mechanisms.

Details

Motivation: Existing memory-augmented LLM agents treat memory construction, retrieval, and utilization as isolated subroutines, leading to strategic blindness (local heuristics rather than strategic reasoning) and sparse, delayed supervision (downstream failures don't directly repair memory).

Method: MemMA coordinates memory cycles with forward and backward paths. Forward: Meta-Thinker produces structured guidance for Memory Manager (construction) and Query Reasoner (iterative retrieval). Backward: in-situ self-evolving memory construction synthesizes probe QA pairs, verifies current memory, and converts failures into repair actions before finalization.

Result: MemMA consistently outperforms existing baselines on LoCoMo benchmark across multiple LLM backbones and improves three different storage backends in a plug-and-play manner.

Conclusion: MemMA effectively addresses strategic blindness and sparse supervision in memory-augmented LLM agents through coordinated multi-agent framework with strategic forward guidance and self-evolving backward repair mechanisms.

Abstract: Memory-augmented LLM agents maintain external memory banks to support long-horizon interaction, yet most existing systems treat construction, retrieval, and utilization as isolated subroutines. This creates two coupled challenges: strategic blindness on the forward path of the memory cycle, where construction and retrieval are driven by local heuristics rather than explicit strategic reasoning, and sparse, delayed supervision on the backward path, where downstream failures rarely translate into direct repairs of the memory bank. To address these challenges, we propose MemMA, a plug-and-play multi-agent framework that coordinates the memory cycle along both the forward and backward paths. On the forward path, a Meta-Thinker produces structured guidance that steers a Memory Manager during construction and directs a Query Reasoner during iterative retrieval. On the backward path, MemMA introduces in-situ self-evolving memory construction, which synthesizes probe QA pairs, verifies the current memory, and converts failures into repair actions before the memory is finalized. Extensive experiments on LoCoMo show that MemMA consistently outperforms existing baselines across multiple LLM backbones and improves three different storage backends in a plug-and-play manner. Our code is publicly available at https://github.com/ventr1c/memma.

[365] Analysis Of Linguistic Stereotypes in Single and Multi-Agent Generative AI Architectures

Martina Ullasci, Marco Rondina, Riccardo Coppola, Flavio Giobergia, Riccardo Bellanca, Gabriele Mancari Pasi, Luca Prato, Federico Spinoso, Silvia Tagliente

Main category: cs.AI

TL;DR: Study examines dialect bias in LLMs between Standard American English and African-American English, tests mitigation strategies including prompt engineering and multi-agent architectures, finds stereotype differences persist but can be mitigated with specific approaches.

Details

Motivation: LLMs exhibit discriminatory behavior and stereotype-based inferences based on dialect, particularly between Standard American English (SAE) and African-American English (AAE). The research aims to replicate existing analyses and investigate effective mitigation strategies for this dialect bias.

Method: Replicated dialect-sensitive stereotype generation analyses using eight prompt templates covering different bias manifestations (names, jobs, adjectives). Tested mitigation strategies including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures (generate-critique-revise models). Used LLM-as-judge approach to evaluate bias across multiple models.

Result: Stereotype-bearing differences emerged between SAE- and AAE-related outputs across all template categories, with strongest effects in adjective and job attribution. Baseline disparities varied by model (largest in Claude Haiku, smallest in Phi-4 Mini). Chain-Of-Thought prompting effectively mitigated bias for Claude Haiku, while multi-agent architectures provided consistent mitigation across all models.

Conclusion: For intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies and workflow-level controls like agentic architectures with critique models in high-impact LLM deployments. Results are exploratory but suggest pathways for bias mitigation.

Abstract: Many works in the literature show that LLM outputs exhibit discriminatory behaviour, triggering stereotype-based inferences based on the dialect in which the inputs are written. This bias has been shown to be particularly pronounced when the same inputs are provided to LLMs in Standard American English (SAE) and African-American English (AAE). In this paper, we replicate existing analyses of dialect-sensitive stereotype generation in LLM outputs and investigate the effects of mitigation strategies, including prompt engineering (role-based and Chain-Of-Thought prompting) and multi-agent architectures composed of generate-critique-revise models. We define eight prompt templates to analyse different ways in which dialect bias can manifest, such as suggested names, jobs, and adjectives for SAE or AAE speakers. We use an LLM-as-judge approach to evaluate the bias in the results. Our results show that stereotype-bearing differences emerge between SAE- and AAE-related outputs across all template categories, with the strongest effects observed in adjective and job attribution. Baseline disparities vary substantially by model, with the largest SAE-AAE differential observed in Claude Haiku and the smallest in Phi-4 Mini. Chain-Of-Thought prompting proved to be an effective mitigation strategy for Claude Haiku, whereas the use of a multi-agent architecture ensured consistent mitigation across all the models. These findings suggest that for intersectionality-informed software engineering, fairness evaluation should include model-specific validation of mitigation strategies, and workflow-level controls (e.g., agentic architectures involving critique models) in high-impact LLM deployments. The current results are exploratory in nature and limited in scope, but can lead to extensions and replications by increasing the dataset size and applying the procedure to different languages or dialects.

[366] Memento-Skills: Let Agents Design Agents

Huichi Zhou, Siyuan Guo, Anjie Liu, Zhongwei Yu, Ziqin Gong, Bowen Zhao, Zhixun Chen, Menglong Zhang, Yihang Chen, Jinsong Li, Runyu Yang, Qiangbin Liu, Xinlei Yu, Jianmin Zhou, Na Wang, Chunyang Sun, Jun Wang

Main category: cs.AI

TL;DR: Memento-Skills is a generalist LLM agent system that autonomously designs, adapts, and improves task-specific agents through experience using memory-based reinforcement learning with stateful prompts and reusable skills.

Details

Motivation: To create a generalist LLM agent system that can autonomously design and improve task-specific agents through experience, enabling continual learning without updating LLM parameters.

Method: Memory-based reinforcement learning framework with stateful prompts and reusable skills stored as structured markdown files. Uses Read-Write Reflective Learning mechanism: read phase selects relevant skills via behavior-trainable skill router; write phase updates and expands skill library based on new experience.

Result: Achieved 26.2% relative improvement on General AI Assistants benchmark and 116.2% relative improvement on Humanity’s Last Exam, demonstrating sustained gains through iterative skill generation and refinement.

Conclusion: Memento-Skills enables generalist agents to design agents end-to-end for new tasks through continual learning without LLM parameter updates, showing significant performance improvements on challenging benchmarks.

Abstract: We introduce \emph{Memento-Skills}, a generalist, continually-learnable LLM agent system that functions as an \emph{agent-designing agent}: it autonomously constructs, adapts, and improves task-specific agents through experience. The system is built on a memory-based reinforcement learning framework with \emph{stateful prompts}, where reusable skills (stored as structured markdown files) serve as persistent, evolving memory. These skills encode both behaviour and context, enabling the agent to carry forward knowledge across interactions. Starting from simple elementary skills (like Web search and terminal operations), the agent continually improves via the \emph{Read–Write Reflective Learning} mechanism introduced in \emph{Memento2}\cite{wang2025memento2}. In the \emph{read} phase, a behaviour-trainable skill router selects the most relevant skill conditioned on the current stateful prompt; in the \emph{write} phase, the agent updates and expands its skill library based on new experience. This closed-loop design enables \emph{continual learning without updating LLM parameters}, as all adaptation is realised through the evolution of externalised skills and prompts. Unlike prior approaches that rely on human-designed agents, Memento-Skills enables a generalist agent to \emph{design agents end-to-end} for new tasks. Through iterative skill generation and refinement, the system progressively improves its own capabilities. Experiments on the \emph{General AI Assistants} benchmark and \emph{Humanity’s Last Exam} demonstrate sustained gains, achieving 26.2% and 116.2% relative improvements in overall accuracy, respectively. Code is available at https://github.com/Memento-Teams/Memento-Skills.

[367] NeuroGame Transformer: Gibbs-Inspired Attention Driven by Game Theory and Statistical Physics

Djamel Bouchaffra, Fayçal Ykhlef, Hanene Azzag, Mustapha Lebbah, Bilal Faye

Main category: cs.AI

TL;DR: NGT introduces a novel transformer architecture that treats tokens as players in cooperative games and spins in statistical physics, using Shapley values and Banzhaf indices to model higher-order token dependencies beyond pairwise attention.

Details

Motivation: Standard transformer attention mechanisms are limited to pairwise token interactions, which prevents modeling of higher-order dependencies among tokens. The authors aim to overcome this limitation by integrating game theory and statistical physics concepts into attention mechanisms.

Method: NGT treats tokens as players in cooperative games and interacting spins in a statistical physics system. It quantifies token importance using Shapley values (global attribution) and Banzhaf indices (local influence), combined via a learnable gating parameter. Pairwise interaction potentials capture synergistic relationships, with system energy following an Ising Hamiltonian. Attention weights emerge as marginal probabilities under Gibbs distribution, computed via mean-field equations. Importance-weighted Monte Carlo estimators handle exponential coalition spaces efficiently.

Result: NGT achieves strong performance on SNLI (86.4% test accuracy, 86.6% peak validation accuracy) and MNLI-matched, outperforming ALBERT-Base and remaining competitive with RoBERTa-Base. It demonstrates effectiveness in modeling higher-order token dependencies while maintaining scalability.

Conclusion: The NeuroGame Transformer successfully integrates game theory and statistical physics to model higher-order token dependencies in transformers, providing a theoretically grounded and scalable alternative to standard pairwise attention mechanisms with competitive performance on natural language inference tasks.

Abstract: Standard attention mechanisms in transformers are limited by their pairwise formulation, which hinders the modeling of higher-order dependencies among tokens. We introduce the NeuroGame Transformer (NGT) to overcome this by reconceptualizing attention through a dual perspective: tokens are treated simultaneously as players in a cooperative game and as interacting spins in a statistical physics system. Token importance is quantified using two complementary game-theoretic concepts – Shapley values for global, permutation-based attribution and Banzhaf indices for local, coalition-level influence. These are combined via a learnable gating parameter to form an external magnetic field, while pairwise interaction potentials capture synergistic relationships. The system’s energy follows an Ising Hamiltonian, with attention weights emerging as marginal probabilities under the Gibbs distribution, efficiently computed via mean-field equations. To ensure scalability despite the exponential coalition space, we develop importance-weighted Monte Carlo estimators with Gibbs-distributed weights. This approach avoids explicit exponential factors, ensuring numerical stability for long sequences. We provide theoretical convergence guarantees and characterize the fairness-sensitivity trade-off governed by the interpolation parameter. Experimental results demonstrate that the NeuroGame Transformer achieves strong performance across SNLI, and MNLI-matched, outperforming some major efficient transformer baselines. On SNLI, it attains a test accuracy of 86.4% (with a peak validation accuracy of 86.6%), surpassing ALBERT-Base and remaining highly competitive with RoBERTa-Base. Code is available at https://github.com/dbouchaffra/NeuroGame-Transformer.

[368] A Concept is More Than a Word: Diversified Unlearning in Text-to-Image Diffusion Models

Duc Hao Pham, Van Duy Truong, Duy Khanh Dinh, Tien Cuong Nguyen, Dien Hy Ngo, Tuan Anh Bui

Main category: cs.AI

TL;DR: Diversified Unlearning improves concept removal in text-to-image diffusion models by using diverse prompts instead of single keywords, achieving better erasure and concept retention.

Details

Motivation: Current concept unlearning methods rely on keywords which are limited because visual concepts are multi-dimensional, expressed in diverse textual forms, and overlap with related concepts in latent space. Single keywords represent only narrow point estimates, failing to cover full semantic distributions and leading to brittle unlearning with over-forgetting.

Method: Proposes Diversified Unlearning, a distributional framework that represents concepts through sets of contextually diverse prompts rather than single keywords. This richer representation enables more precise and robust unlearning. The approach can be integrated as an add-on component into existing unlearning pipelines.

Result: Extensive experiments across multiple benchmarks and state-of-the-art baselines show consistent improvements: stronger erasure of target concepts, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.

Conclusion: Diversified Unlearning addresses limitations of keyword-based concept unlearning by using diverse prompt representations, leading to more precise, robust, and effective concept removal in text-to-image diffusion models while preserving unrelated content.

Abstract: Concept unlearning has emerged as a promising direction for reducing the risks of harmful content generation in text-to-image diffusion models by selectively erasing undesirable concepts from a model’s parameters. Existing approaches typically rely on keywords to identify the target concept to be unlearned. However, we show that this keyword-based formulation is inherently limited: a visual concept is multi-dimensional, can be expressed in diverse textual forms, and often overlap with related concepts in the latent space, making keyword-only unlearning, which imprecisely indicate the target concept is brittle and prone to over-forgetting. This occurs because a single keyword represents only a narrow point estimate of the concept, failing to cover its full semantic distribution and entangled variations in the latent space. To address this limitation, we propose Diversified Unlearning, a distributional framework that represents a concept through a set of contextually diverse prompts rather than a single keyword. This richer representation enables more precise and robust unlearning. Through extensive experiments across multiple benchmarks and state-of-the-art baselines, we demonstrate that integrating Diversified Unlearning as an add-on component into existing unlearning pipelines consistently achieves stronger erasure, better retention of unrelated concepts, and improved robustness against adversarial recovery attacks.

[369] Proceedings of the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

Nitay Alon, Joseph M. Barnby, Reuth Mirsky, Stefan Sarkadi

Main category: cs.AI

TL;DR: Proceedings volume from AAAI 2026 workshop on Theory of Mind and AI, containing selected papers from the event

Details

Motivation: To create an open access curated anthology for the Theory of Mind and AI research community, compiling selected papers from the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind

Method: This is a proceedings volume that collects and curates selected papers presented at the workshop, providing open access publication of the research

Result: A published volume containing research papers on Theory of Mind applications in AI, presented at AAAI 2026 workshop

Conclusion: The volume serves as a resource for the ToM and AI research community, documenting current research presented at the workshop

Abstract: This volume includes a selection of papers presented at the 2nd Workshop on Advancing Artificial Intelligence through Theory of Mind held at AAAI 2026 in Singapore on 26th January 2026. The purpose of this volume is to provide an open access and curated anthology for the ToM and AI research community.

[370] dTRPO: Trajectory Reduction in Policy Optimization of Diffusion Large Language Models

Wenxuan Zhang, Lemeng Wu, Changsheng Zhao, Ernie Chang, Mingchen Zhuge, Zechun Liu, Andy Su, Hanxian Huang, Jun Chen, Chong Zhou, Raghuraman Krishnamoorthi, Vikas Chandra, Mohamed Elhoseiny, Wei Wen

Main category: cs.AI

TL;DR: dTRPO is a policy optimization method for diffusion LLMs that reduces trajectory probability calculation costs through two reduction strategies, enabling efficient offline training and improving performance on instruction-following and reasoning tasks.

Details

Motivation: Diffusion LLMs present new challenges for alignment with human preferences due to their different generation paradigm. Current policy optimization methods for dLLMs are computationally expensive for trajectory probability calculations, limiting scaled-up offline training.

Method: Proposes Trajectory Reduction Policy Optimization (dTRPO) with two key strategies: 1) Using probability ratio of newly unmasked tokens as unbiased estimate of intermediate diffusion states under reference policy regularization, and 2) Estimating full trajectory probability with single forward pass of re-masked final state.

Result: dTRPO substantially improves performance of 7B dLLMs: up to 9.6% gains on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Also shows strong training efficiency and improved generation efficiency through high-quality outputs.

Conclusion: dTRPO enables efficient offline policy optimization for diffusion LLMs through trajectory reduction strategies, achieving significant performance improvements across multiple benchmarks while maintaining training and generation efficiency.

Abstract: Diffusion Large Language Models (dLLMs) introduce a new paradigm for language generation, which in turn presents new challenges for aligning them with human preferences. In this work, we aim to improve the policy optimization for dLLMs by reducing the cost of the trajectory probability calculation, thereby enabling scaled-up offline policy training. We prove that: (i) under reference policy regularization, the probability ratio of the newly unmasked tokens is an unbiased estimate of that of intermediate diffusion states, and (ii) the probability of the full trajectory can be effectively estimated with a single forward pass of a re-masked final state. By integrating these two trajectory reduction strategies into a policy optimization objective, we propose Trajectory Reduction Policy Optimization (dTRPO). We evaluate dTRPO on 7B dLLMs across instruction-following and reasoning benchmarks. Results show that it substantially improves the core performance of state-of-the-art dLLMs, achieving gains of up to 9.6% on STEM tasks, up to 4.3% on coding tasks, and up to 3.0% on instruction-following tasks. Moreover, dTRPO exhibits strong training efficiency due to its offline, single-forward nature, and achieves improved generation efficiency through high-quality outputs.

[371] Can LLM generate interesting mathematical research problems?

Xiaoyang Chen, Xiang Jiang

Main category: cs.AI

TL;DR: LLM agent generates novel mathematical research problems in differential geometry, with 665 problems created and verified by experts as unknown and valuable.

Details

Motivation: To investigate whether LLMs can generate valuable and cutting-edge mathematical research problems, building on previous work on evaluating mathematical creativity in LLMs.

Method: Developed an agent to generate unknown mathematical problems in differential geometry, produced 665 research problems, and conducted human verification by experts.

Result: Many generated mathematical problems were unknown to experts and possessed unique research value, demonstrating LLMs’ capability to produce novel mathematical research questions.

Conclusion: LLMs can generate valuable and cutting-edge mathematical research problems, showing potential for AI-assisted mathematical discovery and creativity.

Abstract: This paper is the second one in a series of work on the mathematical creativity of LLM. In the first paper, the authors proposed three criteria for evaluating the mathematical creativity of LLM and constructed a benchmark dataset to measure it. This paper further explores the mathematical creativity of LLM, with a focus on investigating whether LLM can generate valuable and cutting-edge mathematical research problems. We develop an agent to generate unknown problems and produced 665 research problems in differential geometry. Through human verification, we find that many of these mathematical problems are unknown to experts and possess unique research value.

[372] ProRL Agent: Rollout-as-a-Service for RL Training of Multi-Turn LLM Agents

Hao Zhang, Mingjie Liu, Shaokun Zhang, Songyang Han, Jian Hu, Zhenghui Jin, Yuchi Zhang, Shizhe Diao, Ximing Lu, Binfeng Xu, Zhiding Yu, Jan Kautz, Yi Dong

Main category: cs.AI

TL;DR: ProRL Agent is a scalable infrastructure for multi-turn LLM agent reinforcement learning that provides rollout-as-a-service API and standardized sandbox environments for diverse agentic tasks.

Details

Motivation: Existing RL training infrastructures for multi-turn LLM agents couple rollout orchestration with training loops, making systems hard to migrate and maintain, and requiring large numbers of sandboxed rollout trajectories.

Method: Proposed ProRL Agent infrastructure with rollout-as-a-service philosophy, providing API service for full agentic rollout lifecycle and standardized, extensible sandbox environments that support diverse agentic tasks in rootless HPC settings.

Result: Validated through RL training on software engineering, math, STEM, and coding tasks; integrated as part of NVIDIA NeMo Gym and open-sourced.

Conclusion: ProRL Agent provides a scalable, maintainable infrastructure for RL training of multi-turn LLM agents with standardized environments and rollout-as-a-service architecture.

Abstract: Multi-turn LLM agents are increasingly important for solving complex, interactive tasks, and reinforcement learning (RL) is a key ingredient for improving their long-horizon behavior. However, RL training requires generating large numbers of sandboxed rollout trajectories, and existing infrastructures often couple rollout orchestration with the training loop, making systems hard to migrate and maintain. Under the rollout-as-a-service philosophy, we present ProRL Agent , a scalable infrastructure that serves the full agentic rollout lifecycle through an API service. ProRL Agent also provides standardized and extensible sandbox environments that support diverse agentic tasks in rootless HPC settings. We validate ProRL Agent through RL training on software engineering, math, STEM, and coding tasks. ProRL Agent is open-sourced and integrated as part of NVIDIA NeMo Gym.

[373] RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

Main category: cs.AI

TL;DR: RewardFlow is a lightweight method for estimating state-level rewards in agentic reasoning tasks by leveraging topological structure of states in reasoning trajectories through graph construction and propagation.

Details

Motivation: RL can enhance LLM agentic reasoning but suffers from sparse terminal rewards. Process reward modeling helps but requires expensive dedicated reward models. Need lightweight method for state-level reward estimation.

Method: Constructs state graphs from reasoning trajectories to analyze topological structure. Uses graph propagation to quantify state-wise contributions to success, yielding objective state-level rewards for RL optimization.

Result: Outperforms prior RL baselines across four agentic reasoning benchmarks, showing superior performance, robustness, and training efficiency when used as dense rewards for RL.

Conclusion: RewardFlow provides effective lightweight solution for state-level reward estimation in agentic reasoning, enabling better RL optimization for LLMs with external environments.

Abstract: Reinforcement learning (RL) holds significant promise for enhancing the agentic reasoning capabilities of large language models (LLMs) with external environments. However, the inherent sparsity of terminal rewards hinders fine-grained, state-level optimization. Although process reward modeling offers a promising alternative, training dedicated reward models often entails substantial computational costs and scaling difficulties. To address these challenges, we introduce RewardFlow, a lightweight method for estimating state-level rewards tailored to agentic reasoning tasks. RewardFlow leverages the intrinsic topological structure of states within reasoning trajectories by constructing state graphs. This enables an analysis of state-wise contributions to success, followed by topology-aware graph propagation to quantify contributions and yield objective, state-level rewards. When integrated as dense rewards for RL optimization, RewardFlow substantially outperforms prior RL baselines across four agentic reasoning benchmarks, demonstrating superior performance, robustness, and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

[374] Conflict-Based Search for Multi Agent Path Finding with Asynchronous Actions

Xuemian Wu, Shizhe Zhao, Zhongqiang Ren

Main category: cs.AI

TL;DR: CBS-AA: A new complete and optimal algorithm for Multi-Agent Path Finding with Asynchronous Actions that overcomes limitations of previous approaches.

Details

Motivation: Most MAPF algorithms assume synchronized actions, which limits practical applications. Existing Continuous-time Conflict-Based Search (CCBS) was found to be incomplete due to infinite state space from continuous wait durations.

Method: Proposes Conflict-Based Search with Asynchronous Actions (CBS-AA) that bypasses the theoretical issues of CCBS by handling asynchronous actions while maintaining completeness and optimality. Also develops conflict resolution techniques to improve scalability.

Result: CBS-AA achieves completeness and solution optimality guarantees for MAPF-AA. The conflict resolution techniques reduce the number of branches by up to 90% in tests.

Conclusion: CBS-AA provides a theoretically sound and practical solution for MAPF with asynchronous actions, overcoming previous limitations and improving computational efficiency.

Abstract: Multi-Agent Path Finding (MAPF) seeks collision-free paths for multiple agents from their respective start locations to their respective goal locations while minimizing path costs. Most existing MAPF algorithms rely on a common assumption of synchronized actions, where the actions of all agents start at the same time and always take a time unit, which may limit the use of MAPF planners in practice. To get rid of this assumption, Continuous-time Conflict-Based Search (CCBS) is a popular approach that can find optimal solutions for MAPF with asynchronous actions (MAPF-AA). However, CCBS has recently been identified to be incomplete due to an uncountably infinite state space created by continuous wait durations. This paper proposes a new method, Conflict-Based Search with Asynchronous Actions (CBS-AA), which bypasses this theoretical issue and can solve MAPF-AA with completeness and solution optimality guarantees. Based on CBS-AA, we also develop conflict resolution techniques to improve the scalability of CBS-AA further. Our test results show that our method can reduce the number of branches by up to 90%.

[375] Bridging Network Fragmentation: A Semantic-Augmented DRL Framework for UAV-aided VANETs

Gaoxiang Cao, Wenke Yuan, Huasen He, Yunpeng Hou, Xiaofeng Jiang, Shuangwu Chen, Jian Yang

Main category: cs.AI

TL;DR: SA-DRL framework combines LLMs’ semantic reasoning with DRL for UAV deployment in VANETs, using road topology understanding to improve connectivity and efficiency.

Details

Motivation: VANETs for autonomous driving suffer from network fragmentation in urban areas due to physical obstructions. UAVs can bridge connectivity gaps, but traditional DRL-based deployment lacks semantic understanding of road topology, leading to inefficient exploration.

Method: Proposes Semantic-Augmented DRL (SA-DRL) framework: 1) fragmentation quantification using Road Topology Graphs and Dual Connected Graphs, 2) four-stage pipeline to transform general LLM into domain-specific topology expert, 3) Semantic-Augmented PPO algorithm with Logit Fusion mechanism to inject LLM’s semantic reasoning as policy prior.

Result: SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance using only 26.6% of training episodes, improves two key connectivity metrics by 13.2% and 23.5%, and reduces energy consumption to 28.2% of baseline.

Conclusion: Combining LLMs’ semantic reasoning with DRL for UAV deployment in VANETs significantly improves connectivity, efficiency, and energy consumption by leveraging topological understanding of road networks.

Abstract: Vehicular Ad-hoc Networks (VANETs) are the digital cornerstone of autonomous driving, yet they suffer from severe network fragmentation in urban environments due to physical obstructions. Unmanned Aerial Vehicles (UAVs), with their high mobility, have emerged as a vital solution to bridge these connectivity gaps. However, traditional Deep Reinforcement Learning (DRL)-based UAV deployment strategies lack semantic understanding of road topology, often resulting in blind exploration and sample inefficiency. By contrast, Large Language Models (LLMs) possess powerful reasoning capabilities capable of identifying topological importance, though applying them to control tasks remains challenging. To address this, we propose the Semantic-Augmented DRL (SA-DRL) framework. Firstly, we propose a fragmentation quantification method based on Road Topology Graphs (RTG) and Dual Connected Graphs (DCG). Subsequently, we design a four-stage pipeline to transform a general-purpose LLM into a domain-specific topology expert. Finally, we propose the Semantic-Augmented PPO (SA-PPO) algorithm, which employs a Logit Fusion mechanism to inject the LLM’s semantic reasoning directly into the policy as a prior, effectively guiding the agent toward critical intersections. Extensive high-fidelity simulations demonstrate that SA-PPO achieves state-of-the-art performance with remarkable efficiency, reaching baseline performance levels using only 26.6% of the training episodes. Ultimately, SA-PPO improves two key connectivity metrics by 13.2% and 23.5% over competing methods, while reducing energy consumption to just 28.2% of the baseline.

[376] Geography According to ChatGPT – How Generative AI Represents and Reasons about Geography

Krzysztof Janowicz, Gengchen Mai, Rui Zhu, Song Gao, Zhangyu Wang, Yingjie Hu, Lauren Bennett

Main category: cs.AI

TL;DR: The paper examines how AI systems represent and reason about geography, arguing that understanding AI’s constructed worldviews is as important as evaluating factual accuracy, with three illustrative probes exploring model defaults, distributional shifts, and deeper understanding beyond factual recall.

Details

Motivation: As the public increasingly interacts with spaces and places through AI systems, and researchers rely on pre-trained models, understanding how AI represents and reasons about geography is crucial. The paper argues that examining AI's constructed worldviews is as important as evaluating factual accuracy.

Method: The paper provides three illustrative vignettes/exploratory probes: (1) examining whether models form strong defaults and how brittle outputs are to minute syntactic variations, (2) investigating if distributional shifts can emerge from composition of individually benign tasks (like creating personas), and (3) exploring whether focusing solely on factual recall overlooks deeper questions of understanding.

Result: The paper presents conceptual probes rather than empirical results, aiming to spark discussion about how AI systems construct geographic understanding and the limitations of current evaluation approaches that focus primarily on factual recall.

Conclusion: Understanding AI’s geographic representations and reasoning is critical, and current evaluation methods focusing on factual recall may miss deeper issues. The paper calls for more research into how AI constructs worldviews and the potential biases and limitations in geographic understanding.

Abstract: Understanding how AI will represent and reason about geography should be a key concern for all of us, as the broader public increasingly interacts with spaces and places through these systems. Similarly, in line with the nature of foundation models, our own research often relies on pre-trained models. Hence, understanding what world AI systems construct is as important as evaluating their accuracy, including factual recall. To motivate the need for such studies, we provide three illustrative vignettes, i.e., exploratory probes, in the hope that they will spark lively discussions and follow-up work: (1) Do models form strong defaults, and how brittle are model outputs to minute syntactic variations? (2) Can distributional shifts resurface from the composition of individually benign tasks, e.g., when using AI systems to create personas? (3) Do we overlook deeper questions of understanding when solely focusing on the ability of systems to recall facts such as geographic principles?

[377] Reasoning over mathematical objects: on-policy reward modeling and test time aggregation

Pranjal Aggarwal, Marjan Ghazvininejad, Seungone Kim, Ilia Kulikov, Jack Lanchantin, Xian Li, Tianjian Li, Bo Liu, Graham Neubig, Anaelia Ovalle, Swarnadeep Saha, Sainbayar Sukhbaatar, Sean Welleck, Jason Weston, Chenxi Whitehouse, Adina Williams, Jing Xu, Ping Yu, Weizhe Yuan, Jingyu Zhang, Wenting Zhao

Main category: cs.AI

TL;DR: The Principia suite introduces training data and benchmarks for deriving mathematical objects, with training recipes using LLM-judges and verifiers that show on-policy training boosts performance and enables test-time compute scaling.

Details

Motivation: Current LM evaluations for mathematical and scientific reasoning rely on simplified answer formats (numerical values or multiple choice) due to automated assessment convenience, but precise derivation of mathematical objects is crucial for downstream STEM applications.

Method: Three contributions: (1) Build and release training data and benchmarks for deriving mathematical objects (Principia suite); (2) Provide training recipes with strong LLM-judges and verifiers, showing on-policy judge training boosts performance; (3) Show how on-policy training can scale test-time compute via aggregation.

Result: Strong LMs like Qwen3-235B and o3 struggle on Principia, but the training recipes bring significant improvements across different LLM backbones while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

Conclusion: The Principia suite addresses the gap in evaluating precise mathematical object derivation, with training methods that enhance reasoning capabilities and generalize across answer formats, advancing LM capabilities for STEM applications.

Abstract: The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.

[378] Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Nicolas Martorell

Main category: cs.AI

TL;DR: LLMs can track their own internal emotive states through numeric self-reports, which show causal coupling with probe-defined states and improve with model size.

Details

Motivation: Current methods for tracking LLM internal states (like linear probes) are imperfect and scale poorly with model size. Inspired by human psychology's use of numeric self-reports, researchers explore whether LLMs can introspectively track their own emotive states.

Method: Study four concept pairs (wellbeing, interest, focus, impulsivity) in 40 ten-turn conversations. Use logit-based self-reports (instead of greedy-decoded) to measure introspection as causal informational coupling between self-reports and probe-defined internal states. Test with LLaMA-3.2-3B-Instruct and LLaMA-3.1-8B-Instruct, using activation steering to confirm causality.

Result: Logit-based self-reports effectively track internal states (Spearman ρ=0.40-0.76; isotonic R²=0.12-0.54 in 3B model). Introspection evolves through conversation and can be selectively improved via activation steering (ΔR² up to 0.30). Scaling improves performance (R²≈0.93 in 8B model).

Conclusion: Numeric self-reports are a viable complementary tool for tracking internal emotive states in conversational AI, offering advantages over traditional white-box methods and showing promising scaling properties.

Abstract: Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs’ own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model’s self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $ρ= 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($ΔR^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

[379] Secure Linear Alignment of Large Language Models

Matt Gorbett, Suman Jana

Main category: cs.AI

TL;DR: A privacy-preserving framework for cross-silo inference between independent language models using linear alignment and homomorphic encryption, enabled by representational convergence across models.

Details

Motivation: Language models are converging to similar representations despite different training, creating opportunities for cross-model alignment. This enables applications where security, privacy, or competitive constraints prevent direct data/model sharing, particularly in cross-silo inference scenarios.

Method: Proposes a framework that learns an affine transformation between independent models using a shared public dataset, then applies homomorphic encryption to protect client queries during inference. Only linear alignment and classification operations are encrypted to maintain sub-second latency. Also empirically investigates representational convergence by learning linear transformations between final hidden states of independent models.

Result: Minimal performance degradation in embedding classification and out-of-distribution detection across model pairs. Shows for the first time that linear alignment can enable text generation across independently trained models. The framework achieves sub-second inference latency while maintaining strong security guarantees.

Conclusion: Representational convergence enables practical privacy-preserving cross-silo inference between independent language models through linear alignment and homomorphic encryption, opening new application domains where data/model sharing is prohibited.

Abstract: Language models increasingly appear to learn similar representations, despite differences in training objectives, architectures, and data modalities. This emerging compatibility between independently trained models introduces new opportunities for cross-model alignment to downstream objectives. Moreover, it unlocks new potential application domains, such as settings where security, privacy, or competitive constraints prohibit direct data or model sharing. In this work, we propose a privacy-preserving framework that exploits representational convergence to enable cross-silo inference between independent language models. The framework learns an affine transformation over a shared public dataset and applies homomorphic encryption to protect client queries during inference. By encrypting only the linear alignment and classification operations, the method achieves sub-second inference latency while maintaining strong security guarantees. We support this framework with an empirical investigation into representational convergence, in which we learn linear transformations between the final hidden states of independent models. We evaluate these cross-model mappings on embedding classification and out-of-distribution detection, observing minimal performance degradation across model pairs. Additionally, we show for the first time that linear alignment sometimes enables text generation across independently trained models.

[380] Agentic Business Process Management: A Research Manifesto

Diego Calvanese, Angelo Casciani, Giuseppe De Giacomo, Marlon Dumas, Fabiana Fournier, Timotheus Kampik, Emanuele La Malfa, Lior Limonad, Andrea Marrella, Andreas Metzger, Marco Montali, Daniel Amyot, Peter Fettke, Artem Polyvyanyy, Stefanie Rinderle-Ma, Sebastian Sardiña, Niek Tax, Barbara Weber

Main category: cs.AI

TL;DR: Agentic Business Process Management (APM) extends traditional BPM by governing autonomous agents (software/human) that perceive, reason, and act within explicit process frames, shifting from automation to constrained autonomy through process awareness.

Details

Motivation: Traditional Business Process Management (BPM) focuses on automation but lacks frameworks for governing autonomous agents. As organizations increasingly deploy AI agents, there's a need for a paradigm shift that constrains agent autonomy while aligning it with organizational goals through process awareness.

Method: The paper proposes conceptual foundations for APM, introducing core abstractions and architectural elements. It elaborates on four key capabilities: framed autonomy, explainability, conversational actionability, and self-modification. These ensure agents’ goals align with organizational objectives while maintaining proactive behavior within defined boundaries.

Result: The manifesto establishes a roadmap for bridging BPM, AI, and multi-agent systems communities. It identifies research challenges requiring advances in these fields to realize APM systems that can effectively govern autonomous agents in organizational contexts.

Conclusion: APM represents a paradigm shift from automation-oriented BPM to systems where agent autonomy is constrained and aligned through process awareness. The framework provides foundations for developing practical APM systems that balance agent autonomy with organizational control.

Abstract: This paper presents a manifesto that articulates the conceptual foundations of Agentic Business Process Management (APM), an extension of Business Process Management (BPM) for governing autonomous agents executing processes in organizations. From a management perspective, APM represents a paradigm shift from the traditional process view of the business process, driven by the realization of process awareness and an agent-oriented abstraction, where software and human agents act as primary functional entities that perceive, reason, and act within explicit process frames. This perspective marks a shift from traditional, automation-oriented BPM toward systems in which autonomy is constrained, aligned, and made operational through process awareness. We introduce the core abstractions and architectural elements required to realize APM systems and elaborate on four key capabilities that such APM agents must support: framed autonomy, explainability, conversational actionability, and self-modification. These capabilities jointly ensure that agents’ goals are aligned with organizational goals and that agents behave in a framed yet proactive manner in pursuing those goals. We discuss the extent to which the capabilities can be realized and identify research challenges whose resolution requires further advances in BPM, AI, and multi-agent systems. The manifesto thus serves as a roadmap for bridging these communities and for guiding the development of APM systems in practice.

[381] Teleological Inference in Structural Causal Models via Intentional Interventions

Dario Compagno, Fabio Massimo Zennaro

Main category: cs.AI

TL;DR: SCMs can model teleological questions about goal-directed agents through intentional interventions and structural final models (SFMs), enabling detection of agents and discovery of their intentions.

Details

Motivation: Previous approaches to modeling goal-directed agents in causal systems have limitations. The paper aims to extend structural causal models (SCMs) beyond causal questions to also address teleological questions about agents' intentions and goal-directed interventions.

Method: Introduces intentional interventions as a time-agnostic operator that induces a twin SCM called structural final model (SFM). SFMs treat observed values as outcomes of intentional interventions and relate them to counterfactual conditions (what would have happened without intervention).

Result: Demonstrates how SFMs can be used to empirically detect agents and discover their intentions from observational data, showing the framework’s practical applicability.

Conclusion: SCMs can be extended to model teleological questions about goal-directed agents through intentional interventions and SFMs, providing a unified framework for both causal and intentional analysis in complex systems.

Abstract: Structural causal models (SCMs) were conceived to formulate and answer causal questions. This paper shows that SCMs can also be used to formulate and answer teleological questions, concerning the intentions of a state-aware, goal-directed agent intervening in a causal system. We review limitations of previous approaches to modeling such agents, and then introduce intentional interventions, a new time-agnostic operator that induces a twin SCM we call a structural final model (SFM). SFMs treat observed values as the outcome of intentional interventions and relate them to the counterfactual conditions of those interventions (what would have happened had the agent not intervened). We show how SFMs can be used to empirically detect agents and to discover their intentions.

[382] Evaluating 5W3H Structured Prompting for Intent Alignment in Human-AI Interaction

Peng Gang

Main category: cs.AI

TL;DR: PPS (Prompt Protocol Specification) framework improves AI intent alignment using structured 5W3H representation, showing task-dependent benefits especially in high-ambiguity scenarios.

Details

Motivation: Natural language prompts often suffer from intent transmission loss - the gap between what users need and what they communicate to AI systems. There's a need for better structured intent representation in human-AI interaction.

Method: Evaluated PPS framework using 5W3H-based structured intent representation in a three-condition study across 60 tasks in business, technical, and travel domains with three LLMs (DeepSeek-V3, Qwen-Max, Kimi). Compared simple prompts, raw PPS JSON, and natural-language-rendered PPS, collecting 540 AI outputs evaluated by an LLM judge with a new goal_alignment metric.

Result: Rendered PPS outperforms both simple prompts and raw JSON on goal_alignment metric. Gains are task-dependent: large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning. Also identified measurement asymmetry in standard LLM evaluation where unconstrained prompts can inflate constraint adherence scores. Preliminary survey shows 66.1% reduction in follow-up prompts required.

Conclusion: Structured intent representations like PPS can improve alignment and usability in human-AI interaction, especially in tasks where user intent is inherently ambiguous.

Abstract: Natural language prompts often suffer from intent transmission loss: the gap between what users actually need and what they communicate to AI systems. We evaluate PPS (Prompt Protocol Specification), a 5W3H-based framework for structured intent representation in human-AI interaction. In a controlled three-condition study across 60 tasks in three domains (business, technical, and travel), three large language models (DeepSeek-V3, Qwen-Max, and Kimi), and three prompt conditions - (A) simple prompts, (B) raw PPS JSON, and (C) natural-language-rendered PPS - we collect 540 AI-generated outputs evaluated by an LLM judge. We introduce goal_alignment, a user-intent-centered evaluation dimension, and find that rendered PPS outperforms both simple prompts and raw JSON on this metric. PPS gains are task-dependent: gains are large in high-ambiguity business analysis tasks but reverse in low-ambiguity travel planning. We also identify a measurement asymmetry in standard LLM evaluation, where unconstrained prompts can inflate constraint adherence scores and mask the practical value of structured prompting. A preliminary retrospective survey (N = 20) further suggests a 66.1% reduction in follow-up prompts required, from 3.33 to 1.13 rounds. These findings suggest that structured intent representations can improve alignment and usability in human-AI interaction, especially in tasks where user intent is inherently ambiguous.

[383] Unmasking Algorithmic Bias in Predictive Policing: A GAN-Based Simulation Framework with Multi-City Temporal Analysis

Pronob Kumar Barman, Pronoy Kumar Barman

Main category: cs.AI

TL;DR: A simulation framework using GANs and patrol models reveals how predictive policing algorithms amplify racial disparities in crime detection, with significant bias metrics across Baltimore and Chicago data.

Details

Motivation: To quantitatively understand how predictive policing systems encode and amplify racial disparities through the full enforcement pipeline from crime occurrence to police contact.

Method: Developed a reproducible simulation framework coupling a Generative Adversarial Network (GAN) with a Noisy OR patrol detection model, using 145,000+ crime records from Baltimore (2017-2019) and 233,000+ from Chicago (2022), augmented with US Census ACS demographic data. Computed four monthly bias metrics across 264 city-year-mode observations.

Result: Revealed extreme and year-variant bias in Baltimore’s detected mode (mean annual DIR up to 157.14 in 2019), moderate under-detection of Black residents in Chicago (DIR=0.22), and persistent Gini coefficients of 0.43-0.62. CTGAN debiasing partially redistributes detection rates but cannot eliminate structural disparity without policy intervention.

Conclusion: Predictive policing systems significantly amplify racial disparities, with outcomes most sensitive to officer deployment levels. Algorithmic debiasing alone is insufficient without accompanying policy changes to address structural inequality.

Abstract: Predictive policing systems that direct patrol resources based on algorithmically generated crime forecasts have been widely deployed across US cities, yet their tendency to encode and amplify racial disparities remains poorly understood in quantitative terms. We present a reproducible simulation framework that couples a Generative Adversarial Network GAN with a Noisy OR patrol detection model to measure how racial bias propagates through the full enforcement pipeline from crime occurrence to police contact. Using 145000 plus Part 1 crime records from Baltimore 2017 to 2019 and 233000 plus records from Chicago 2022, augmented with US Census ACS demographic data, we compute four monthly bias metrics across 264 city year mode observations: the Disparate Impact Ratio DIR, Demographic Parity Gap, Gini Coefficient, and a composite Bias Amplification Score. Our experiments reveal extreme and year variant bias in Baltimores detected mode, with mean annual DIR up to 15714 in 2019, moderate under detection of Black residents in Chicago DIR equals 0.22, and persistent Gini coefficients of 0.43 to 0.62 across all conditions. We further demonstrate that a Conditional Tabular GAN CTGAN debiasing approach partially redistributes detection rates but cannot eliminate structural disparity without accompanying policy intervention. Socioeconomic regression analysis confirms strong correlations between neighborhood racial composition and detection likelihood Pearson r equals 0.83 for percent White and r equals negative 0.81 for percent Black. A sensitivity analysis over patrol radius, officer count, and citizen reporting probability reveals that outcomes are most sensitive to officer deployment levels. The code and data are publicly available at this repository.

[384] Evaluating Game Difficulty in Tetris Block Puzzle

Chun-Jui Wang, Jian-Ting Guo, Hung Guei, Chung-Chin Shih, Ti-Rong Wu, I-Chen Wu

Main category: cs.AI

TL;DR: SGAZ agent evaluates difficulty of Tetris Block Puzzle rule variants using training reward and convergence metrics, finding that more preview/hold options reduce difficulty while additional block shapes increase it.

Details

Motivation: Tetris variants are popular but lack principled assessment of rule set difficulty; need systematic evaluation method for stochastic puzzle games.

Method: Use Stochastic Gumbel AlphaZero (SGAZ) as budget-aware planning agent to evaluate rule changes including hold block (h), preview (p), and additional Tetris block variants.

Result: Increasing h and p reduces difficulty (higher reward, faster convergence), while adding more block variants increases difficulty, with T-pentomino causing largest slowdown.

Conclusion: SGAZ enables efficient, reproducible comparisons across rule sets for stochastic puzzle games, providing reference for future game design.

Abstract: Tetris Block Puzzle is a single player stochastic puzzle in which a player places blocks on an 8 x 8 grid to complete lines; its popular variants have amassed tens of millions of downloads. Despite this reach, there is little principled assessment of which rule sets are more difficult. Inspired by prior work that uses AlphaZero as a strong evaluator for chess variants, we study difficulty in this domain using Stochastic Gumbel AlphaZero (SGAZ), a budget-aware planning agent for stochastic environments. We evaluate rule changes including holding block h, preview holding block p, and additional Tetris block variants using metrics such as training reward and convergence iterations. Empirically, increasing h and p reduces difficulty (higher reward and faster convergence), while adding more Tetris block variants increases difficulty, with the T-pentomino producing the largest slowdown. Through analysis, SGAZ delivers strong play under small simulation budgets, enabling efficient, reproducible comparisons across rule sets and providing a reference for future design in stochastic puzzle games.

[385] Regret Bounds for Competitive Resource Allocation with Endogenous Costs

Rui Chai

Main category: cs.AI

TL;DR: Online resource allocation with endogenous costs where costs depend on interactions between modules, analyzed through three allocation paradigms with different regret bounds.

Details

Motivation: To understand online resource allocation in modular systems where costs are endogenous (depend on interactions between modules) rather than exogenous, and to analyze different allocation strategies in this setting.

Method: Analyzed three allocation paradigms: (I) uniform allocation (cost-ignorant), (II) gated allocation (cost-estimating), and (III) competitive allocation via multiplicative weights update with interaction feedback. Studied regret bounds under adversarial sequences with bounded variation and analyzed how interaction topology affects computation-regret tradeoff.

Result: Strict separation in regret bounds: uniform incurs Ω(T) regret, gated achieves O(T^{2/3}), and competitive achieves O(√(T log N)). Interaction topology governs computation-regret tradeoff, with ring-structured topologies (like Wuxing) minimizing computation × regret product.

Conclusion: Competitive allocation via multiplicative weights with interaction feedback is optimal for endogenous cost settings, providing formal regret-theoretic justification for decentralized competitive allocation in modular architectures.

Abstract: We study online resource allocation among N interacting modules over T rounds. Unlike standard online optimization, costs are endogenous: they depend on the full allocation vector through an interaction matrix W encoding pairwise cooperation and competition. We analyze three paradigms: (I) uniform allocation (cost-ignorant), (II) gated allocation (cost-estimating), and (III) competitive allocation via multiplicative weights update with interaction feedback (cost-revealing). Our main results establish a strict separation under adversarial sequences with bounded variation: uniform incurs Omega(T) regret, gated achieves O(T^{2/3}), and competitive achieves O(sqrt(T log N)). The performance gap stems from competitive allocation’s ability to exploit endogenous cost information revealed through interactions. We further show that W’s topology governs a computation-regret tradeoff. Full interaction (|E|=O(N^2)) yields the tightest bound but highest per-step cost, while sparse topologies (|E|=O(N)) increase regret by at most O(sqrt(log N)) while reducing per-step cost from O(N^2) to O(N). Ring-structured topologies with both cooperative and competitive links - of which the five-element Wuxing topology is canonical - minimize the computation x regret product. These results provide the first formal regret-theoretic justification for decentralized competitive allocation in modular architectures and establish cost endogeneity as a fundamental challenge distinct from partial observability. Keywords: online learning, regret bounds, resource allocation, endogenous costs, interaction topology, multiplicative weights, modular systems, Wuxing topology

[386] Behavioral Fingerprints for LLM Endpoint Stability and Identity

Jonah Leshin, Manish Shah, Ian Timmis, Daniel Kang

Main category: cs.AI

TL;DR: Stability Monitor: A black-box system for detecting behavioral changes in AI model endpoints by fingerprinting output distributions over time.

Details

Motivation: Traditional reliability metrics (uptime, latency, throughput) don't capture behavioral changes in AI models. Models can remain "healthy" while their effective identity changes due to updates to weights, tokenizers, quantization, inference engines, or hardware, affecting AI-native application consistency.

Method: Periodically fingerprints endpoints by sampling outputs from fixed prompt sets and comparing output distributions over time. Uses summed energy distance statistic across prompts with permutation-test p-values aggregated sequentially to detect distribution shifts and define stability periods.

Result: In controlled validation, detects changes to model family, version, inference stack, quantization, and behavioral parameters. In real-world monitoring of same model across providers, observes substantial provider-to-provider and within-provider stability differences.

Conclusion: Stability Monitor provides a practical black-box approach for monitoring behavioral consistency of AI model endpoints, addressing a gap in traditional reliability metrics and enabling detection of subtle model identity changes.

Abstract: The consistency of AI-native applications depends on the behavioral consistency of the model endpoints that power them. Traditional reliability metrics such as uptime, latency and throughput do not capture behavioral change, and an endpoint can remain “healthy” while its effective model identity changes due to updates to weights, tokenizers, quantization, inference engines, kernels, caching, routing, or hardware. We introduce Stability Monitor, a black-box stability monitoring system that periodically fingerprints an endpoint by sampling outputs from a fixed prompt set and comparing the resulting output distributions over time. Fingerprints are compared using a summed energy distance statistic across prompts, with permutation-test p-values as evidence of distribution shift aggregated sequentially to detect change events and define stability periods. In controlled validation, Stability Monitor detects changes to model family, version, inference stack, quantization, and behavioral parameters. In real-world monitoring of the same model hosted by multiple providers, we observe substantial provider-to-provider and within-provider stability differences.

[387] Man and machine: artificial intelligence and judicial decision making

Arthur Dyevre, Ahmad Shahvaroughi

Main category: cs.AI

TL;DR: Review paper examining AI integration in judicial decision-making, focusing on risk assessment tools, human judge biases, and AI-human interactions in criminal justice contexts.

Details

Motivation: Address concerns about transparency, reliability, and accountability of AI in judicial decision-making while exploring how judges interact with AI-based decision aids in pretrial, sentencing, and parole contexts.

Method: Synthetic review connecting three aspects: AI tool performance/fairness, human judge strengths/biases, and AI+human interactions across computer science, economics, law, criminology, and psychology literature.

Result: Existing evidence shows modest or no impact of AI decision aids on pretrial/sentencing decisions; reveals gaps in evaluating AI performance, understanding judicial navigation of noisy environments, and individual characteristics influencing judge responses to AI.

Conclusion: AI vs Human comparisons can yield insights into both algorithmic tools and human decision-makers; advocates for greater interdisciplinary integration in future research on AI in judicial systems.

Abstract: The integration of artificial intelligence (AI) technologies into judicial decision-making - particularly in pretrial, sentencing, and parole contexts - has generated substantial concerns about transparency, reliability, and accountability. At the same time, these developments have brought the limitations of human judgment into sharper relief and underscored the importance of understanding how judges interact with AI-based decision aids. Using criminal justice risk assessment as a focal case, we conduct a synthetic review connecting three intertwined aspects of AI’s role in judicial decision-making: the performance and fairness of AI tools, the strengths and biases of human judges, and the nature of AI+human interactions. Across the fields of computer science, economics, law, criminology and psychology, researchers have made significant progress in evaluating the predictive validity of automated risk assessment instruments, documenting biases in judicial decision-making, and, to a more limited extent, examining how judges use algorithmic recommendations. While the existing empirical evidence indicates that the impact of AI decision aid tools on pretrial and sentencing decisions is modest or inexistent, our review also reveals important gaps in the canvassed literatures. Further research is needed to evaluate the performance of AI risk assessment instruments, understand how judges navigate noisy decision making environments and how individual characteristics influence judges’ responses to AI advice. We argue that AI vs Human comparisons have the potential to yield new insights into both algorithmic tools and human decision-makers and advocate greater interdisciplinary integration and cross-fertilization in future research.

[388] Serendipity by Design: Evaluating the Impact of Cross-domain Mappings on Human and LLM Creativity

Qiawen Ella Liu, Marina Dubova, Henry Conklin, Takumi Harada, Thomas L. Griffiths

Main category: cs.AI

TL;DR: LLMs and humans were tested on creativity using cross-domain mapping interventions; humans benefit from random analogies while LLMs don’t show significant effects, though both benefit more from semantically distant sources.

Details

Motivation: To understand whether LLMs exhibit human-like creativity and whether the same interventions (specifically cross-domain mapping) can enhance creativity in both systems, comparing human and AI creative processes.

Method: Human participants and LLMs generated novel features for 10 daily products under two conditions: cross-domain mapping (drawing analogies from random remote sources) and user-need (targeting unmet user needs). The study measured originality of ideas and effects of semantic distance.

Result: Humans reliably benefit from cross-domain mapping, while LLMs generate more original ideas on average but show no statistically significant effect from the intervention. Both systems show increased impact when inspiration sources are more semantically distant from targets.

Conclusion: The study reveals systematic differences in how humans and LLMs respond to creativity interventions, highlighting the role of remote association in creative ideation and suggesting LLMs may have different creative mechanisms than humans.

Abstract: Are large language models (LLMs) creative in the same way humans are, and can the same interventions increase creativity in both? We evaluate a promising but largely untested intervention for creativity: forcing creators to draw an analogy from a random, remote source domain (‘‘cross-domain mapping’’). Human participants and LLMs generated novel features for ten daily products (e.g., backpack, TV) under two prompts: (i) cross-domain mapping, which required translating a property from a randomly assigned source (e.g., octopus, cactus, GPS), and (ii) user-need, which required proposing innovations targeting unmet user needs. We show that humans reliably benefit from randomly assigned cross-domain mappings, while LLMs, on average, generate more original ideas than humans and do not show a statistically significant effect of cross-domain mappings. However, in both systems, the impact of cross-domain mapping increases when the inspiration source becomes more semantically distant from the target. Our results highlight both the role of remote association in creative ideation and systematic differences in how humans and LLMs respond to the same intervention for creativity.

[389] LuMamba: Latent Unified Mamba for Electrode Topology-Invariant and Efficient EEG Modeling

Danaé Broustail, Anna Tegon, Thorir Mar Ingolfsson, Yawei Li, Luca Benini

Main category: cs.AI

TL;DR: LuMamba is a self-supervised EEG foundation model using topology-invariant encodings and linear-complexity Mamba blocks, achieving state-of-the-art performance with high efficiency.

Details

Motivation: Building foundation models for EEG is challenging due to differing electrode topologies across datasets and computational scalability issues with Transformer architectures that have quadratic sequence complexity.

Method: Combines topology-invariant encodings with linear-complexity state-space modeling using LUNA’s learned-query cross-attention for channel unification and FEMBA’s bidirectional Mamba blocks for efficient temporal modeling. Introduces systematic investigation of Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning.

Result: Pre-trained on 21,000+ hours of unlabeled EEG, achieves 80.99% balanced accuracy on TUAB, state-of-art performance on Alzheimer’s detection (0.97 AUPR), with 377× fewer FLOPS than state-of-art models and scaling to 12× longer sequences before hitting GPU memory limits.

Conclusion: LuMamba provides an efficient, scalable foundation model for EEG that handles varying electrode topologies while achieving strong performance across multiple downstream tasks with minimal computational requirements.

Abstract: Electroencephalography (EEG) enables non-invasive monitoring of brain activity across clinical and neurotechnology applications, yet building foundation models for EEG remains challenging due to \emph{differing electrode topologies} and \emph{computational scalability}, as Transformer architectures incur quadratic sequence complexity. As a joint solution, we propose \textbf{LuMamba} (\textbf{L}atent \textbf{U}nified \textbf{Mamba}), a self-supervised framework combining topology-invariant encodings with linear-complexity state-space modeling, using LUNA’s learned-query cross-attention mechanism for channel unification~\cite{luna}, and FEMBA’s bidirectional Mamba blocks for efficient temporal modeling~\cite{femba}. Within this architecture, we provide the first systematic investigation of the Latent-Euclidean Joint-Embedding Predictive Architecture (LeJEPA) for biosignal learning. Pre-trained on over 21,000 hours of unlabeled EEG from the TUEG corpus, LuMamba is evaluated on five downstream tasks spanning abnormality detection, artifact recognition, and mental condition classification across electrode configurations ranging from 16 to 26 channels. In the pre-training objective, masked reconstruction alone yields structured but less generalizable representations, while LeJEPA alone produces diffuse embeddings; combining both objectives achieves the most robust performance. With only 4.6M parameters, LuMamba attains 80.99% balanced accuracy on TUAB and achieves state-of-art performance on Alzheimer’s detection (0.97 AUPR), while requiring \textbf{377$\times$ fewer FLOPS} than state-of-art models at equivalent sequence lengths and scaling to \textbf{12$\times$ longer sequences} before reaching typical GPU memory limits. Code is available at https://github.com/pulp-bio/biofoundation

[390] How Uncertainty Estimation Scales with Sampling in Reasoning Models

Maksym Del, Markus Kängsepp, Marharyta Domnich, Ardi Tampuu, Lisa Yankovskaya, Meelis Kull, Mark Fishel

Main category: cs.AI

TL;DR: Study of uncertainty estimation in reasoning language models using parallel sampling with verbalized confidence and self-consistency across 17 reasoning tasks, finding that hybrid estimators combining both signals outperform either alone even with minimal sampling.

Details

Motivation: Uncertainty estimation is critical for deploying reasoning language models but remains poorly understood under extended chain-of-thought reasoning, especially how different uncertainty signals scale and complement each other.

Method: Used parallel sampling as a fully black-box approach with verbalized confidence and self-consistency across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities. Analyzed how these uncertainty signals scale and their complementarity.

Result: Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency has lower initial discrimination. Hybrid estimators combining both signals with just two samples improve AUROC by up to +12 on average and outperform either signal alone even when scaled to larger budgets. Effects are domain-dependent with mathematics showing higher uncertainty quality and faster scaling.

Conclusion: Uncertainty estimation in reasoning models benefits significantly from combining verbalized confidence and self-consistency signals, with hybrid estimators providing substantial gains even with minimal sampling, though performance varies across domains.

Abstract: Uncertainty estimation is critical for deploying reasoning language models, yet remains poorly understood under extended chain-of-thought reasoning. We study parallel sampling as a fully black-box approach using verbalized confidence and self-consistency. Across three reasoning models and 17 tasks spanning mathematics, STEM, and humanities, we characterize how these signals scale. Both self-consistency and verbalized confidence scale in reasoning models, but self-consistency exhibits lower initial discrimination and lags behind verbalized confidence under moderate sampling. Most uncertainty gains, however, arise from signal combination: with just two samples, a hybrid estimator improves AUROC by up to $+12$ on average and already outperforms either signal alone even when scaled to much larger budgets, after which returns diminish. These effects are domain-dependent: in mathematics, the native domain of RLVR-style post-training, reasoning models achieve higher uncertainty quality and exhibit both stronger complementarity and faster scaling than in STEM or humanities.

[391] Implicit Patterns in LLM-Based Binary Analysis

Qiang Li, XiangRui Zhang, Haining Wang

Main category: cs.AI

TL;DR: LLM-based binary vulnerability analysis reveals implicit token-level patterns in multi-pass reasoning, with four dominant patterns emerging from large-scale trace analysis.

Details

Motivation: To understand how LLM-based agents organize exploration over hundreds of reasoning steps in binary vulnerability analysis, given limited context windows and implicit token-level behaviors that remain poorly understood.

Method: Conducted first large-scale, trace-level study analyzing 521 binaries with 99,563 reasoning steps to identify implicit patterns in multi-pass LLM reasoning for binary analysis.

Result: Identified four dominant implicit patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization. These patterns form a stable, structured system with distinct temporal roles and measurable characteristics.

Conclusion: LLM reasoning in binary analysis exhibits structured, token-level implicit patterns that serve as an abstraction layer for exploration organization, providing foundation for more reliable analysis systems.

Abstract: Binary vulnerability analysis is increasingly performed by LLM-based agents in an iterative, multi-pass manner, with the model as the core decision-maker. However, how such systems organize exploration over hundreds of reasoning steps remains poorly understood, due to limited context windows and implicit token-level behaviors. We present the first large-scale, trace-level study showing that multi-pass LLM reasoning gives rise to structured, token-level implicit patterns. Analyzing 521 binaries with 99,563 reasoning steps, we identify four dominant patterns: early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization that emerge implicitly from reasoning traces. These token-level implicit patterns serve as an abstraction of LLM reasoning: instead of explicit control-flow or predefined heuristics, exploration is organized through implicit decisions regulating path selection, commitment, and revision. Our analysis shows these patterns form a stable, structured system with distinct temporal roles and measurable characteristics. Our results provide the first systematic characterization of LLM-driven binary analysis and a foundation for more reliable analysis systems.

[392] D5P4: Partition Determinantal Point Process for Diversity in Parallel Discrete Diffusion Decoding

Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Axel Marmoret, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene

Main category: cs.AI

TL;DR: A generalized beam-search framework for discrete diffusion models that improves diversity in text generation through DPP-based candidate selection.

Details

Motivation: Discrete diffusion models lack effective decoding methods compared to autoregressive approaches. Existing diffusion decoding techniques provide limited control over in-batch diversity, and standard autoregressive methods like beam search don't directly apply to iterative denoising processes.

Method: Introduces a generalized beam-search framework for discrete diffusion that generates candidates in parallel with modular beam-selection objectives. Proposes D5P4 as a diversity-focused instantiation that formulates selection as MAP inference over a Determinantal Point Process (DPP), using a scalable greedy solver for multi-GPU compatibility.

Result: Experiments on free-form generation and question answering show D5P4 improves diversity over strong baselines while maintaining competitive generation quality, with near-zero compute overhead for explicit diversity-probability trade-offs.

Conclusion: The proposed framework bridges the gap between autoregressive and diffusion decoding methods, enabling better diversity control in discrete diffusion models for text generation tasks.

Abstract: Discrete diffusion models are promising alternatives to autoregressive approaches for text generation, yet their decoding methods remain under-studied. Standard decoding methods for autoregressive models, such as beam search, do not directly apply to iterative denoising, and existing diffusion decoding techniques provide limited control over in-batch diversity. To bridge this gap, we introduce a generalized beam-search framework for discrete diffusion that generates candidates in parallel and supports modular beam-selection objectives. As a diversity-focused instantiation, we propose D5P4, which formulates the selection step as MAP inference over a Determinantal Point Process. Leveraging a scalable greedy solver, D5P4 maintains multi-GPU compatibility and enables an explicit trade-off between model probability and target diversity with near-zero compute overhead. Experiments on free-form generation and question answering demonstrate that D5P4 improves diversity over strong baselines while maintaining competitive generation quality.

[393] Box Maze: A Process-Control Architecture for Reliable LLM Reasoning

Zou Qiang

Main category: cs.AI

TL;DR: The Box Maze framework proposes a process-control architecture with three explicit layers (memory grounding, structured inference, boundary enforcement) to improve reasoning integrity in LLMs under adversarial conditions, reducing boundary failure rates from ~40% to <1% in simulations.

Details

Motivation: LLMs demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches (RLHF, output filtering) operate at behavioral level and lack explicit architectural mechanisms for enforcing reasoning process integrity.

Method: Proposes Box Maze framework - a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. Uses simulation-based evaluation with progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen).

Result: Results from n=50 adversarial scenarios suggest explicit cognitive control layers improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions.

Conclusion: While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.

Abstract: Large language models (LLMs) demonstrate strong generative capabilities but remain vulnerable to hallucination and unreliable reasoning under adversarial prompting. Existing safety approaches – such as reinforcement learning from human feedback (RLHF) and output filtering – primarily operate at the behavioral level and may lack explicit architectural mechanisms for enforcing reasoning process integrity. This paper proposes the Box Maze framework, a conceptual process-control architecture that decomposes LLM reasoning into three explicit layers: memory grounding, structured inference, and boundary enforcement. We introduce preliminary simulation-based evaluation involving progressive boundary erosion scenarios across multiple heterogeneous LLM systems (DeepSeek-V3, Doubao, Qwen). Results from n=50 adversarial scenarios suggest that explicit cognitive control layers may improve consistency in boundary maintenance, with architectural constraints reducing boundary failure rates from approximately 40% (baseline RLHF) to below 1% under adversarial conditions. While current validation is simulation-based, these preliminary results indicate that process-level control may offer a promising direction for improving reliability in large language model reasoning.

[394] cuGenOpt: A GPU-Accelerated General-Purpose Metaheuristic Framework for Combinatorial Optimization

Yuyang Liu

Main category: cs.AI

TL;DR: cuGenOpt is a GPU-accelerated metaheuristic framework for combinatorial optimization that balances generality, performance, and usability through CUDA architecture, extensible operator design, and LLM-assisted modeling.

Details

Motivation: Existing combinatorial optimization approaches face trade-offs between generality, performance, and usability. The authors aim to create a framework that addresses all three dimensions simultaneously for problems in logistics, scheduling, and resource allocation.

Method: Three-level approach: 1) Engine level with “one block evolves one solution” CUDA architecture, unified encoding abstraction, adaptive operator selection, and hardware-aware resource management; 2) Extensibility through user-defined operator registration; 3) Usability via JIT compilation pipeline (Python API) and LLM-based modeling assistant for natural language to code conversion.

Result: Outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers up to n=150, attains 4.73% gap on TSP-442 within 30s. Solves twelve problem types across five encoding variants optimally. Framework optimizations reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%.

Conclusion: cuGenOpt successfully addresses the trade-off between generality, performance, and usability in combinatorial optimization through GPU acceleration, extensible architecture, and user-friendly interfaces including LLM assistance.

Abstract: Combinatorial optimization problems arise in logistics, scheduling, and resource allocation, yet existing approaches face a fundamental trade-off among generality, performance, and usability. We present cuGenOpt, a GPU-accelerated general-purpose metaheuristic framework that addresses all three dimensions simultaneously. At the engine level, cuGenOpt adopts a “one block evolves one solution” CUDA architecture with a unified encoding abstraction (permutation, binary, integer), a two-level adaptive operator selection mechanism, and hardware-aware resource management. At the extensibility level, a user-defined operator registration interface allows domain experts to inject problem-specific CUDA search operators. At the usability level, a JIT compilation pipeline exposes the framework as a pure-Python API, and an LLM-based modeling assistant converts natural-language problem descriptions into executable solver code. Experiments across five thematic suites on three GPU architectures (T4, V100, A800) show that cuGenOpt outperforms general MIP solvers by orders of magnitude, achieves competitive quality against specialized solvers on instances up to n=150, and attains 4.73% gap on TSP-442 within 30s. Twelve problem types spanning five encoding variants are solved to optimality. Framework-level optimizations cumulatively reduce pcb442 gap from 36% to 4.73% and boost VRPTW throughput by 75-81%. Code: https://github.com/L-yang-yang/cugenopt

[395] OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Zehao Li, Zhenyu Wu, Yibo Zhao, Bowen Yang, Jingjing Xie, Zhaoyang Liu, Zhoumianze Liu, Kaiming Jin, Jianze Liang, Zonglin Li, Feng Wu, Bowen Zhou, Zun Wang, Zichen Ding

Main category: cs.AI

TL;DR: OS-Themis: A multi-agent critic framework for GUI agents that decomposes trajectories into verifiable milestones with strict evidence chain auditing, improving RL robustness in stochastic GUI environments.

Details

Motivation: Reinforcement Learning for GUI agents faces challenges with reward function quality in stochastic environments. Existing reward approaches struggle to balance scalability and performance, requiring a more robust framework for evaluating agent actions in GUI contexts.

Method: Proposes OS-Themis, a multi-agent critic framework that decomposes agent trajectories into verifiable milestones to isolate critical evidence. Uses a review mechanism to strictly audit the evidence chain before making final verdicts on agent actions. Also introduces OmniGUIRewardBench (OGRBench) as a cross-platform benchmark for GUI outcome rewards.

Result: OS-Themis yields 10.3% improvement when used to support online RL training and 6.9% gain for trajectory validation/filtering in self-training loops. All evaluated models achieve best performance under OS-Themis on the OGRBench benchmark.

Conclusion: OS-Themis provides a scalable and accurate framework for evaluating GUI agent performance, demonstrating significant improvements in RL training and self-training loops, with potential to drive agent evolution in stochastic GUI environments.

Abstract: Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

[396] CausalARC: Abstract Reasoning with Causal World Models

Jacqueline Maasch, John Kalantari, Kia Khezeli

Main category: cs.AI

TL;DR: CausalARC: A causal reasoning testbed for AI evaluation in low-data and OOD regimes, based on structural causal models with observational, interventional, and counterfactual feedback.

Details

Motivation: To create a systematic testbed for evaluating AI reasoning capabilities under challenging conditions of limited data and distribution shift, addressing the need for principled evaluation of causal reasoning in language models.

Method: Developed CausalARC as an experimental testbed modeled after ARC, where each reasoning task is sampled from a fully specified causal world model expressed as a structural causal model. Provides data augmentations giving observational, interventional, and counterfactual feedback as few-shot demonstrations.

Result: Demonstrated CausalARC’s utility in four language model evaluation settings: abstract reasoning with test-time training, counterfactual reasoning with in-context learning, program synthesis, and causal discovery with logical reasoning. Found significant performance variation across tasks and models.

Conclusion: CausalARC provides a principled framework for evaluating causal reasoning in AI systems, revealing substantial room for improvement in language model reasoning capabilities across different causal inference tasks.

Abstract: On-the-fly reasoning often requires adaptation to novel problems under limited data and distribution shift. This work introduces CausalARC: an experimental testbed for AI reasoning in low-data and out-of-distribution regimes, modeled after the Abstraction and Reasoning Corpus (ARC). Each CausalARC reasoning task is sampled from a fully specified causal world model, formally expressed as a structural causal model. Principled data augmentations provide observational, interventional, and counterfactual feedback about the world model in the form of few-shot, in-context learning demonstrations. As a proof-of-concept, we illustrate the use of CausalARC for four language model evaluation settings: (1) abstract reasoning with test-time training, (2) counterfactual reasoning with in-context learning, (3) program synthesis, and (4) causal discovery with logical reasoning. Within- and between-model performance varied heavily across tasks, indicating room for significant improvement in language model reasoning.

[397] Don’t Pass@k: A Bayesian Framework for Large Language Model Evaluation

Mohsen Hariri, Amirhossein Samandar, Michael Hinczewski, Vipin Chaudhary

Main category: cs.AI

TL;DR: A Bayesian evaluation framework that replaces Pass@k with posterior estimates of model success probability, providing stable rankings and uncertainty quantification for LLM reasoning evaluation.

Details

Motivation: Pass@k produces unstable and misleading rankings of LLMs' reasoning performance, especially with limited trials and computational constraints, necessitating a more principled evaluation approach.

Method: Models evaluation outcomes as categorical with Dirichlet prior, yielding closed-form posterior mean and uncertainty estimates for any weighted rubric, enabling use of prior evidence when appropriate.

Result: The Bayesian framework achieves faster convergence and greater rank stability than Pass@k and variants in simulations and real benchmarks (AIME'24/‘25, HMMT'25, BrUMO'25), enabling reliable comparisons with fewer samples.

Conclusion: Recommends replacing Pass@k with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit, clarifying when observed performance gaps are statistically meaningful.

Abstract: Pass$@k$ is widely used to report the reasoning performance of LLMs, but it often produces unstable and potentially misleading rankings, especially when the number of trials (samples) is limited and computational resources are constrained. We present a principled Bayesian evaluation framework that replaces Pass$@k$ and average accuracy over $N$ trials (avg$@N$) with posterior estimates of a model’s underlying success probability and credible intervals, yielding stable rankings and a transparent decision rule for differences. Evaluation outcomes are modeled as categorical (not just 0/1) with a Dirichlet prior, giving closed-form expressions for the posterior mean and uncertainty of any weighted rubric and enabling the use of prior evidence when appropriate. Theoretically, under a uniform prior, the Bayesian posterior mean is order-equivalent to average accuracy (Pass$@1$), explaining its empirical robustness while adding principled uncertainty. Empirically, in simulations with known ground-truth success rates and on AIME'24/‘25, HMMT'25, and BrUMO'25, the posterior-based procedure achieves faster convergence and greater rank stability than Pass$@k$ and recent variants, enabling reliable comparisons at far smaller sample counts. The framework clarifies when observed gaps are statistically meaningful (non-overlapping credible intervals) versus noise, and it naturally extends to graded, rubric-based evaluations. Together, these results recommend replacing Pass$@k$ for LLM evaluation and ranking with a posterior-based, compute-efficient protocol that unifies binary and non-binary evaluation while making uncertainty explicit. Source code is available at https://github.com/mohsenhariri/scorio

[398] SynBullying: A Multi LLM Synthetic Conversational Dataset for Cyberbullying Detection

Arefeh Kazemi, Hamza Qadeer, Joachim Wagner, Hossein Hosseini, Sri Balaaji Natarajan Kalaivendan, Brian Davis

Main category: cs.AI

TL;DR: SynBullying is a synthetic multi-LLM conversational dataset for cyberbullying research, offering realistic multi-turn interactions with context-aware annotations and fine-grained labeling as an ethical alternative to human data collection.

Details

Motivation: The paper addresses the need for scalable and ethically safe cyberbullying datasets. Traditional human data collection faces privacy concerns and scalability limitations. Synthetic data generation using LLMs offers a solution to create realistic bullying interactions without harming real individuals.

Method: The authors leverage multiple large language models to simulate realistic bullying conversations. They create a dataset with: 1) conversational structure capturing multi-turn exchanges, 2) context-aware annotations assessing harmfulness within conversational flow, and 3) fine-grained labeling covering various cyberbullying categories. The dataset is evaluated across five dimensions including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution.

Result: The paper presents SynBullying as a comprehensive dataset for cyberbullying research. It demonstrates the dataset’s utility by testing its performance both as standalone training data and as an augmentation source for cyberbullying classification tasks.

Conclusion: SynBullying provides a scalable, ethically safe alternative to human-collected cyberbullying data, enabling more comprehensive research into bullying dynamics through realistic synthetic conversations with detailed annotations.

Abstract: We introduce SynBullying, a synthetic multi-LLM conversational dataset for studying and detecting cyberbullying (CB). SynBullying provides a scalable and ethically safe alternative to human data collection by leveraging large language models (LLMs) to simulate realistic bullying interactions. The dataset offers (i) conversational structure, capturing multi-turn exchanges rather than isolated posts; (ii) context-aware annotations, where harmfulness is assessed within the conversational flow considering context, intent, and discourse dynamics; and (iii) fine-grained labeling, covering various CB categories for detailed linguistic and behavioral analysis. We evaluate SynBullying across five dimensions, including conversational structure, lexical patterns, sentiment/toxicity, role dynamics, harm intensity, and CB-type distribution. We further examine its utility by testing its performance as standalone training data and as an augmentation source for CB classification.

[399] Heuristic Multiobjective Discrete Optimization using Restricted Decision Diagrams

Rahul Patel, Elias B. Khalil, David Bergman

Main category: cs.AI

TL;DR: New node-selection heuristics for constructing restricted decision diagrams to approximate Pareto frontiers in multiobjective integer linear programming, achieving 85% Pareto frontier recovery with 2.5x speedups.

Details

Motivation: When decision diagrams become too large for memory or when fast approximations are preferred, restricted DDs with node selection are needed, but existing methods lack effective heuristics for high-quality Pareto frontier approximations.

Method: Introduces new node-selection heuristics for restricted DDs based on: 1) simple rules, 2) machine learning with feature engineering, or 3) end-to-end deep learning, depending on problem structure.

Result: Experiments on multiobjective knapsack, set packing, and traveling salesperson problems show recovery of over 85% of Pareto frontier with 2.5x speedups over exact DD enumeration, with few non-Pareto solutions.

Conclusion: The proposed node-selection heuristics effectively balance solution quality and computational efficiency for approximating Pareto frontiers in multiobjective optimization problems using restricted decision diagrams.

Abstract: Decision diagrams (DDs) have emerged as a state-of-the-art method for exact multiobjective integer linear programming. When the DD is too large to fit into memory or the decision-maker prefers a fast approximation to the Pareto frontier, the complete DD must be restricted to a subset of its states (or nodes). We introduce new node-selection heuristics for constructing restricted DDs that produce a high-quality approximation of the Pareto frontier. Depending on the structure of the problem, our heuristics are based on either simple rules, machine learning with feature engineering, or end-to-end deep learning. Experiments on multiobjective knapsack, set packing, and traveling salesperson problems show that our approach is highly effective, recovering over 85% of the Pareto frontier while achieving 2.5x speedups over exact DD enumeration on average, with very few non-Pareto solutions. The code is available at https://github.com/rahulptel/HMORDD.

[400] Automated Explanation Selection for Scientific Discovery

Ashlin Iser

Main category: cs.AI

TL;DR: Paper proposes a scientific discovery cycle combining machine learning with automated reasoning for generating and selecting explanations in XAI, with a taxonomy of explanation selection criteria drawing from sociology and cognitive science.

Details

Motivation: Automated reasoning is crucial for Explainable AI (XAI) to build trust beyond predictive accuracy. Current approaches need better methods for generating and selecting explanations that align with human understanding and social contexts.

Method: Proposes a cycle of scientific discovery integrating machine learning with automated reasoning. Develops a taxonomy of explanation selection problems based on insights from sociology and cognitive science, extending existing notions with new properties.

Result: A comprehensive framework for explanation generation and selection that incorporates interdisciplinary insights, providing new selection criteria that go beyond existing approaches in XAI.

Conclusion: The integration of automated reasoning with machine learning and insights from sociology/cognitive science provides a robust approach to explanation generation and selection, advancing the field of Explainable AI.

Abstract: Automated reasoning is a key technology in the young but rapidly growing field of Explainable Artificial Intelligence (XAI). Explanability helps build trust in artificial intelligence systems beyond their mere predictive accuracy and robustness. In this paper, we propose a cycle of scientific discovery that combines machine learning with automated reasoning for the generation and the selection of explanations. We present a taxonomy of explanation selection problems that draws on insights from sociology and cognitive science. These selection criteria subsume existing notions and extend them with new properties.

[401] LLM-Based World Models Can Make Decisions Solely, But Rigorous Evaluations are Needed

Chang Yang, Xinrun Wang, Junzhe Jiang, Qinggang Zhang, Xiao Huang

Main category: cs.AI

TL;DR: Comprehensive evaluation of LLMs as world models for decision-making across 31 diverse environments, testing policy verification, action proposal, and policy planning tasks.

Details

Motivation: While LLMs have been used as world simulators and for reasoning in decision-making, there's a lack of comprehensive evaluation from a decision-making perspective. Existing work either evaluates LLMs as general world simulators or as functional modules for transition prediction, but not systematically as standalone decision-making world models.

Method: Uses 31 diverse environments from prior work with curated rule-based policies. Designs three main tasks: policy verification (checking if actions follow rules), action proposal (suggesting valid actions), and policy planning (generating action sequences). Evaluates advanced LLMs (GPT-4o and GPT-4o-mini) across these tasks under various settings.

Result: Key findings: 1) GPT-4o significantly outperforms GPT-4o-mini, especially on tasks requiring domain knowledge; 2) Performance degrades for long-term decision-making tasks; 3) Combining different world model functionalities introduces additional performance instability.

Conclusion: LLMs show promise as world models for decision-making but have limitations in long-term reasoning and stability when combining multiple functionalities. The evaluation framework provides comprehensive insights into LLM capabilities as decision-making world models.

Abstract: World model emerges as a key module in decision making, where MuZero and Dreamer achieve remarkable successes in complex tasks. Recent work leverages Large Language Models (LLMs) as general world simulators to simulate the dynamics of the world due to their generalizability. LLMs also serve as the world model for deliberative reasoning in Reasoning via Planning (RAP) and Tree of Thought (ToT). However, the world models are either evaluated as a general world simulator, or as a functional module of the agent, i.e., predicting the transitions to assist the planning. In this work, we propose a comprehensive evaluation of the world models with LLMs from the decision making perspective. Specifically, we leverage the 31 diverse environments from (Wang et al., 2023;2024) and curate the rule-based policy of each environment for the diverse evaluation. Then, we design three main tasks, i.e., policy verification, action proposal, and policy planning, where the world models can be used for decision making solely. Finally, we conduct the comprehensive evaluation of the advanced LLMs, i.e., GPT-4o and GPT-4o-mini, on the environments for the three main tasks under various settings. The key observations include: i) GPT-4o significantly outperforms GPT-4o-mini on the three main tasks, especially for the tasks which require the domain knowledge, ii) the performance of the world model with LLM will be decreased for long-term decision-making tasks, and iii) the combination of different functionalities of the world model will brings additional unstabilities of the performance.

[402] Prompt Architecture Determines Reasoning Quality: A Variable Isolation Study on the Car Wash Problem

Heejin Jo

Main category: cs.AI

TL;DR: STAR reasoning framework dramatically improves LLM performance on car wash problem from 0% to 85%, with context adding incremental gains to reach 100% accuracy.

Details

Motivation: LLMs consistently fail the "car wash problem" benchmark requiring implicit physical constraint inference, prompting investigation into which prompt architecture layers enable correct reasoning.

Method: Variable isolation study with 120 total trials using Claude 3.5 Sonnet, testing different prompt architecture layers: STAR reasoning framework, user profile context via vector database retrieval, and RAG context.

Result: STAR framework alone raised accuracy from 0% to 85%, user profile context added 10 percentage points, RAG context added 5 percentage points, achieving 100% accuracy in full-stack condition.

Conclusion: Structured reasoning scaffolds (forced goal articulation before inference) matter substantially more than context injection for implicit constraint reasoning tasks.

Abstract: Large language models consistently fail the “car wash problem,” a viral reasoning benchmark requiring implicit physical constraint inference. We present a variable isolation study (n=20 per condition, 6 conditions, 120 total trials) examining which prompt architecture layers in a production system enable correct reasoning. Using Claude 3.5 Sonnet with controlled hyperparameters (temperature 0.7, top_p 1.0), we find that the STAR (Situation-Task-Action-Result) reasoning framework alone raises accuracy from 0% to 85% (p=0.001, Fisher’s exact test, odds ratio 13.22). Adding user profile context via vector database retrieval provides a further 10 percentage point gain, while RAG context contributes an additional 5 percentage points, achieving 100% accuracy in the full-stack condition. These results suggest that structured reasoning scaffolds – specifically, forced goal articulation before inference – matter substantially more than context injection for implicit constraint reasoning tasks.

[403] Agentic LLM Framework for Adaptive Decision Discourse

Antoine Dolant, Praveen Kumar

Main category: cs.AI

TL;DR: Agentic LLM framework simulates stakeholder discourse for collaborative decision-making under uncertainty, applied to flood response scenarios.

Details

Motivation: Traditional decision-support tools lack the ability to simulate diverse stakeholder perspectives and collaborative deliberation needed for complex, uncertain scenarios. There's a need for frameworks that can synthesize multiple viewpoints to develop robust, equitable recommendations.

Method: An agentic LLM framework that simulates diverse stakeholder personas with unique priorities, expertise, and value-driven reasoning. The framework enables self-governed assembly dialogues emphasizing trade-off exploration, tested on real (Texas floods) and hypothetical (Midwestern township) flooding scenarios.

Result: The framework successfully generates balanced recommendations considering social, economic, and environmental dimensions. It demonstrates ability to handle varying forecasting uncertainty and produce context-aware, scalable recommendations for high-stake scenarios.

Conclusion: Agentic LLMs can transform decision-making for complex, uncertain scenarios by enabling adaptive, collaborative, and equitable recommendations through simulated stakeholder discourse, with broad implications across domains where uncertainty and complexity converge.

Abstract: Effective decision-making in complex systems requires synthesizing diverse perspectives to address multifaceted challenges under uncertainty. This study introduces an agentic Large Language Models (LLMs) framework for simulating decision discourse - the deliberative process through which actionable strategies are collaboratively developed. Unlike traditional decision-support tools, this framework simulates diverse stakeholder personas, each bringing unique priorities, expertise and value-driven reasoning to a dialogue that emphasizes trade-off exploration in a self-governed assembly. We present explorative results fostering robust and equitable recommendations, with two use cases: first, our framework simulates a response to the floods that occurred on July 2025 in Texas; second, a hypothetical extreme flooding in a Midwestern township under varying forecasting uncertainty. Recommendations made balance competing priorities considered through social, economic and environmental dimensions, setting a foundation for scalable and context-aware recommendations and transforming how decisions for real-world high-stake scenarios can be approached in digital environments. This research explores novel and alternate routes leveraging agentic LLMs for adaptive, collaborative, and equitable recommendations, with implications across domains where uncertainty and complexity converge.

[404] From Mind to Machine: The Rise of Manus AI as a Fully Autonomous Digital Agent

Minjie Shen, Yanshu Li, Lulu Chen, Zhichao Fan, Yanhang Li, Qikai Yang

Main category: cs.AI

TL;DR: Manus AI is a general-purpose AI agent that combines LLM reasoning with task execution capabilities to bridge the gap between planning and real-world action implementation.

Details

Motivation: To create an autonomous AI agent that can translate high-level intentions into tangible real-world outcomes by combining reasoning capabilities with execution abilities, addressing the gap between "mind" (planning) and "hand" (execution).

Method: The paper presents Manus AI’s technical architecture that integrates large language models for reasoning and planning with execution systems for complex end-to-end task completion across various domains.

Result: Manus AI demonstrates capabilities across multiple sectors including healthcare, finance, manufacturing, robotics, and gaming, showing potential for translating intentions into real-world actions.

Conclusion: Manus AI represents a significant advancement toward intelligent agents that can execute complex tasks and heralds a new era of human-AI collaboration, though current limitations exist and future potential is substantial.

Abstract: Manus AI is a general-purpose AI agent introduced in early 2025, marking a significant advancement in autonomous artificial intelligence. Developed by the Chinese startup Monica.im, Manus is designed to bridge the gap between “mind” and “hand” - combining the reasoning and planning capabilities of large language models with the ability to execute complex, end-to-end tasks that produce tangible outcomes. This paper presents a comprehensive overview of Manus AI, exploring its core technical architecture, diverse applications across sectors such as healthcare, finance, manufacturing, robotics, and gaming, as well as its key strengths, current limitations, and future potential. Positioned as a preview of what lies ahead, Manus AI represents a shift toward intelligent agents that can translate high-level intentions into real-world actions, heralding a new era of human-AI collaboration.

[405] Preference-Driven Multi-Objective Combinatorial Optimization with Conditional Computation

Mingfeng Fan, Jianan Zhou, Yifeng Zhang, Yaoxin Wu, Jinbiao Chen, Guillaume Adrien Sartoretti

Main category: cs.AI

TL;DR: POCCO is a plug-and-play framework for multi-objective combinatorial optimization that adaptively selects specialized neural architectures for different subproblems and uses preference-driven optimization instead of explicit rewards.

Details

Motivation: Current deep RL methods for multi-objective combinatorial optimization treat all subproblems equally with a single model, limiting solution space exploration and leading to suboptimal performance.

Method: Proposes POCCO framework with: 1) conditional computation block that routes subproblems to specialized neural architectures, and 2) preference-driven optimization algorithm learning pairwise preferences between winning and losing solutions.

Result: Experimental results across four classic MOCOP benchmarks show significant superiority and strong generalization when applied to two state-of-the-art neural methods.

Conclusion: POCCO effectively overcomes limitations of equal treatment of subproblems and enables better solution space exploration through adaptive model selection and preference-based optimization.

Abstract: Recent deep reinforcement learning methods have achieved remarkable success in solving multi-objective combinatorial optimization problems (MOCOPs) by decomposing them into multiple subproblems, each associated with a specific weight vector. However, these methods typically treat all subproblems equally and solve them using a single model, hindering the effective exploration of the solution space and thus leading to suboptimal performance. To overcome the limitation, we propose POCCO, a novel plug-and-play framework that enables adaptive selection of model structures for subproblems, which are subsequently optimized based on preference signals rather than explicit reward values. Specifically, we design a conditional computation block that routes subproblems to specialized neural architectures. Moreover, we propose a preference-driven optimization algorithm that learns pairwise preferences between winning and losing solutions. We evaluate the efficacy and versatility of POCCO by applying it to two state-of-the-art neural methods for MOCOPs. Experimental results across four classic MOCOP benchmarks demonstrate its significant superiority and strong generalization.

[406] Efficient Reasoning with Balanced Thinking

Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian

Main category: cs.AI

TL;DR: ReBalance is a training-free framework that uses confidence-based steering to balance overthinking and underthinking in Large Reasoning Models, improving efficiency and accuracy across math, QA, and coding tasks.

Details

Motivation: Large Reasoning Models suffer from overthinking (redundant computation on simple problems) and underthinking (insufficient exploration of reasoning paths), leading to inefficiencies and inaccuracies. Existing methods to fix overthinking often cause underthinking, compromising accuracy.

Method: ReBalance uses confidence as a continuous indicator of reasoning dynamics: identifies overthinking through high confidence variance and underthinking via consistent overconfidence. Aggregates hidden states from a small dataset into reasoning mode prototypes, computes a steering vector to guide reasoning trajectories, and uses a dynamic control function to modulate vector strength/direction based on real-time confidence.

Result: Extensive experiments on four models (0.5B to 32B) across nine benchmarks in math reasoning, general QA, and coding tasks show ReBalance effectively reduces output redundancy while improving accuracy.

Conclusion: ReBalance offers a general, training-free, plug-and-play strategy for efficient and robust LRM deployment by balancing thinking processes through confidence-based steering.

Abstract: Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .

[407] Multimodal Fused Learning for Solving the Generalized Traveling Salesman Problem in Robotic Task Planning

Jiaqi Chen, Mingfeng Fan, Xuefeng Zhang, Jingsong Liang, Yuhong Cao, Guohua Wu, Guillaume Adrien Sartoretti

Main category: cs.AI

TL;DR: A multimodal learning framework (MMFL) that combines graph and image representations to solve Generalized Traveling Salesman Problems for real-time robotic task planning.

Details

Motivation: Mobile robots need efficient task planning for applications like warehouse retrieval and environmental monitoring, which involve selecting locations from target clusters (GTSP problems). Current methods struggle to balance solution quality with real-time computational efficiency.

Method: Proposes Multimodal Fused Learning (MMFL) framework with: 1) coordinate-based image builder transforming GTSP instances into spatial representations, 2) adaptive resolution scaling for different problem scales, 3) multimodal fusion module with bottlenecks to integrate geometric (graph) and spatial (image) features, learning a policy for real-time planning.

Result: Extensive experiments show MMFL significantly outperforms state-of-the-art methods across various GTSP instances while maintaining computational efficiency for real-time applications. Physical robot tests validate practical effectiveness in real-world scenarios.

Conclusion: MMFL provides an effective solution for robotic task planning by leveraging multimodal representations to solve GTSP problems with both high quality and real-time efficiency.

Abstract: Effective and efficient task planning is essential for mobile robots, especially in applications like warehouse retrieval and environmental monitoring. These tasks often involve selecting one location from each of several target clusters, forming a Generalized Traveling Salesman Problem (GTSP) that remains challenging to solve both accurately and efficiently. To address this, we propose a Multimodal Fused Learning (MMFL) framework that leverages both graph and image-based representations to capture complementary aspects of the problem, and learns a policy capable of generating high-quality task planning schemes in real time. Specifically, we first introduce a coordinate-based image builder that transforms GTSP instances into spatially informative representations. We then design an adaptive resolution scaling strategy to enhance adaptability across different problem scales, and develop a multimodal fusion module with dedicated bottlenecks that enables effective integration of geometric and spatial features. Extensive experiments show that our MMFL approach significantly outperforms state-of-the-art methods across various GTSP instances while maintaining the computational efficiency required for real-time robotic applications. Physical robot tests further validate its practical effectiveness in real-world scenarios.

[408] Single Agent Robust Deep Reinforcement Learning for Bus Fleet Control

Yifan Zhang

Main category: cs.AI

TL;DR: Single-agent RL framework for bus holding control using categorical state augmentation and structured rewards, outperforming multi-agent approaches in realistic transit simulations.

Details

Motivation: Traditional multi-agent RL for bus bunching control fails in realistic transit operations with heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes due to data imbalance and convergence issues.

Method: Reformulates multi-agent problem into single-agent RL by augmenting state space with categorical identifiers (vehicle ID, station ID, time period) plus numerical features (headway, occupancy, velocity). Uses modified soft actor-critic (SAC) with ridge-shaped reward function balancing uniform headways and schedule adherence.

Result: Modified SAC achieves more stable and superior performance than benchmarks including MADDPG (-430k vs. -530k under stochastic conditions), demonstrating effective bus holding management in non-loop, real-world contexts.

Conclusion: Single-agent deep RL with categorical structuring and schedule-aware rewards offers robust, scalable alternative to MARL frameworks for bus holding control, particularly where agent-specific experiences are imbalanced.

Abstract: Bus bunching remains a challenge for urban transit due to stochastic traffic and passenger demand. Traditional solutions rely on multi-agent reinforcement learning (MARL) in loop-line settings, which overlook realistic operations characterized by heterogeneous routes, timetables, fluctuating demand, and varying fleet sizes. We propose a novel single-agent reinforcement learning (RL) framework for bus holding control that avoids the data imbalance and convergence issues of MARL under near-realistic simulation. A bidirectional timetabled network with dynamic passenger demand is constructed. The key innovation is reformulating the multi-agent problem into a single-agent one by augmenting the state space with categorical identifiers (vehicle ID, station ID, time period) in addition to numerical features (headway, occupancy, velocity). This high-dimensional encoding enables single-agent policies to capture inter-agent dependencies, analogous to projecting non-separable inputs into a higher-dimensional space. We further design a structured reward function aligned with operational goals: instead of exponential penalties on headway deviations, a ridge-shaped reward balances uniform headways and schedule adherence. Experiments show that our modified soft actor-critic (SAC) achieves more stable and superior performance than benchmarks, including MADDPG (e.g., -430k vs. -530k under stochastic conditions). These results demonstrate that single-agent deep RL, when enhanced with categorical structuring and schedule-aware rewards, can effectively manage bus holding in non-loop, real-world contexts. This paradigm offers a robust, scalable alternative to MARL frameworks, particularly where agent-specific experiences are imbalanced.

[409] MMSearch-Plus: Benchmarking Provenance-Aware Search for Multimodal Browsing Agents

Xijia Tao, Yihua Teng, Xinxing Su, Xinyu Fu, Jihao Wu, Chaofan Tao, Ziru Liu, Haoli Bai, Rui Liu, Lingpeng Kong

Main category: cs.AI

TL;DR: MMSearch-Plus is a 311-task benchmark requiring genuine multimodal reasoning through iterative image-text retrieval and cross-validation with retrieval noise, addressing limitations of existing benchmarks that can be solved with text-only heuristics.

Details

Motivation: Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning because many tasks can be solved with text-only heuristics without vision-in-the-loop verification. There's a need for benchmarks that enforce true multimodal understanding through extraction and propagation of fine-grained visual cues.

Method: Created MMSearch-Plus benchmark with 311 tasks requiring extrapolation from spatial cues and temporal traces to out-of-image facts. Provided model-agnostic agent framework with standard browsing tools and Set-of-Mark (SoM) module that enables provenance-aware zoom-and-retrieve operations, crop subregions, and targeted image/text searches.

Result: Strongest system achieved 36.0% end-to-end accuracy. Integrating SoM produced consistent gains up to +3.9 points. Failure analysis revealed recurring errors in locating relevant webpages and distinguishing visually similar events.

Conclusion: MMSearch-Plus establishes a rigorous benchmark for advancing agentic multimodal large language models, highlighting challenges in real-world multimodal search and demonstrating the value of provenance-aware visual reasoning tools.

Abstract: Existing multimodal browsing benchmarks often fail to require genuine multimodal reasoning, as many tasks can be solved with text-only heuristics without vision-in-the-loop verification. We introduce MMSearch-Plus, a 311-task benchmark that enforces multimodal understanding by requiring extraction and propagation of fine-grained visual cues through iterative image-text retrieval and cross-validation under retrieval noise. Our curation procedure seeds questions whose answers require extrapolating from spatial cues and temporal traces to out-of-image facts such as events, dates, and venues. Beyond the dataset, we provide a model-agnostic agent framework with standard browsing tools and a set-of-mark (SoM) module, which lets the agent place marks, crop subregions, and launch targeted image/text searches. SoM enables provenance-aware zoom-and-retrieve and improves robustness in multi-step reasoning. We evaluated closed- and open-source MLLMs in this framework. The strongest system achieves an end-to-end accuracy of 36.0%, and integrating SoM produces consistent gains in multiple settings, with improvements up to +3.9 points. From failure analysis, we observe recurring errors in locating relevant webpages and distinguishing between visually similar events. These results underscore the challenges of real-world multimodal search and establish MMSearch-Plus as a rigorous benchmark for advancing agentic MLLMs.

[410] When Only the Final Text Survives: Implicit Execution Tracing for Multi-Agent Attribution

Yi Nian, Haosen Cao, Shenzhe Zhu, Henry Peng Zou, Qingqing Luan, Yue Zhao

Main category: cs.AI

TL;DR: IET enables token-level attribution and interaction topology reconstruction in multi-agent language systems using embedded keyed signals in generated text, allowing privacy-preserving auditing without execution logs.

Details

Motivation: Multi-agent language systems lack accountability when execution logs and agent identifiers are unavailable, making it difficult to determine responsibility for incorrect or harmful outputs. Current systems obscure interaction topology and individual agent contributions.

Method: IET embeds agent-specific keyed signals into token distributions during generation, creating self-describing execution traces. A transition-aware scoring method detects agent handover points and reconstructs interaction graphs using only the generated text and a secret key.

Result: Experiments show IET accurately recovers agent segments and coordination structure while preserving generation quality, enabling effective privacy-preserving auditing for multi-agent systems.

Conclusion: IET provides a practical framework for accountability in multi-agent language systems without requiring execution logs, addressing the critical need for attribution and topology reconstruction in privacy-sensitive applications.

Abstract: When a multi-agent system produces an incorrect or harmful answer, who is accountable if execution logs and agent identifiers are unavailable? Multi-agent language systems increasingly rely on structured interactions such as delegation and iterative refinement, yet the final output often obscures the underlying interaction topology and agent contributions. We introduce IET (Implicit Execution Tracing), a metadata-independent framework that enables token-level attribution directly from generated text and a simple mechanism for interaction topology reconstruction. During generation, agent-specific keyed signals are embedded into the token distribution, transforming the text into a self-describing execution trace detectable only with a secret key. At detection time, a transition-aware scoring method identifies agent handover points and reconstructs the interaction graph. Experiments show that IET recovers agent segments and coordination structure with high accuracy while preserving generation quality, enabling privacy-preserving auditing for multi-agent language systems.

[411] An Order-Sensitive Conflict Measure for Random Permutation Sets

Ruolan Cheng, Yong Deng

Main category: cs.AI

TL;DR: A new conflict measure for Random Permutation Sets (RPS) that accounts for order information, developed from both Random Finite Set and Dempster-Shafer Theory perspectives, with order-sensitive quantification of evidence conflicts.

Details

Motivation: Measuring conflict between evidence represented by permutation mass functions in Random Permutation Sets is an open issue in order-dependent uncertain information fusion. Current methods don't adequately handle the qualitative propensity where higher-ranked elements are more significant.

Method: Analyzes conflicts in RPS from Random Finite Set and Dempster-Shafer Theory perspectives. Defines a non-overlap-based inconsistency measure for permutations and develops an order-sensitive conflict measure that reformulates conflict as a graded, order-dependent notion rather than a simple dichotomy.

Result: Proposed method exhibits inherent top-weightedness property, effectively quantifies conflict between RPSs within DST framework, and provides decision-makers with flexibility in selecting weights, parameters, and truncation depths. Numerical examples validate behavior and properties.

Conclusion: The paper addresses the open issue of conflict measurement in Random Permutation Sets by developing an order-sensitive conflict measure that captures graded, order-dependent conflicts rather than simple binary conflict/non-conflict classifications.

Abstract: Random permutation set (RPS) is a new formalism for reasoning with uncertainty involving order information. Measuring the conflict between two pieces of evidence represented by permutation mass functions remains an open issue in order-dependent uncertain information fusion. This paper analyzes conflicts in RPS from two different perspectives: random finite set (RFS) and Dempster-Shafer theory (DST). From the DST perspective, the order information incorporated into focal sets reflects a qualitative propensity where higher-ranked elements are more significant. Motivated by this view and observations on permutations, we define a non-overlap-based inconsistency measure for permutations and develop an order-sensitive conflict measure for RPSs. The proposed method reformulates the conflict in RPSs as a graded, order-dependent notion rather than a simple dichotomy of conflict versus non-conflict. Numerical examples are presented to validate the behavior and properties of the proposed conflict measure. The proposed method not only exhibits an inherent top-weightedness property and effectively quantifies conflict between RPSs within the DST framework, but also provides decision-makers with flexibility in selecting weights, parameters, and truncation depths.

[412] AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.AI

TL;DR: Paper ID 2511.23253 summary could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2511.23253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[413] Memory Bear AI A Breakthrough from Memory to Cognition Toward Artificial General Intelligence

Deliang Wen, Ke Sun

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2512.20651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[414] Developing a Discrete-Event Simulator of School Shooter Behavior from VR Data

Christopher A. McClurg, Alan R. Wagner

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.06023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[415] From Logs to Language: Learning Optimal Verbalization for LLM-Based Recommendation at Industry Scale

Yucheng Shi, Ying Li, Yu Wang, Yesu Feng, Arjun Rao, Rein Houthooft, Shradha Sehgal, Jin Wang, Hao Zhen, Ninghao Liu, Linas Baltrunas

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2602.20558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[416] CIRCLE: A Framework for Evaluating AI from a Real-World Lens

Reva Schwartz, Carina Westling, Morgan Briggs, Marzieh Fadaee, Isar Nejadgholi, Matthew Holmes, Fariza Rashid, Maya Carlyle, Afaf Taïk, Kyra Wilson, Peter Douglas, Theodora Skeadas, Gabriella Waters, Rumman Chowdhury, Thiago Lacerda

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2602.24055: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24055&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] AI4S-SDS: A Neuro-Symbolic Solvent Design System via Sparse MCTS and Differentiable Physics Alignment

Jiangyu Chen

Main category: cs.AI

TL;DR: Failed to fetch summary for paper 2603.03686 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2603.03686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] Offline Materials Optimization with CliqueFlowmer

Jakub Grudzien Kuba, Benjamin Kurt Miller, Sergey Levine, Pieter Abbeel

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.06082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] MEMO: Memory-Augmented Model Context Optimization for Robust Multi-Turn Multi-Agent LLM Games

Yunfei Xie, Kevin Wang, Bobby Cheng, Jianzhu Yao, Zhizhou Sha, Alexander Duffy, Yihan Xi, Hongyuan Mei, Cheston Tan, Chen Wei, Pramod Viswanath, Zhangyang Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.09022: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09022&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] MedMASLab: A Unified Orchestration Framework for Benchmarking Multimodal Medical Multi-Agent Systems

Yunhang Qian, Xiaobin Hu, Jiaquan Yu, Siyang Xin, Xiaokun Chen, Jiangning Zhang, Peng-Tao Jiang, Jiawei Liu, Hongwei Bran Li

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - the arXiv API request failed

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot determine conclusion as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.09909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] Reversible Lifelong Model Editing via Semantic Routing-Based LoRA

Haihua Luo, Xuming Ran, Tommi Kärkkäinen, Zhonghua Chen, Jiangrong Shen, Qi Xu, Fengyu Cong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.11239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[422] VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Xuanyu Zhu, Yuhao Dong, Rundong Wang, Yang Shi, Zhipeng Wu, Yinlun Peng, YiFan Zhang, Yihang Lou, Yuanxing Zhang, Ziwei Liu, Yan Bai, Yuan Zhou

Main category: cs.AI

TL;DR: VTC-Bench: A comprehensive benchmark for evaluating multimodal LLMs’ tool-use proficiency with 32 OpenCV visual operations, 680 problems, and structured cognitive hierarchy to assess multi-tool composition and long-horizon planning.

Details

Motivation: Existing benchmarks for MLLMs have limited tool-sets and simple tool-use trajectories, failing to capture complex tool interactions needed for real-world visual tasks. There's a gap in evaluating models' ability to compose diverse tools and execute multi-step plans.

Method: Created VisualToolChain-Bench (VTC-Bench) with 32 diverse OpenCV-based visual operations, 680 curated problems across nine cognitive hierarchy categories, each with ground-truth execution trajectories for precise evaluation.

Result: Evaluation of 19 leading MLLMs shows critical limitations: models struggle with diverse tool-sets, generalization to unseen operations, and multi-tool composition. Gemini-3.0-Pro achieved only 51% accuracy, revealing heavy reliance on familiar but suboptimal functions.

Conclusion: VTC-Bench identifies fundamental challenges in current MLLMs’ visual agentic capabilities and establishes a rigorous baseline for developing more generalized visual agentic models that can effectively compose and execute complex tool-use plans.

Abstract: Recent advancements extend Multimodal Large Language Models (MLLMs) beyond standard visual question answering to utilizing external tools for advanced visual tasks. Despite this progress, precisely executing and effectively composing diverse tools for complex tasks remain persistent bottleneck. Constrained by sparse tool-sets and simple tool-use trajectories, existing benchmarks fail to capture complex and diverse tool interactions, falling short in evaluating model performance under practical, real-world conditions. To bridge this gap, we introduce VisualToolChain-Bench(VTC-Bench), a comprehensive benchmark designed to evaluate tool-use proficiency in MLLMs. To align with realistic computer vision pipelines, our framework features 32 diverse OpenCV-based visual operations. This rich tool-set enables extensive combinations, allowing VTC-Bench to rigorously assess multi-tool composition and long-horizon, multi-step plan execution. For precise evaluation, we provide 680 curated problems structured across a nine-category cognitive hierarchy, each with ground-truth execution trajectories. Extensive experiments on 19 leading MLLMs reveal critical limitations in current models’ visual agentic capabilities. Specifically, models struggle to adapt to diverse tool-sets and generalize to unseen operations, with the leading model Gemini-3.0-Pro only achieving 51% on our benchmark. Furthermore, multi-tool composition remains a persistent challenge. When facing complex tasks, models struggle to formulate efficient execution plans, relying heavily on a narrow, suboptimal subset of familiar functions rather than selecting the optimal tools. By identifying these fundamental challenges, VTC-Bench establishes a rigorous baseline to guide the development of more generalized visual agentic models.

[423] Listening to the Echo: User-Reaction Aware Policy Optimization via Scalar-Verbal Hybrid Reinforcement Learning

Jing Ye, Xinpei Zhao, Lu Xiang, Yaping Zhang, Chengqing Zong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to API rate limiting error

Result: No results available - paper content inaccessible

Conclusion: Cannot analyze paper due to technical limitations in accessing arXiv data

Abstract: Failed to fetch summary for 2603.15434: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15434&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[424] Safety is Non-Compositional: A Formal Framework for Capability-Based AI Systems

Cosimo Spera

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.15973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[425] From Workflow Automation to Capability Closure: A Formal Framework for Safe and Revenue-Aware Customer Service AI

Cosimo Spera

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.15978: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15978&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[426] CADGL: Context-Aware Deep Graph Learning for Predicting Drug-Drug Interactions

Azmine Toushik Wasi, Taki Hasan Rafi, Raima Islam, Serbetar Karlo, Dong-Kyu Chae

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2403.17210: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.17210&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[427] ARISE: Agent Reasoning with Intrinsic Skill Evolution in Hierarchical Reinforcement Learning

Yu Li, Rui Miao, Zhengling Qi, Tian Lan

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed fetch

Method: Unable to determine method due to failed fetch

Result: Unable to determine results due to failed fetch

Conclusion: Unable to determine conclusion due to failed fetch

Abstract: Failed to fetch summary for 2603.16060: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16060&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[428] Learning to Predict, Discover, and Reason in High-Dimensional Event Sequences

Hugo Math

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about paper content due to access limitations

Abstract: Failed to fetch summary for 2603.16313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] Nonstandard Errors in AI Agents

Ruijiang Gao, Steven Chong Xiao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.16744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] PLM-Net: Perception Latency Mitigation Network for Vision-Based Lateral Control of Autonomous Vehicles

Aws Khalil, Jaerock Kwon

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2407.16740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.16740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] Cliqueformer: Model-Based Optimization with Structured Transformers

Jakub Grudzien Kuba, Pieter Abbeel, Sergey Levine

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to API access limitations

Method: Cannot determine method due to API access limitations

Result: Cannot determine results due to API access limitations

Conclusion: Cannot determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2410.13106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.13106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] Biased AI can Influence Political Decision-Making

Jillian Fisher, Shangbin Feng, Robert Aron, Thomas Richardson, Yejin Choi, Daniel W. Fisher, Jennifer Pan, Yulia Tsvetkov, Katharina Reinecke

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2410.06415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.06415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] LLMIA: An Out-of-the-Box Index Advisor via In-Context Learning with LLMs

Xinxin Zhao, Xinmei Huang, Haoyang Li, Jing Zhang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Cuiping Li, Hong Chen

Main category: cs.AI

TL;DR: Paper ID 2503.07884 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2503.07884: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.07884&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] Differentially Private Equilibrium Finding in Polymatrix Games

Mingyang Liu, Gabriele Farina, Asuman Ozdaglar

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2503.09538: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.09538&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[435] Detecting and Mitigating DDoS Attacks with AI: A Survey

Alexandru Apostu, Silviu Gheorghe, Andrei Hîji, Nicolae Cleju, Andrei Pătraşcu, Cristian Rusu, Radu Ionescu, Paul Irofti

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2503.17867: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.17867&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[436] OPUS-VFL: Incentivizing Optimal Privacy-Utility Tradeoffs in Vertical Federated Learning

Sindhuja Madabushi, Ahmad Faraz Khan, Haider Ali, Jin-Hee Cho

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to technical limitations in accessing the content

Abstract: Failed to fetch summary for 2504.15995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.15995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[437] A New Tractable Description Logic under Categorical Semantics

Chan Le Duc, Ludovic Brieulle

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2505.08916: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08916&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[438] Online Fair Division with Additional Information

Tzeh Yuan Neoh, Jannik Peters, Nicholas Teh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2505.24503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[439] Size-adaptive Hypothesis Testing for Fairness

Antonio Ferrara, Francesco Cozzi, Alan Perotti, André Panisson, Francesco Bonchi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2506.10586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.10586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[440] Quantifying Student Success with Generative AI: A Monte Carlo Simulation Informed by Systematic Review

Seyma Yaman Kayadibi

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.01062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[441] Surrogate Model for Heat Transfer Prediction in Impinging Jet Arrays using Dynamic Inlet/Outlet and Flow Rate Control

Mikael Vaillant, Victor Oliveira Ferreira, Wiebke Mainville, Jean-Michel Lamarre, Vincent Raymond, Moncef Chioua, Bruno Blais

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2507.07034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[442] When Validation Fails: Cross-Institutional Blood Pressure Prediction and the Limits of Electronic Health Record-Based Models

Md Basit Azam, Sarangthem Ibotombi Singh

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2507.19530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.19530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[443] Unsupervised Learning for Inverse Problems in Computed Tomography

Laura Hellwege, Johann Christopher Engster, Moritz Schaar, Thorsten M. Buzug, Maik Stille

Main category: cs.AI

TL;DR: Unable to analyze paper 2508.05321 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2508.05321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[444] VeriEquivBench: An Equivalence Score for Ground-Truth-Free Evaluation of Formally Verifiable Code

Lingfei Zeng, Fengdi Che, Xuhan Huang, Fei Ye, Xu Xu, Binhang Yuan, Jie Fu

Main category: cs.AI

TL;DR: Paper 2510.06296 could not be analyzed due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2510.06296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[445] Manual2Skill++: Connector-Aware General Robotic Assembly from Instruction Manuals via Vision-Language Models

Chenrui Tie, Shengxiang Sun, Yudi Lin, Yanbo Wang, Zhongrui Li, Zhouhan Zhong, Jinxuan Zhu, Yiman Pang, Haonan Chen, Junting Chen, Ruihai Wu, Lin Shao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2510.16344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[446] iSeal: Encrypted Fingerprinting for Reliable LLM Ownership Verification

Zixun Xiong, Gaoyi Wu, Qingyang Yu, Mingyu Derek Ma, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2511.08905 appears to be an arXiv paper, but no content is available for analysis.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.

Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.

Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.

Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.

Abstract: Failed to fetch summary for 2511.08905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[447] Membership Inference Attack against Large Language Model-based Recommendation Systems: A New Distillation-based Paradigm

Li Cuihong, Huang Xiaowen, Yin Chuanhuan, Sang Jitao

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.14763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[448] Evolved Sample Weights for Bias Mitigation: Effectiveness Depends on the Fairness Objective

Anil K. Saini, Jose Guadalupe Hernandez, Emily F. Wong, Debanshi Misra, Tiffani J. Bright, Jason H. Moore

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.

Details

Motivation: Cannot determine motivation without access to the paper content.

Method: Cannot determine method without access to the paper content.

Result: Cannot determine results without access to the paper content.

Conclusion: Cannot determine conclusion without access to the paper content.

Abstract: Failed to fetch summary for 2511.20909: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20909&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] Cell-cell Communication Inference and Analysis: Biological Mechanisms, Computational Approaches, and Future Opportunities

Xiangzheng Cheng, Haili Huang, Ye Su, Qing Nie, Xiufen Zou, Suoqin Jin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.03497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] Heads collapse, features stay: Why Replay needs big buffers

Giulia Lanzillotta, Damiano Meier, Thomas Hofmann

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: No method information available due to API rate limiting error

Result: No results available - paper content inaccessible

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2512.07400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] Neuron-Guided Interpretation of Code LLMs: Where, Why, and How?

Zhe Yin, Xiaodong Gu, Beijun Shen

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.19980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] STELLAR: Structure-guided LLM Assertion Retrieval and Generation for Formal Verification

Saeid Rajabi, Chengmo Yang, Satwik Patnaik

Main category: cs.AI

TL;DR: Paper ID 2601.19903 summary could not be fetched due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Method unknown - paper content inaccessible due to HTTP 429 error from arXiv API

Result: No results available - technical issue with API access prevents analysis

Conclusion: Cannot provide analysis due to technical limitations in accessing paper content

Abstract: Failed to fetch summary for 2601.19903: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19903&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] Sheaf Neural Networks and biomedical applications

Aneeqa Mehrab, Jan Willem Van Looy, Pietro Demurtas, Stefano Iotti, Emil Malucelli, Francesca Rossi, Ferdinando Zanchetta, Rita Fioresi

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2602.00159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Faezeh Fadaei, Jenny Carla Moran, Taha Yasseri

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2602.02606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] Optimal rates for density and mode estimation with expand-and-sparsify representations

Kaushik Sinha, Christopher Tosh

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper with ID 2602.06175 cannot be analyzed without access to its abstract or content.

Details

Motivation: Cannot determine motivation without access to paper content. The arXiv API returned a rate limiting error (HTTP 429), preventing retrieval of the paper's abstract or details.

Method: Cannot determine method without access to paper content. The paper ID 2602.06175 exists but cannot be retrieved due to API rate limiting.

Result: Cannot determine results without access to paper content. The arXiv API request failed with HTTP 429 status, indicating too many requests in a short time period.

Conclusion: Cannot draw conclusions about the paper’s content or relevance without access to its abstract or full text. The analysis is limited by technical constraints of the arXiv API.

Abstract: Failed to fetch summary for 2602.06175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] Krause Synchronization Transformers

Jingkun Liu, Yisong Yue, Max Welling, Yue Song

Main category: cs.AI

TL;DR: Unable to analyze paper 2602.11534 due to HTTP 429 error when fetching summary from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2602.11534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] SF-RAG: Structure-Fidelity Retrieval-Augmented Generation for Academic Question Answering

Rui Yu, Tianyi Wang, Ruixia Liu, Yinglong Wang

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2602.13647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] Stable Deep Reinforcement Learning via Isotropic Gaussian Representations

Ali Saheb Pasand, Johan Obando-Ceron, Aaron Courville, Pouya Bashivan, Pablo Samuel Castro

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2602.19373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Yongzhong Xu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Unable to determine motivation due to failed paper fetch.

Method: Unable to determine method due to failed paper fetch.

Result: Unable to determine results due to failed paper fetch.

Conclusion: Unable to determine conclusion due to failed paper fetch.

Abstract: Failed to fetch summary for 2602.23696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Theory of Code Space: Do Code Agents Understand Software Architecture?

Grigory Sapunov

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2603.00601

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2603.00601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Towards Efficient and Stable Ocean State Forecasting: A Continuous-Time Koopman Approach

Rares Grozavescu, Pengyu Zhang, Mark Girolami, Etienne Meunier

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2603.05560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] Adversarial Latent-State Training for Robust Policies in Partially Observable Domains

Angad Singh Ahuja

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2603.07313: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07313&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] Interleaving Scheduling and Motion Planning with Incremental Learning of Symbolic Space-Time Motion Abstractions

Elisa Tosello, Arthur Bit-Monnot, Davide Lusuardi, Alessandro Valentini, Andrea Micheli

Main category: cs.AI

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.10651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Thousand-GPU Large-Scale Training and Optimization Recipe for AI-Native Cloud Embodied Intelligence Infrastructure

Yongjian Guo, Yunxuan Ma, Haoran Sun, Zhong Guan, Shuai Di, Jing Long, Wanting Xu, Xiaodong Bai, Wen Huang, Yucheng Guo, Chen Zhou, Qiming Yang, Mingxi Luo, Tianyun Zhao, Hedan Yang, Song Wang, Xiaomeng Tian, Xiaolong Xiang, Zhen Sun, Yu Wei, Luqiao Wang, Yuzhen Li, Chenfeng Gu, Junwu Xiong, Yicheng Gong

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.11101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] WebWeaver: Breaking Topology Confidentiality in LLM Multi-Agent Systems with Stealthy Context-Based Inference

Zixun Xiong, Gaoyi Wu, Lingfeng Yao, Miao Pan, Xiaojiang Du, Hao Wang

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2603.11132: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11132&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] Representation Finetuning for Continual Learning

Haihua Luo, Xuming Ran, Tommi Kärkkäinen, Huiyan Xue, Zhonghua Chen, Qi Xu, Fengyu Cong

Main category: cs.AI

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.11201 could not be retrieved from arXiv API.

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.

Method: Cannot determine method as paper content is unavailable.

Result: Cannot determine results as paper content is unavailable.

Conclusion: Cannot draw conclusions about the paper due to inability to access its content.

Abstract: Failed to fetch summary for 2603.11201: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11201&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] WORKSWORLD: A Domain for Integrated Numeric Planning and Scheduling of Distributed Pipelined Workflows

Taylor Paul, William Regli

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.12214: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12214&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] PREBA: Surgical Duration Prediction via PCA-Weighted Retrieval-Augmented LLMs and Bayesian Averaging Aggregation

Wanyin Wu, Kanxue Li, Baosheng Yu, Haoyun Zhao, Yibing Zhan, Dapeng Tao, Hua Jin

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to draw conclusions due to access error

Abstract: Failed to fetch summary for 2603.13275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales

Yongzhong Xu

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.15678: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15678&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] SCALE:Scalable Conditional Atlas-Level Endpoint transport for virtual cell perturbation prediction

Shuizhou Chen, Lang Yu, Kedu Jin, Songming Zhang, Hao Wu, Wenxuan Huang, Sheng Xu, Quan Qian, Qin Chen, Lei Bai, Siqi Sun, Zhangyang Gao

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.17380: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17380&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies

Utkarsh Grover, Ravi Ranjan, Mingyang Mao, Trung Tien Dong, Satvik Praveen, Zhenqi Wu, J. Morris Chang, Tinoosh Mohsenin, Yi Sheng, Agoritsa Polyzou, Eiman Kanjo, Xiaomin Lin

Main category: cs.AI

TL;DR: Unable to analyze paper 2603.16952 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2603.16952: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16952&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] TDAD: Test-Driven Agentic Development - Reducing Code Regressions in AI Coding Agents via Graph-Based Impact Analysis

Pepe Alonso, Sergio Yovine, Victor A. Braberman

Main category: cs.AI

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.17973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.SD

[473] MOSS-TTS Technical Report

Main category: cs.SD

Details

[474] Towards Interpretable Framework for Neural Audio Codecs via Sparse Autoencoders: A Case Study on Accent Information

Shih-Heng Wang, Tiantian Feng, Aditya Kommineni, Thanathai Lertpetchpun, Bowen Yi, Xuan Shi, Shrikanth Narayanan

Main category: cs.SD

TL;DR: SAEs decompose NAC representations into sparse activations to study accent encoding; DAC and SpeechTokenizer show highest interpretability; acoustic vs phonetic NACs differ in how they encode accent information.

Details

Motivation: Neural Audio Codecs (NACs) are widely used but their encoding of linguistic/paralinguistic information is unclear. Improving interpretability is critical for sensitive applications, especially understanding how they encode challenging attributes like accent.

Method: Use Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. Focus on accent as a challenging paralinguistic attribute. Evaluate 4 NAC models under 16 SAE configurations using a relative performance index framework.

Result: DAC and SpeechTokenizer achieve highest interpretability. Acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, while phonetic-oriented NACs rely more on activation positions. Low-bitrate EnCodec variants show higher interpretability.

Conclusion: SAEs effectively improve interpretability of NAC representations for accent encoding. Different NAC architectures encode paralinguistic information differently, with implications for model selection in sensitive applications.

Abstract: Neural Audio Codecs (NACs) are widely adopted in modern speech systems, yet how they encode linguistic and paralinguistic information remains unclear. Improving the interpretability of NAC representations is critical for understanding and deploying them in sensitive applications. Hence, we employ Sparse Autoencoders (SAEs) to decompose dense NAC representations into sparse, interpretable activations. In this work, we focus on a challenging paralinguistic attribute-accent-and propose a framework to quantify NAC interpretability. We evaluate four NAC models under 16 SAE configurations using a relative performance index. Our results show that DAC and SpeechTokenizer achieve the highest interpretability. We further reveal that acoustic-oriented NACs encode accent information primarily in activation magnitudes of sparse representations, whereas phonetic-oriented NACs rely more on activation positions, and that low-bitrate EnCodec variants show higher interpretability.

[475] Words at Play: Benchmarking Audio Pun Understanding in Large Audio-Language Models

Yuchen Su, Shaoxin Zhong, Yonghua Zhu, Ruofan Wang, Zijian Huang, Qiqi Wang, Na Zhao, Diana Benavides-Prado, Michael Witbrock

Main category: cs.SD

TL;DR: APUN-Bench: First benchmark for evaluating large audio language models on audio pun understanding, containing 4,434 annotated audio samples across three tasks: pun recognition, pun word location, and pun meaning inference.

Details

Motivation: Puns present unique challenges for natural language understanding due to polysemy and phonetic ambiguity. While audio is central to human communication, datasets for spoken puns are scarce, leaving this modality underexplored in humor-aware AI systems.

Method: Created APUN-Bench with 4,434 audio samples annotated across three stages: pun recognition (identifying puns), pun word location (locating pun words), and pun meaning inference (interpreting pun meanings). Systematically evaluated 10 state-of-the-art large audio language models on this benchmark.

Result: Evaluation revealed substantial performance gaps in recognizing, localizing, and interpreting audio puns. Identified key challenges including positional biases in audio pun location and error cases in meaning inference.

Conclusion: APUN-Bench provides the first systematic resource for evaluating audio pun understanding in LALMs, offering actionable insights for advancing humor-aware audio intelligence and highlighting significant research gaps in multimodal language understanding.

Abstract: Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.

[476] Few-shot Acoustic Synthesis with Multimodal Flow Matching

Amandine Brunetto

Main category: cs.SD

TL;DR: FLAC is a probabilistic few-shot acoustic synthesis method that generates plausible room impulse responses using flow-matching diffusion transformers conditioned on spatial, geometric, and acoustic cues.

Details

Motivation: Existing neural acoustic field methods require dense audio measurements and costly per-scene training, while few-shot approaches are deterministic and fail to capture acoustic uncertainty. There's a need for data-efficient, robust acoustic synthesis that handles sparse context.

Method: FLAC uses a diffusion transformer trained with flow-matching objective to generate RIRs at arbitrary positions. It conditions on spatial, geometric, and acoustic cues and introduces AGREE embedding for geometry-consistent evaluation.

Result: FLAC outperforms state-of-the-art eight-shot baselines with only one-shot on AcousticRooms and Hearing Anything Anywhere datasets. It’s the first application of generative flow matching to explicit RIR synthesis.

Conclusion: FLAC establishes a new direction for robust, data-efficient acoustic synthesis by probabilistically modeling RIR distributions with minimal scene context, enabling realistic audio generation for immersive environments.

Abstract: Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

[477] Modeling Overlapped Speech with Shuffles

Matthew Wiesner, Samuele Cornell, Alexander Polok, Lucas Ondel Yang, Lukáš Burget, Sanjeev Khudanpur

Main category: cs.SD

TL;DR: A novel approach using shuffle products and partial order finite-state automata for alignment and speaker-attributed transcription of overlapped speech, enabling single-pass processing.

Details

Motivation: To address the challenge of processing parallel streams of overlapped speech data, particularly for alignment and speaker-attributed transcription, which traditionally requires multiple passes or complex processing.

Method: Uses shuffle product and partial order finite-state automata (FSAs) to model overlapping sequences at subword, word, and phrase levels. Trains using total score on FSAs as loss function, marginalizing over all possible serializations. Imposes temporal constraints to reduce graph size and models (token, speaker) tuples directly for speaker attribution.

Result: The method enables one-pass alignment of multi-talker recordings and is evaluated on synthetic LibriSpeech overlaps. All algorithms are implemented using k2/Icefall framework.

Conclusion: This represents the first algorithm enabling single-pass alignment of multi-talker recordings, providing an efficient approach for overlapped speech processing using finite-state automata and shuffle products.

Abstract: We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.

[478] GLAD: Global-Local Aware Dynamic Mixture-of-Experts for Multi-Talker ASR

Yujie Guo, Jiaming Zhou, Yuhang Jia, Shiwan Zhao, Yong Qin

Main category: cs.SD

TL;DR: GLAD architecture uses global-local fusion MoE strategy for multi-talker ASR, outperforming SOT-based approaches in overlapping speech scenarios.

Details

Motivation: End-to-end multi-talker ASR struggles with overlapping speech because speaker-specific acoustic characteristics get diluted in deep network layers, making it hard to distinguish between speakers.

Method: Proposes Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture with novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection.

Result: GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches on LibriSpeechMix and CH109 datasets, showing exceptional robustness in challenging high-overlap scenarios.

Conclusion: First work to apply global-local fusion MoE strategy to MTASR, successfully addressing the bottleneck of diluted speaker characteristics in deep networks for overlapping speech recognition.

Abstract: End-to-end multi-talker automatic speech recognition (MTASR) faces significant challenges in accurately transcribing overlapping speech. A critical bottleneck is that speaker-specific acoustic characteristics, which are essential for distinguishing overlapping speech, are often diluted in deep network layers. To address this, we propose the Global-Local Aware Dynamic Mixture-of-Experts (GLAD) architecture. GLAD introduces a novel routing mechanism that dynamically fuses speaker-aware global context with fine-grained local acoustic details to adaptively guide expert selection. Experiments on the LibriSpeechMix and CH109 datasets demonstrate that GLAD significantly outperforms existing Serialized Output Training (SOT)-based MTASR approaches, exhibiting exceptional robustness in challenging, high-overlap scenarios. To the best of our knowledge, this is the first work to apply a global-local fusion MoE strategy to MTASR.

[479] Evaluating Hallucinations in Audio-Visual Multimodal LLMs with Spoken Queries under Diverse Acoustic Conditions

Hansol Park, Hoseong Ahn, Junwon Moon, Yejin Lee, Kyuhong Shim

Main category: cs.SD

TL;DR: Multimodal models show increased hallucination rates when queries are spoken rather than written, with error rates rising 3-6% for clean speech and up to 30% under noise, despite mitigation attempts.

Details

Motivation: While multimodal hallucinations have been studied in image-text settings, the impact of spoken queries remains unexplored despite growing voice interfaces, creating a gap in understanding reliability of voice-based multimodal systems.

Method: Developed systematic pipeline to convert existing multimodal hallucination benchmarks into spoken-query versions while preserving original tasks/labels. Instantiated on RePOPE to create RePOPE-Spk with spoken audio queries under diverse input conditions including clean speech and environmental noise.

Result: Hallucinations significantly increase with spoken queries: error rates rise 3-6% with clean speech and up to 30% under environmental noise. Many-shot prompting and chain-of-thought reasoning provide only partial mitigation.

Conclusion: Spoken queries exacerbate multimodal hallucinations, highlighting reliability challenges for voice interfaces. Findings motivate new directions for building robust voice interface systems and evaluations.

Abstract: Hallucinations in multimodal models have been extensively studied using benchmarks that probe reliability in image-text query settings. However, the effect of spoken queries on multimodal hallucinations remains largely unexplored, despite the growing role of voice interfaces. In this paper, we introduce a systematic pipeline that converts existing multimodal hallucination benchmarks into spoken-query versions while preserving the original tasks and labels. We instantiate this pipeline on RePOPE and release RePOPE-Spk, where all queries are provided as spoken audio under diverse input conditions. Experimental results show that hallucinations escalate when queries are spoken rather than written: error rates increase by 3-6% with clean speech and by up to 30% under environmental noise. Furthermore, many-shot prompting and chain-of-thought reasoning provide only partial mitigation. Our findings motivate new directions for building reliable voice interface systems and evaluations.

[480] Fair-Gate: Fairness-Aware Interpretable Risk Gating for Sex-Fair Voice Biometrics

Yangyang Qu, Todisco Massimiliano, Galdi Chiara, Evans Nicholas

Main category: cs.SD

TL;DR: Fair-Gate: A fairness-aware framework that addresses sex-related performance gaps in voice biometric systems through risk extrapolation and interpretable feature routing

Details

Motivation: Voice biometric systems often exhibit sex-related performance gaps even with high overall accuracy, due to demographic shortcut learning (spurious correlations between sex and identity) and feature entanglement (overlap between sex-linked acoustic variation and identity cues)

Method: Fair-Gate framework with two components: 1) risk extrapolation to reduce variation in speaker-classification risk across proxy sex groups, and 2) a local complementary gate that routes intermediate features into separate identity and sex branches, providing interpretable routing masks

Result: Experiments on VoxCeleb1 show Fair-Gate improves the utility-fairness trade-off, yielding more sex-fair automatic speaker verification (ASV) performance under challenging evaluation conditions

Conclusion: Fair-Gate effectively addresses both demographic shortcut learning and feature entanglement mechanisms in voice biometric systems, providing both fairness improvements and interpretability through explicit feature routing

Abstract: Voice biometric systems can exhibit sex-related performance gaps even when overall verification accuracy is strong. We attribute these gaps to two practical mechanisms: (i) demographic shortcut learning, where speaker classification training exploits spurious correlations between sex and speaker identity, and (ii) feature entanglement, where sex-linked acoustic variation overlaps with identity cues and cannot be removed without degrading speaker discrimination. We propose Fair-Gate, a fairness-aware and interpretable risk-gating framework that addresses both mechanisms in a single pipeline. Fair-Gate applies risk extrapolation to reduce variation in speaker-classification risk across proxy sex groups, and introduces a local complementary gate that routes intermediate features into an identity branch and a sex branch. The gate provides interpretability by producing an explicit routing mask that can be inspected to understand which features are allocated to identity versus sex-related pathways. Experiments on VoxCeleb1 show that Fair-Gate improves the utility–fairness trade-off, yielding more sex-fair ASV performance under challenging evaluation conditions.

cs.LG

[481] Frayed RoPE and Long Inputs: A Geometric Perspective

Davis Wertheimer, Aozhong Zhang, Derrick Liu, Penghang Yin, Naigang Wang

Main category: cs.LG

TL;DR: RoPE-ID: A modification to Rotary Positional Embedding that applies high-frequency RoPE to a subset of channels to enable better generalization to longer inputs by preserving key/query cluster separation and sink token functionality.

Details

Motivation: RoPE causes performance breakdown when input length exceeds training length. Prior analyses identified that long inputs cause channels to rotate "out of distribution," but it wasn't clear how this extra rotation causes pathological behavior or relates to attention mechanisms.

Method: Through empirical and theoretical analysis, the authors develop a geometric understanding of attention behavior with RoPE. They find attention induces tight clustering of separated key and query latent point clouds, creating sink tokens. They propose RoPE-ID: apply RoPE with high frequency to a subset of channels to maintain key/query cluster separation for longer inputs.

Result: Demonstrated effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on LongBench and RULER information retrieval benchmarks, showing improved generalization to longer sequences.

Conclusion: RoPE-ID provides a straightforward modification to RoPE that enables attention layers to generalize to longer inputs out of the box by preserving the geometric structure necessary for proper attention mechanism functioning.

Abstract: Rotary Positional Embedding (RoPE) is a widely adopted technique for encoding position in language models, which, while effective, causes performance breakdown when input length exceeds training length. Prior analyses assert (rightly) that long inputs cause channels to rotate ``out of distribution,’’ but it is not clear how extra rotation relates to or causes pathological behavior. Through empirical and theoretical analysis we advance a unified geometric understanding of attention behavior with RoPE. We find that attention induces tight clustering of separated key and query latent point clouds, allowing for creation of sink tokens: placeholders that allow attention heads to avoid token mixing when not required. RoPE applied to longer inputs damages this key/query cluster separation, producing pathological behavior by inhibiting sink token functionality. From this geometric perspective, we propose RoPE-ID (In Distribution), a straightforward modification that allows attention layers to generalize to longer inputs out of the box: apply RoPE with high frequency to a subset of channels. We demonstrate the effectiveness of RoPE-ID for extended inputs using 1B and 3B parameter Transformers on the LongBench and RULER information retrieval benchmarks.

[482] Engineering Verifiable Modularity in Transformers via Per-Layer Supervision

J. Clayton Kerce

Main category: cs.LG

TL;DR: Transformers resist surgical control due to distributed redundancy, but architectural interventions with per-layer supervision can expose hidden modularity for better interpretability and control.

Details

Motivation: Current transformer models resist interpretability and control because distributed redundancy compensates for damage to individual components, making causal analysis difficult. This "Hydra effect" renders interpretability illusory - we can identify correlations but not predict or control causal roles.

Method: Combines three architectural interventions: 1) dual-stream processing separating token and contextual representations, 2) per-layer supervision providing independent gradient signals at each depth, and 3) gated attention regularizing toward discrete activation patterns. Uses per-layer supervision to train models that reveal modular structure.

Result: Models trained with per-layer supervision show ablation effects 5-23 times larger than architecturally identical controls, enabling 4 times greater control leverage on targeted behaviors. Ablation damage spreads widely (standard deviation 6.32% vs 0.63% without supervision), revealing which predictions depend on which circuits. Different tasks route through different attention heads, demonstrating functional reorganization.

Conclusion: Architectural interventions, particularly per-layer supervision, can transform interpretability from passive observation to active control by exposing hidden modularity in transformer models, enabling predictable manipulation of model behavior through targeted interventions.

Abstract: Transformers resist surgical control. Ablating an attention head identified as critical for capitalization produces minimal behavioral change because distributed redundancy compensates for damage. This Hydra effect renders interpretability illusory: we may identify components through correlation, but cannot predict or control their causal role. We demonstrate that architectural interventions can expose hidden modularity. Our approach combines dual-stream processing separating token and contextual representations, per-layer supervision providing independent gradient signal at each depth, and gated attention regularizing toward discrete activation patterns. When trained with per-layer supervision, models produce ablation effects 5 to 23 times larger than architecturally identical controls trained with standard objectives. This enables 4 times greater control leverage on targeted behaviors: scaling identified attention heads produces smooth, predictable changes in model output. The key finding is architectural. Without per-layer supervision, ablation damage concentrates near zero with low variance (Winograd standard deviation 0.63%). With per-layer supervision, effects spread widely (standard deviation 6.32%), revealing which predictions depend on which circuits. The larger variance is not measurement noise but the signature of unmasked modularity. We validate our approach through three components: engineered features that capture computational dynamics rather than vocabulary structure (validated by near-zero correlation with raw activation clustering), an architecture providing positive control for modularity, and causal experiments demonstrating functional reorganization where different tasks route through different attention heads. This es tablishes a methodology for transforming interpretability from passive observation to active control.

[483] InfoMamba: An Attention-Free Hybrid Mamba-Transformer Model

Youjin Wang, Jiaqiao Zhao, Rong Fu, Run Zhou, Ruizhe Zhang, Jiani Liang, Suisuai Cao, Feng Zhou

Main category: cs.LG

TL;DR: InfoMamba: A hybrid attention-free architecture combining selective state-space models with a concept bottleneck linear filtering layer and information-maximizing fusion to achieve efficient long-range dependency capture while maintaining near-linear scaling.

Details

Motivation: Transformers have quadratic complexity limitations, while Mamba-style SSMs scale linearly but struggle with high-rank and synchronous global interactions. There's a need for models that balance fine-grained local modeling with long-range dependency capture under computational constraints.

Method: Proposes InfoMamba with: 1) Concept bottleneck linear filtering layer replacing token-level self-attention as minimal-bandwidth global interface; 2) Integration with selective recurrent stream through information-maximizing fusion (IMF) that dynamically injects global context into SSM dynamics; 3) Mutual-information-inspired objective to encourage complementary information usage.

Result: Extensive experiments on classification, dense prediction, and non-vision tasks show InfoMamba consistently outperforms strong Transformer and SSM baselines, achieving competitive accuracy-efficiency trade-offs while maintaining near-linear scaling.

Conclusion: InfoMamba successfully addresses the limitations of both Transformers and SSMs by providing an efficient attention-free hybrid architecture that captures global dependencies while maintaining computational efficiency.

Abstract: Balancing fine-grained local modeling with long-range dependency capture under computational constraints remains a central challenge in sequence modeling. While Transformers provide strong token mixing, they suffer from quadratic complexity, whereas Mamba-style selective state-space models (SSMs) scale linearly but often struggle to capture high-rank and synchronous global interactions. We present a consistency boundary analysis that characterizes when diagonal short-memory SSMs can approximate causal attention and identifies structural gaps that remain. Motivated by this analysis, we propose InfoMamba, an attention-free hybrid architecture. InfoMamba replaces token-level self-attention with a concept bottleneck linear filtering layer that serves as a minimal-bandwidth global interface and integrates it with a selective recurrent stream through information-maximizing fusion (IMF). IMF dynamically injects global context into the SSM dynamics and encourages complementary information usage through a mutual-information-inspired objective. Extensive experiments on classification, dense prediction, and non-vision tasks show that InfoMamba consistently outperforms strong Transformer and SSM baselines, achieving competitive accuracy-efficiency trade-offs while maintaining near-linear scaling.

[484] Towards Differentiating Between Failures and Domain Shifts in Industrial Data Streams

Natalia Wojak-Strzelecka, Szymon Bobek, Grzegorz J. Nalepa, Jerzy Stefanowski

Main category: cs.LG

TL;DR: A method for distinguishing between failures and normal domain shifts in industrial data streams using changepoint detection, domain adaptation, and XAI for explainability.

Details

Motivation: Current anomaly detection methods often incorrectly flag normal domain shifts (like processing new products) as failures, leading to false alarms. There's a need to distinguish between actual failures and healthy data distribution changes in industrial systems.

Method: Combines modified Page-Hinkley changepoint detector to identify domain shifts and potential failures, with supervised domain adaptation algorithms for online anomaly detection, plus an XAI component to help human operators differentiate between domain shifts and failures.

Result: Method demonstrated on steel factory data stream, showing capability to distinguish between failures and normal domain shifts while maintaining practical system robustness.

Conclusion: The proposed approach effectively addresses the critical challenge of distinguishing failures from normal domain shifts in industrial applications, reducing false alarms while maintaining detection sensitivity.

Abstract: Anomaly and failure detection methods are crucial in identifying deviations from normal system operational conditions, which allows for actions to be taken in advance, usually preventing more serious damages. Long-lasting deviations indicate failures, while sudden, isolated changes in the data indicate anomalies. However, in many practical applications, changes in the data do not always represent abnormal system states. Such changes may be recognized incorrectly as failures, while being a normal evolution of the system, e.g. referring to characteristics of starting the processing of a new product, i.e. realizing a domain shift. Therefore, distinguishing between failures and such ‘‘healthy’’ changes in data distribution is critical to ensure the practical robustness of the system. In this paper, we propose a method that not only detects changes in the data distribution and anomalies but also allows us to distinguish between failures and normal domain shifts inherent to a given process. The proposed method consists of a modified Page-Hinkley changepoint detector for identification of the domain shift and possible failures and supervised domain-adaptation-based algorithms for fast, online anomaly detection. These two are coupled with an explainable artificial intelligence (XAI) component that aims at helping the human operator to finally differentiate between domain shifts and failures. The method is illustrated by an experiment on a data stream from the steel factory.

[485] Taming Epilepsy: Mean Field Control of Whole-Brain Dynamics

Ming Li, Ting Gao, Jingqiao Dua

Main category: cs.LG

TL;DR: Graph-Regularized Koopman Mean-Field Game (GK-MFG) framework integrates Reservoir Computing for Koopman operator approximation with APAC-Net to control epileptic seizure dynamics using EEG data with graph Laplacian constraints.

Details

Motivation: Controlling high-dimensional neural dynamics during epileptic seizures is challenging due to nonlinear characteristics and complex brain connectivity. Existing methods struggle with the nonlinear nature and topological structure of brain networks.

Method: Proposes GK-MFG framework that: 1) Uses Reservoir Computing to approximate Koopman operator for linear embedding of EEG dynamics, 2) Employs APAC-Net for solving distributional control problems, 3) Imposes graph Laplacian constraints derived from Phase Locking Value to respect functional brain topology.

Result: Achieves robust seizure suppression while respecting the functional topological structure of the brain through the graph-regularized approach.

Conclusion: The GK-MFG framework effectively controls epileptic seizure dynamics by combining Koopman operator theory with graph regularization, providing a principled approach for neural control that respects brain network structure.

Abstract: Controlling the high-dimensional neural dynamics during epileptic seizures remains a significant challenge due to the nonlinear characteristics and complex connectivity of the brain. In this paper, we propose a novel framework, namely Graph-Regularized Koopman Mean-Field Game (GK-MFG), which integrates Reservoir Computing (RC) for Koopman operator approximation with Alternating Population and Agent Control Network (APAC-Net) for solving distributional control problems. By embedding Electroencephalogram (EEG) dynamics into a linear latent space and imposing graph Laplacian constraints derived from the Phase Locking Value (PLV), our method achieves robust seizure suppression while respecting the functional topological structure of the brain.

[486] ALIGN: Adversarial Learning for Generalizable Speech Neuroprosthesis

Zhanqi Zhang, Shun Li, Bernardo L. Sabatini, Mikio Aoi, Gal Mishne

Main category: cs.LG

TL;DR: ALIGN is a session-invariant learning framework using multi-domain adversarial neural networks for semi-supervised cross-session adaptation in intracortical speech BCIs, improving generalization to unseen sessions.

Details

Motivation: Intracortical BCIs for speech decoding suffer from performance degradation when models must generalize to new sessions without labeled data due to cross-session nonstationarities like electrode shifts, neural turnover, and changes in user strategy.

Method: ALIGN uses multi-domain adversarial neural networks with a feature encoder trained jointly with a phoneme classifier and domain classifier. Through adversarial optimization, the encoder preserves task-relevant information while suppressing session-specific cues.

Result: ALIGN generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines in intracortical speech decoding tasks.

Conclusion: Adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.

Abstract: Intracortical brain-computer interfaces (BCIs) can decode speech from neural activity with high accuracy when trained on data pooled across recording sessions. In realistic deployment, however, models must generalize to new sessions without labeled data, and performance often degrades due to cross-session nonstationarities (e.g., electrode shifts, neural turnover, and changes in user strategy). In this paper, we propose ALIGN, a session-invariant learning framework based on multi-domain adversarial neural networks for semi-supervised cross-session adaptation. ALIGN trains a feature encoder jointly with a phoneme classifier and a domain classifier operating on the latent representation. Through adversarial optimization, the encoder is encouraged to preserve task-relevant information while suppressing session-specific cues. We evaluate ALIGN on intracortical speech decoding and find that it generalizes consistently better to previously unseen sessions, improving both phoneme error rate and word error rate relative to baselines. These results indicate that adversarial domain alignment is an effective approach for mitigating session-level distribution shift and enabling robust longitudinal BCI decoding.

[487] MST-Direct: Matching via Sinkhorn Transport for Multivariate Geostatistical Simulation with Complex Non-Linear Dependencies

Tchalies Bachmann Schmitz

Main category: cs.LG

TL;DR: MST-Direct: A novel geostatistical simulation method using Optimal Transport theory to preserve complex multivariate dependencies in geological data

Details

Motivation: Traditional geostatistical methods (Gaussian Copula, LU Decomposition) fail to preserve complex non-linear dependencies in geological variables like bimodal distributions, step functions, and heteroscedastic relationships, assuming linear correlation structures instead.

Method: MST-Direct uses Optimal Transport theory with the Sinkhorn algorithm to directly match multivariate distributions while preserving spatial correlation structures. It processes all variables simultaneously as a single multidimensional vector for relational matching across the full joint space.

Result: The method enables faithful reproduction of complex joint distribution patterns that traditional linear methods cannot capture, improving multivariate geostatistical simulation accuracy.

Conclusion: MST-Direct provides a superior approach for multivariate geostatistical simulation by using Optimal Transport to handle complex non-linear dependencies that traditional methods fail to preserve.

Abstract: Multivariate geostatistical simulation requires the faithful reproduction of complex non-linear dependencies among geological variables, including bimodal distributions, step functions, and heteroscedastic relationships. Traditional methods such as the Gaussian Copula and LU Decomposition assume linear correlation structures and often fail to preserve these complex joint distribution patterns. We propose MST-Direct (Matching via Sinkhorn Transport), a novel algorithm based on Optimal Transport theory that uses the Sinkhorn algorithm to directly match multivariate distributions while preserving spatial correlation structures. The method processes all variables simultaneously as a single multidimensional vector, enabling relational matching across the full joint space rather than relying on pairwise linear dependencies.

[488] Adapting Methods for Domain-Specific Japanese Small LMs: Scale, Architecture, and Quantization

Takato Yasuno

Main category: cs.LG

TL;DR: Systematic methodology for building domain-specific Japanese small language models using QLoRA fine-tuning, focusing on optimal training scale, base-model selection, and architecture-aware quantization.

Details

Motivation: To provide actionable guidance for building compact Japanese specialist language models that can run on consumer hardware, addressing the need for domain-specific Japanese models in low-resource technical domains.

Method: Three-stage methodology: 1) Scale-learning experiments to determine optimal training data size (1k-5k samples), 2) Comparison of Japanese LLMs to select best base model, 3) Architecture-aware quantization analysis to determine optimal quantization strategy for different model architectures.

Result: Identified optimal training scale at n=4,000 samples; found Llama-3 models with Japanese continual pre-training outperform multilingual models; discovered Llama-3 architectures improve under Q4_K_M quantization while GQA architectures degrade; recommended Swallow-8B Q4_K_M achieves 2.830/3 score with 8.9s/question and 4.9GB size.

Conclusion: The methodology provides practical guidance for building efficient Japanese domain-specific language models on consumer hardware, with Swallow-8B Q4_K_M as the recommended production model, and generalizes to low-resource technical domains.

Abstract: This paper presents a systematic methodology for building domain-specific Japanese small language models using QLoRA fine-tuning. We address three core questions: optimal training scale, base-model selection, and architecture-aware quantization. Stage 1 (Training scale): Scale-learning experiments (1k–5k samples) identify n=4,000 as optimal, where test-set NLL reaches minimum (1.127) before overfitting at 5k samples. Stage 2 (Compare finetuned SLMs): Comparing four Japanese LLMs shows that Llama-3 models with Japanese continual pre-training (Swallow-8B, ELYZA-JP-8B) outperform multilingual models (Qwen2.5-7B). Stage 3 (Quantization): Llama-3 architectures improve under Q4_K_M quantization, while GQA architectures degrade severely (Qwen2.5: -0.280 points). Production recommendation: Swallow-8B Q4_K_M achieves 2.830/3 score, 8.9 s/question, 4.9 GB size. The methodology generalizes to low-resource technical domains and provides actionable guidance for compact Japanese specialist LMs on consumer hardware.

[489] Quotient Geometry and Persistence-Stable Metrics for Swarm Configurations

Mark M. Bailey

Main category: cs.LG

TL;DR: Geometric framework for comparing multi-agent configurations using quotient formation spaces and persistence-stable signatures, with applications to satellite constellations and formation monitoring.

Details

Motivation: Need for mathematically rigorous methods to compare and monitor multi-agent configurations (like satellite constellations) that are invariant to symmetries and relabelings, with stability guarantees for practical applications.

Method: Introduces quotient formation space $\mathcal{S}n(M,G)=M^n/(G\times S_n)$ and formation matching metric $d{M,G}$ that optimizes worst-case assignment error over ambient symmetries and relabelings. Uses Vietoris-Rips persistence to create stable signatures for configuration monitoring.

Result: Proves metric properties (compactness, completeness, geodesic structure), establishes persistence stability bounds, analyzes signature expressivity, and demonstrates applications on $\mathbb{S}^2$ and $\mathbb{T}^m$ for satellite constellations.

Conclusion: Provides a theoretically sound geometric framework for multi-agent configuration comparison with stability guarantees, applicable to satellite constellations and formation monitoring with practical mathematical foundations.

Abstract: Swarm and constellation reconfiguration can be viewed as motion of an unordered point configuration in an ambient space. Here, we provide persistence-stable, symmetry-invariant geometric representations for comparing and monitoring multi-agent configuration data. We introduce a quotient formation space $\mathcal{S}n(M,G)=M^n/(G\times S_n)$ and a formation matching metric $d{M,G}$ obtained by optimizing a worst-case assignment error over ambient symmetries $g\in G$ and relabelings $σ\in S_n$. This metric is a structured, physically interpretable relaxation of Gromov–Hausdorff distance: the induced inter-agent metric spaces satisfy $d_{\mathrm{GH}}(X_x,X_y)\le d_{M,G}([x],[y])$. Composing this bound with stability of Vietoris–Rips persistence yields $d_B(Φ_k([x]),Φ_k([y]))\le d_{M,G}([x],[y])$, providing persistence-stable signatures for reconfiguration monitoring. We analyze the metric geometry of $(\mathcal{S}n(M,G),d{M,G})$: under compactness/completeness assumptions on $M$ and compact $G$ it is compact/complete and the metric induces the quotient topology; if $M$ is geodesic then the quotient is geodesic and exhibits stratified singularities along collision and symmetry strata, relating it to classical configuration spaces. We study expressivity of the signatures, identifying symmetry-mismatch and persistence-compression mechanisms for non-injectivity. Finally, in a phase-circle model we prove a conditional inverse theorem: under semicircle support and a gap-labeling margin, the $H_0$ signature is locally bi-Lipschitz to $d_{M,G}$ up to an explicit factor, yielding two-sided control. Examples on $\mathbb{S}^2$ and $\mathbb{T}^m$ illustrate satellite-constellation and formation settings.

[490] NANOZK: Layerwise Zero-Knowledge Proofs for Verifiable Large Language Model Inference

Zhaohui Geoffrey Wang

Main category: cs.LG

TL;DR: METHOD is a zero-knowledge proof system that cryptographically verifies LLM inference, ensuring users get outputs from the claimed model rather than cheaper substitutes or cached responses.

Details

Motivation: Users of proprietary LLM APIs have no way to verify that outputs actually come from the claimed model. Service providers could substitute cheaper models, apply aggressive quantization, or return cached responses without detection, while charging premium prices for frontier capabilities.

Method: Exploits transformer architecture’s natural decomposition into independent layers for layerwise proof framework. Uses lookup table approximations for non-arithmetic operations (softmax, GELU, LayerNorm) with zero accuracy loss. Implements Fisher information-guided verification for practical scenarios where proving all layers is impractical.

Result: Achieves constant-size layer proofs of 5.5KB (2.1KB attention + 3.5KB MLP) with 24 ms verification time for models up to d=128. Compared to EZKL, METHOD achieves 70x smaller proofs and 5.7x faster proving time while maintaining formal soundness guarantees (epsilon < 1e-37). Lookup approximations preserve model perplexity exactly.

Conclusion: METHOD enables cryptographically verifiable LLM inference with practical efficiency, addressing the trust gap in proprietary LLM services while maintaining model accuracy and performance.

Abstract: When users query proprietary LLM APIs, they receive outputs with no cryptographic assurance that the claimed model was actually used. Service providers could substitute cheaper models, apply aggressive quantization, or return cached responses - all undetectable by users paying premium prices for frontier capabilities. We present METHOD, a zero-knowledge proof system that makes LLM inference verifiable: users can cryptographically confirm that outputs correspond to the computation of a specific model. Our approach exploits the fact that transformer inference naturally decomposes into independent layer computations, enabling a layerwise proof framework where each layer generates a constant-size proof regardless of model width. This decomposition sidesteps the scalability barrier facing monolithic approaches and enables parallel proving. We develop lookup table approximations for non-arithmetic operations (softmax, GELU, LayerNorm) that introduce zero measurable accuracy loss, and introduce Fisher information-guided verification for scenarios where proving all layers is impractical. On transformer models up to d=128, METHOD generates constant-size layer proofs of 5.5KB (2.1KB attention + 3.5KB MLP) with 24 ms verification time. Compared to EZKL, METHOD achieves 70x smaller proofs and 5.7x faster proving time at d=128, while maintaining formal soundness guarantees (epsilon < 1e-37). Lookup approximations preserve model perplexity exactly, enabling verification without quality compromise.

[491] Fundamental Limits of Neural Network Sparsification: Evidence from Catastrophic Interpretability Collapse

Dip Roy, Rajiv Misra, Sanjay Kumar Singh

Main category: cs.LG

TL;DR: Extreme neural network sparsification (90% activation reduction) causes interpretable features to collapse despite stable global representation quality, with collapse scaling with dataset complexity.

Details

Motivation: To understand whether interpretable features survive aggressive neural network compression (90% activation reduction) and investigate the fundamental limits of the sparsification-interpretability relationship.

Method: Uses hybrid VAE-SAE architectures with adaptive sparsity scheduling that progressively reduces active neurons from 500 to 50 over 50 epochs. Tests on dSprites and Shapes3D datasets with Top-k and L1 sparsification methods, measuring dead neuron rates and interpretability collapse.

Result: Reveals a paradox: global representation quality (Mutual Information Gap) remains stable while local feature interpretability collapses systematically. Dead neuron rates reach 34.4% on dSprites and 62.7% on Shapes3D with Top-k, and worse with L1. Collapse scales with dataset complexity and is robust across training durations and threshold definitions.

Conclusion: Interpretability collapse under extreme sparsification is intrinsic to the compression process rather than an artifact of specific algorithms, training duration, or threshold choices, establishing fundamental limits to sparsification-interpretability tradeoffs.

Abstract: Extreme neural network sparsification (90% activation reduction) presents a critical challenge for mechanistic interpretability: understanding whether interpretable features survive aggressive compression. This work investigates feature survival under severe capacity constraints in hybrid Variational Autoencoder–Sparse Autoencoder (VAE-SAE) architectures. We introduce an adaptive sparsity scheduling framework that progressively reduces active neurons from 500 to 50 over 50 training epochs, and provide empirical evidence for fundamental limits of the sparsification-interpretability relationship. Testing across two benchmark datasets – dSprites and Shapes3D – with both Top-k and L1 sparsification methods, our key finding reveals a pervasive paradox: while global representation quality (measured by Mutual Information Gap) remains stable, local feature interpretability collapses systematically. Under Top-k sparsification, dead neuron rates reach $34.4\pm0.9%$ on dSprites and $62.7\pm1.3%$ on Shapes3D at k=50. L1 regularization – a fundamentally different “soft constraint” paradigm – produces equal or worse collapse: $41.7\pm4.4%$ on dSprites and $90.6\pm0.5%$ on Shapes3D. Extended training for 100 additional epochs fails to recover dead neurons, and the collapse pattern is robust across all tested threshold definitions. Critically, the collapse scales with dataset complexity: Shapes3D (RGB, 6 factors) shows $1.8\times$ more dead neurons than dSprites (grayscale, 5 factors) under Top-k and $2.2\times$ under L1. These findings establish that interpretability collapse under sparsification is intrinsic to the compression process rather than an artifact of any particular algorithm, training duration, or threshold choice.

[492] Investigating Faithfulness in Large Audio Language Models

Pooneh Mousavi, Lovenya Jain, Mirco Ravanelli, Cem Subakan

Main category: cs.LG

TL;DR: A framework to evaluate Chain-of-Thought faithfulness in Large Audio Language Models, revealing potential disconnect between reasoning and audio grounding.

Details

Motivation: While Large Audio Language Models can generate Chain-of-Thought explanations, it's unclear whether these reasoning chains are faithful to the input audio and final predictions, raising concerns about multimodal grounding.

Method: Proposes a systematic framework with three audio faithfulness criteria (hallucination-free, holistic, attentive listening) and introduces a benchmark using audio and CoT interventions to assess faithfulness in models like Audio Flamingo 3 and Qwen2.5-Omni.

Result: Experiments reveal a potential multimodal disconnect: reasoning often aligns with final predictions but isn’t strongly grounded in audio, making it vulnerable to hallucinations and adversarial perturbations.

Conclusion: There’s a need for better audio grounding in LALMs’ reasoning chains, and the proposed framework provides tools to evaluate and improve CoT faithfulness in multimodal audio-language models.

Abstract: Large Audio Language Models (LALMs) integrate audio encoders with pretrained Large Language Models to perform complex multimodal reasoning tasks. While these models can generate Chain-of-Thought (CoT) explanations, the faithfulness of these reasoning chains remains unclear. In this work, we propose a systematic framework to evaluate CoT faithfulness in LALMs with respect to both the input audio and the final model prediction. We define three criteria for audio faithfulness: hallucination-free, holistic, and attentive listening. We also introduce a benchmark based on both audio and CoT interventions to assess faithfulness. Experiments on Audio Flamingo 3 and Qwen2.5-Omni suggest a potential multimodal disconnect: reasoning often aligns with the final prediction but is not always strongly grounded in the audio and can be vulnerable to hallucinations or adversarial perturbations.

[493] Lightweight Adaptation for LLM-based Technical Service Agent: Latent Logic Augmentation and Robust Noise Reduction

Yi Yu, Junzhuo Ma, Chenghuang Shen, Xingyan Liu, Jing Gu, Hangyi Sun, Guangquan Hu, Jianfeng Liu, Weiting Liu, Mingyue Pu, Yu Wang, Zhengdong Xiao, Rui Xie, Longjiu Luo, Qianrong Wang, Gurong Cui, Honglin Qiao, Wenlian Lu

Main category: cs.LG

TL;DR: Lightweight adaptation framework for LLMs in technical service domains using latent logic augmentation, noise reduction, and hybrid reward mechanisms to improve decision-making while reducing computational costs.

Details

Motivation: LLM adaptation in technical service domains faces challenges: lack of explicit cognitive chains in human demonstrations, ambiguity from diverse valid responses, and prohibitive resource/time costs of standard training paradigms.

Method: Three-part framework: 1) Latent Logic Augmentation with Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation; 2) Robust Noise Reduction via dual-filtering to create Multiple Ground Truths dataset; 3) Lightweight Adaptation using Hybrid Reward mechanism combining LLM-based judge with lightweight relevance-based Reranker.

Result: Empirical evaluations on real-world Cloud service tasks show framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction, with Hybrid Reward achieving alignment comparable to standard LLM-as-a-judge methods with reduced training time.

Conclusion: The lightweight adaptation framework effectively addresses LLM adaptation challenges in technical service domains by improving decision logic understanding, reducing noise, and maintaining performance while reducing computational costs.

Abstract: Adapting Large Language Models in complex technical service domains is constrained by the absence of explicit cognitive chains in human demonstrations and the inherent ambiguity arising from the diversity of valid responses. These limitations severely hinder agents from internalizing latent decision dynamics and generalizing effectively. Moreover, practical adaptation is often impeded by the prohibitive resource and time costs associated with standard training paradigms. To overcome these challenges and guarantee computational efficiency, we propose a lightweight adaptation framework comprising three key contributions. (1) Latent Logic Augmentation: We introduce Planning-Aware Trajectory Modeling and Decision Reasoning Augmentation to bridge the gap between surface-level supervision and latent decision logic. These approaches strengthen the stability of Supervised Fine-Tuning alignment. (2) Robust Noise Reduction: We construct a Multiple Ground Truths dataset through a dual-filtering method to reduce the noise by validating diverse responses, thereby capturing the semantic diversity. (3) Lightweight Adaptation: We design a Hybrid Reward mechanism that fuses an LLM-based judge with a lightweight relevance-based Reranker to distill high-fidelity reward signals while reducing the computational cost compared to standard LLM-as-a-Judge reinforcement learning. Empirical evaluations on real-world Cloud service tasks, conducted across semantically diverse settings, demonstrate that our framework achieves stability and performance gains through Latent Logic Augmentation and Robust Noise Reduction. Concurrently, our Hybrid Reward mechanism achieves alignment comparable to standard LLM-as-a-judge methods with reduced training time, underscoring the practical value for deploying technical service agents.

[494] Variational Phasor Circuits for Phase-Native Brain-Computer Interface Classification

Dibakar Sigdel

Main category: cs.LG

TL;DR: VPC is a deterministic classical learning architecture on the unit circle manifold using phase shifts and interference for classification tasks, inspired by quantum circuits but implemented classically.

Details

Motivation: To create a mathematically principled alternative to dense neural computation using phase interference on the unit circle, inspired by variational quantum circuits but implemented as a classical architecture for efficient classification.

Method: Replace dense real-valued weight matrices with trainable phase shifts, local unitary mixing, and structured interference in complex space. Use single VPC blocks for compact phase-based decision boundaries and stacked compositions with inter-block pull-back normalization for deeper circuits.

Result: VPC can decode difficult mental-state classification tasks with competitive accuracy using substantially fewer trainable parameters than standard Euclidean baselines on synthetic brain-computer interface benchmarks.

Conclusion: Unit-circle phase interference is a practical alternative to dense neural computation, and VPC can serve as both a standalone classifier and a front-end encoding layer for future hybrid phasor-quantum systems.

Abstract: We present the \textbf{Variational Phasor Circuit (VPC)}, a deterministic classical learning architecture operating on the continuous $S^1$ unit circle manifold. Inspired by variational quantum circuits, VPC replaces dense real-valued weight matrices with trainable phase shifts, local unitary mixing, and structured interference in the ambient complex space. This phase-native design provides a unified method for both binary and multi-class classification of spatially distributed signals. A single VPC block supports compact phase-based decision boundaries, while stacked VPC compositions extend the model to deeper circuits through inter-block pull-back normalization. Using synthetic brain-computer interface benchmarks, we show that VPC can decode difficult mental-state classification tasks with competitive accuracy and substantially fewer trainable parameters than standard Euclidean baselines. These results position unit-circle phase interference as a practical and mathematically principled alternative to dense neural computation, and motivate VPC as both a standalone classifier and a front-end encoding layer for future hybrid phasor-quantum systems.

[495] SLEA-RL: Step-Level Experience Augmented Reinforcement Learning for Multi-Turn Agentic Training

Prince Zizhuang Wang, Shuli Jiang

Main category: cs.LG

TL;DR: SLEA-RL is a reinforcement learning framework that retrieves relevant experiences at each decision step in multi-turn agent tasks, using step-level observation clustering and a self-evolving experience library.

Details

Motivation: Current LLM agents operate in isolation during training and existing experience-augmented methods retrieve experiences only once based on initial task descriptions, which becomes mismatched as episodes progress in multi-turn settings where observations change at every step.

Method: SLEA-RL uses three components: 1) step-level observation clustering to group structurally equivalent environmental states for efficient cluster-indexed retrieval, 2) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction, and 3) policy optimization with step-level credit assignment for fine-grained advantage estimation across multi-turn episodes.

Result: Experiments on long-horizon multi-turn agent benchmarks demonstrate that SLEA-RL achieves superior performance compared to various reinforcement learning baselines.

Conclusion: SLEA-RL effectively addresses the limitations of static experience retrieval in multi-turn agent tasks by enabling dynamic, step-level experience retrieval and self-evolving experience libraries.

Abstract: Large Language Model (LLM) agents have shown strong results on multi-turn tool-use tasks, yet they operate in isolation during training, failing to leverage experiences accumulated across episodes. Existing experience-augmented methods address this by organizing trajectories into retrievable libraries, but they retrieve experiences only once based on the initial task description and hold them constant throughout the episode. In multi-turn settings where observations change at every step, this static retrieval becomes increasingly mismatched as episodes progress. We propose SLEA-RL (Step-Level Experience-Augmented Reinforcement Learning), a framework that retrieves relevant experiences at each decision step conditioned on the current observation. SLEA-RL operates through three components: (i) step-level observation clustering that groups structurally equivalent environmental states for efficient cluster-indexed retrieval; (ii) a self-evolving experience library that distills successful strategies and failure patterns through score-based admission and rate-limited extraction; and (iii) policy optimization with step-level credit assignment for fine-grained advantage estimation across multi-turn episodes. The experience library evolves alongside the policy through semantic analysis rather than gradient updates. Experiments on long-horizon multi-turn agent benchmarks demonstrate that SLEA-RL achieves superior performance compared to various reinforcement learning baselines.

[496] Probabilistic Federated Learning on Uncertain and Heterogeneous Data with Model Personalization

Ratun Rahman, Dinh C. Nguyen

Main category: cs.LG

TL;DR: Meta-BayFL: A personalized probabilistic federated learning method combining meta-learning with Bayesian neural networks to handle data uncertainty and heterogeneity across clients.

Details

Motivation: Conventional federated learning suffers from training degradation due to data uncertainty and heterogeneity across clients. Probabilistic approaches like Bayesian neural networks can help but introduce runtime, latency, and bandwidth overhead that hasn't been well-studied in federated settings.

Method: Combines meta-learning with Bayesian neural networks: (1) BNN-based client models incorporate uncertainty across hidden layers, (2) meta-learning with adaptive learning rates enables personalized updates for non-IID data, (3) unified probabilistic and personalized design improves global model aggregation robustness.

Result: Outperforms state-of-the-art methods including standard and personalized FL approaches (pFedMe, Ditto, FedFomo) with up to 7.42% higher test accuracy on CIFAR-10, CIFAR-100, and Tiny-ImageNet. Provides theoretical convergence analysis and computational cost evaluation.

Conclusion: Meta-BayFL effectively addresses data uncertainty and heterogeneity in federated learning through a probabilistic-personalized approach, demonstrating superior performance while considering practical deployment constraints.

Abstract: Conventional federated learning (FL) frameworks often suffer from training degradation due to data uncertainty and heterogeneity across local clients. Probabilistic approaches such as Bayesian neural networks (BNNs) can mitigate this issue by explicitly modeling uncertainty, but they introduce additional runtime, latency, and bandwidth overhead that has rarely been studied in federated settings. To address these challenges, we propose Meta-BayFL, a personalized probabilistic FL method that combines meta-learning with BNNs to improve training under uncertain and heterogeneous data. The framework is characterized by three main features: (1) BNN-based client models incorporate uncertainty across hidden layers to stabilize training on small and noisy datasets, (2) meta-learning with adaptive learning rates enables personalized updates that enhance local training under non-IID conditions, and (3) a unified probabilistic and personalized design improves the robustness of global model aggregation. We provide a theoretical convergence analysis and characterize the upper bound of the global model over communication rounds. In addition, we evaluate computational costs (runtime, latency, and communication) and discuss the feasibility of deployment on resource-constrained devices such as edge nodes and IoT systems. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show that Meta-BayFL consistently outperforms state-of-the-art methods, including both standard and personalized FL approaches (e.g., pFedMe, Ditto, FedFomo), with up to 7.42% higher test accuracy.

[497] Enhancing Reinforcement Learning Fine-Tuning with an Online Refiner

Hao Ma, Zhiqiang Pu, Yang Liu, Xiaolin Ai

Main category: cs.LG

TL;DR: Dynamic constraints for RL fine-tuning that adapt based on model capabilities, using a reference model as online refiner to generate minimally corrected outputs, enabling automatic constraint adjustment based on output quality.

Details

Motivation: Traditional constraints in reinforcement learning fine-tuning (RFT) inherently conflict with optimization objectives - stronger constraints limit model's ability to discover better solutions, while weaker constraints risk degenerate outputs. There's a need for constraints that can adapt to the evolving capabilities of the fine-tuned model.

Method: Proposes dynamic constraints using a reference model as an online refiner. The refiner takes responses from the fine-tuned model and generates minimally corrected versions that preserve correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output, creating constraints that automatically strengthen or relax based on output quality.

Result: Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.

Conclusion: Dynamic constraints effectively resolve the tension between constraint strength and optimization objectives in RL fine-tuning by adapting to model capabilities, providing a more flexible and effective approach than static constraints like KL regularization.

Abstract: Constraints are essential for stabilizing reinforcement learning fine-tuning (RFT) and preventing degenerate outputs, yet they inherently conflict with the optimization objective because stronger constraints limit the ability of a fine-tuned model to discover better solutions. We propose \textit{dynamic constraints} that resolve this tension by adapting to the evolving capabilities of the fine-tuned model based on the insight that constraints should only intervene when degenerate outputs occur. We implement this by using a reference model as an \textit{online refiner} that takes the response from the fine-tuned model and generates a minimally corrected version which preserves correct content verbatim while fixing errors. A supervised fine-tuning loss then trains the fine-tuned model to produce the refined output. This mechanism yields a constraint that automatically strengthens or relaxes based on output quality. Experiments on dialogue and code generation show that dynamic constraints outperform both KL regularization and unconstrained baselines, achieving substantially higher task rewards while maintaining training stability.

[498] On the Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang

Main category: cs.LG

TL;DR: Decentralized learning with strategic communication scheduling shows that concentrating communication in later stages and final global merging improves performance under data heterogeneity.

Details

Motivation: Decentralized learning scales better than parameter-server approaches but suffers from limited peer-to-peer communication. The paper investigates optimal communication scheduling to improve performance, particularly under challenging conditions of high data heterogeneity.

Method: Studies communication scheduling in decentralized learning, including when and how frequently devices synchronize. Explores concentrating communication budgets in later training stages and introduces a final global merging step. Theoretically analyzes decentralized SGD with merging to match parallel SGD convergence rates.

Result: Counterintuitive findings show that concentrating communication in later stages improves global test performance. A single global merging at the final step significantly boosts performance under high data heterogeneity. Theoretical analysis proves the merged model can match parallel SGD convergence rates.

Conclusion: Decentralized learning can generalize well under high data heterogeneity with limited communication when using strategic scheduling and final merging. This work opens new research avenues for model merging techniques in decentralized settings.

Abstract: Decentralized learning provides a scalable alternative to parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Counterintuitive empirical results show that concentrating communication budgets in the later stages of decentralized training remarkably improves global test performance. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, can significantly improve the performance of decentralized learning under high data heterogeneity. Our theoretical contributions, which explain these phenomena, are the first to establish that the globally merged model of decentralized SGD can match the convergence rate of parallel SGD. Technically, we reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components essential for matching this rate. This work provides evidence that decentralized learning is able to generalize under high data heterogeneity and limited communication, while offering broad new avenues for model merging research.

[499] ARTEMIS: A Neuro Symbolic Framework for Economically Constrained Market Dynamics

Rahul D Ray

Main category: cs.LG

TL;DR: ARTEMIS is a neuro-symbolic framework for quantitative finance that combines neural operators with symbolic bottlenecks to enforce economic principles like no-arbitrage constraints, achieving state-of-the-art directional accuracy while providing interpretable trading rules.

Details

Motivation: Current deep learning models in quantitative finance lack interpretability and fail to incorporate fundamental economic principles like no-arbitrage constraints, operating as black boxes that limit trust and practical application in financial domains.

Method: Combines a continuous-time Laplace Neural Operator encoder, neural stochastic differential equation with physics-informed losses, and a differentiable symbolic bottleneck that distills interpretable trading rules. Enforces economic plausibility via two novel regularization terms: Feynman-Kac PDE residual penalizing local no-arbitrage violations and market price of risk penalty bounding instantaneous Sharpe ratio.

Result: Achieves state-of-the-art directional accuracy, outperforming six baselines on four datasets: DSLOB (64.96%) and Time-IMM (96.0%). Ablation study confirms each component’s importance - removing PDE loss reduces accuracy from 64.89% to 50.32%. Underperforms on Optiver due to long sequence length and volatility-focused target.

Conclusion: ARTEMIS successfully bridges the gap between deep learning’s predictive power and the transparency demanded in quantitative finance by providing interpretable, economically grounded predictions through its neuro-symbolic architecture with economic constraints.

Abstract: Deep learning models in quantitative finance often operate as black boxes, lacking interpretability and failing to incorporate fundamental economic principles such as no-arbitrage constraints. This paper introduces ARTEMIS (Arbitrage-free Representation Through Economic Models and Interpretable Symbolics), a novel neuro-symbolic framework combining a continuous-time Laplace Neural Operator encoder, a neural stochastic differential equation regularised by physics-informed losses, and a differentiable symbolic bottleneck that distils interpretable trading rules. The model enforces economic plausibility via two novel regularisation terms: a Feynman-Kac PDE residual penalising local no-arbitrage violations, and a market price of risk penalty bounding the instantaneous Sharpe ratio. We evaluate ARTEMIS against six strong baselines on four datasets: Jane Street, Optiver, Time-IMM, and DSLOB (a synthetic crash regime). Results demonstrate ARTEMIS achieves state-of-the-art directional accuracy, outperforming all baselines on DSLOB (64.96%) and Time-IMM (96.0%). A comprehensive ablation study confirms each component’s contribution: removing the PDE loss reduces directional accuracy from 64.89% to 50.32%. Underperformance on Optiver is attributed to its long sequence length and volatility-focused target. By providing interpretable, economically grounded predictions, ARTEMIS bridges the gap between deep learning’s power and the transparency demanded in quantitative finance.

[500] BoundAD: Boundary-Aware Negative Generation for Time Series Anomaly Detection

Xiancheng Wang, Lin Wang, Zhibo Zhang, Rui Wang, Minghang Zhao

Main category: cs.LG

TL;DR: A reconstruction-driven boundary negative generation framework for time series anomaly detection that creates hard negative samples from normal data using reinforcement learning-guided reconstruction optimization.

Details

Motivation: Existing contrastive learning methods for time series anomaly detection rely on random perturbations or pseudo-anomaly injection, which struggle to preserve temporal semantic consistency and provide effective decision-boundary supervision. Most methods overlook generating hard negatives near the data manifold boundary directly from normal samples.

Method: Proposes a reconstruction-driven boundary negative generation framework: 1) Uses a reconstruction network to capture normal temporal patterns, 2) Employs reinforcement learning to adaptively adjust optimization update magnitude based on current reconstruction state, 3) Generates boundary-shifted samples along reconstruction trajectory for contrastive representation learning.

Result: Experimental results show the method effectively improves anomaly representation learning and achieves competitive detection performance on current datasets.

Conclusion: The framework successfully generates challenging boundary negatives from normal data without predefined anomaly patterns, improving contrastive learning for time series anomaly detection through reconstruction-driven negative sample generation.

Abstract: Contrastive learning methods for time series anomaly detection (TSAD) heavily depend on the quality of negative sample construction. However, existing strategies based on random perturbations or pseudo-anomaly injection often struggle to simultaneously preserve temporal semantic consistency and provide effective decision-boundary supervision. Most existing methods rely on prior anomaly injection, while overlooking the potential of generating hard negatives near the data manifold boundary directly from normal samples themselves. To address this issue, we propose a reconstruction-driven boundary negative generation framework that automatically constructs hard negatives through the reconstruction process of normal samples. Specifically, the method first employs a reconstruction network to capture normal temporal patterns, and then introduces a reinforcement learning strategy to adaptively adjust the optimization update magnitude according to the current reconstruction state. In this way, boundary-shifted samples close to the normal data manifold can be induced along the reconstruction trajectory and further used for subsequent contrastive representation learning. Unlike existing methods that depend on explicit anomaly injection, the proposed framework does not require predefined anomaly patterns, but instead mines more challenging boundary negatives from the model’s own learning dynamics. Experimental results show that the proposed method effectively improves anomaly representation learning and achieves competitive detection performance on the current dataset.

[501] Tula: Optimizing Time, Cost, and Generalization in Distributed Large-Batch Training

Sahil Tyagi, Feiyi Wang

Main category: cs.LG

TL;DR: Tula is an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models by identifying optimal batch sizes through parallel-systems modeling and statistical performance prediction.

Details

Motivation: Distributed training faces trade-offs between scaling-out (communication overhead) and scaling-up (memory/computation limits), with diminishing returns and generalization gaps at large batch sizes. The optimal batch size depends on model, data, and compute resources.

Method: Combines parallel-systems modeling with statistical performance prediction to identify optimal batch sizes. Uses online service approach to automatically optimize training time, cost, and convergence quality for convolutional models.

Result: Predicts training time and cost within 7.5-14% error across multiple models, achieves up to 20x overall speedup, and improves test accuracy by 9% on average over standard large-batch training on various vision tasks.

Conclusion: Tula successfully mitigates the generalization gap in large-batch training while accelerating training, providing an automated solution for optimizing distributed training of vision models.

Abstract: Distributed training increases the number of batches processed per iteration either by scaling-out (adding more nodes) or scaling-up (increasing the batch-size). However, the largest configuration does not necessarily yield the best performance. Horizontal scaling introduces additional communication overhead, while vertical scaling is constrained by computation cost and device memory limits. Thus, simply increasing the batch-size leads to diminishing returns: training time and cost decrease initially but eventually plateaus, creating a knee-point in the time/cost versus batch-size pareto curve. The optimal batch-size therefore depends on the underlying model, data and available compute resources. Large batches also suffer from worse model quality due to the well-known generalization gap. In this paper, we present Tula, an online service that automatically optimizes time, cost, and convergence quality for large-batch training of convolutional models. It combines parallel-systems modeling with statistical performance prediction to identify the optimal batch-size. Tula predicts training time and cost within 7.5-14% error across multiple models, and achieves up to 20x overall speedup and improves test accuracy by 9% on average over standard large-batch training on various vision tasks, thus successfully mitigating the generalization gap and accelerating training at the same time.

[502] VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models

Hefei Xu, Le Wu, Yu Wang, Min Hou, Han Wu, Zhen Zhang, Meng Wang

Main category: cs.LG

TL;DR: VC-soup: A data filtering and parameter merging framework for multi-value alignment in LLMs that addresses value conflicts through value-consistent learning and Pareto filtering.

Details

Motivation: Aligning LLMs with multiple, potentially conflicting human values is challenging. Existing methods face limitations: training separate models for each value combination is expensive, and value conflicts degrade alignment performance, making it difficult to achieve favorable trade-offs across diverse values.

Method: Proposes VC-soup framework with: 1) Value consistency metric based on cosine similarity between reward-gap vectors and all-ones vector to quantify cross-value coherence; 2) Filtering out low-consistency preference pairs; 3) Training on remaining data to obtain smooth, value-consistent policy models; 4) Linearly combining policies with Pareto filtering across values for balanced performance.

Result: Extensive experiments and theoretical analysis demonstrate VC-soup effectively mitigates conflicts and consistently outperforms existing multi-value alignment methods.

Conclusion: VC-soup addresses multi-value alignment challenges through value-consistent learning, offering a practical solution for achieving balanced performance across diverse human values in LLMs.

Abstract: As large language models (LLMs) increasingly shape content generation, interaction, and decision-making across the Web, aligning them with human values has become a central objective in trustworthy AI. This challenge becomes even more pronounced when aligning multiple, potentially conflicting human values. Although recent approaches, such as reward reweighting, prompt-based supervised fine-tuning, and model merging, attempt to tackle multi-value alignment, they still face two major limitations: (1) training separate models for each value combination is prohibitively expensive; (2) value conflicts substantially degrade alignment performance. These limitations make it difficult to achieve favorable trade-offs across diverse human values. To address these challenges, we revisit multi-value alignment from the perspective of value consistency in data and propose VC-soup, a data filtering and parameter merging framework grounded in value-consistent learning. We first design a value consistency metric based on the cosine similarity between the reward-gap vector of each preference pair and an all-ones vector, which quantifies its cross-value coherence. We then filter out low-consistency preference pairs in each value dataset and train on the remaining data to obtain smooth, value-consistent policy models that better preserve linear mode connectivity. Finally, we linearly combine these policies and apply Pareto filtering across values to obtain solutions with balanced multi-value performance. Extensive experiments and theoretical analysis demonstrate that VC-soup effectively mitigates conflicts and consistently outperforms existing multi-value alignment methods.

[503] LLM-Augmented Computational Phenotyping of Long Covid

Jing Wang, Jie Shen, Amar Sra, Qiaomin Xie, Jeremy C Weiss

Main category: cs.LG

TL;DR: LLM-augmented computational phenotyping framework “Grace Cycle” discovers three distinct clinical phenotypes in Long COVID patients from longitudinal data

Details

Motivation: Long COVID is a complex persistent condition with poorly understood clinical subphenotypes. Phenotypic characterization is essential for understanding heterogeneity in chronic diseases and guiding personalized interventions.

Method: Proposed “Grace Cycle” framework that iteratively integrates hypothesis generation, evidence extraction, and feature refinement using LLMs to discover clinically meaningful subgroups from longitudinal patient data.

Result: Identified three distinct clinical phenotypes (Protected, Responder, Refractory) from 13,511 Long COVID participants with pronounced separation in peak symptom severity, baseline disease burden, and longitudinal dose-response patterns.

Conclusion: LLMs can be integrated into principled, statistically grounded pipelines for phenotypic screening from complex longitudinal data. The framework is disease-agnostic and offers a general approach for discovering clinically interpretable subphenotypes.

Abstract: Phenotypic characterization is essential for understanding heterogeneity in chronic diseases and for guiding personalized interventions. Long COVID, a complex and persistent condition, yet its clinical subphenotypes remain poorly understood. In this work, we propose an LLM-augmented computational phenotyping framework ``Grace Cycle’’ that iteratively integrates hypothesis generation, evidence extraction, and feature refinement to discover clinically meaningful subgroups from longitudinal patient data. The framework identifies three distinct clinical phenotypes, Protected, Responder, and Refractory, based on 13,511 Long Covid participants. These phenotypes exhibit pronounced separation in peak symptom severity, baseline disease burden, and longitudinal dose-response patterns, with strong statistical support across multiple independent dimensions. This study illustrates how large language models can be integrated into a principled, statistically grounded pipeline for phenotypic screening from complex longitudinal data. Note that the proposed framework is disease-agnostic and offers a general approach for discovering clinically interpretable subphenotypes.

[504] Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

Main category: cs.LG

TL;DR: Paper addresses conflict detection in policy languages when using probabilistic ML signals instead of crisp Boolean predicates, proposing solutions for embedding-based systems.

Details

Motivation: Traditional policy language conflict detection assumes crisp Boolean predicates, but modern routing and access-control systems use probabilistic ML signals (embeddings, classifiers), which can cause silent conflicts when multiple signals clear thresholds on the same query.

Method: Characterizes the problem as a three-level decidability hierarchy: crisp conflicts (SAT-solvable), embedding conflicts (spherical cap intersection), and classifier conflicts (undecidable without distribution knowledge). For embedding conflicts, proposes replacing independent thresholding with temperature-scaled softmax to partition embedding space into Voronoi regions preventing co-firing.

Result: Implements detection and prevention mechanisms in Semantic Router DSL, a production routing language for LLM inference, showing that no model retraining is needed for the embedding case.

Conclusion: Provides practical solutions for conflict detection in policy languages using ML signals, with applications to semantic RBAC and API gateway policy, addressing a critical gap in modern routing systems.

Abstract: Conflict detection in policy languages is a solved problem – as long as every rule condition is a crisp Boolean predicate. BDDs, SMT solvers, and NetKAT all exploit that assumption. But a growing class of routing and access-control systems base their decisions on probabilistic ML signals: embedding similarities, domain classifiers, complexity estimators. Two such signals, declared over categories the author intended to be disjoint, can both clear their thresholds on the same query and silently route it to the wrong model. Nothing in the compiler warns about this. We characterize the problem as a three-level decidability hierarchy – crisp conflicts are decidable via SAT, embedding conflicts reduce to spherical cap intersection, and classifier conflicts are undecidable without distributional knowledge – and show that for the embedding case, which dominates in practice, replacing independent thresholding with a temperature-scaled softmax partitions the embedding space into Voronoi regions where co-firing is impossible. No model retraining is needed. We implement the detection and prevention mechanisms in the Semantic Router DSL, a production routing language for LLM inference, and discuss how the same ideas apply to semantic RBAC and API gateway policy.

[505] R2-Dreamer: Redundancy-Reduced World Models without Decoders or Augmentation

Naoki Morihira, Amal Nahar, Kartik Bharadwaj, Yasuhiro Kato, Akinobu Hayashi, Tatsuya Harada

Main category: cs.LG

TL;DR: R2-Dreamer: A decoder-free model-based reinforcement learning framework with self-supervised redundancy-reduction objective that replaces data augmentation for learning robust visual representations.

Details

Motivation: Reconstruction-based MBRL methods waste capacity on irrelevant visual details, while decoder-free methods rely on external data augmentation regularizers that limit versatility. Need an internal regularizer for robust representation learning without data augmentation.

Method: Proposes R2-Dreamer, a decoder-free MBRL framework with a self-supervised redundancy-reduction objective inspired by Barlow Twins. This serves as an internal regularizer to prevent representation collapse without data augmentation.

Result: Competitive with strong baselines (DreamerV3, TD-MPC2) on DeepMind Control Suite and Meta-World, trains 1.59x faster than DreamerV3, and shows substantial gains on DMC-Subtle with tiny task-relevant objects.

Conclusion: An effective internal regularizer can enable versatile, high-performance decoder-free MBRL without reliance on external data augmentation.

Abstract: A central challenge in image-based Model-Based Reinforcement Learning (MBRL) is to learn representations that distill essential information from irrelevant visual details. While promising, reconstruction-based methods often waste capacity on large task-irrelevant regions. Decoder-free methods instead learn robust representations by leveraging Data Augmentation (DA), but reliance on such external regularizers limits versatility. We propose R2-Dreamer, a decoder-free MBRL framework with a self-supervised objective that serves as an internal regularizer, preventing representation collapse without resorting to DA. The core of our method is a redundancy-reduction objective inspired by Barlow Twins, which can be easily integrated into existing frameworks. On DeepMind Control Suite and Meta-World, R2-Dreamer is competitive with strong baselines such as DreamerV3 and TD-MPC2 while training 1.59x faster than DreamerV3, and yields substantial gains on DMC-Subtle with tiny task-relevant objects. These results suggest that an effective internal regularizer can enable versatile, high-performance decoder-free MBRL. Code is available at https://github.com/NM512/r2dreamer.

[506] Gradient-Informed Temporal Sampling Improves Rollout Accuracy in PDE Surrogate Training

Wenshuo Wang, Fan Zhang

Main category: cs.LG

TL;DR: GITS is a data sampling method for neural simulators that optimizes both model gradients and temporal coverage to improve rollout accuracy compared to uniform or other sampling strategies.

Details

Motivation: Current data sampling methods for neural simulators either focus too much on high-information-density regions or preserve diversity but lack model-specificity, often performing no better than uniform sampling. There's a need for a principled approach to maximize rollout accuracy through optimal training data selection.

Method: Proposes Gradient-Informed Temporal Sampling (GITS) that jointly optimizes two objectives: 1) pilot-model local gradients (model specificity) and 2) set-level temporal coverage (dynamical information). This balances model-specific information needs with comprehensive temporal dynamics coverage.

Result: GITS achieves lower rollout error across multiple PDE systems, model backbones, and sample ratios compared to various sampling baselines. Ablation studies confirm the necessity and complementarity of both optimization objectives.

Conclusion: GITS provides an effective data sampling strategy for neural simulators that outperforms existing methods by balancing model-specific gradient information with temporal coverage, though it may fail on certain PDE systems and model backbones.

Abstract: Researchers train neural simulators on uniformly sampled numerical simulation data. But under the same budget, does systematically sampled data provide the most effective information? A fundamental yet unformalized problem is how to sample training data for neural simulators so as to maximize rollout accuracy. Existing data sampling methods either tend to collapse into locally high-information-density regions, or preserve diversity but remain insufficiently model-specific, often leading to performance that is no better than uniform sampling. To address this, we propose a data sampling method tailored to neural simulators, Gradient-Informed Temporal Sampling (GITS). GITS jointly optimizes pilot-model local gradients and set-level temporal coverage, thereby effectively balancing model specificity and dynamical information. Compared with multiple sampling baselines, the data selected by GITS achieves lower rollout error across multiple PDE systems, model backbones and sample ratios. Furthermore, ablation studies demonstrate the necessity and complementarity of the two optimization objectives in GITS. In addition, we analyze the successful sampling patterns of GITS as well as the typical PDE systems and model backbones on which GITS fails.

[507] AGRI-Fidelity: Evaluating the Reliability of Listenable Explanations for Poultry Disease Detection

Sindhuja Madabushi, Arda Dogan, Jonathan Liu, Dian Chen, Dong S. Ha, Sook Shin, Sam H. Noh, Jin-Hee Cho

Main category: cs.LG

TL;DR: AGRI-Fidelity is a reliability evaluation framework for audio explanations in poultry disease detection that addresses model multiplicity and stationary artifacts in noisy farm environments.

Details

Motivation: Existing XAI metrics measure faithfulness for single models but ignore model multiplicity where different near-optimal classifiers may rely on spurious acoustic cues. In noisy farm environments, stationary artifacts like ventilation noise can produce explanations that are faithful yet unreliable, as current masking-based metrics fail to penalize redundant shortcuts.

Method: Proposes AGRI-Fidelity framework combining cross-model consensus with cyclic temporal permutation to construct null distributions and compute False Discovery Rate (FDR). This suppresses stationary artifacts while preserving time-localized bioacoustic markers for poultry disease detection without spatial ground truth.

Result: Across real and controlled datasets, AGRI-Fidelity effectively provides reliability-aware discrimination for all data points compared to masking-based metrics, successfully suppressing stationary artifacts while preserving relevant bioacoustic markers.

Conclusion: AGRI-Fidelity addresses critical limitations in audio explanation evaluation for noisy environments by providing reliability-aware metrics that account for model multiplicity and stationary artifacts, offering more robust evaluation for listenable explanations in agricultural settings.

Abstract: Existing XAI metrics measure faithfulness for a single model, ignoring model multiplicity where near-optimal classifiers rely on different or spurious acoustic cues. In noisy farm environments, stationary artifacts such as ventilation noise can produce explanations that are faithful yet unreliable, as masking-based metrics fail to penalize redundant shortcuts. We propose AGRI-Fidelity, a reliability-oriented evaluation framework for listenable explanations in poultry disease detection without spatial ground truth. The method combines cross-model consensus with cyclic temporal permutation to construct null distributions and compute a False Discovery Rate (FDR), suppressing stationary artifacts while preserving time-localized bioacoustic markers. Across real and controlled datasets, AGRI-Fidelity effectively provides reliability-aware discrimination for all data points versus masking-based metrics.

[508] MolRGen: A Training and Evaluation Setting for De Novo Molecular Generation with Reasonning Models

Philippe Formont, Maxime Darrin, Ismail Ben Ayed, Pablo Piantanida

Main category: cs.LG

TL;DR: MolRGen: A benchmark and dataset for training reasoning-based LLMs on de novo molecular generation without ground-truth supervision.

Details

Motivation: Existing reasoning LLM approaches for drug discovery require ground-truth labels or focus only on evaluation, lacking methods for de novo molecular generation where novel molecules must be optimized without prior knowledge of high-scoring candidates.

Method: Introduces MolRGen benchmark with three contributions: 1) setting for de novo molecular generation and property prediction, 2) diversity-aware top-k score metric, 3) training a 24B LLM with reinforcement learning for molecular generation.

Result: Provides a comprehensive benchmark and demonstrates training of a large-scale LLM for molecular generation, with detailed analysis of performance and limitations.

Conclusion: MolRGen bridges the gap in reasoning LLM applications for de novo molecular generation by providing a benchmark, novel evaluation metric, and training methodology without requiring ground-truth supervision.

Abstract: Recent advances in reasoning-based large language models (LLMs) have demonstrated substantial improvements in complex problem-solving tasks. Motivated by these advances, several works have explored the application of reasoning LLMs to drug discovery and molecular design. However, most existing approaches either focus on evaluation or rely on training setups that require ground-truth labels, such as molecule pairs with known property modifications. Such supervision is unavailable in \textit{de novo} molecular generation, where the objective is to generate novel molecules that optimize a desirability score without prior knowledge of high-scoring candidates. To bridge this gap, we introduce MolRGen, a large-scale benchmark and dataset for training and evaluating reasoning-based LLMs on \textit{de novo} molecular generation. Our contributions are threefold. First, we propose a setting to evaluate and train models for \textit{de novo} molecular generation and property prediction. Second, we introduce a novel diversity-aware top-$k$ score that captures both the quality and diversity of generated molecules. Third, we show our setting can be used to train LLMs for molecular generation, training a 24B LLM with reinforcement learning, and we provide a detailed analysis of its performance and limitations.

[509] Discovering What You Can Control: Interventional Boundary Discovery for Reinforcement Learning

Jiaxin Liu

Main category: cs.LG

TL;DR: IBD is a causal feature selection method that identifies which observation dimensions are causally influenced by an agent’s actions using interventions, helping RL algorithms focus on relevant state features while ignoring confounded distractors.

Details

Motivation: In RL with high-dimensional observations, many dimensions may be confounded distractors that correlate with actions but aren't causally influenced by them. Traditional observational feature selection can fail by selecting these confounded features while discarding truly causal ones, degrading RL performance.

Method: IBD applies Pearl’s do-operator to intervene on the agent’s actions and uses two-sample testing to compare observational vs. interventional distributions. This produces a binary mask over observation dimensions indicating which dimensions are causally influenced by actions, requiring no learned models.

Result: Across 12 continuous control settings with up to 100 distractor dimensions: (1) observational feature selection actively selects confounded distractors, (2) full-state RL degrades sharply when distractors outnumber relevant features ~3:1, (3) IBD closely tracks oracle performance across all distractor levels and transfers gains across SAC and TD3 algorithms.

Conclusion: IBD provides an interpretable, model-free method for causal feature selection that composes with any downstream RL algorithm, effectively identifying the agent’s causal sphere of influence and improving performance in environments with many confounded distractors.

Abstract: Selecting relevant state dimensions in the presence of confounded distractors is a causal identification problem: observational statistics alone cannot reliably distinguish dimensions that correlate with actions from those that actions cause. We formalize this as discovering the agent’s Causal Sphere of Influence and propose Interventional Boundary Discovery IBD, which applies Pearl’s do-operator to the agent’s own actions and uses two-sample testing to produce an interpretable binary mask over observation dimensions. IBD requires no learned models and composes with any downstream RL algorithm as a preprocessing step. Across 12 continuous control settings with up to 100 distractor dimensions, we find that: (1) observational feature selection can actively select confounded distractors while discarding true causal dimensions; (2) full-state RL degrades sharply once distractors outnumber relevant features by roughly 3:1 in our benchmarks; and (3)IBD closely tracks oracle performance across all distractor levels tested, with gains transferring across SAC and TD3.

[510] Sharpness-Aware Minimization in Logit Space Efficiently Enhances Direct Preference Optimization

Haocheng Luo, Zehang Deng, Thanh-Toan Do, Mehrtash Harandi, Dinh Phung, Trung Le

Main category: cs.LG

TL;DR: Logits-SAM improves DPO by addressing the squeezing effect through curvature regularization in logit space

Details

Motivation: Direct Preference Optimization (DPO) suffers from the squeezing effect where probability of preferred responses decreases during training, limiting its effectiveness for aligning LLMs with human preferences

Method: Develop theoretical framework modeling coordinate-wise dynamics in logit space, then propose logits-SAM variant that perturbs only the output layer to suppress negative-gradient updates along high-curvature directions

Result: Logits-SAM consistently improves DPO effectiveness across Pythia-2.8B, Mistral-7B, and Gemma-2B-IT models on multiple datasets and benchmarks

Conclusion: Logits-SAM provides computationally efficient solution to DPO’s squeezing effect, integrates well with other DPO variants, and improves alignment performance

Abstract: Direct Preference Optimization (DPO) has emerged as a popular algorithm for aligning pretrained large language models with human preferences, owing to its simplicity and training stability. However, DPO suffers from the recently identified squeezing effect (also known as likelihood displacement), where the probability of preferred responses decreases unintentionally during training. To understand and mitigate this phenomenon, we develop a theoretical framework that models the coordinate-wise dynamics in logit space. Our analysis reveals that negative-gradient updates cause residuals to expand rapidly along high-curvature directions, which underlies the squeezing effect, whereas Sharpness-Aware Minimization (SAM) can suppress this behavior through its curvature-regularization effect. Building on this insight, we investigate logits-SAM, a computationally efficient variant that perturbs only the output layer with negligible overhead. Extensive experiments on Pythia-2.8B, Mistral-7B, and Gemma-2B-IT across multiple datasets and benchmarks demonstrate that logits-SAM consistently improves the effectiveness of DPO and integrates seamlessly with other DPO variants. Code is available at https://github.com/RitianLuo/logits-sam-dpo.

[511] Enactor: From Traffic Simulators to Surrogate World Models

Yash Ranjan, Rahul Sengupta, Anand Rangarajan, Sanjay Ranka

Main category: cs.LG

TL;DR: A transformer-based world model for generating physically consistent, long-horizon traffic actor trajectories that captures complex actor-actor interactions and intersection geometry.

Details

Motivation: Existing traffic microsimulators use simplistic behavior models that fail to capture realistic actor-actor interactions, while deep learning approaches lack physical consistency over long time periods and don't adequately address complex intersection dynamics.

Method: Actor-centric generative model using transformer-based architecture inspired by World Model paradigm, capturing actor-actor interactions while understanding traffic intersection geometry. Tested in “simulation-in-the-loop” setting with SUMO for initial conditions.

Result: Model generates long-horizon, physically consistent trajectories, captures complex interactions, requires fewer training samples than traditional approaches, and outperforms baseline by more than 10x on KL-Divergence metrics.

Conclusion: The proposed framework effectively addresses limitations of existing traffic simulation methods by generating realistic, physically grounded trajectories that maintain consistency over extended time periods while capturing complex intersection dynamics.

Abstract: Traffic microsimulators are widely used to evaluate road network performance under various what-if" conditions. However, the behavior models controlling the actions of the actors are overly simplistic and fails to capture realistic actor-actor interactions. Deep learning-based methods have been applied to model vehicles and pedestrians as agents" responding to their surrounding environment" (including lanes, signals, and neighboring agents). Although effective in learning actor-actor interaction, these approaches fail to generate physically consistent trajectories over long time periods, and they do not explicitly address the complex dynamics that arise at traffic intersections which is a critical location in urban networks. Inspired by the World Model paradigm, we have developed an actor centric generative model using transformer-based architecture that is able to capture the actor-actor interaction, at the same time understanding the geometry to the traffic intersection to generate physically grounded trajectories that are based on learned behavior. Moreover, we test the model in a live simulation-in-the-loop" setting, where we generate the initial conditions of the actors using SUMO and then let the model control the dynamics of the actors. We let the simulation run for 40000 timesteps (4000 seconds), testing the performance of the model on long timerange and evaluating the trajectories on traffic engineering related metrics. Experimental results demonstrate that the proposed framework effectively captures complex actor-actor interactions and generates long-horizon, physically consistent trajectories, while requiring significantly fewer training samples than traditional agent-centric generative approaches. Our model is able to outperform the baseline in traffic related as well as aggregate metrics where our model beats the baseline by more than 10x on the KL-Divergence.

[512] Detection Is Cheap, Routing Is Learned: Why Refusal-Based Alignment Evaluation Fails

Gregory N. Frank

Main category: cs.LG

TL;DR: Paper studies political censorship in Chinese LLMs as alignment case study, revealing three-stage framework: detect, route, generate, showing current evaluation misses routing layer

Details

Motivation: Current alignment evaluation focuses on concept detection and refusal behavior, missing the routing layer between detection and behavioral policy. Political censorship in Chinese LLMs provides natural experiment to study this routing mechanism.

Method: Used probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Examined political probes, null controls, permutation baselines, and held-out category generalization. Performed surgical ablation of political-sensitivity direction and cross-model transfer analysis.

Result: 1) Probe accuracy alone non-diagnostic - all baselines can reach 100%, held-out generalization is informative. 2) Surgical ablation reveals lab-specific routing - removing political-sensitivity direction eliminates censorship in most models, one model confabulates due to entangled architecture. 3) Refusal no longer dominant mechanism - hard refusal falls to zero while narrative steering rises, making censorship invisible to refusal-only benchmarks.

Conclusion: Supports three-stage framework: detect, route, generate. Models retain knowledge but alignment changes expression. Current evaluations miss routing mechanism that most directly determines behavior.

Abstract: Current alignment evaluation mostly measures whether models encode dangerous concepts and whether they refuse harmful requests. Both miss the layer where alignment often operates: routing from concept detection to behavioral policy. We study political censorship in Chinese-origin language models as a natural experiment, using probes, surgical ablations, and behavioral tests across nine open-weight models from five labs. Three findings follow. First, probe accuracy alone is non-diagnostic: political probes, null controls, and permutation baselines can all reach 100%, so held-out category generalization is the informative test. Second, surgical ablation reveals lab-specific routing. Removing the political-sensitivity direction eliminates censorship and restores accurate factual output in most models tested, while one model confabulates because its architecture entangles factual knowledge with the censorship mechanism. Cross-model transfer fails, indicating that routing geometry is model- and lab-specific. Third, refusal is no longer the dominant censorship mechanism. Within one model family, hard refusal falls to zero while narrative steering rises to the maximum, making censorship invisible to refusal-only benchmarks. These results support a three-stage descriptive framework: detect, route, generate. Models often retain the relevant knowledge; alignment changes how that knowledge is expressed. Evaluations that audit only detection or refusal therefore miss the routing mechanism that most directly determines behavior.

[513] On Additive Gaussian Processes for Wind Farm Power Prediction

Simon M. Brealy, Lawrence A. Bull, Daniel S. Brennan, Pauline Beltrando, Anders Sommer, Nikolaos Dervilis, Keith Worden

Main category: cs.LG

TL;DR: Using additive Gaussian processes to model wind turbine power generation at both individual turbine and farm levels for population-based structural health monitoring

Details

Motivation: To enable information sharing between similar structures/machines in PBSHM, specifically for wind farms, by understanding variations in power generation patterns at both turbine-specific and farm-wide levels

Method: Additive Gaussian processes applied to wind farm data to model power generation, capturing both individual turbine variations and overall farm-level patterns

Result: Predictions reveal intuitive patterns in wind farm power generation that should enable more informed control and decision-making

Conclusion: Additive Gaussian processes provide a useful approach for PBSHM in wind farms by revealing multi-level power generation patterns that can inform operational decisions

Abstract: Population-based Structural Health Monitoring (PBSHM) aims to share information between similar machines or structures. This paper takes a population-level perspective, exploring the use of additive Gaussian processes to reveal variations in turbine-specific and farm-level power models over a collected wind farm dataset. The predictions illustrate patterns in wind farm power generation, which follow intuition and should enable more informed control and decision-making.

[514] Path-Constrained Mixture-of-Experts

Zijin Gu, Tatiana Likhomanenko, Vimal Thilak, Jason Ramapuram, Navdeep Jaitly

Main category: cs.LG

TL;DR: PathMoE: A novel MoE architecture that shares router parameters across consecutive layers to constrain the expert path space, improving statistical efficiency and performance without auxiliary losses.

Details

Motivation: Conventional MoE routing selects experts independently per layer, creating N^L possible expert paths that exceed typical training set sizes, leading to statistical inefficiency and failure to learn meaningful structure over the vast path space.

Method: Proposes PathMoE which shares router parameters across consecutive layers to constrain the expert path space, reducing the number of possible paths and improving statistical efficiency while eliminating the need for auxiliary load balancing losses.

Result: Experiments on 0.9B and 16B parameter models show consistent improvements in perplexity and downstream tasks over independent routing. Analysis reveals tokens following the same path naturally cluster by linguistic function, with PathMoE producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations.

Conclusion: PathMoE offers a more efficient MoE architecture by constraining expert paths through shared router parameters, providing better performance and interpretability while eliminating auxiliary losses. The work offers a new perspective for understanding MoE architectures through expert paths.

Abstract: Sparse Mixture-of-Experts (MoE) architectures enable efficient scaling by activating only a subset of parameters for each input. However, conventional MoE routing selects each layer’s experts independently, creating N^L possible expert paths – for N experts across L layers. This far exceeds typical training set sizes, leading to statistical inefficiency as the model may not learn meaningful structure over such a vast path space. To constrain it, we propose \pathmoe, which shares router parameters across consecutive layers. Experiments on 0.9B and 16B parameter models demonstrate consistent improvements on perplexity and downstream tasks over independent routing, while eliminating the need for auxiliary load balancing losses. Analysis reveals that tokens following the same path naturally cluster by linguistic function, with \pathmoe{} producing more concentrated groups, better cross-layer consistency, and greater robustness to routing perturbations. These results offer a new perspective for understanding MoE architectures through the lens of expert paths.

[515] Approximate Subgraph Matching with Neural Graph Representations and Reinforcement Learning

Kaiyang Li, Shihao Ji, Zhipeng Cai, Wei Li

Main category: cs.LG

TL;DR: RL-ASM: A reinforcement learning approach for approximate subgraph matching using graph transformers and PPO optimization, outperforming existing methods.

Details

Motivation: Approximate subgraph matching (ASM) is NP-hard and critical for graph analysis applications, but existing heuristic methods cannot fully utilize graph information, leading to sub-optimal solutions.

Method: Proposes RL-ASM algorithm using graph transformers to extract graph representations and reinforcement learning policies. Built on branch-and-bound algorithm, uses Graph Transformer for feature extraction, imitation learning for initial training, and PPO for fine-tuning.

Result: Extensive experiments on synthetic and real-world datasets show RL-ASM outperforms existing methods in both effectiveness and efficiency.

Conclusion: The RL-based approach with graph transformers provides an effective solution for approximate subgraph matching, demonstrating superior performance over traditional heuristic methods.

Abstract: Approximate subgraph matching (ASM) is a task that determines the approximate presence of a given query graph in a large target graph. Being an NP-hard problem, ASM is critical in graph analysis with a myriad of applications ranging from database systems and network science to biochemistry and privacy. Existing techniques often employ heuristic search strategies, which cannot fully utilize the graph information, leading to sub-optimal solutions. This paper proposes a Reinforcement Learning based Approximate Subgraph Matching (RL-ASM) algorithm that exploits graph transformers to effectively extract graph representations and RL-based policies for ASM. Our model is built upon the branch-and-bound algorithm that selects one pair of nodes from the two input graphs at a time for potential matches. Instead of using heuristics, we exploit a Graph Transformer architecture to extract feature representations that encode the full graph information. To enhance the training of the RL policy, we use supervised signals to guide our agent in an imitation learning stage. Subsequently, the policy is fine-tuned with the Proximal Policy Optimization (PPO) that optimizes the accumulative long-term rewards over episodes. Extensive experiments on both synthetic and real-world datasets demonstrate that our RL-ASM outperforms existing methods in terms of effectiveness and efficiency. Our source code is available at https://github.com/KaiyangLi1992/RL-ASM.

[516] Learning to Reason with Curriculum I: Provable Benefits of Autocurriculum

Nived Rajaraman, Audrey Huang, Miro Dudik, Robert Schapire, Dylan J. Foster, Akshay Krishnamurthy

Main category: cs.LG

TL;DR: Autocurriculum training method reduces costs for reasoning models by adaptively selecting training data based on model performance, requiring exponentially fewer demonstrations than standard fine-tuning.

Details

Motivation: Training reasoning models with chain-of-thought is extremely costly due to long reasoning traces collection and reinforcement learning post-training. The paper investigates whether these costs are fundamental or can be reduced through better algorithmic design.

Method: Proposes autocurriculum approach where the model uses its own performance to decide which problems to focus training on. For supervised fine-tuning, it focuses teacher supervision on prompts where the current model struggles. For RL fine-tuning, it decouples computational cost from reference model quality.

Result: Autocurriculum provably improves upon standard training recipes: requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning for SFT, and reduces reference model quality to a burn-in cost nearly independent of target accuracy for RL.

Conclusion: Autocurriculum offers significant cost reductions for training reasoning models through adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, without requiring assumptions about prompt distribution or difficulty.

Abstract: Chain-of-thought reasoning, where language models expend additional computation by producing thinking tokens prior to final responses, has driven significant advances in model capabilities. However, training these reasoning models is extremely costly in terms of both data and compute, as it involves collecting long traces of reasoning behavior from humans or synthetic generators and further post-training the model via reinforcement learning. Are these costs fundamental, or can they be reduced through better algorithmic design? We show that autocurriculum, where the model uses its own performance to decide which problems to focus training on, provably improves upon standard training recipes for both supervised fine-tuning (SFT) and reinforcement learning (RL). For SFT, we show that autocurriculum requires exponentially fewer reasoning demonstrations than non-adaptive fine-tuning, by focusing teacher supervision on prompts where the current model struggles. For RL fine-tuning, autocurriculum decouples the computational cost from the quality of the reference model, reducing the latter to a burn-in cost that is nearly independent of the target accuracy. These improvements arise purely from adaptive data selection, drawing on classical techniques from boosting and learning from counterexamples, and requiring no assumption on the distribution or difficulty of prompts.

[517] Escaping Offline Pessimism: Vector-Field Reward Shaping for Safe Frontier Exploration

Amirhossein Roknilamouki, Arnob Ghosh, Eylem Ekici, Ness B. Shroff

Main category: cs.LG

TL;DR: Proposes vector-field reward shaping for safe boundary exploration in offline-to-online RL, using gradient-alignment and rotational-flow terms to induce continuous exploration along uncertainty boundaries without degenerate parking behavior.

Details

Motivation: Offline RL provides reliable policies but is too pessimistic for online exploration. Safe RL explores near offline dataset boundaries, but naive boundary-seeking rewards cause degenerate parking behavior where agents stop at frontiers instead of exploring continuously.

Method: Uses uncertainty oracle trained from offline data. Reward combines: 1) gradient-alignment term attracting agent toward target uncertainty level, 2) rotational-flow term promoting motion along local tangent plane of uncertainty manifold. Integrated with Soft Actor-Critic for continuous navigation.

Result: Empirical validation on 2D continuous navigation task shows agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion. Theoretical analysis shows reward structure induces sustained exploratory behavior.

Conclusion: Vector-field reward shaping enables continuous, safe boundary exploration for non-adaptive deployed policies, preventing degenerate parking behavior while maintaining safety through uncertainty-aware exploration near offline dataset boundaries.

Abstract: While offline reinforcement learning provides reliable policies for real-world deployment, its inherent pessimism severely restricts an agent’s ability to explore and collect novel data online. Drawing inspiration from safe reinforcement learning, exploring near the boundary of regions well covered by the offline dataset and reliably modeled by the simulator allows an agent to take manageable risks–venturing into informative but moderate-uncertainty states while remaining close enough to familiar regions for safe recovery. However, naively rewarding this boundary-seeking behavior can lead to a degenerate parking behavior, where the agent simply stops once it reaches the frontier. To solve this, we propose a novel vector-field reward shaping paradigm designed to induce continuous, safe boundary exploration for non-adaptive deployed policies. Operating on an uncertainty oracle trained from offline data, our reward combines two complementary components: a gradient-alignment term that attracts the agent toward a target uncertainty level, and a rotational-flow term that promotes motion along the local tangent plane of the uncertainty manifold. Through theoretical analysis, we show that this reward structure naturally induces sustained exploratory behavior along the boundary while preventing degenerate solutions. Empirically, by integrating our proposed reward shaping with Soft Actor-Critic on a 2D continuous navigation task, we validate that agents successfully traverse uncertainty boundaries while balancing safe, informative data collection with primary task completion.

[518] A Family of Adaptive Activation Functions for Mitigating Failure Modes in Physics-Informed Neural Networks

Krishna Murari

Main category: cs.LG

TL;DR: This paper introduces adaptive wavelet-based activation functions for Physics-Informed Neural Networks (PINNs) to address common failure modes, improving training stability and expressive power for solving PDEs.

Details

Motivation: Standard PINNs often suffer from training instability and limited expressive power. The authors are motivated by the failure modes observed in traditional PINNs and the proven approximation capabilities of wavelets in computational mathematics.

Method: Developed a novel family of adaptive wavelet-based activation functions that combine trainable wavelet functions with either trainable or fixed hyperbolic tangent and softplus functions. Created five distinct activation functions within the PINN framework and systematically evaluated them across four representative classes of partial differential equations.

Result: The proposed wavelet-based activation functions significantly improved training stability and expressive power. Comprehensive comparisons using bar plots demonstrated improved robustness and accuracy compared to traditional activation functions. The approach outperformed baseline PINNs, transformer-based architectures like PINNsFormer, and other deep learning models.

Conclusion: The adaptive wavelet-based activation functions provide an effective and general solution to enhance PINN performance, addressing common failure modes while maintaining flexibility across different PDE types.

Abstract: Physics-Informed Neural Networks(PINNs) are a powerful and flexible learning framework that has gained significant attention in recent years. It has demonstrated strong performance across a wide range of scientific and engineering problems. In parallel, wavelets have been extensively used as efficient computational tools due to their strong approximation capabilities. Motivated by the common failure modes observed in standard PINNs, this work introduces a novel family of adaptive wavelet-based activation functions. The proposed activation functions significantly improve training stability and expressive power by combining trainable wavelet functions with either trainable or fixed hyperbolic tangent and softplus functions. Five distinct activation functions are developed within the PINN framework and systematically evaluated across four representative classes of partial differential equations (PDEs). Comprehensive comparisons using bar plots demonstrate improved robustness and accuracy compared to traditional activation functions. Furthermore, the proposed approach is validated through direct comparisons with baseline PINNs, transformer-based architectures such as PINNsFormer, and other deep learning models, highlighting its effectiveness and generality.

[519] Epistemic Generative Adversarial Networks

Muhammad Mubashar, Fabio Cuzzolin

Main category: cs.LG

TL;DR: Novel GAN framework using Dempster-Shafer theory to quantify uncertainty and improve output diversity through mass function predictions for each pixel.

Details

Motivation: GANs often suffer from mode collapse and lack of output diversity, generating similar samples rather than a wide range of variations. There's a need for principled approaches to model uncertainty and improve generation variability.

Method: Proposes a generalization of GAN loss function based on Dempster-Shafer theory applied to both generator and discriminator. Introduces architectural enhancement to generator enabling prediction of mass functions for each image pixel to quantify uncertainty.

Result: Experimental evidence shows improved generation variability and provides a principled framework for modeling and interpreting uncertainty in generative processes.

Conclusion: The Dempster-Shafer based GAN framework successfully addresses diversity issues in generative models while providing uncertainty quantification capabilities.

Abstract: Generative models, particularly Generative Adversarial Networks (GANs), often suffer from a lack of output diversity, frequently generating similar samples rather than a wide range of variations. This paper introduces a novel generalization of the GAN loss function based on Dempster-Shafer theory of evidence, applied to both the generator and discriminator. Additionally, we propose an architectural enhancement to the generator that enables it to predict a mass function for each image pixel. This modification allows the model to quantify uncertainty in its outputs and leverage this uncertainty to produce more diverse and representative generations. Experimental evidence shows that our approach not only improves generation variability but also provides a principled framework for modeling and interpreting uncertainty in generative processes.

[520] Mathematical Foundations of Deep Learning

Xiaojing Ye

Main category: cs.LG

TL;DR: A comprehensive mathematical textbook covering theoretical foundations of deep learning, including neural network approximation theory, optimal control/reinforcement learning integration, and contemporary generative models.

Details

Motivation: To provide a rigorous mathematical foundation for modern deep learning, bridging the gap between practical applications and theoretical understanding across key areas of contemporary AI research.

Method: Mathematical textbook approach covering theoretical topics systematically: 1) Approximation capabilities of deep neural networks, 2) Optimal control and reinforcement learning theory/algorithms integrated with deep learning, 3) Contemporary generative models driving AI advances.

Result: A comprehensive draft book that systematically presents mathematical principles underlying deep learning, offering theoretical foundations for researchers and practitioners.

Conclusion: This book provides essential mathematical foundations for understanding deep learning theory, with particular relevance to generative models and reinforcement learning - key areas in multimodal AI research.

Abstract: This draft book offers a comprehensive and rigorous treatment of the mathematical principles underlying modern deep learning. The book spans core theoretical topics, from the approximation capabilities of deep neural networks, the theory and algorithms of optimal control and reinforcement learning integrated with deep learning techniques, to contemporary generative models that drive today’s advances in artificial intelligence.

[521] RE-SAC: Disentangling aleatoric and epistemic risks in bus fleet control: A stable and robust ensemble DRL approach

Yifan Zhang, Liang Zheng

Main category: cs.LG

TL;DR: RE-SAC: A robust ensemble soft actor-critic framework for bus holding control that disentangles aleatoric and epistemic uncertainties using IPM-based regularization and diversified Q-ensembles.

Details

Motivation: Bus holding control faces challenges from stochastic traffic and passenger demand. Standard DRL actor-critic algorithms suffer from Q-value instability in volatile environments due to conflating aleatoric (irreducible noise) and epistemic (data insufficiency) uncertainties, leading to value underestimation and catastrophic policy collapse.

Method: Proposes RE-SAC with dual mechanisms: 1) Integral Probability Metric (IPM)-based weight regularization on critic network to hedge against aleatoric risk, providing smooth analytical lower bound for robust Bellman operator without expensive inner-loop perturbations; 2) Diversified Q-ensemble to penalize overconfident value estimates in sparsely covered regions, addressing epistemic risk.

Result: In realistic bidirectional bus corridor simulation, RE-SAC achieves highest cumulative reward (~-0.4e6) vs vanilla SAC (-0.55e6). Mahalanobis rareness analysis shows RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs 4343), demonstrating superior robustness under high traffic variability.

Conclusion: RE-SAC successfully disentangles aleatoric and epistemic uncertainties, preventing ensemble variance from misidentifying noise as data gaps, leading to more robust bus holding control in stochastic environments.

Abstract: Bus holding control is challenging due to stochastic traffic and passenger demand. While deep reinforcement learning (DRL) shows promise, standard actor-critic algorithms suffer from Q-value instability in volatile environments. A key source of this instability is the conflation of two distinct uncertainties: aleatoric uncertainty (irreducible noise) and epistemic uncertainty (data insufficiency). Treating these as a single risk leads to value underestimation in noisy states, causing catastrophic policy collapse. We propose a robust ensemble soft actor-critic (RE-SAC) framework to explicitly disentangle these uncertainties. RE-SAC applies Integral Probability Metric (IPM)-based weight regularization to the critic network to hedge against aleatoric risk, providing a smooth analytical lower bound for the robust Bellman operator without expensive inner-loop perturbations. To address epistemic risk, a diversified Q-ensemble penalizes overconfident value estimates in sparsely covered regions. This dual mechanism prevents the ensemble variance from misidentifying noise as a data gap, a failure mode identified in our ablation study. Experiments in a realistic bidirectional bus corridor simulation demonstrate that RE-SAC achieves the highest cumulative reward (approx. -0.4e6) compared to vanilla SAC (-0.55e6). Mahalanobis rareness analysis confirms that RE-SAC reduces Oracle Q-value estimation error by up to 62% in rare out-of-distribution states (MAE of 1647 vs. 4343), demonstrating superior robustness under high traffic variability.

[522] FlowMS: Flow Matching for De Novo Structure Elucidation from Mass Spectra

Jianan Nie, Peng Gao

Main category: cs.LG

TL;DR: FlowMS introduces discrete flow matching for spectrum-conditioned molecular structure generation from mass spectrometry data, achieving state-of-the-art performance on molecular elucidation benchmarks.

Details

Motivation: De novo molecular structure elucidation from mass spectrometry spectra is challenging due to chemical space complexity and spectral ambiguity. While deep learning methods have made progress, diffusion-based approaches remain computationally demanding, and discrete flow matching hasn't been explored for this task.

Method: FlowMS uses discrete flow matching to generate molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder.

Result: Achieves state-of-the-art performance on 5 out of 6 metrics on NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). Generated molecules are structurally plausible and resemble ground truth.

Conclusion: Discrete flow matching establishes a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.

Abstract: Mass spectrometry (MS) stands as a cornerstone analytical technique for molecular identification, yet de novo structure elucidation from spectra remains challenging due to the combinatorial complexity of chemical space and the inherent ambiguity of spectral fragmentation patterns. Recent deep learning approaches, including autoregressive sequence models, scaffold-based methods, and graph diffusion models, have made progress. However, diffusion-based generation for this task remains computationally demanding. Meanwhile, discrete flow matching, which has shown strong performance for graph generation, has not yet been explored for spectrum-conditioned structure elucidation. In this work, we introduce FlowMS, the first discrete flow matching framework for spectrum-conditioned de novo molecular generation. FlowMS generates molecular graphs through iterative refinement in probability space, enforcing chemical formula constraints while conditioning on spectral embeddings from a pretrained formula transformer encoder. Notably, it achieves state-of-the-art performance on 5 out of 6 metrics on the NPLIB1 benchmark: 9.15% top-1 accuracy (9.7% relative improvement over DiffMS) and 7.96 top-10 MCES (4.2% improvement over MS-BART). We also visualize the generated molecules, which further demonstrate that FlowMS produces structurally plausible candidates closely resembling ground truth structures. These results establish discrete flow matching as a promising paradigm for mass spectrometry-based structure elucidation in metabolomics and natural product discovery.

[523] Self-Tuning Sparse Attention: Multi-Fidelity Hyperparameter Optimization for Transformer Acceleration

Arundhathi Dev, Justin Zhan

Main category: cs.LG

TL;DR: AFBS-BO is an automated framework that discovers optimal sparse attention hyperparameters using Bayesian Optimization and binary search, eliminating manual tuning and accelerating transformer models.

Details

Motivation: Sparse attention mechanisms face limited production adoption due to the need for manual hyperparameter tuning across different layers and models, creating a usability gap that hinders widespread deployment.

Method: AFBS-BO combines Bayesian Optimization for global exploration with binary search for local refinement, using multi-fidelity evaluation across sequence lengths to reduce tuning costs and discover layer- and head-specific hyperparameters automatically.

Result: On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, finding high-sparsity configurations that outperform existing sparse attention baselines while matching dense attention quality.

Conclusion: AFBS-BO transforms sparse attention from a manually tuned heuristic into a self-optimizing primitive, enabling plug-and-play acceleration across diverse transformer architectures and domains without human intervention.

Abstract: Sparse attention mechanisms promise to break the quadratic bottleneck of long-context transformers, yet production adoption remains limited by a critical usability gap: optimal hyperparameters vary substantially across layers and models, and current methods (e.g., SpargeAttn) rely on manual grid search to identify them. We propose AFBS-BO (Adaptive Fidelity Binary Search with Bayesian Optimization), a fully automated framework that discovers optimal layer- and head-specific hyperparameters without human intervention. Our hybrid algorithm combines Bayesian Optimization for global exploration with binary search for local refinement, leveraging multi-fidelity evaluation across sequence lengths to reduce tuning cost. On Llama-2-7B, AFBS-BO accelerates hyperparameter discovery by 3.4x with 8.8x fewer evaluations than grid search, and identifies high-sparsity configurations that outperform existing sparse attention baselines while closely matching dense attention quality. By transforming sparse attention from a manually tuned heuristic into a self-optimizing primitive, AFBS-BO enables plug-and-play acceleration across diverse transformer architectures and domains.

[524] Towards Noise-Resilient Quantum Multi-Armed and Stochastic Linear Bandits

Zhuoyue Chen, Kechao Cai

Main category: cs.LG

TL;DR: Quantum bandit algorithms with noise-robust quantum Monte Carlo estimation for noisy quantum devices

Details

Motivation: Existing quantum multi-armed bandit algorithms assume ideal noise-free quantum circuits, but current NISQ devices have significant noise that degrades performance. Need noise-robust quantum bandit algorithms that work on real quantum hardware.

Method: Develop a noise-robust quantum Monte Carlo algorithm for accurate estimation from noisy quantum reward oracles. Use this estimator to build noise-robust quantum multi-armed bandit and stochastic linear bandit algorithms that maintain quantum advantages in noisy environments.

Result: Experiments show the noise-robust approach improves estimation accuracy and reduces regret under several quantum noise models compared to non-robust quantum bandit algorithms.

Conclusion: The proposed noise-robust quantum bandit algorithms enhance performance on noisy quantum devices while preserving quantum advantages over classical methods, making quantum bandits more practical for NISQ-era quantum computing.

Abstract: Quantum multi-armed bandits (MAB) and stochastic linear bandits (SLB) have recently attracted significant attention, as their quantum counterparts can achieve quadratic speedups over classical MAB and SLB. However, most existing quantum MAB algorithms assume ideal quantum Monte Carlo (QMC) procedures on noise-free circuits, overlooking the impact of noise in current noisy intermediate-scale quantum (NISQ) devices. In this paper, we study a noise-robust QMC algorithm that improves estimation accuracy when querying quantum reward oracles. Building on this estimator, we propose noise-robust QMAB and QSLB algorithms that enhance performance in noisy environments while preserving the advantage over classical methods. Experiments show that our noise-robust approach improves QMAB estimation accuracy and reduces regret under several quantum noise models.

[525] MLOW: Interpretable Low-Rank Frequency Magnitude Decomposition of Multiple Effects for Time Series Forecasting

Runze Yang, Longbing Cao, Xiaoming Wu, Xin You, Kun Fang, Jianxun Li, Jie Yang

Main category: cs.LG

TL;DR: MLOW is a frequency-based decomposition method for time series forecasting that uses low-rank magnitude spectrum representation and Hyperplane-NMF for interpretable multi-effect decomposition.

Details

Motivation: Existing time series forecasting models cannot effectively learn interpretable multi-effect decomposition due to their smoothing-based temporal techniques. There's a need for methods that can separate multiple effects (trending, seasonal) in time series while maintaining interpretability.

Method: MLOW uses frequency-based decomposition where time series are represented as magnitude spectrum multiplied by phase-aware basis functions. It learns low-rank representation of magnitude spectrum to capture dominant effects, proposes Hyperplane-NMF for interpretable decomposition, and addresses frequency leakage through flexible input horizon and frequency level selection.

Result: MLOW enables interpretable and hierarchical multiple-effect decomposition that is robust to noise. It can be integrated as a plug-and-play component into existing TSF backbones with minimal architectural modifications while providing remarkable performance improvement.

Conclusion: MLOW provides an effective frequency-based approach for interpretable multi-effect decomposition in time series forecasting, addressing limitations of existing methods through novel low-rank spectrum representation and Hyperplane-NMF techniques.

Abstract: Separating multiple effects in time series is fundamental yet challenging for time-series forecasting (TSF). However, existing TSF models cannot effectively learn interpretable multi-effect decomposition by their smoothing-based temporal techniques. Here, a new interpretable frequency-based decomposition pipeline MLOW captures the insight: a time series can be represented as a magnitude spectrum multiplied by the corresponding phase-aware basis functions, and the magnitude spectrum distribution of a time series always exhibits observable patterns for different effects. MLOW learns a low-rank representation of the magnitude spectrum to capture dominant trending and seasonal effects. We explore low-rank methods, including PCA, NMF, and Semi-NMF, and find that none can simultaneously achieve interpretable, efficient and generalizable decomposition. Thus, we propose hyperplane-nonnegative matrix factorization (Hyperplane-NMF). Further, to address the frequency (spectral) leakage restricting high-quality low-rank decomposition, MLOW enables a flexible selection of input horizons and frequency levels via a mathematical mechanism. Visual analysis demonstrates that MLOW enables interpretable and hierarchical multiple-effect decomposition, robust to noises. It can also enable plug-and-play in existing TSF backbones with remarkable performance improvement but minimal architectural modifications.

[526] Discounted Beta–Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

Haechan Kim, Soohyun Ryu, Gyouk Chu, Doohyuk Jang, Eunho Yang

Main category: cs.LG

TL;DR: RLVR reformulated as statistical estimation problem with Discounted Beta-Bernoulli reward estimation to reduce variance and improve sample efficiency in reasoning tasks.

Details

Motivation: Existing group-based RLVR methods suffer from severe sample inefficiency due to reliance on point estimation from small rollouts, leading to high variance, variance collapse, and ineffective use of generated responses.

Method: Reformulate RLVR from statistical estimation perspective by modeling rewards as samples from policy-induced distribution, casting advantage computation as reward distribution estimation. Propose Discounted Beta-Bernoulli reward estimation leveraging historical reward statistics for non-stationary distribution.

Result: GRPO with DBB consistently outperforms naive GRPO across six in-distribution and three out-of-distribution reasoning benchmarks, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on 1.7B and 8B models respectively, without additional computational cost or memory usage.

Conclusion: The statistical estimation perspective and DBB reward estimation provide more efficient RLVR approach with reduced variance, stable estimation, and improved reasoning capabilities for LLMs.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta–Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

[527] Seeking Universal Shot Language Understanding Solutions

Haoxin Liu, Harshavardhan Kamarthi, Zhiyuan Zhao, Hongjie Chen, B. Aditya Prakash

Main category: cs.LG

TL;DR: SLU-SUITE: A comprehensive training and evaluation suite for shot language understanding in film analysis, with 490K QA pairs across 33 tasks, plus two complementary VLM solutions (UniShot and AgentShots) that outperform existing approaches.

Details

Motivation: Shot language understanding (SLU) is challenging for cinematic analysis due to diverse cinematographic dimensions and subjective expert judgment. While VLMs show strong general visual understanding, there are judgment discrepancies between VLMs and film experts on SLU tasks, creating a gap that needs to be addressed.

Method: 1) Created SLU-SUITE with 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. 2) Developed two complementary solutions: UniShot (balanced one-for-all generalist trained via dynamic-balanced data mixing) and AgentShots (prompt-routed expert cluster maximizing peak dimension performance).

Result: Models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks. The suite provides insights into VLM bottlenecks and cross-dimensional task influences.

Conclusion: SLU-SUITE addresses the gap between VLMs and film experts in shot language understanding, providing both a comprehensive evaluation framework and effective model solutions that significantly improve performance on cinematic analysis tasks.

Abstract: Shot language understanding (SLU) is crucial for cinematic analysis but remains challenging due to its diverse cinematographic dimensions and subjective expert judgment. While vision-language models (VLMs) have shown strong ability in general visual understanding, recent studies reveal judgment discrepancies between VLMs and film experts on SLU tasks. To address this gap, we introduce SLU-SUITE, a comprehensive training and evaluation suite containing 490K human-annotated QA pairs across 33 tasks spanning six film-grounded dimensions. Using SLU-SUITE, we originally observe two insights into VLM-based SLU from: the model side, which diagnoses key bottlenecks of modules; the data side, which quantifies cross-dimensional influences among tasks. These findings motivate our universal SLU solutions from two complementary paradigms: UniShot, a balanced one-for-all generalist trained via dynamic-balanced data mixing, and AgentShots, a prompt-routed expert cluster that maximizes peak dimension performance. Extensive experiments show that our models outperform task-specific ensembles on in-domain tasks and surpass leading commercial VLMs by 22% on out-of-domain tasks.

[528] AcceRL: A Distributed Asynchronous Reinforcement Learning and World Model Framework for Vision-Language-Action Models

Chengxuan Lu, Shukuan Wang, Yanjie Li, Wei Liu, Shiji Jin, Fuyuan Qian, Peiming Li, Baigui Sun, Yang Liu

Main category: cs.LG

TL;DR: AcceRL: A fully asynchronous, decoupled RL framework with plug-and-play world model for Vision-Language-Action models, achieving SOTA performance with super-linear scaling and improved sample efficiency.

Details

Motivation: Reinforcement learning for large-scale Vision-Language-Action models faces challenges in computational efficiency and data acquisition, requiring better frameworks to handle synchronization barriers and improve training efficiency.

Method: Proposes AcceRL, a fully asynchronous and decoupled RL framework that physically isolates training, inference, and rollouts to eliminate synchronization barriers. Integrates a plug-and-play, trainable world model into distributed asynchronous RL pipeline to generate virtual experiences.

Result: Achieves state-of-the-art performance on LIBERO benchmark, exhibits super-linear scaling in throughput, highly efficient hardware utilization, and delivers unprecedented sample efficiency and robust training stability in complex control tasks.

Conclusion: AcceRL provides an effective solution for scaling RL in Vision-Language-Action models through asynchronous architecture and world model integration, addressing key computational and data efficiency challenges.

Abstract: Reinforcement learning (RL) for large-scale Vision-Language-Action (VLA) models faces significant challenges in computational efficiency and data acquisition. We propose AcceRL, a fully asynchronous and decoupled RL framework designed to eliminate synchronization barriers by physically isolating training, inference, and rollouts. Crucially, AcceRL is the first to integrate a plug-and-play, trainable world model into a distributed asynchronous RL pipeline to generate virtual experiences. Experiments on the LIBERO benchmark demonstrate that AcceRL achieves state-of-the-art (SOTA) performance. Systematically, it exhibits super-linear scaling in throughput and highly efficient hardware utilization. Algorithmically, the world-model-augmented variant delivers unprecedented sample efficiency and robust training stability in complex control tasks.

[529] AIMER: Calibration-Free Task-Agnostic MoE Pruning

Zongfang Liu, Shengkun Tang, Yifan Shen, Huan Wang, Xin Yuan

Main category: cs.LG

TL;DR: AIMER is a calibration-free method for pruning experts in Mixture-of-Experts language models that uses absolute mean over root mean square importance scoring, eliminating the need for calibration data and reducing preprocessing time.

Details

Motivation: Current expert pruning methods for MoE models require calibration data to estimate expert importance, making pruning outcomes sensitive to calibration set choice and adding substantial preprocessing costs. A calibration-free approach is needed.

Method: AIMER uses absolute mean over root mean square (AMoRMS) importance scoring to rank experts without calibration data. It computes importance scores directly from model parameters, providing clear score separation and expert stratification.

Result: AIMER achieves competitive or better performance than calibration-based methods across 7B to 30B MoE models at 25-50% pruning ratios on 16 benchmarks, with scoring times of only 0.22-1.27 seconds.

Conclusion: AIMER provides an effective calibration-free alternative for expert pruning in MoE models, reducing preprocessing overhead while maintaining or improving performance compared to calibration-dependent methods.

Abstract: Mixture-of-Experts (MoE) language models increase parameter capacity without proportional per-token compute, but the deployment still requires storing all experts, making expert pruning important for reducing memory and serving overhead. Existing task-agnostic expert pruning methods are typically calibration-dependent: they estimate expert importance from routing or activation statistics on a calibration set, which makes pruning outcomes sensitive to the choice of calibration set and adds substantial preprocessing cost. We introduce AIMER (\textbf{A}bsolute mean over root mean square \textbf{IM}portance for \textbf{E}xpert \textbf{R}anking), a simple calibration-free criterion that yields clear within-layer score separation and distinct expert stratification. Across 7B to 30B MoE language models at 25% and 50% pruning ratios over 16 benchmarks, AIMER consistently delivers competitive or stronger overall performance against state-of-the-art calibration-based expert pruning baselines with only 0.22–1.27 seconds for scoring the experts.

[530] Balancing the Reasoning Load: Difficulty-Differentiated Policy Optimization with Length Redistribution for Efficient and Robust Reinforcement Learning

Yinan Xia, Haotian Zhang, Huiming Wang

Main category: cs.LG

TL;DR: DDPO is a reinforcement learning algorithm that optimizes Large Reasoning Models by differentiating between simple and complex tasks - reducing output length for simple tasks while expanding exploration for complex ones, achieving better accuracy-length trade-offs.

Details

Motivation: Large Reasoning Models suffer from overthinking (excessively long answers) and overconfidence (overly short but incorrect answers), leading to suboptimal performance. The paper aims to address these dual issues through task-differentiated optimization.

Method: Proposes Difficulty-Differentiated Policy Optimization (DDPO), an RL algorithm that optimizes simple and complex tasks separately. For simple tasks, it reduces output length without compromising accuracy; for complex tasks, it expands exploration space. Uses theoretical conditions for maximizing expected accuracy and difficulty-level average as reference for length optimization.

Result: Extensive experiments show DDPO reduces average answer length by 12% while improving accuracy by 1.85% compared to GRPO across multiple benchmarks, achieving better accuracy-length trade-off.

Conclusion: DDPO effectively addresses overthinking and overconfidence in LRMs through difficulty-differentiated optimization, providing a practical solution for improving reasoning model efficiency and performance.

Abstract: Large Reasoning Models (LRMs) have shown exceptional reasoning capabilities, but they also suffer from the issue of overthinking, often generating excessively long and redundant answers. For problems that exceed the model’s capabilities, LRMs tend to exhibit the overconfidence phenomenon, generating overly short but incorrect answers, which may contribute to suboptimal performance. To address these issues, we propose Difficulty-Differentiated Policy Optimization (DDPO), an efficient reinforcement learning algorithm that optimizes simple and complex tasks separately based on the overconfidence phenomenon. Specifically, it reduces the output length for simple tasks without compromising accuracy, while for complex tasks, it expands the exploration space to improve performance. We further derive the theoretical conditions for maximizing expected accuracy, which require the length distribution to closely approximate the optimal length and be as concentrated as possible. Based on these conditions, we propose using the difficulty-level average as a well-founded reference for length optimization. Extensive experiments on both in-domain and out-of-domain benchmarks validate the superiority and effectiveness of DDPO. Compared to GRPO, DDPO reduces the average answer length by 12% while improving accuracy by 1.85% across multiple benchmarks, achieving a better trade-off between accuracy and length. The code is available at https://github.com/Yinan-Xia/DDPO.

[531] Data-efficient pre-training by scaling synthetic megadocs

Konwoo Kim, Suhas Kotha, Yejin Choi, Tatsunori Hashimoto, Nick Haber, Percy Liang

Main category: cs.LG

TL;DR: Synthetic data augmentation improves LLM pre-training efficiency, with “megadocs” (stitching multiple synthetic rephrases into longer documents) achieving better loss scaling than simple rephrasing as compute increases.

Details

Motivation: When pre-training is data-constrained rather than compute-constrained, synthetic data augmentation offers a solution. The research aims to design synthetic data algorithms that achieve better loss scaling - improving performance not just at finite compute but especially as compute approaches infinity.

Method: Two main approaches: 1) Simple rephrasing: pre-training on web data mixed with synthetically generated rephrases from different distributions. 2) Megadocs: constructing longer documents by either stitching synthetic rephrases from the same web document or stretching documents by inserting rationales. Both methods are evaluated with optimal mixing and epoching strategies.

Result: Simple rephrasing achieves 1.48× data efficiency at 32 rephrases per document. Megadocs improve this to 1.80× data efficiency at 32 generations per document, with better i.i.d. loss, downstream benchmarks, and long-context loss. The improvement of megadocs over simple rephrasing widens as more synthetic data is generated.

Conclusion: Synthetic data algorithms can be designed to benefit more from increasing compute when data-constrained. Megadocs provide superior loss scaling compared to simple rephrasing, with the advantage growing as more synthetic data is generated.

Abstract: Synthetic data augmentation has emerged as a promising solution when pre-training is constrained by data rather than compute. We study how to design synthetic data algorithms that achieve better loss scaling: not only lowering loss at finite compute but especially as compute approaches infinity. We first show that pre-training on web data mixed with synthetically generated rephrases improves i.i.d. validation loss on the web data, despite the synthetic data coming from an entirely different distribution. With optimal mixing and epoching, loss and benchmark accuracy improve without overfitting as the number of synthetic generations grows, plateauing near $1.48\times$ data efficiency at 32 rephrases per document. We find even better loss scaling under a new perspective: synthetic generations from the same document can form a single substantially longer megadocument instead of many short documents. We show two ways to construct megadocs: stitching synthetic rephrases from the same web document or stretching a document by inserting rationales. Both methods improve i.i.d. loss, downstream benchmarks, and especially long-context loss relative to simple rephrasing, increasing data efficiency from $1.48\times$ to $1.80\times$ at $32$ generations per document. Importantly, the improvement of megadocs over simple rephrasing widens as more synthetic data is generated. Our results show how to design synthetic data algorithms that benefit more from increasing compute when data-constrained.

[532] Beyond Passive Aggregation: Active Auditing and Topology-Aware Defense in Decentralized Federated Learning

Sheng Pan, Niansheng Tang

Main category: cs.LG

TL;DR: Active defense framework for Decentralized Federated Learning using proactive auditing metrics and topology-aware defense placement to detect adaptive backdoor attacks.

Details

Motivation: DFL is vulnerable to adaptive backdoor attacks that bypass traditional passive defense metrics, requiring a shift to active, interventional auditing approaches.

Method: 1) Establish dynamical model for adversarial update diffusion across graph topologies; 2) Introduce proactive auditing metrics (stochastic entropy anomaly, randomized smoothing KL divergence, activation kurtosis) using private probes; 3) Implement topology-aware defense placement strategy.

Result: Framework demonstrates high competitiveness with state-of-the-art defenses in mitigating stealthy adaptive backdoors while preserving primary task utility across diverse architectures.

Conclusion: Active auditing framework effectively exposes latent backdoors invisible to conventional static detection, with theoretical convergence guarantees under co-evolving attack-defense dynamics.

Abstract: Decentralized Federated Learning (DFL) remains highly vulnerable to adaptive backdoor attacks designed to bypass traditional passive defense metrics. To address this limitation, we shift the defensive paradigm toward a novel active, interventional auditing framework. First, we establish a dynamical model to characterize the spatiotemporal diffusion of adversarial updates across complex graph topologies. Second, we introduce a suite of proactive auditing metrics, stochastic entropy anomaly, randomized smoothing Kullback-Leibler divergence, and activation kurtosis. These metrics utilize private probes to stress-test local models, effectively exposing latent backdoors that remain invisible to conventional static detection. Furthermore, we implement a topology-aware defense placement strategy to maximize global aggregation resilience. We provide theoretical property for the system’s convergence under co-evolving attack and defense dynamics. Numeric empirical evaluations across diverse architectures demonstrate that our active framework is highly competitive with state-of-the-art defenses in mitigating stealthy, adaptive backdoors while preserving primary task utility.

[533] GAPSL: A Gradient-Aligned Parallel Split Learning on Heterogeneous Data

Zheng Lin, Ons Aouedi, Wei Ni, Symeon Chatzinotas, Xianhao Chen

Main category: cs.LG

TL;DR: GAPSL is a gradient-aligned parallel split learning framework that addresses training divergence in federated learning by using leader gradient identification and gradient direction alignment to improve convergence on resource-constrained devices.

Details

Motivation: The paper addresses challenges in democratizing federated learning (FL) on resource-constrained devices, where traditional FL approaches face computational burdens. Parallel split learning (PSL) helps by offloading computation to servers, but suffers from training divergence due to gradient directional inconsistency across clients since it's aggregation-free.

Method: GAPSL introduces two key components: 1) Leader Gradient Identification (LGI) - dynamically selects directionally consistent client gradients to construct a leader gradient capturing global convergence trend; 2) Gradient Direction Alignment (GDA) - uses direction-aware regularization to align each client’s gradient with the leader gradient, mitigating inter-device gradient inconsistency.

Result: Extensive experiments on a prototype computing testbed demonstrate that GAPSL consistently outperforms state-of-the-art benchmarks in both training accuracy and latency.

Conclusion: GAPSL effectively addresses the training divergence problem in parallel split learning by aligning gradient directions across clients, enabling more efficient federated learning on resource-constrained devices while maintaining convergence quality.

Abstract: The increasing complexity of neural networks poses significant challenges for democratizing FL on resource?constrained client devices. Parallel split learning (PSL) has emerged as a promising solution by offloading substantial computing workload to a server via model partitioning, shrinking client-side computing load, and eliminating the client-side model aggregation for reduced communication and deployment costs. Since PSL is aggregation-free, it suffers from severe training divergence stemming from gradient directional inconsistency across clients. To address this challenge, we propose GAPSL, a gradient-aligned PSL framework that comprises two key components: leader gradient identification (LGI) and gradient direction alignment (GDA). LGI dynamically selects a set of directionally consistent client gradients to construct a leader gradient that captures the global convergence trend. GDA employs a direction-aware regularization to align each client’s gradient with the leader gradient, thereby mitigating inter-device gradient directional inconsistency and enhancing model convergence. We evaluate GAPSL on a prototype computing testbed. Extensive experiments demonstrate that GAPSL consistently outperforms state-of-the-art benchmarks in training accuracy and latency.

[534] HEP Statistical Inference for UAV Fault Detection: CLs, LRT, and SBI Applied to Blade Damage

Khushiyant

Main category: cs.LG

TL;DR: Particle physics statistical methods (LRT, CLs, SNPE) applied to UAV propeller fault detection using spectral features, achieving high accuracy with uncertainty quantification.

Details

Motivation: To improve multirotor propeller fault detection by transferring robust statistical methods from particle physics that provide not just binary detection but also controlled false alarm rates and calibrated uncertainty quantification.

Method: Transfers three particle physics methods: likelihood ratio test for binary detection, CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Uses spectral features tied to rotor harmonic physics.

Result: Achieves AUC 0.862 on UAV-FD dataset (outperforming CUSUM, autoencoder, LSTM autoencoder), 93% detection of significant blade damage at 5% false alarm rate, and 0.986 AUC on PADRE platform. SNPE provides full posterior over fault severity with 92-100% credible interval coverage.

Conclusion: Particle physics statistical methods effectively transfer to UAV fault detection, providing superior performance with uncertainty quantification and controlled false alarm rates compared to traditional methods.

Abstract: This paper transfers three statistical methods from particle physics to multirotor propeller fault detection: the likelihood ratio test (LRT) for binary detection, the CLs modified frequentist method for false alarm rate control, and sequential neural posterior estimation (SNPE) for quantitative fault characterization. Operating on spectral features tied to rotor harmonic physics, the system returns three outputs: binary detection, controlled false alarm rates, and calibrated posteriors over fault severity and motor location. On UAV-FD, a hexarotor dataset of 18 real flights with 5% and 10% blade damage, leave-one-flight-out cross-validation gives AUC 0.862 +/- 0.007 (95% CI: 0.849–0.876), outperforming CUSUM (0.708 +/- 0.010), autoencoder (0.753 +/- 0.009), and LSTM autoencoder (0.551). At 5% false alarm rate the system detects 93% of significant and 81% of subtle blade damage. On PADRE, a quadrotor platform, AUC reaches 0.986 after refitting only the generative models. SNPE gives a full posterior over fault severity (90% credible interval coverage 92–100%, MAE 0.012), so the output includes uncertainty rather than just a point estimate or fault flag. Per-flight sequential detection achieves 100% fault detection with 94% overall accuracy.

[535] SINDy-KANs: Sparse identification of non-linear dynamics through Kolmogorov-Arnold networks

Amanda A. Howard, Nicholas Zolman, Bruno Jacob, Steven L. Brunton, Panos Stinis

Main category: cs.LG

TL;DR: SINDy-KANs combines Kolmogorov-Arnold networks with sparse identification of nonlinear dynamics to improve interpretability of learned representations while maintaining function composition capabilities.

Details

Motivation: KANs offer potential interpretability but don't guarantee sparse/parsimonious solutions, while SINDy learns sparse equations but is limited by library constraints. The authors aim to combine both approaches for better interpretability.

Method: Simultaneously trains a KAN and a SINDy-like representation, applying SINDy at the level of each activation function to maintain deep KAN composition capabilities while increasing interpretability.

Result: The method demonstrates accurate equation discovery across various symbolic regression tasks, including dynamical systems.

Conclusion: SINDy-KANs successfully enhances interpretability of KAN representations while preserving their compositional power, enabling better equation discovery for dynamical systems.

Abstract: Kolmogorov-Arnold networks (KANs) have arisen as a potential way to enhance the interpretability of machine learning. However, solutions learned by KANs are not necessarily interpretable, in the sense of being sparse or parsimonious. Sparse identification of nonlinear dynamics (SINDy) is a complementary approach that allows for learning sparse equations for dynamical systems from data; however, learned equations are limited by the library. In this work, we present SINDy-KANs, which simultaneously train a KAN and a SINDy-like representation to increase interpretability of KAN representations with SINDy applied at the level of each activation function, while maintaining the function compositions possible through deep KANs. We apply our method to a number of symbolic regression tasks, including dynamical systems, to show accurate equation discovery across a range of systems.

[536] Transformers Learn Robust In-Context Regression under Distributional Uncertainty

Hoang T. H. Cao, Hai D. V. Trinh, Tho Quan, Lan V. Truong

Main category: cs.LG

TL;DR: Transformers demonstrate robust in-context learning for linear regression under various distributional shifts, outperforming classical estimators despite non-Gaussian data, heavy-tailed noise, and non-i.i.d. prompts.

Details

Motivation: Previous work on Transformers' in-context learning for linear regression relied on restrictive assumptions (i.i.d. data, Gaussian noise, Gaussian coefficients) that don't reflect real-world data distributions. The paper investigates whether Transformers can learn effectively under realistic distributional uncertainty.

Method: Study in-context learning for noisy linear regression under broad distributional shifts including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. Compare Transformers against classical baselines (optimal/suboptimal under maximum-likelihood criteria).

Result: Transformers consistently match or outperform classical baselines across all settings, demonstrating robust in-context adaptation beyond classical estimators.

Conclusion: Transformers exhibit strong in-context learning capabilities even under challenging distributional shifts, suggesting they can adapt effectively to real-world data patterns that violate traditional statistical assumptions.

Abstract: Recent work has shown that Transformers can perform in-context learning for linear regression under restrictive assumptions, including i.i.d. data, Gaussian noise, and Gaussian regression coefficients. However, real-world data often violate these assumptions: the distributions of inputs, noise, and coefficients are typically unknown, non-Gaussian, and may exhibit dependency across the prompt. This raises a fundamental question: can Transformers learn effectively in-context under realistic distributional uncertainty? We study in-context learning for noisy linear regression under a broad range of distributional shifts, including non-Gaussian coefficients, heavy-tailed noise, and non-i.i.d. prompts. We compare Transformers against classical baselines that are optimal or suboptimal under the corresponding maximum-likelihood criteria. Across all settings, Transformers consistently match or outperform these baselines, demonstrating robust in-context adaptation beyond classical estimators.

[537] SpecForge: A Flexible and Efficient Open-Source Training Framework for Speculative Decoding

Shenggui Li, Chao Wang, Yikai Zhu, Yubo Wang, Fan Yin, Shuai Shi, Yefei Chen, Xiaomin Dong, Qiaoling Chen, Jin Pan, Ji Li, Laixin Xie, Yineng Zhang, Lei Yu, Yonggang Wen, Ivor Tsang, Tianwei Zhang

Main category: cs.LG

TL;DR: SpecForge is an open-source framework for training speculative decoding models with EAGLE-3 support, enabling faster training and deployment of draft models for LLM inference acceleration.

Details

Motivation: Large language models have high inference latency due to sequential autoregressive decoding. Speculative decoding can reduce this bottleneck but faces limitations due to lack of high-quality draft models and scalable training infrastructure.

Method: SpecForge framework incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines. It also introduces SpecBundle - a suite of production-grade EAGLE-3 draft models trained with SpecForge.

Result: Achieves up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B and up to 4.48x end-to-end inference speedup on SGLang with their draft models.

Conclusion: SpecForge provides a practical foundation for real-world speculative decoding deployment by addressing the scarcity of high-quality draft models and enabling efficient training infrastructure.

Abstract: Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.

[538] Attack by Unlearning: Unlearning-Induced Adversarial Attacks on Graph Neural Networks

Jiahao Zhang, Yilong Wang, Suhang Wang

Main category: cs.LG

TL;DR: Unlearning corruption attacks on GNNs: Adversaries inject carefully chosen nodes into training data, then request their deletion via privacy regulations, causing significant accuracy degradation after unlearning is applied.

Details

Motivation: Graph neural networks need to comply with privacy regulations requiring data deletion, but approximate unlearning methods can cause performance degradation that adversaries can exploit.

Method: Formulate attack as bi-level optimization: inject adversarial nodes into training graph, then request their deletion. Approximate unlearning via gradient-based updates and use surrogate model for pseudo-labels to overcome black-box unlearning and label scarcity challenges.

Result: Extensive experiments show small, carefully designed unlearning requests can induce significant accuracy degradation across benchmarks and unlearning algorithms, demonstrating vulnerability of GNN unlearning systems.

Conclusion: Unlearning corruption attacks reveal serious robustness concerns for GNN unlearning under real-world regulatory demands, raising urgent security issues for privacy-compliant graph learning systems.

Abstract: Graph neural networks (GNNs) are widely used for learning from graph-structured data in domains such as social networks, recommender systems, and financial platforms. To comply with privacy regulations like the GDPR, CCPA, and PIPEDA, approximate graph unlearning, which aims to remove the influence of specific data points from trained models without full retraining, has become an increasingly important component of trustworthy graph learning. However, approximate unlearning often incurs subtle performance degradation, which may incur negative and unintended side effects. In this work, we show that such degradations can be amplified into adversarial attacks. We introduce the notion of \textbf{unlearning corruption attacks}, where an adversary injects carefully chosen nodes into the training graph and later requests their deletion. Because deletion requests are legally mandated and cannot be denied, this attack surface is both unavoidable and stealthy: the model performs normally during training, but accuracy collapses only after unlearning is applied. Technically, we formulate this attack as a bi-level optimization problem: to overcome the challenges of black-box unlearning and label scarcity, we approximate the unlearning process via gradient-based updates and employ a surrogate model to generate pseudo-labels for the optimization. Extensive experiments across benchmarks and unlearning algorithms demonstrate that small, carefully designed unlearning requests can induce significant accuracy degradation, raising urgent concerns about the robustness of GNN unlearning under real-world regulatory demands. The source code will be released upon paper acceptance.

[539] Elastic Weight Consolidation Done Right for Continual Learning

Xuan Liu, Xiaobin Chang

Main category: cs.LG

TL;DR: EWC-DR improves Elastic Weight Consolidation by addressing gradient vanishing and redundant protection issues through Logits Reversal operation during Fisher Information Matrix calculation.

Details

Motivation: EWC is a foundational continual learning method but shows suboptimal performance due to gradient vanishing and inaccurate importance estimation when using Fisher Information Matrix, and its variants like MAS impose unnecessary constraints on irrelevant parameters.

Method: Proposes Logits Reversal (LR) operation that reverses logit values during FIM calculation to prevent gradient vanishing and redundant protection, creating EWC Done Right (EWC-DR).

Result: Extensive experiments across various continual learning tasks and datasets show EWC-DR significantly outperforms existing EWC and its variants.

Conclusion: The proposed Logits Reversal operation effectively addresses fundamental issues in EWC’s importance estimation, leading to improved continual learning performance.

Abstract: Weight regularization methods in continual learning (CL) alleviate catastrophic forgetting by assessing and penalizing changes to important model weights. Elastic Weight Consolidation (EWC) is a foundational and widely used approach within this framework that estimates weight importance based on gradients. However, it has consistently shown suboptimal performance. In this paper, we conduct a systematic analysis of importance estimation in EWC from a gradient-based perspective. For the first time, we find that EWC’s reliance on the Fisher Information Matrix (FIM) results in gradient vanishing and inaccurate importance estimation in certain scenarios. Our analysis also reveals that Memory Aware Synapses (MAS), a variant of EWC, imposes unnecessary constraints on parameters irrelevant to prior tasks, termed the redundant protection. Consequently, both EWC and its variants exhibit fundamental misalignments in estimating weight importance, leading to inferior performance. To tackle these issues, we propose the Logits Reversal (LR) operation, a simple yet effective modification that rectifies EWC’s importance estimation. Specifically, reversing the logit values during the calculation of FIM can effectively prevent both gradient vanishing and redundant protection. Extensive experiments across various CL tasks and datasets show that the proposed method significantly outperforms existing EWC and its variants. Therefore, we refer to it as EWC Done Right (EWC-DR).

[540] Evaluating Model-Free Policy Optimization in Masked-Action Environments via an Exact Blackjack Oracle

Kevin Song

Main category: cs.LG

TL;DR: Exact dynamic programming oracle for infinite-shoe blackjack provides ground-truth benchmark for evaluating sample-efficient policy recovery methods in discrete stochastic control with dynamic action masking.

Details

Motivation: To establish a rigorous, exactly verifiable benchmark for discrete stochastic control problems with dynamically masked actions, using casino blackjack as a testbed where exact optimal policies can be computed for comparison.

Method: Derived exact dynamic programming oracle over 4,600 canonical decision cells for Vegas-style blackjack rules. Evaluated three model-free optimizers: masked REINFORCE with per-cell baseline, SPSA, and CEM via simulated interaction.

Result: REINFORCE was most sample-efficient (46.37% action-match rate, -0.04688 EV after 1M hands), outperforming CEM and SPSA. All methods showed substantial cell-conditional regret despite smooth reward convergence. Negative control confirmed optimal bet sizing collapses to table minimum under i.i.d. draws.

Conclusion: Tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging; aggregate reward curves can obscure local policy failures. Need exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

Abstract: Infinite-shoe casino blackjack provides a rigorous, exactly verifiable benchmark for discrete stochastic control under dynamically masked actions. Under a fixed Vegas-style ruleset (S17, 3:2 payout, dealer peek, double on any two, double after split, resplit to four), an exact dynamic programming (DP) oracle was derived over 4,600 canonical decision cells. This oracle yielded ground-truth action values, optimal policy labels, and a theoretical expected value (EV) of -0.00161 per hand. To evaluate sample-efficient policy recovery, three model-free optimizers were trained via simulated interaction: masked REINFORCE with a per-cell exponential moving average baseline, simultaneous perturbation stochastic approximation (SPSA), and the cross-entropy method (CEM). REINFORCE was the most sample-efficient, achieving a 46.37% action-match rate and an EV of -0.04688 after 10^6 hands, outperforming CEM (39.46%, 7.5x10^6 evaluations) and SPSA (38.63%, 4.8x10^6 evaluations). However, all methods exhibited substantial cell-conditional regret, indicating persistent policy-level errors despite smooth reward convergence. This gap shows that tabular environments with severe state-visitation sparsity and dynamic action masking remain challenging, while aggregate reward curves can obscure critical local failures. As a negative control, it was proven and empirically confirmed that under i.i.d. draws without counting, optimal bet sizing collapses to the table minimum. In addition, larger wagers strictly increased volatility and ruin without improving expectation. These results highlight the need for exact oracles and negative controls to avoid mistaking stochastic variability for genuine algorithmic performance.

[541] Enhancing Multi-Corpus Training in SSL-Based Anti-Spoofing Models: Domain-Invariant Feature Extraction

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

Main category: cs.LG

TL;DR: Multi-corpus training for speech spoofing detection doesn’t always improve performance due to dataset biases; proposed Invariant Domain Feature Extraction framework reduces corpus-specific information to enhance generalization.

Details

Motivation: Speech spoofing detection performance varies across different training/evaluation corpora. While multi-corpus training typically helps in other speech tasks, it doesn't consistently improve spoofing detection and may even degrade performance due to dataset-specific biases that impair generalization.

Method: Proposed Invariant Domain Feature Extraction (IDFE) framework using multi-task learning with a gradient reversal layer to minimize corpus-specific information in learned embeddings, making features more domain-invariant.

Result: IDFE framework reduces average equal error rate by 20% compared to baseline when evaluated across four varied datasets, demonstrating improved generalization across different corpora.

Conclusion: Dataset biases significantly impact speech spoofing detection generalization; the IDFE framework effectively addresses this by extracting domain-invariant features, leading to more robust performance across different corpora.

Abstract: The performance of speech spoofing detection often varies across different training and evaluation corpora. Leveraging multiple corpora typically enhances robustness and performance in fields like speaker recognition and speech recognition. However, our spoofing detection experiments show that multi-corpus training does not consistently improve performance and may even degrade it. We hypothesize that dataset-specific biases impair generalization, leading to performance instability. To address this, we propose an Invariant Domain Feature Extraction (IDFE) framework, employing multi-task learning and a gradient reversal layer to minimize corpus-specific information in learned embeddings. The IDFE framework reduces the average equal error rate by 20% compared to the baseline, assessed across four varied datasets.

[542] Revisiting Label Inference Attacks in Vertical Federated Learning: Why They Are Vulnerable and How to Defend

Yige Liu, Dexuan Xu, Zimai Guo, Yongzhi Cao, Hanpin Wang

Main category: cs.LG

TL;DR: The paper reveals that in vertical federated learning, bottom models primarily extract features while top models handle label mapping, challenging previous assumptions about label inference attacks and proposing a zero-overhead defense through layer adjustment.

Details

Motivation: Previous studies on label inference attacks (LIAs) in vertical federated learning assumed that well-trained bottom models can effectively represent labels, but this paper demonstrates this view is misleading and exposes vulnerabilities in existing LIAs.

Method: The authors use mutual information analysis to observe the “model compensation” phenomenon, theoretically prove that mutual information between layer outputs and labels increases with depth, introduce task reassignment to test distribution alignment, and propose a zero-overhead defense based on layer adjustment.

Result: Experiments across five datasets and five model architectures show that shifting cut layers forward (increasing top model proportion) improves resistance to LIAs and enhances other defenses, while disrupting feature-label alignment causes LIA performance to decline sharply.

Conclusion: Bottom models in VFL primarily extract feature information while top models handle label mapping, challenging previous LIA assumptions; adjusting layer distribution provides effective defense without overhead.

Abstract: Vertical federated learning (VFL) allows an active party with a top model, and multiple passive parties with bottom models to collaborate. In this scenario, passive parties possessing only features may attempt to infer active party’s private labels, making label inference attacks (LIAs) a significant threat. Previous LIA studies have claimed that well-trained bottom models can effectively represent labels. However, we demonstrate that this view is misleading and exposes the vulnerability of existing LIAs. By leveraging mutual information, we present the first observation of the “model compensation” phenomenon in VFL. We theoretically prove that, in VFL, the mutual information between layer outputs and labels increases with layer depth, indicating that bottom models primarily extract feature information while the top model handles label mapping. Building on this insight, we introduce task reassignment to show that the success of existing LIAs actually stems from the distribution alignment between features and labels. When this alignment is disrupted, the performance of LIAs declines sharply or even fails entirely. Furthermore, the implications of this insight for defenses are also investigated. We propose a zero-overhead defense technique based on layer adjustment. Extensive experiments across five datasets and five representative model architectures indicate that shifting cut layers forward to increase the proportion of top model layers in the entire model not only improves resistance to LIAs but also enhances other defenses.

[543] HISR: Hindsight Information Modulated Segmental Process Rewards For Multi-turn Agentic Reinforcement Learning

Zhicong Lu, Zichuan Lin, Wei Jia, Changyuan Tian, Deheng Ye, Peiguang Li, Li Jin, Nayu Liu, Guangluan Xu, Wei Feng

Main category: cs.LG

TL;DR: HISR (Hindsight Information for Segmental Rewards) improves agentic decision-making in LLMs by using hindsight information to modulate segment-level process rewards, enhancing credit assignment reliability.

Details

Motivation: Current LLMs struggle with complex long-horizon agentic decision-making tasks. Existing methods using reward models suffer from delayed reward propagation in sparse outcomes and unreliable credit assignment with overly fine-grained turn-level process rewards.

Method: Proposes HISR with: 1) Segment-level process reward model for sub-goal rewards (avoiding turn-level granularity), 2) Hindsight model reflecting action preference given trajectory outcomes, 3) Using likelihood ratios between hindsight and policy models to measure action importance, 4) Aggregating segment importance scores to modulate segmental process rewards.

Result: Extensive experiments on three public benchmarks demonstrate the method’s validity and effectiveness in improving agentic decision-making performance.

Conclusion: HISR successfully addresses credit assignment issues in long-horizon agentic tasks by aligning rewards with sub-goals and emphasizing significant segments through hindsight information modulation.

Abstract: While large language models excel in diverse domains, their performance on complex longhorizon agentic decision-making tasks remains limited. Most existing methods concentrate on designing effective reward models (RMs) to advance performance via multi-turn reinforcement learning. However, they suffer from delayed propagation in sparse outcome rewards and unreliable credit assignment with potentially overly fine-grained and unfocused turnlevel process rewards. In this paper, we propose (HISR) exploiting Hindsight Information to modulate Segmental process Rewards, which closely aligns rewards with sub-goals and underscores significant segments to enhance the reliability of credit assignment. Specifically, a segment-level process RM is presented to assign rewards for each sub-goal in the task, avoiding excessively granular allocation to turns. To emphasize significant segments in the trajectory, a hindsight model is devised to reflect the preference of performing a certain action after knowing the trajectory outcome. With this characteristic, we design the ratios of sequence likelihoods between hindsight and policy model to measure action importance. The ratios are subsequently employed to aggregate segment importance scores, which in turn modulate segmental process rewards, enhancing credit assignment reliability. Extensive experimental results on three publicly benchmarks demonstrate the validity of our method.

[544] STEP: Scientific Time-Series Encoder Pretraining via Cross-Domain Distillation

Chen Zhang, Liwei Liu, Jun Tao, Xiaoyu Yang, Xuenan Xu, Kai Chen, Bowen Zhou, Wen Wu, Chao Zhang

Main category: cs.LG

TL;DR: STEP is a framework for scientific time series representation learning that transfers knowledge from multiple foundation models (audio, general time series, brain signals) via cross-domain distillation to create a unified encoder for sparse, heterogeneous scientific time series.

Details

Motivation: Scientific time series are sparse, heterogeneous, and limited in scale, making unified representation learning challenging. Foundation models from relevant domains (audio, general time series, brain signals) contain rich knowledge but their applicability to scientific signals remains underexplored.

Method: STEP uses cross-domain distillation to integrate knowledge from multiple foundation models into a unified encoder. It introduces adaptive patching for extreme-length sequences and statistics compensation for diverse numerical scales. The framework combines complementary representations across different domains.

Result: Experiments on seven scientific time series tasks demonstrate that STEP provides both an effective structure and an effective pretraining paradigm for scientific time series representation learning.

Conclusion: STEP takes a step toward scientific time series representation learning by effectively leveraging transferable knowledge from multiple foundation models through cross-domain distillation, creating a unified encoder for scientific signals.

Abstract: Scientific time series are central to scientific AI but are typically sparse, highly heterogeneous, and limited in scale, making unified representation learning particularly challenging. Meanwhile, foundation models pretrained on relevant time series domains such as audio, general time series, and brain signals contain rich knowledge, but their applicability to scientific signals remains underexplored. In this paper, we investigate the transferability and complementarity of foundation models from relevant time series domains, and study how to effectively leverage them to build a unified encoder for scientific time series. We first systematically evaluate relevant foundation models, showing the effectiveness of knowledge transfer to scientific tasks and their complementary strengths. Based on this observation, we propose STEP, a Scientific Time Series Encoder Pretraining framework via cross domain distillation. STEP introduces adaptive patching to handle extreme-length sequences and a statistics compensation scheme to accommodate diverse numerical scales. It further leverages cross-domain distillation to integrate knowledge from multiple foundation models into a unified encoder. By combining complementary representations across different domains, STEP learns general-purpose and transferable features tailored for scientific signals. Experiments on seven scientific time series tasks demonstrate that STEP provides both an effective structure and an effective pretraining paradigm, taking a STEP toward scientific time series representation learning.

[545] OCP: Orthogonal Constrained Projection for Sparse Scaling in Industrial Commodity Recommendation

Chen Sun, Beilin Xu, Boheng Tan, Jiacheng Wang, Yuefeng Sun, Rite Bo, Ying He, Yaqiang Zang, Pinghua Gong

Main category: cs.LG

TL;DR: OCP method uses orthogonal constraints to optimize item embeddings in recommendation systems, preventing representation collapse and improving scalability.

Details

Motivation: Traditional Item-Id vocabularies in recommendation systems suffer from low-frequency information interference during sparse scaling, leading to representation collapse and limited expressive power for massive item sets.

Method: Proposes Orthogonal Constrained Projection (OCP) method that enforces orthogonality to constrain backpropagation manifold, aligning singular value spectrum of learned embeddings with orthogonal basis to ensure high singular entropy.

Result: OCP accelerates loss convergence, enhances model scalability, enables consistent performance gains when scaling dense layers, and in industrial deployment at JD.com achieved 12.97% increase in UCXR and 8.9% uplift in GMV.

Conclusion: OCP effectively addresses representation collapse in recommendation systems, demonstrating robust utility for scaling both sparse vocabularies and dense architectures in industrial applications.

Abstract: In industrial commodity recommendation systems, the representation quality of Item-Id vocabularies directly impacts the scalability and generalization ability of recommendation models. A key challenge is that traditional Item-Id vocabularies, when subjected to sparse scaling, suffer from low-frequency information interference, which restricts their expressive power for massive item sets and leads to representation collapse. To address this issue, we propose an Orthogonal Constrained Projection method to optimize embedding representation. By enforcing orthogonality, the projection constrains the backpropagation manifold, aligning the singular value spectrum of the learned embeddings with the orthogonal basis. This alignment ensures high singular entropy, thereby preserving isotropic generalized features while suppressing spurious correlations and overfitting to rare items. Empirical results demonstrate that OCP accelerates loss convergence and enhances the model’s scalability; notably, it enables consistent performance gains when scaling up dense layers. Large-scale industrial deployment on JD.com further confirms its efficacy, yielding a 12.97% increase in UCXR and an 8.9% uplift in GMV, highlighting its robust utility for scaling up both sparse vocabularies and dense architectures.

[546] CausalRM: Causal-Theoretic Reward Modeling for RLHF from Observational User Feedbacks

Hao Wang, Licheng Pan, Zhichao Chen, Chunyuan Zheng, Zhixuan Chu, Xiaoxi Li, Yuan Lu, Xinggao Liu, Haoxuan Li, Zhouchen Lin

Main category: cs.LG

TL;DR: CausalRM: A causal-theoretic framework for learning reward models from noisy and biased observational user feedback (clicks, copies, upvotes) instead of expensive experimental human feedback.

Details

Motivation: Current RLHF reward modeling relies on costly experimental human feedback data. Observational user feedback (clicks, copies, upvotes) offers a scalable alternative but suffers from noise (annotation errors) and bias (user preference distribution shift).

Method: Proposes CausalRM with two components: (1) noise-aware surrogate loss that models annotation error generation process, provably equivalent to primal loss under noise-free conditions; (2) propensity score reweighting using probability of users providing feedback to eliminate user preference bias.

Result: Extensive experiments show CausalRM learns accurate reward signals from observational feedback, achieving 49.2% gain on WildGuardMix and 32.7% improvement on HarmBench in downstream RLHF tasks across diverse LLM backbones.

Conclusion: CausalRM provides an effective framework for scalable reward modeling from observational feedback, addressing noise and bias challenges, with significant performance improvements in RLHF alignment tasks.

Abstract: Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models, current reward modeling heavily relies on experimental feedback data collected from human annotators under controlled and costly conditions. In this work, we introduce observational reward modeling – learning reward models with observational user feedback (e.g., clicks, copies, and upvotes) – as a scalable and cost-effective alternative. We identify two fundamental challenges in this setting: (1) observational feedback is noisy due to annotation errors, which deviates it from true user preference; (2) observational feedback is biased by user preference, where users preferentially provide feedback on responses they feel strongly about, which creats a distribution shift between training and inference data. To address these challenges, we propose CausalRM, a causal-theoretic reward modeling framework that aims to learn unbiased reward models from observational feedback. To tackle challenge (1), CausalRM introduces a noise-aware surrogate loss term that is provably equivalent to the primal loss under noise-free conditions by explicitly modeling the annotation error generation process. To tackle challenge (2), CausalRM uses propensity scores – the probability of a user providing feedback for a given response – to reweight training samples, yielding a loss function that eliminates user preference bias. Extensive experiments across diverse LLM backbones and benchmark datasets validate that CausalRM effectively learns accurate reward signals from noisy and biased observational feedback and delivers substantial performance improvements on downstream RLHF tasks – including a 49.2% gain on WildGuardMix and a 32.7% improvement on HarmBench. Code is available on our project website.

[547] Off-Policy Learning with Limited Supply

Koichi Tanaka, Ren Kishimoto, Bushun Kawagishi, Yusuke Narita, Yasuo Yamamoto, Nobuyuki Shimizu, Yuta Saito

Main category: cs.LG

TL;DR: OPLS: Off-Policy Learning with Limited Supply for contextual bandits, addressing item constraints like budget/inventory limits that make greedy approaches suboptimal.

Details

Motivation: Traditional off-policy learning assumes unlimited item availability, but real applications like coupon allocation and e-commerce have supply constraints. Greedy selection of highest-reward items leads to early depletion and suboptimal allocation for future users.

Method: Proposes OPLS (Off-Policy learning with Limited Supply) that selects items based on relative expected rewards compared to other users rather than absolute highest rewards, enabling more efficient allocation of constrained items.

Result: Empirical results on synthetic and real-world datasets show OPLS outperforms existing OPL methods in contextual bandit problems with limited supply constraints.

Conclusion: Limited supply settings require different approaches than unconstrained OPL, and OPLS provides an effective solution for contextual bandit problems with item constraints.

Abstract: We study off-policy learning (OPL) in contextual bandits, which plays a key role in a wide range of real-world applications such as recommendation systems and online advertising. Typical OPL in contextual bandits assumes an unconstrained environment where a policy can select the same item infinitely. However, in many practical applications, including coupon allocation and e-commerce, limited supply constrains items through budget limits on distributed coupons or inventory restrictions on products. In these settings, greedily selecting the item with the highest expected reward for the current user may lead to early depletion of that item, making it unavailable for future users who could potentially generate higher expected rewards. As a result, OPL methods that are optimal in unconstrained settings may become suboptimal in limited supply settings. To address the issue, we provide a theoretical analysis showing that conventional greedy OPL approaches may fail to maximize the policy performance, and demonstrate that policies with superior performance must exist in limited supply settings. Based on this insight, we introduce a novel method called Off-Policy learning with Limited Supply (OPLS). Rather than simply selecting the item with the highest expected reward, OPLS focuses on items with relatively higher expected rewards compared to the other users, enabling more efficient allocation of items with limited supply. Our empirical results on both synthetic and real-world datasets show that OPLS outperforms existing OPL methods in contextual bandit problems with limited supply.

[548] Are complicated loss functions necessary for teaching LLMs to reason?

Gabriele Carrino, Andrea Sassella, Nicolo Brunello, Federico Toschi, Mark James Carman

Main category: cs.LG

TL;DR: RGRA simplifies GRPO by removing PPO clipping while keeping group relative advantage estimation, showing comparable or better math reasoning performance with REINFORCE-based approach.

Details

Motivation: GRPO's complexity raises questions about which components are necessary for improving reasoning in LLMs. The authors aim to identify essential elements and propose a simpler alternative.

Method: Systematic analysis of GRPO components leads to RGRA (REINFORCE with Group Relative Advantage), which retains group relative advantage estimation but removes PPO-style clipping and policy ratio terms.

Result: RGRA achieves comparable or stronger performance than GRPO on standard mathematical benchmarks, demonstrating that simpler REINFORCE-based approaches can effectively enhance reasoning.

Conclusion: Negative feedback is essential for learning, PPO constraints aren’t necessary for math reasoning, and simpler REINFORCE-based methods offer transparent and efficient alternatives to complex RL approaches.

Abstract: Recent advances in large language models (LLMs) highlight the importance of post training techniques for improving reasoning and mathematical ability. Group Relative Policy Optimization (GRPO) has shown promise in this domain by combining group relative advantage estimation, PPO style clipping, and KL regularization. However, its complexity raises the question of whether all components are necessary for fostering reasoning behaviors. We conduct a systematic analysis of GRPO and identify two key findings: (1) incorporating negative feedback is essential training solely on actions above a baseline limits learning; and (2) PPO style constraints, such as policy ratio clipping, are not required to improve mathematical reasoning or performance. Building on these insights, we propose REINFORCE with Group Relative Advantage (RGRA), a simplified variant that retains group relative advantage estimation but removes PPO style clipping and policy ratio terms. Experiments across standard mathematical benchmarks indicate that RGRA has the potential to achieve stronger performance than GRPO. Our results suggest that simpler REINFORCE based approaches can effectively enhance reasoning in LLMs, offering a more transparent and efficient alternative to GRPO.

[549] From ex(p) to poly: Gaussian Splatting with Polynomial Kernels

Joerg H. Mueller, Martin Winter, Markus Steinberger

Main category: cs.LG

TL;DR: Proposes a polynomial+ReLU kernel for Gaussian Splatting that maintains dataset compatibility while improving computational efficiency through more aggressive Gaussian culling.

Details

Motivation: Recent kernel modifications in Gaussian Splatting (3DGS) improve performance but break compatibility with existing datasets optimized for the original exponential kernel, limiting adoption.

Method: Replace the original exponential kernel with a polynomial approximation combined with a ReLU function, enabling more aggressive culling of Gaussians while maintaining dataset compatibility.

Result: Achieves 4-15% performance improvement with negligible impact on image quality, and provides mathematical analysis showing benefits for NPU hardware implementations.

Conclusion: The proposed kernel offers a practical solution that balances performance gains with backward compatibility, facilitating wider adoption of 3DGS improvements.

Abstract: Recent advancements in Gaussian Splatting (3DGS) have introduced various modifications to the original kernel, resulting in significant performance improvements. However, many of these kernel changes are incompatible with existing datasets optimized for the original Gaussian kernel, presenting a challenge for widespread adoption. In this work, we address this challenge by proposing an alternative kernel that maintains compatibility with existing datasets while improving computational efficiency. Specifically, we replace the original exponential kernel with a polynomial approximation combined with a ReLU function. This modification allows for more aggressive culling of Gaussians, leading to enhanced performance across different 3DGS implementations. Our results show a notable performance improvement of 4 to 15% with negligible impact on image quality. We also provide a detailed mathematical analysis of the new kernel and discuss its potential benefits for 3DGS implementations on NPU hardware.

[550] Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN

Marcio Augusto Sampaio, Paulo Henrique Ranazzi, Martin Julian Blunt

Main category: cs.LG

TL;DR: VAE-GAN integrated with ESMDA for reservoir history matching, combining GAN’s geological realism with VAE’s data assimilation performance

Details

Motivation: Current ensemble methods for reservoir simulation have limitations: finite ensemble size and Gaussian assumptions that don't match non-Gaussian reservoir properties. GANs produce geologically plausible models but poor data assimilation, while VAEs have better data assimilation but less realistic models.

Method: Integrated Variational Autoencoder Generative Adversarial Network (VAE-GAN) with Ensemble Smoother with Multiple Data Assimilation (ESMDA). Applied to two case studies: one categorical and one with continuous permeability values.

Result: VAE-GAN model achieved both high-quality reservoir descriptions (like GANs) and good history matching on production curves (like VAEs) simultaneously.

Conclusion: The VAE-GAN approach successfully combines strengths of both generative models for improved reservoir characterization and history matching, addressing limitations of current ensemble methods.

Abstract: Currently, the methods called Iterative Ensemble Smoothers, especially the method called Ensemble Smoother with Multiple Data Assimilation (ESMDA) can be considered state-of-the-art for history matching in petroleum reservoir simulation. However, this approach has two important limitations: the use of an ensemble with finite size to represent the distributions and the Gaussian assumption in parameter and data uncertainties. This latter is particularly important because many reservoir properties have non-Gaussian distributions. Parameterization involves mapping non-Gaussian parameters to a Gaussian field before the update and then mapping them back to the original domain to forward the ensemble through the reservoir simulator. A promising approach to perform parameterization is through deep learning models. Recent studies have shown that Generative Adversarial Networks (GAN) performed poorly concerning data assimilation, but generated more geologically plausible realizations of the reservoir, while the Variational Autoencoder (VAE) performed better than the GAN in data assimilation, but generated less geologically realistic models. This work is innovative in combining the strengths of both to implement a deep learning model called Variational Autoencoder Generative Adversarial Network (VAE-GAN) integrated with ESMDA. The methodology was applied in two case studies, one case being categorical and the other with continuous values of permeability. Our findings demonstrate that by applying the VAE-GAN model we can obtain high quality reservoir descriptions (just like GANs) and a good history matching on the production curves (just like VAEs) simultaneously.

[551] Automatic Configuration of LLM Post-Training Pipelines

Channe Chwa, Xinle Wu, Yao Lu

Main category: cs.LG

TL;DR: AutoPipe: A budget-aware two-stage framework for efficient configuration selection in LLM post-training pipelines using offline learning-to-rank guidance and online Bayesian optimization with early stopping.

Details

Motivation: LLM post-training pipelines combining supervised fine-tuning and reinforcement learning are difficult to configure due to high-dimensional heterogeneous configuration space, strong coupling between stages, and expensive end-to-end evaluations under realistic compute budgets.

Method: Two-stage framework: 1) Offline - learns dataset-conditioned learning-to-rank surrogate from historical runs to capture within-dataset preferences and provide transferable guidance; 2) Online - uses offline guidance to steer Bayesian optimization and models dataset-specific deviations with Gaussian-process residual surrogate, with early stopping and learned predictor for low-cost performance estimation.

Result: Experiments on biomedical reasoning tasks show AutoPipe consistently outperforms offline-only baselines and achieves comparable performance with strongest online HPO baselines while using less than 10% of their computational cost.

Conclusion: AutoPipe provides an efficient budget-aware framework for LLM post-training configuration selection that significantly reduces computational cost while maintaining competitive performance.

Abstract: LLM post-training pipelines that combine supervised fine-tuning and reinforcement learning are difficult to configure under realistic compute budgets: the configuration space is high-dimensional and heterogeneous, stages are strongly coupled, and each end-to-end evaluation is expensive. We propose AutoPipe, a budget-aware two-stage framework for configuration selection in LLM post-training. Offline, AutoPipe learns a dataset-conditioned learning-to-rank surrogate from historical runs, capturing within-dataset preferences and providing transferable guidance toward promising regions of the configuration space. Online, for a new dataset, AutoPipe uses the offline guidance to steer Bayesian optimization and models dataset-specific deviations with a Gaussian-process residual surrogate. To reduce evaluation cost, each trial is early-stopped and scored by a learned predictor that maps early training signals to a low-cost proxy for final post-training performance. Experiments on biomedical reasoning tasks show that AutoPipe consistently outperforms offline-only baselines and achieves comparable performance with the strongest online HPO baselines while using less than 10% of their computational cost.

[552] Signals of Success and Struggle: Early Prediction and Physiological Signatures of Human Performance across Task Complexity

Yufei Cao, Penny Sweetser, Ziyu Chen, Xuanying Zhu

Main category: cs.LG

TL;DR: Early ocular and cardiac signals can predict user performance in interactive systems with 86% accuracy, revealing physiological differences between high and low performers.

Details

Motivation: To enable timely identification of users struggling with task demands by prospectively predicting performance using early physiological signals, and to understand the physiological mechanisms underlying performance differences.

Method: Conducted a within-subject experiment in a game environment with naturally unfolding complexity, using early ocular (eye tracking) and cardiac (heart rate) signals to predict later performance. Developed fusion models combining ocular and cardiac data, and examined physiological and self-reported group differences between high and low performers.

Result: Ocular-cardiac fusion model achieved 86% balanced accuracy for performance prediction. Ocular-only model showed comparable predictive power. High performers exhibited targeted gaze, adjusted visual sampling, sustained more stable cardiac activation as demands intensified, and reported more positive affective experience.

Conclusion: Early physiological signals can effectively predict user performance cross-session, providing interpretable insights into performance variation and facilitating future proactive interventions in interactive systems.

Abstract: User performance is crucial in interactive systems, capturing how effectively users engage with task execution. Prospectively predicting performance enables the timely identification of users struggling with task demands. While ocular and cardiac signals are widely used to characterise performance-relevant visual behaviour and physiological activation, their potential for early prediction and for revealing the physiological mechanisms underlying performance differences remains underexplored. We conducted a within-subject experiment in a game environment with naturally unfolding complexity, using early ocular and cardiac signals to predict later performance and to examine physiological and self-reported group differences. Results show that the ocular-cardiac fusion model achieves a balanced accuracy of 0.86, and the ocular-only model shows comparable predictive power. High performers exhibited targeted gaze and adjusted visual sampling, and sustained more stable cardiac activation as demands intensified, with a more positive affective experience. These findings demonstrate the feasibility of cross-session prediction from early physiology, providing interpretable insights into performance variation and facilitating future proactive intervention.

[553] Seasoning Generative Models for a Generalization Aftertaste

Hisham Husain, Valentin De Bortoli, Richard Nock

Main category: cs.LG

TL;DR: Theoretical framework for refining generative models using discriminator guidance, with provable generalization improvements based on discriminator complexity.

Details

Motivation: Discriminators have proven effective for training/improving generative models (GANs, diffusion models), but theoretical understanding of how discriminator guidance improves generalization is limited. The paper aims to provide theoretical validation and generalization guarantees for discriminator-based refinement methods.

Method: Extends strong-duality results related to f-divergences to develop a discriminator-guided recipe for refining any generative model. Analyzes generalization improvements using Rademacher complexity of the discriminator set, connecting to score-based diffusion approaches.

Result: Shows refined generative models provably improve generalization compared to non-refined counterparts, with improvement gap based on discriminator complexity. Provides theoretical validation for existing score-based diffusion methods and suggests new algorithmic avenues.

Conclusion: The work offers theoretical foundation for discriminator guidance in generative models, explains generalization improvements, validates existing empirical successes, and opens directions for new algorithms with provable guarantees.

Abstract: The use of discriminators to train or fine-tune generative models has proven to be a rather successful framework. A notable example is Generative Adversarial Networks (GANs) that minimize a loss incurred by training discriminators along with other paradigms that boost generative models via discriminators that satisfy weak learner constraints. More recently, even diffusion models have shown advantages with some kind of discriminator guidance. In this work, we extend a strong-duality result related to $f$-divergences which gives rise to a discriminator-guided recipe that allows us to \textit{refine} any generative model. We then show that the refined generative models provably improve generalization, compared to its non-refined counterpart. In particular, our analysis reveals that the gap in generalization is improved based on the Rademacher complexity of the discriminator set used for refinement. Our recipe subsumes a recently introduced score-based diffusion approach (Kim et al., 2022) that has shown great empirical success, however allows us to shed light on the generalization guarantees of this method by virtue of our analysis. Thus, our work provides a theoretical validation for existing work, suggests avenues for new algorithms, and contributes to our understanding of generalization in generative models at large.

[554] A Model Ensemble-Based Post-Processing Framework for Fairness-Aware Prediction

Zhouting Zhao, Tin Lok James Ng

Main category: cs.LG

TL;DR: A post-processing framework using model ensembling to improve fairness in ML predictions while maintaining accuracy across various tasks.

Details

Motivation: Addressing the fundamental challenge of balancing predictive performance and fairness in machine learning, seeking a solution that works across different models and fairness definitions.

Method: Proposes a post-processing framework that leverages model ensembling to facilitate fairness-aware prediction, designed to be model-agnostic and applicable to various learning tasks, architectures, and fairness definitions.

Result: Extensive experiments across classification, regression, and survival analysis show the framework effectively enhances fairness while maintaining or minimally affecting predictive accuracy.

Conclusion: The proposed model-agnostic ensembling approach provides an effective solution for improving fairness in ML predictions across diverse applications without compromising performance.

Abstract: Striking an optimal balance between predictive performance and fairness continues to be a fundamental challenge in machine learning. In this work, we propose a post-processing framework that facilitates fairness-aware prediction by leveraging model ensembling. Designed to operate independently of any specific model internals, our approach is widely applicable across various learning tasks, model architectures, and fairness definitions. Through extensive experiments spanning classification, regression, and survival analysis, we demonstrate that the framework effectively enhances fairness while maintaining, or only minimally affecting, predictive accuracy.

[555] Online Learning and Equilibrium Computation with Ranking Feedback

Mingyang Liu, Yongshan Chen, Zhiyuan Fan, Gabriele Farina, Asuman Ozdaglar, Kaiqing Zhang

Main category: cs.LG

TL;DR: Online learning with ranking feedback instead of numeric utility, showing sublinear regret is impossible in general but achievable with sublinear variation assumptions, with applications to game theory equilibrium and LLM routing.

Details

Motivation: Traditional online learning relies on numeric utility feedback, which may be unavailable in human-in-the-loop applications or restricted by privacy concerns. The paper proposes using only ranking feedback over actions instead of numeric utilities.

Method: Studies online learning with two ranking mechanisms: instantaneous utility rankings and time-average utility rankings, under both full-information and bandit feedback settings. Develops new algorithms that achieve sublinear regret under the assumption that utility sequences have sublinear total variation.

Result: Shows sublinear regret is impossible with instantaneous-utility ranking feedback in general, and also impossible with time-average utility ranking feedback under deterministic ranking models. However, with sublinear variation assumptions, new algorithms achieve sublinear regret, leading to approximate coarse correlated equilibrium in game theory and effective LLM routing.

Conclusion: Online learning with ranking feedback is feasible under certain conditions, providing a practical alternative when numeric utilities are unavailable, with applications to equilibrium computation and real-world tasks like LLM routing.

Abstract: Online learning in arbitrary, and possibly adversarial, environments has been extensively studied in sequential decision-making, and it is closely connected to equilibrium computation in game theory. Most existing online learning algorithms rely on \emph{numeric} utility feedback from the environment, which may be unavailable in human-in-the-loop applications and/or may be restricted by privacy concerns. In this paper, we study an online learning model in which the learner only observes a \emph{ranking} over a set of proposed actions at each timestep. We consider two ranking mechanisms: rankings induced by the \emph{instantaneous} utility at the current timestep, and rankings induced by the \emph{time-average} utility up to the current timestep, under both \emph{full-information} and \emph{bandit} feedback settings. Using the standard external-regret metric, we show that sublinear regret is impossible with instantaneous-utility ranking feedback in general. Moreover, when the ranking model is relatively deterministic, \emph{i.e.}, under the Plackett-Luce model with a temperature that is sufficiently small, sublinear regret is also impossible with time-average utility ranking feedback. We then develop new algorithms that achieve sublinear regret under the additional assumption that the utility sequence has sublinear total variation. Notably, for full-information time-average utility ranking feedback, this additional assumption can be removed. As a consequence, when all players in a normal-form game follow our algorithms, repeated play yields an approximate coarse correlated equilibrium. We also demonstrate the effectiveness of our algorithms in an online large-language-model routing task.

[556] DriftGuard: Mitigating Asynchronous Data Drift in Federated Learning

Yizhou Han, Di Wu, Blesson Varghese

Main category: cs.LG

TL;DR: DriftGuard: A federated continual learning framework using Mixture-of-Experts architecture to efficiently adapt to asynchronous data drift in FL deployments by separating shared global parameters from local adaptive parameters.

Details

Motivation: In real-world Federated Learning deployments, data distributions on devices evolve over time at different rates and directions (asynchronous data drift). Frequent retraining is computationally expensive on resource-constrained devices, while infrequent retraining degrades performance on drifting devices.

Method: Uses Mixture-of-Experts inspired architecture separating shared parameters (capture globally transferable knowledge) from local parameters (adapt to group-specific distributions). Enables two retraining strategies: global retraining (updates shared parameters for system-wide drift) and group retraining (selectively updates local parameters for device clusters identified via MoE gating patterns without sharing raw data).

Result: Matches or exceeds state-of-the-art accuracy while reducing total retraining cost by up to 83%. Achieves highest accuracy per unit retraining cost, improving over strongest baseline by up to 2.3x across multiple datasets and models.

Conclusion: DriftGuard efficiently addresses asynchronous data drift in FL through its MoE-based architecture and dual retraining strategies, significantly reducing computational costs while maintaining or improving performance.

Abstract: In real-world Federated Learning (FL) deployments, data distributions on devices that participate in training evolve over time. This leads to asynchronous data drift, where different devices shift at different times and toward different distributions. Mitigating such drift is challenging: frequent retraining incurs high computational cost on resource-constrained devices, while infrequent retraining degrades performance on drifting devices. We propose DriftGuard, a federated continual learning framework that efficiently adapts to asynchronous data drift. DriftGuard adopts a Mixture-of-Experts (MoE) inspired architecture that separates shared parameters, which capture globally transferable knowledge, from local parameters that adapt to group-specific distributions. This design enables two complementary retraining strategies: (i) global retraining, which updates the shared parameters when system-wide drift is identified, and (ii) group retraining, which selectively updates local parameters for clusters of devices identified via MoE gating patterns, without sharing raw data. Experiments across multiple datasets and models show that DriftGuard matches or exceeds state-of-the-art accuracy while reducing total retraining cost by up to 83%. As a result, it achieves the highest accuracy per unit retraining cost, improving over the strongest baseline by up to 2.3x. DriftGuard is available for download from https://github.com/blessonvar/DriftGuard.

[557] Authority-Level Priors: An Under-Specified Constraint in Hierarchical Predictive Processing

Marcela Palejova

Main category: cs.LG

TL;DR: The paper introduces Authority-Level Priors (ALPs) as meta-structural constraints that determine which identity-level hypotheses are admissible for autonomic and behavioral regulation, explaining why explicit belief changes don’t always alter stress responses.

Details

Motivation: To address the asymmetry where explicit belief revision often fails to produce corresponding changes in stress reactivity or autonomic regulation, suggesting that hierarchical predictive processing leaves unspecified which identity-level hypotheses regulate autonomic and behavioral control under uncertainty.

Method: Introduces Authority-Level Priors (ALPs) as meta-structural constraints defining a regulatory-admissible subset of identity-level hypotheses. Provides computational formalization restricting policy optimization to policies generated by authorized hypotheses, compatible with variational active inference without additional inferential operators.

Result: ALPs explain why explicit belief updating modifies representational beliefs while autonomic threat responses remain stable. The model generates testable predictions concerning stress-reactivity dynamics, recovery time constants, compensatory control engagement, and behavioral persistence.

Conclusion: ALPs are advanced as an architectural hypothesis to be evaluated through computational modeling and longitudinal stress-induction paradigms, providing a boundary condition required for determinate identity-regulation mapping in predictive processing frameworks.

Abstract: Hierarchical predictive processing explains adaptive behaviour through precision-weighted inference. Explicit belief revision often fails to produce corresponding changes in stress reactivity or autonomic regulation. This asymmetry suggests the framework leaves under-specified a governance-level constraint concerning which identity-level hypotheses regulate autonomic and behavioural control under uncertainty. We introduce Authority-Level Priors (ALPs) as meta-structural constraints defining a regulatory-admissible subset (Hauth, a subset of H) of identity-level hypotheses. ALPs are not additional representational states nor hyperpriors over precision; they constrain which hypotheses are admissible for regulatory control. Precision determines influence conditional on admissibility; ALPs determine admissibility itself. This explains why explicit belief updating modifies representational beliefs while autonomic threat responses remain stable. A computational formalisation restricts policy optimisation to policies generated by authorised hypotheses, yielding testable predictions concerning stress-reactivity dynamics, recovery time constants, compensatory control engagement, and behavioural persistence. Neurobiologically, ALPs manifest through distributed prefrontal arbitration and control networks. The proposal is compatible with variational active inference and introduces no additional inferential operators, instead formalising a boundary condition required for determinate identity-regulation mapping. The model generates falsifiable predictions: governance shifts should produce measurable changes in stress-reactivity curves, recovery dynamics, compensatory cognitive effort, and behavioural change durability. ALPs are advanced as an architectural hypothesis to be evaluated through computational modelling and longitudinal stress-induction paradigms.

[558] Uniform a priori bounds and error analysis for the Adam stochastic gradient descent optimization method

Steffen Dereich, Thang Do, Arnulf Jentzen

Main category: cs.LG

TL;DR: This paper provides the first unconditional error analysis for the Adam optimizer by establishing uniform a priori bounds for strongly convex stochastic optimization problems.

Details

Motivation: Despite Adam being the most popular optimizer for training deep neural networks, there has been no complete error analysis, especially for strongly convex stochastic optimization problems. Previous analyses relied on the conditional assumption that Adam remains uniformly bounded rather than diverging to infinity.

Method: The authors establish uniform a priori bounds for Adam when applied to strongly convex stochastic optimization problems, which allows them to provide an unconditional error analysis without relying on the assumption that Adam remains bounded.

Result: The paper provides the first unconditional error analysis for Adam for a large class of strongly convex stochastic optimization problems, overcoming the limitations of previous conditional convergence analyses.

Conclusion: This work makes a significant theoretical contribution by providing unconditional convergence guarantees for Adam, addressing a long-standing open problem in optimization theory for deep learning.

Abstract: The adaptive moment estimation (Adam) optimizer proposed by Kingma & Ba (2014) is presumably the most popular stochastic gradient descent (SGD) optimization method for the training of deep neural networks (DNNs) in artificial intelligence (AI) systems. Despite its groundbreaking success in the training of AI systems, it still remains an open research problem to provide a complete error analysis of Adam, not only for optimizing DNNs but even when applied to strongly convex stochastic optimization problems (SOPs). Previous error analysis results for strongly convex SOPs in the literature provide conditional convergence analyses that rely on the assumption that Adam does not diverge to infinity but remains uniformly bounded. It is the key contribution of this work to establish uniform a priori bounds for Adam and, thereby, to provide – for the first time – an unconditional error analysis for Adam for a large class of strongly convex SOPs.

[559] Neural Galerkin Normalizing Flow for Transition Probability Density Functions of Diffusion Models

Riccardo Saporiti, Fabio Nobile

Main category: cs.LG

TL;DR: Neural Galerkin Normalizing Flow framework for approximating diffusion process transition densities by solving Fokker-Planck equations with adaptive sampling and structure-preserving normalizing flows.

Details

Motivation: To develop an efficient surrogate model for approximating transition probability density functions of diffusion processes that can be deployed in many-query problems like Bayesian inference, simulation, and diffusion bridge generation.

Method: Combines Neural Galerkin schemes with Normalizing Flows to solve Fokker-Planck equations parametrically with respect to initial mass location. Uses normalizing flows as structure-preserving transformations of reference process densities, derives ODEs for parameter evolution, and employs adaptive sampling for high-dimensional PDEs.

Result: The method captures key features of true solutions, enforces causal relationships between initial data and subsequent densities, and provides significantly more cost-effective online evaluation after offline training compared to solving PDEs from scratch.

Conclusion: The proposed Neural Galerkin Normalizing Flow framework serves as a promising surrogate model for stochastic differential equation applications, offering structure-preserving approximations with efficient online evaluation capabilities.

Abstract: We propose a new Neural Galerkin Normalizing Flow framework to approximate the transition probability density function of a diffusion process by solving the corresponding Fokker-Planck equation with an atomic initial distribution, parametrically with respect to the location of the initial mass. By using Normalizing Flows, we look for the solution as a transformation of the transition probability density function of a reference stochastic process, ensuring that our approximation is structure-preserving and automatically satisfies positivity and mass conservation constraints. By extending Neural Galerkin schemes to the context of Normalizing Flows, we derive a system of ODEs for the time evolution of the Normalizing Flow’s parameters. Adaptive sampling routines are used to evaluate the Fokker-Planck residual in meaningful locations, which is of vital importance to address high-dimensional PDEs. Numerical results show that this strategy captures key features of the true solution and enforces the causal relationship between the initial datum and the density function at subsequent times. After completing an offline training phase, online evaluation becomes significantly more cost-effective than solving the PDE from scratch. The proposed method serves as a promising surrogate model, which could be deployed in many-query problems associated with stochastic differential equations, like Bayesian inference, simulation, and diffusion bridge generation.

[560] An Optimised Greedy-Weighted Ensemble Framework for Financial Loan Default Prediction

Ezekiel Nii Noye Nortey, Jones Asante-Koranteng, Marcellin Atemkeng, Theophilus Ansah-Narh, David Mensah, Rebecca Davis, Ravenhill Adjetey Laryea

Main category: cs.LG

TL;DR: An Optimised Greedy-Weighted Ensemble framework for loan default prediction that dynamically weights ML models based on performance, achieving AUC of 0.80 on Lending Club data.

Details

Motivation: Traditional statistical models and static ensemble methods struggle with modern financial datasets characterized by nonlinear relationships, class imbalance, and evolving borrower behavior in loan default prediction.

Method: Proposes an ensemble framework that: 1) optimizes multiple ML classifiers using Particle Swarm Optimization, 2) combines predictions via regularized greedy weighting, and 3) uses neural-network-based meta-learner in stacked ensemble to capture higher-order relationships.

Result: The BlendNet ensemble achieved AUC of 0.80, macro-average F1-score of 0.73, and default recall of 0.81. Tree-based ensembles provided most reliable probability estimates, while stacked ensemble offered superior ranking capability. Key predictors identified: revolving utilization, annual income, and debt-to-income ratio.

Conclusion: Performance-driven ensemble weighting improves both predictive accuracy and interpretability in credit risk modeling, providing a scalable data-driven approach for institutional credit assessment and risk monitoring.

Abstract: Accurate prediction of loan defaults is a central challenge in credit risk management, particularly in modern financial datasets characterised by nonlinear relationships, class imbalance, and evolving borrower behaviour. Traditional statistical models and static ensemble methods often struggle to maintain reliable performance under such conditions. This study proposes an Optimised Greedy-Weighted Ensemble framework for loan default prediction that dynamically allocates model weights based on empirical predictive performance. The framework integrates multiple machine learning classifiers, with their hyperparameters first optimised using Particle Swarm Optimisation. Model predictions are then combined via a regularised greedy weighting mechanism. At the same time, a neural-network-based meta-learner is employed within stacked-ensemble to capture higher-order relationships among model outputs. Experiments conducted on the Lending Club dataset demonstrate that the proposed framework improves predictive performance compared with individual classifiers. The BlendNet ensemble achieved the strongest results with an AUC of 0.80, a macro-average F1-score of 0.73, and a default recall of 0.81. Calibration analysis further shows that tree-based ensembles such as Extra Trees and Gradient Boosting provide the most reliable probability estimates, while the stacked ensemble offers superior ranking capability. Feature analysis using Recursive Feature Elimination identifies revolving utilisation, annual income, and debt-to-income ratio as the most influential predictors of loan default. These findings demonstrate that performance-driven ensemble weighting can improve both predictive accuracy and interpretability in credit risk modelling. The proposed framework provides a scalable data-driven approach to support institutional credit assessment, risk monitoring, and financial decision-making.

[561] Context Bootstrapped Reinforcement Learning

Saaket Agashe, Jayanth Srinivasa, Gaowen Liu, Ramana Kompella, Xin Eric Wang

Main category: cs.LG

TL;DR: CBRL improves RLVR exploration efficiency by stochastically prepending few-shot demonstrations to training prompts with a curriculum that anneals injection probability from high to zero.

Details

Motivation: RLVR suffers from exploration inefficiency, especially for tasks requiring novel reasoning patterns or domain-specific knowledge, making it difficult to generate successful rollouts and obtain learning signals.

Method: Context Bootstrapped Reinforcement Learning (CBRL) augments RLVR by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum starting high to bootstrap exploration, then annealing to zero so models must succeed without assistance.

Result: CBRL consistently improves success rate, provides better exploration efficiency across two model families and five Reasoning Gym tasks, and is algorithm-agnostic. It also shows practical applicability on domain-specific programming language Q.

Conclusion: CBRL effectively addresses RLVR’s exploration inefficiency by forcing policies to internalize reasoning patterns from demonstrations rather than relying on them at test time, making it a practical solution for tasks requiring novel reasoning.

Abstract: Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training prompts. The injection probability follows a curriculum that starts high to bootstrap early exploration, then anneals to zero so the model must ultimately succeed without assistance. This forces the policy to internalize reasoning patterns from the demonstrations rather than relying on them at test time. We validate CBRL across two model families and five Reasoning Gym tasks. Our results demonstrate that CBRL consistently improves success rate, provides better exploration efficiency, and is algorithm-agnostic. We further demonstrate CBRL’s practical applicability on Q, a domain-specific programming language that diverges significantly from mainstream language conventions.

[562] Balancing Performance and Fairness in Explainable AI for Anomaly Detection in Distributed Power Plants Monitoring

Corneille Niyonkuru, Marcellin Atemkeng, Gabin Maxime Nguegnang, Arnaud Nguembang Fadja

Main category: cs.LG

TL;DR: A supervised ML framework for anomaly detection in diesel generator monitoring using ensemble methods with resampling techniques, SHAP for interpretability, and fairness metrics across regional clusters.

Details

Motivation: Reliable anomaly detection in distributed power plant monitoring is essential for operational continuity and cost reduction, but faces challenges of class imbalance, lack of interpretability, and fairness issues across regional clusters in telecom power systems.

Method: Proposes a supervised ML framework integrating ensemble methods (LightGBM, XGBoost, Random Forest, etc.) with advanced resampling techniques (SMOTE with Tomek Links and ENN) to address class imbalance. Uses SHAP for interpretability, Disparate Impact Ratio for fairness evaluation, and Maximum Mean Discrepancy for domain shift assessment across regions.

Result: Ensemble models consistently outperform baselines, with LightGBM achieving F1-score of 0.99 and minimal bias across clusters (DIR ≈ 0.95). SHAP analysis identifies fuel consumption rate and runtime per day as dominant predictors. Framework demonstrates balance between performance, interpretability, and fairness.

Conclusion: The framework successfully balances performance, interpretability, and fairness in anomaly detection for industrial power management, enabling more equitable and explainable AI systems with potential for real-time deployment via containerized services.

Abstract: Reliable anomaly detection in distributed power plant monitoring systems is essential for ensuring operational continuity and reducing maintenance costs, particularly in regions where telecom operators heavily rely on diesel generators. However, this task is challenged by extreme class imbalance, lack of interpretability, and potential fairness issues across regional clusters. In this work, we propose a supervised ML framework that integrates ensemble methods (LightGBM, XGBoost, Random Forest, CatBoost, GBDT, AdaBoost) and baseline models (Support Vector Machine, K-Nearrest Neighbors, Multilayer Perceptrons, and Logistic Regression) with advanced resampling techniques (SMOTE with Tomek Links and ENN) to address imbalance in a dataset of diesel generator operations in Cameroon. Interpretability is achieved through SHAP (SHapley Additive exPlanations), while fairness is quantified using the Disparate Impact Ratio (DIR) across operational clusters. We further evaluate model generalization using Maximum Mean Discrepancy (MMD) to capture domain shifts between regions. Experimental results show that ensemble models consistently outperform baselines, with LightGBM achieving an F1-score of 0.99 and minimal bias across clusters (DIR $\approx 0.95$). SHAP analysis highlights fuel consumption rate and runtime per day as dominant predictors, providing actionable insights for operators. Our findings demonstrate that it is possible to balance performance, interpretability, and fairness in anomaly detection, paving the way for more equitable and explainable AI systems in industrial power management. {\color{black} Finally, beyond offline evaluation, we also discuss how the trained models can be deployed in practice for real-time monitoring. We show how containerized services can process in real-time, deliver low-latency predictions, and provide interpretable outputs for operators.

[563] BVSIMC: Bayesian Variable Selection-Guided Inductive Matrix Completion for Improved and Interpretable Drug Discovery

Sijian Fan, Liyan Xiong, Dayuan Wang, Guoshuai Cai, Ray Bai

Main category: cs.LG

TL;DR: BVSIMC is a Bayesian model for drug discovery that performs variable selection on side features to improve prediction accuracy and interpretability in tasks like drug resistance prediction and drug repositioning.

Details

Motivation: Drug discovery often uses side information (chemical properties, genomic data) to improve predictions, but these features can be noisy, high-dimensional, and vary in relevance. Current methods lack effective variable selection from side features.

Method: Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC) learns sparse latent embeddings through Bayesian variable selection to identify relevant side features while improving prediction performance.

Result: BVSIMC outperforms state-of-the-art methods in both synthetic and real data. In drug resistance prediction for Mycobacterium tuberculosis and drug-disease association prediction, it achieves better prediction accuracy and identifies clinically meaningful side features.

Conclusion: BVSIMC effectively handles noisy, high-dimensional side features in drug discovery through Bayesian variable selection, improving both prediction performance and interpretability by identifying relevant biological features.

Abstract: Recent advances in drug discovery have demonstrated that incorporating side information (e.g., chemical properties about drugs and genomic information about diseases) often greatly improves prediction performance. However, these side features can vary widely in relevance and are often noisy and high-dimensional. We propose Bayesian Variable Selection-Guided Inductive Matrix Completion (BVSIMC), a new Bayesian model that enables variable selection from side features in drug discovery. By learning sparse latent embeddings, BVSIMC improves both predictive accuracy and interpretability. We validate our method through simulation studies and two drug discovery applications: 1) prediction of drug resistance in Mycobacterium tuberculosis, and 2) prediction of new drug-disease associations in computational drug repositioning. On both synthetic and real data, BVSIMC outperforms several other state-of-the-art methods in terms of prediction. In our two real examples, BVSIMC further reveals the most clinically meaningful side features.

[564] Maximum-Entropy Exploration with Future State-Action Visitation Measures

Adrien Bolland, Gaspard Lambrechts, Damien Ernst

Main category: cs.LG

TL;DR: Maximum entropy RL with intrinsic rewards based on discounted state-action feature distribution entropy improves exploration within trajectories.

Details

Motivation: To enhance exploration in reinforcement learning by maximizing the entropy of the discounted distribution of state-action features visited during future time steps, providing better exploration within individual trajectories.

Method: Proposes intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited in future time steps. Shows this is a lower bound on the entropy of the discounted distribution of state-action features in trajectories, and that the distribution can be estimated off-policy using a contraction operator.

Result: The approach leads to improved visitation of features within individual trajectories (though slightly reduced visitation in expectation over different trajectories), and improved convergence speed for exploration-only agents. Control performance remains similar across most benchmarks.

Conclusion: The proposed maximum entropy RL method with discounted feature distribution entropy provides effective exploration benefits, particularly improving within-trajectory feature visitation and learning speed for exploration-focused agents.

Abstract: Maximum entropy reinforcement learning motivates agents to explore states and actions to maximize the entropy of some distribution, typically by providing additional intrinsic rewards proportional to that entropy function. In this paper, we study intrinsic rewards proportional to the entropy of the discounted distribution of state-action features visited during future time steps. This approach is motivated by two results. First, we show that the expected sum of these intrinsic rewards is a lower bound on the entropy of the discounted distribution of state-action features visited in trajectories starting from the initial states, which we relate to an alternative maximum entropy objective. Second, we show that the distribution used in the intrinsic reward definition is the fixed point of a contraction operator and can therefore be estimated off-policy. Experiments highlight that the new objective leads to improved visitation of features within individual trajectories, in exchange for slightly reduced visitation of features in expectation over different trajectories, as suggested by the lower bound. It also leads to improved convergence speed for learning exploration-only agents. Control performance remains similar across most methods on the considered benchmarks.

[565] Best-of-Both-Worlds Multi-Dueling Bandits: Unified Algorithms for Stochastic and Adversarial Preferences under Condorcet and Borda Objectives

S. Akash, Pratik Gajane, Jawar Singh

Main category: cs.LG

TL;DR: Best-of-both-worlds algorithms for multi-dueling bandits that work optimally in both stochastic and adversarial environments without knowing which regime they face, for both Condorcet and Borda objectives.

Details

Motivation: Multi-dueling bandits (selecting m≥2 arms per round with only winner feedback) are important for ranking and recommendation systems, but existing algorithms require knowing whether the environment is stochastic or adversarial. The paper addresses whether a single algorithm can perform optimally in both regimes without this knowledge.

Method: For Condorcet setting: MetaDueling - a black-box reduction that converts any dueling bandit algorithm into multi-dueling by transforming multi-way winner feedback into unbiased pairwise signals. For Borda setting: AlgBorda - a specialized stochastic-and-adversarial algorithm.

Result: First best-of-both-worlds algorithms for multi-dueling bandits. For Condorcet: O(√KT) pseudo-regret against adversarial and instance-optimal O(∑{i≠a*} logT/Δ_i) under stochastic. For Borda: O(K²logKT + Klog²T + ∑{i:Δ_iᴮ>0} KlogKT/(Δ_iᴮ)²) in stochastic and O(K√(TlogKT) + K¹/³T²/³(logK)¹/³) against adversaries.

Conclusion: Successfully demonstrates that single algorithms can perform optimally in both stochastic and adversarial multi-dueling bandit environments without regime knowledge, with matching lower bounds for Condorcet and near-optimal bounds for Borda.

Abstract: Multi-dueling bandits, where a learner selects $m \geq 2$ arms per round and observes only the winner, arise naturally in many applications including ranking and recommendation systems, yet a fundamental question has remained open: can a single algorithm perform optimally in both stochastic and adversarial environments, without knowing which regime it faces? We answer this affirmatively, providing the first best-of-both-worlds algorithms for multi-dueling bandits under both Condorcet and Borda objectives. For the Condorcet setting, we propose \texttt{MetaDueling}, a black-box reduction that converts any dueling bandit algorithm into a multi-dueling bandit algorithm by transforming multi-way winner feedback into an unbiased pairwise signal. Instantiating our reduction with \texttt{Versatile-DB} yields the first best-of-both-worlds algorithm for multi-dueling bandits: it achieves $O(\sqrt{KT})$ pseudo-regret against adversarial preferences and the instance-optimal $O!\left(\sum_{i \neq a^\star} \frac{\log T}{Δ_i}\right)$ pseudo-regret under stochastic preferences, both simultaneously and without prior knowledge of the regime. For the Borda setting, we propose \AlgBorda, a stochastic-and-adversarial algorithm that achieves $O\left(K^2 \log KT + K \log^2 T + \sum_{i: Δ_i^{\mathrm{B}} > 0} \frac{K\log KT}{(Δ_i^{\mathrm{B}})^2}\right)$ regret in stochastic environments and $O\left(K \sqrt{T \log KT} + K^{1/3} T^{2/3} (\log K)^{1/3}\right)$ regret against adversaries, again without prior knowledge of the regime. We complement our upper bounds with matching lower bounds for the Condorcet setting. For the Borda setting, our upper bounds are near-optimal with respect to the lower bounds (within a factor of $K$) and match the best-known results in the literature.

[566] Book your room in the Turing Hotel! A symmetric and distributed Turing Test with multiple AIs and humans

Christian Di Maio, Tommaso Guidi, Luigi Quarantiello, Jack Bell, Marco Gori, Stefano Melacci, Vincenzo Lomonaco

Main category: cs.LG

TL;DR: A novel Turing Test extension called “TuringHotel” tests LLMs in mixed human-AI communities where both act as judges and respondents, revealing current models still sometimes pass as human but with identifiable “human fingerprints.”

Details

Motivation: To extend the classical Turing Test from one-to-one interactions to group settings with mixed human-LLM communities, creating a more complex social environment to evaluate AI's ability to blend in with humans.

Method: Developed TuringHotel on the UNaIVERSE platform, creating a “World” with defined roles and interaction dynamics. Used authenticated peer-to-peer network for communication. Involved 17 human participants and 19 LLMs in time-bounded discussions where both humans and AIs served as judges and respondents.

Result: Current LLMs are still sometimes confused as humans in group settings, but there are unexpected mistakes that reveal identifiable “human fingerprints.” Despite high-quality language skills, artificial participants are not fully ambiguous in their human-like behavior.

Conclusion: This represents the first distributed experiment of its kind, demonstrating that while LLMs can sometimes pass as human in complex social settings, their “human fingerprints” remain detectable. Such initiatives could support ongoing monitoring of LLM evolution.

Abstract: In this paper, we report our experience with TuringHotel'', a novel extension of the Turing Test based on interactions within mixed communities of Large Language Models (LLMs) and human participants. The classical one-to-one interaction of the Turing Test is reinterpreted in a group setting, where both human and artificial agents engage in time-bounded discussions and, interestingly, are both judges and respondents. This community is instantiated in the novel platform UNaIVERSE (https://unaiverse.io), creating a World’’ which defines the roles and interaction dynamics, facilitated by the platform’s built-in programming tools. All communication occurs over an authenticated peer-to-peer network, ensuring that no third parties can access the exchange. The platform also provides a unified interface for humans, accessible via both mobile devices and laptops, that was a key component of the experience in this paper. Results of our experimentation involving 17 human participants and 19 LLMs revealed that current models are still sometimes confused as humans. Interestingly, there are several unexpected mistakes, suggesting that human fingerprints are still identifiable but not fully unambiguous, despite the high-quality language skills of artificial participants. We argue that this is the first experiment conducted in such a distributed setting, and that similar initiatives could be of national interest to support ongoing experiments and competitions aimed at monitoring the evolution of large language models over time.

[567] Foundations of Schrödinger Bridges for Generative Modeling

Sophia Tang

Main category: cs.LG

TL;DR: The paper provides a mathematical foundation for Schrödinger bridges as a unifying principle underlying modern generative models like diffusion models, score-based models, and flow matching, showing how they connect prior and target distributions through optimal stochastic paths.

Details

Motivation: To establish Schrödinger bridges as a fundamental mathematical framework that unifies various generative modeling approaches (diffusion models, score-based models, flow matching) by framing the generative task as an optimal stochastic transport problem between distributions.

Method: Develops mathematical foundations drawing from optimal transport, stochastic control, and path-space optimization, focusing on dynamic formulations with direct connections to generative modeling. Builds a comprehensive toolkit for constructing Schrödinger bridges from first principles.

Result: Provides a unified theoretical framework showing how Schrödinger bridges underlie modern generative models, with generalized and task-specific computational methods derived from these constructions.

Conclusion: Schrödinger bridges offer a powerful mathematical foundation that connects and unifies various generative modeling approaches through optimal stochastic transport principles, enabling both theoretical understanding and practical computational methods.

Abstract: At the core of modern generative modeling frameworks, including diffusion models, score-based models, and flow matching, is the task of transforming a simple prior distribution into a complex target distribution through stochastic paths in probability space. Schrödinger bridges provide a unifying principle underlying these approaches, framing the problem as determining an optimal stochastic bridge between marginal distribution constraints with minimal-entropy deviations from a pre-defined reference process. This guide develops the mathematical foundations of the Schrödinger bridge problem, drawing on optimal transport, stochastic control, and path-space optimization, and focuses on its dynamic formulation with direct connections to modern generative modeling. We build a comprehensive toolkit for constructing Schrödinger bridges from first principles, and show how these constructions give rise to generalized and task-specific computational methods.

[568] AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

An Luo, Jin Du, Xun Xian, Robert Specht, Fangqiao Tian, Ganghua Wang, Xuan Bi, Charles Fleming, Ashish Kundu, Jayanth Srinivasa, Mingyi Hong, Rui Zhang, Tianxi Li, Galin Jones, Jie Ding

Main category: cs.LG

TL;DR: AgentDS benchmark evaluates AI agents vs human experts on domain-specific data science tasks across 6 industries, finding current AI struggles with domain reasoning and human-AI collaboration outperforms AI-only approaches.

Details

Motivation: While LLMs and AI agents have automated data science workflows, it's unclear how well they perform compared to human experts on domain-specific tasks and where human expertise still provides advantages.

Method: Created AgentDS benchmark with 17 challenges across 6 industries (commerce, food production, healthcare, insurance, manufacturing, retail banking). Conducted open competition with 29 teams and 80 participants to compare human-AI collaboration vs AI-only baselines.

Result: AI-only baselines performed near or below median of competition participants. Current AI agents struggle with domain-specific reasoning. Strongest solutions came from human-AI collaboration rather than fully automated AI approaches.

Conclusion: Challenges narrative of complete automation by AI, underscores enduring importance of human expertise in data science, and illuminates directions for next-generation AI development.

Abstract: Data science plays a critical role in transforming complex data into actionable insights across numerous domains. Recent developments in large language models (LLMs) and artificial intelligence (AI) agents have significantly automated data science workflow. However, it remains unclear to what extent AI agents can match the performance of human experts on domain-specific data science tasks, and in which aspects human expertise continues to provide advantages. We introduce AgentDS, a benchmark and competition designed to evaluate both AI agents and human-AI collaboration performance in domain-specific data science. AgentDS consists of 17 challenges across six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. We conducted an open competition involving 29 teams and 80 participants, enabling systematic comparison between human-AI collaborative approaches and AI-only baselines. Our results show that current AI agents struggle with domain-specific reasoning. AI-only baselines perform near or below the median of competition participants, while the strongest solutions arise from human-AI collaboration. These findings challenge the narrative of complete automation by AI and underscore the enduring importance of human expertise in data science, while illuminating directions for the next generation of AI. Visit the AgentDS website here: https://agentds.org/ and open source datasets here: https://huggingface.co/datasets/lainmn/AgentDS .

[569] When Differential Privacy Meets Wireless Federated Learning: An Improved Analysis for Privacy and Convergence

Chen Yaoling, Liang Hao, Tu Xiaotong

Main category: cs.LG

TL;DR: Analysis of privacy and convergence for differentially private wireless federated learning with non-convex objectives, showing privacy loss converges to constant rather than diverging with iterations.

Details

Motivation: Existing differentially private wireless federated learning frameworks have open questions about precisely characterizing privacy loss, and existing convergence analyses rely on restrictive convexity assumptions or ignore gradient clipping effects.

Method: Comprehensive analysis of privacy and convergence for DPWFL with general smooth non-convex loss objectives, explicitly incorporating device selection and mini-batch sampling, with gradient clipping.

Result: Privacy loss converges to a constant rather than diverging with number of iterations; convergence guarantees established with gradient clipping; explicit privacy-utility trade-off derived; numerical results validate theoretical findings.

Conclusion: The paper provides rigorous theoretical foundations for differentially private wireless federated learning with non-convex objectives, addressing key limitations of prior work and establishing practical privacy-utility trade-offs.

Abstract: Differentially private wireless federated learning (DPWFL) is a promising framework for protecting sensitive user data. However, foundational questions on how to precisely characterize privacy loss remain open, and existing work is further limited by convergence analyses that rely on restrictive convexity assumptions or ignore the effect of gradient clipping. To overcome these issues, we present a comprehensive analysis of privacy and convergence for DPWFL with general smooth non-convex loss objectives. Our analysis explicitly incorporates both device selection and mini-batch sampling, and shows that the privacy loss can converge to a constant rather than diverge with the number of iterations. Moreover, we establish convergence guarantees with gradient clipping and derive an explicit privacy-utility trade-off. Numerical results validate our theoretical findings.

Mohamed Badi, Chaouki Ben Issaid, Mehdi Bennis

Main category: cs.LG

TL;DR: CoMFed is a communication-efficient federated learning framework for multi-modal data that uses learnable projection matrices and latent-space regularization to align heterogeneous client modalities while preserving privacy.

Details

Motivation: Federated learning for multi-modal settings faces challenges due to client heterogeneity in modalities and model architectures, making feature space alignment difficult while preserving privacy and minimizing communication costs.

Method: CoMFed uses learnable projection matrices to generate compressed latent representations and a latent-space regularizer to align these representations across clients, improving cross-modal consistency and robustness to outliers.

Result: Experiments on human activity recognition benchmarks show CoMFed achieves competitive accuracy with minimal communication overhead.

Conclusion: CoMFed provides an effective solution for multi-modal federated learning that balances communication efficiency, privacy preservation, and model performance.

Abstract: Federated learning (FL) enables collaborative model training across distributed devices without sharing raw data, but applying FL to multi-modal settings introduces significant challenges. Clients typically possess heterogeneous modalities and model architectures, making it difficult to align feature spaces efficiently while preserving privacy and minimizing communication costs. To address this, we introduce CoMFed, a Communication-Efficient Multi-Modal Federated Learning framework that uses learnable projection matrices to generate compressed latent representations. A latent-space regularizer aligns these representations across clients, improving cross-modal consistency and robustness to outliers. Experiments on human activity recognition benchmarks show that CoMFed achieves competitive accuracy with minimal overhead.

[571] Position: Spectral GNNs Are Neither Spectral Nor Superior for Node Classification

Qin Jiang, Chengjia Wang, Michael Lones, Dongdong Chen, Wei Pang

Main category: cs.LG

TL;DR: Spectral GNNs for node classification are built on flawed theoretical foundations - they don’t actually perform meaningful spectral filtering, and their empirical success comes from implementation details that make them equivalent to message-passing networks rather than true spectral methods.

Details

Motivation: To critically examine the theoretical foundations of Spectral Graph Neural Networks for node classification, challenging the common belief that they perform frequency-domain filtering on graphs and explaining why they appear to work despite flawed theoretical underpinnings.

Method: Theoretical analysis identifying two key flaws: (1) graph Laplacian eigenvectors don’t form a true Fourier basis, and (2) polynomial approximations aren’t theoretically justified. Analysis of message-passing dynamics showing low/high-pass behaviors come from MPNN dynamics, not spectral formulations. Empirical analysis of directed spectral models (MagNet, HoloNet) showing their effectiveness comes from implementation issues reducing them to MPNNs.

Result: Spectral GNNs don’t meaningfully capture graph spectrum or reliably improve performance. Their competitive results are better explained by equivalence to MPNNs, sometimes aided by implementations inconsistent with intended spectral design. When implemented consistently as claimed spectral algorithms, performance becomes weak.

Conclusion: For node classification, Spectral GNNs are fundamentally flawed - they neither perform meaningful spectral filtering nor provide reliable performance improvements. Their apparent success stems from implementation details that make them effectively message-passing networks rather than true spectral methods.

Abstract: Spectral Graph Neural Networks (Spectral GNNs) for node classification promise frequency-domain filtering on graphs, yet rest on flawed foundations. Recent work shows that graph Laplacian eigenvectors do not in general have the key properties of a true Fourier basis, but leaves the empirical success of Spectral GNNs unexplained. We identify two theoretical glitches: (1) commonly used “graph Fourier bases” are not classical Fourier bases for graph signals; (2) (n-1)-degree polynomials (n = number of nodes) can exactly interpolate any spectral response via a Vandermonde system, so the usual “polynomial approximation” narrative is not theoretically justified. The effectiveness of GCN is commonly attributed to spectral low-pass filtering, yet we prove that low- and high-pass behaviors arise solely from message-passing dynamics rather than Graph Fourier Transform-based spectral formulations. We then analyze two representative directed spectral models, MagNet and HoloNet. Their reported effectiveness is not spectral: it arises from implementation issues that reduce them to powerful MPNNs. When implemented consistently with the claimed spectral algorithms, performance becomes weak. This position paper argues that: for node classification, Spectral GNNs neither meaningfully capture the graph spectrum nor reliably improve performance; competitive results are better explained by their equivalence to MPNNs, sometimes aided by implementations inconsistent with their intended design.

[572] On Optimizing Multimodal Jailbreaks for Spoken Language Models

Aravind Krishnan, Karolina Stańczak, Dietrich Klakow

Main category: cs.LG

TL;DR: JAMA is a joint multimodal jailbreak attack that simultaneously optimizes text and audio perturbations to exploit vulnerabilities in Spoken Language Models, achieving 1.5x to 10x higher success rates than unimodal attacks.

Details

Motivation: Spoken Language Models (SLMs) inherit safety vulnerabilities from LLMs and have expanded attack surfaces due to multimodal integration. Existing jailbreak attacks are largely unimodal (text-only or audio-only), leaving a gap in understanding and defending against coordinated multimodal attacks.

Method: JAMA combines Greedy Coordinate Gradient (GCG) for text optimization and Projected Gradient Descent (PGD) for audio optimization to simultaneously perturb both modalities. The framework also introduces a sequential approximation method to accelerate the attack by 4x to 6x.

Result: Evaluations across four state-of-the-art SLMs and four audio types show JAMA surpasses unimodal jailbreak rates by 1.5x to 10x. The sequential approximation method significantly improves attack efficiency without compromising effectiveness.

Conclusion: Unimodal safety approaches are insufficient for robust SLMs. The demonstrated effectiveness of joint multimodal attacks highlights the need for comprehensive multimodal safety evaluations and defenses in spoken language systems.

Abstract: As Spoken Language Models (SLMs) integrate speech and text modalities, they inherit the safety vulnerabilities of their LLM backbone and an expanded attack surface. SLMs have been previously shown to be susceptible to jailbreaking, where adversarial prompts induce harmful responses. Yet existing attacks largely remain unimodal, optimizing either text or audio in isolation. We explore gradient-based multimodal jailbreaks by introducing JAMA (Joint Audio-text Multimodal Attack), a joint multimodal optimization framework combining Greedy Coordinate Gradient (GCG) for text and Projected Gradient Descent (PGD) for audio, to simultaneously perturb both modalities. Evaluations across four state-of-the-art SLMs and four audio types demonstrate that JAMA surpasses unimodal jailbreak rate by 1.5x to 10x. We analyze the operational dynamics of this joint attack and show that a sequential approximation method makes it 4x to 6x faster. Our findings suggest that unimodal safety is insufficient for robust SLMs. The code and data are available at https://repos.lsv.uni-saarland.de/akrishnan/multimodal-jailbreak-slm

[573] From Inference Efficiency to Embodied Efficiency: Revisiting Efficiency Metrics for Vision-Language-Action Models

Zhuofan Li, Hongkun Yang, Zhenyang Chen, Yangxuan Chen, Yingyan, Lin, Chaojian Li

Main category: cs.LG

TL;DR: Current VLA efficiency metrics (parameters, FLOPs, token throughput) don’t reflect real robotic performance; system-level embodied metrics like task completion time and motion quality reveal hidden trade-offs in model compression methods.

Details

Motivation: The paper challenges the prevailing notion of efficiency in VLA models, arguing that conventional computational metrics don't translate to actual robotic performance. Real-world efficiency should be measured by system-level embodied behaviors rather than just computational metrics.

Method: Conducted controlled studies across three approaches: model compression, token sparsification, and action sequence compression. Evaluated using both conventional efficiency metrics and system-level embodied metrics like task completion time, trajectory smoothness, cumulative joint rotation, and motion energy.

Result: Found that methods reducing computation under conventional metrics often increase end-to-end execution cost or degrade motion quality while maintaining task success. System-level metrics reveal hidden performance differences in learned action policies. Common adaptation methods show only mild, metric-specific improvements with trade-offs.

Conclusion: Conventional inference efficiency metrics overlook important aspects of embodied execution. Incorporating embodied efficiency metrics provides a more complete view of policy behavior and enables fairer, more comprehensive comparisons of VLA models for real-world robotic applications.

Abstract: Vision-Language-Action (VLA) models have recently enabled embodied agents to perform increasingly complex tasks by jointly reasoning over visual, linguistic, and motor modalities. However, we find that the prevailing notion of ``efficiency’’ in current VLA research, characterized by parameters, FLOPs, or token decoding throughput, does not reflect actual performance on robotic platforms. In real-world execution, efficiency is determined by system-level embodied behaviors such as task completion time, trajectory smoothness, cumulative joint rotation, and motion energy. Through controlled studies across model compression, token sparsification, and action sequence compression, we make several observations that challenge common assumptions. (1) Methods that reduce computation under conventional metrics often increase end-to-end execution cost or degrade motion quality, despite maintaining task success rates. (2) System-level embodied efficiency metrics reveal performance differences in the learned action policies that remain hidden under conventional evaluations. (3) Common adaptation methods such as in-context prompting or supervised fine-tuning show only mild and metric-specific improvements in embodied efficiency. While these methods can reduce targeted embodied-efficiency metrics such as jerk or action rate, the resulting gains may come with trade-offs in other metrics, such as longer completion time. Taken together, our results suggest that conventional inference efficiency metrics can overlook important aspects of embodied execution. Incorporating embodied efficiency provides a more complete view of policy behavior and practical performance, enabling fairer and more comprehensive comparisons of VLA models.

[574] Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

Main category: cs.LG

TL;DR: Adaptive stock prediction framework using autoencoder for regime detection, dual transformer networks for different market conditions, and reinforcement learning controller for adaptive routing and threshold tuning.

Details

Motivation: Stock markets show regime-dependent behavior where standard prediction models fail during volatile periods. Existing approaches treat all market states uniformly or require expensive manual regime labeling that becomes stale quickly.

Method: Three-component architecture: (1) Autoencoder trained on normal conditions identifies anomalous regimes via reconstruction error, (2) Dual node transformer networks specialized for stable vs. event-driven markets, (3) Soft Actor-Critic RL controller adaptively tunes regime detection threshold and pathway blending weights based on prediction feedback.

Result: On 20 S&P 500 stocks (1982-2025): 0.68% MAPE without RL controller, 0.59% MAPE with full adaptive system vs. 0.80% baseline. Directional accuracy reaches 72%. Maintains robust performance during high-volatility periods (MAPE <0.85% when baselines exceed 1.5%).

Conclusion: The adaptive framework effectively handles regime-dependent market behavior by automatically detecting anomalies and routing through specialized pathways, with RL enabling adaptive regime boundary learning based on prediction failure.

Abstract: Stock markets exhibit regime-dependent behavior where prediction models optimized for stable conditions often fail during volatile periods. Existing approaches typically treat all market states uniformly or require manual regime labeling, which is expensive and quickly becomes stale as market dynamics evolve. This paper introduces an adaptive prediction framework that adaptively identifies deviations from normal market conditions and routes data through specialized prediction pathways. The architecture consists of three components: (1) an autoencoder trained on normal market conditions that identifies anomalous regimes through reconstruction error, (2) dual node transformer networks specialized for stable and event-driven market conditions respectively, and (3) a Soft Actor-Critic reinforcement learning controller that adaptively tunes the regime detection threshold and pathway blending weights based on prediction performance feedback. The reinforcement learning component enables the system to learn adaptive regime boundaries, defining anomalies as market states where standard prediction approaches fail. Experiments on 20 S&P 500 stocks spanning 1982 to 2025 demonstrate that the proposed framework achieves 0.68% MAPE for one-day predictions without the reinforcement controller and 0.59% MAPE with the full adaptive system, compared to 0.80% for the baseline integrated node transformer. Directional accuracy reaches 72% with the complete framework. The system maintains robust performance during high-volatility periods, with MAPE below 0.85% when baseline models exceed 1.5%. Ablation studies confirm that each component contributes meaningfully: autoencoder routing accounts for 36% relative MAPE degradation upon removal, followed by the SAC controller at 15% and the dual-path architecture at 7%.

[575] Hierarchical Latent Structure Learning through Online Inference

Ines Aitsahalia, Kiyohito Iigaya

Main category: cs.LG

TL;DR: HOLMES is an online hierarchical Bayesian model that learns multiscale latent structure from sequential data through tractable trial-by-trial inference, balancing generalization and discrimination.

Details

Motivation: Current learning systems struggle to balance generalization across experiences with discrimination of task-relevant details. Online latent-cause models support incremental inference but assume flat partitions, while hierarchical Bayesian models capture multilevel structure but typically require offline inference.

Method: Combines a variation on the nested Chinese Restaurant Process prior with sequential Monte Carlo inference to perform tractable trial-by-trial inference over hierarchical latent representations without explicit supervision over the latent structure.

Result: In simulations, HOLMES matched predictive performance of flat models while learning more compact representations that supported one-shot transfer to higher-level latent categories. In context-dependent tasks with nested temporal structure, HOLMES improved outcome prediction relative to flat models.

Conclusion: Provides a tractable computational framework for discovering hierarchical structure in sequential data through online inference.

Abstract: Learning systems must balance generalization across experiences with discrimination of task-relevant details. Effective learning therefore requires representations that support both. Online latent-cause models support incremental inference but assume flat partitions, whereas hierarchical Bayesian models capture multilevel structure but typically require offline inference. We introduce the Hierarchical Online Learning of Multiscale Experience Structure (HOLMES) model, a computational framework for hierarchical latent structure learning through online inference. HOLMES combines a variation on the nested Chinese Restaurant Process prior with sequential Monte Carlo inference to perform tractable trial-by-trial inference over hierarchical latent representations without explicit supervision over the latent structure. In simulations, HOLMES matched the predictive performance of flat models while learning more compact representations that supported one-shot transfer to higher-level latent categories. In a context-dependent task with nested temporal structure, HOLMES also improved outcome prediction relative to flat models. These results provide a tractable computational framework for discovering hierarchical structure in sequential data.

[576] SHAPCA: Consistent and Interpretable Explanations for Machine Learning Models on Spectroscopy Data

Mingxing Zhang, Nicola Rossberg, Simone Innocente, Katarzyna Komolibus, Rekha Gautam, Barry O’Sullivan, Luca Longo, Andrea Visentin

Main category: cs.LG

TL;DR: SHAPCA combines PCA for dimensionality reduction with SHAP for explainable ML in spectroscopy, providing interpretable feature importance in original spectral space with improved consistency.

Details

Motivation: Spectroscopy data has high dimensionality and collinearity, making ML models difficult to interpret and trust in clinical/safety-critical settings. Existing feature extraction methods disconnect explanations from original signals.

Method: SHAPCA pipeline uses Principal Component Analysis for dimensionality reduction followed by Shapely Additive exPlanations to map explanations back to original input space, enabling both global and local interpretability.

Result: The framework provides interpretable results showing spectral bands driving model behavior, with greater consistency across training runs compared to standard approaches.

Conclusion: SHAPCA enables trustworthy ML for spectroscopy by providing stable, interpretable explanations in the original spectral space that practitioners can link to biological components.

Abstract: In recent years, machine learning models have been increasingly applied to spectroscopic datasets for chemical and biomedical analysis. For their successful adoption, particularly in clinical and safety-critical settings, professionals and researchers must be able to understand and trust the reasoning behind model predictions. However, the inherently high dimensionality and strong collinearity of spectroscopy data pose a fundamental challenge to model explainability. These properties not only complicate model training but also undermine the stability and consistency of explanations, leading to fluctuations in feature importance across repeated training runs. Feature extraction techniques have been used to reduce the input dimensionality; these new features hinder the connection between the prediction and the original signal. This study proposes SHAPCA, an explainable machine learning pipeline that combines Principal Component Analysis (for dimensionality reduction) and Shapely Additive exPlanations (for post hoc explanation) to provide explanations in the original input space, which a practitioner can interpret and link back to the biological components. The proposed framework enables analysis from both global and local perspectives, revealing the spectral bands that drive overall model behaviour as well as the instance-specific features that influence individual predictions. Numerical analysis demonstrated the interpretability of the results and greater consistency across different runs.

[577] Enhancing Pretrained Model-based Continual Representation Learning via Guided Random Projection

Ruilin Li, Heming Zou, Xiufeng Yan, Zheming Liang, Jie Yang, Chenliang Li, Xue Yang

Main category: cs.LG

TL;DR: SCL-MGSM improves continual learning by constructing projection layers via data-guided selection of target-aligned random bases instead of random initialization, enhancing expressivity under domain shifts while maintaining numerical stability.

Details

Motivation: Random Projection Layer (RPL)-based continual learning methods struggle with severe domain gaps between pre-trained models and target domains. Random initialization has limited expressivity under large domain shifts, and scaling up RPL dimensions causes ill-conditioned feature matrices that destabilize analytic updates.

Method: Proposes Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM), which constructs projection layers via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt pre-trained representations to downstream tasks, creating compact yet expressive RPLs while improving numerical stability.

Result: Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.

Conclusion: SCL-MGSM provides an effective solution for continual learning under domain shifts by using data-guided projection layer construction instead of random initialization, balancing expressivity and numerical stability.

Abstract: Recent paradigms in Random Projection Layer (RPL)-based continual representation learning have demonstrated superior performance when building upon a pre-trained model (PTM). These methods insert a randomly initialized RPL after a PTM to enhance feature representation in the initial stage. Subsequently, a linear classification head is used for analytic updates in the continual learning stage. However, under severe domain gaps between pre-trained representations and target domains, a randomly initialized RPL exhibits limited expressivity under large domain shifts. While largely scaling up the RPL dimension can improve expressivity, it also induces an ill-conditioned feature matrix, thereby destabilizing the recursive analytic updates of the linear head. To this end, we propose the Stochastic Continual Learner with MemoryGuard Supervisory Mechanism (SCL-MGSM). Unlike random initialization, MGSM constructs the projection layer via a principled, data-guided mechanism that progressively selects target-aligned random bases to adapt the PTM representation to downstream tasks. This facilitates the construction of a compact yet expressive RPL while improving the numerical stability of analytic updates. Extensive experiments on multiple exemplar-free Class Incremental Learning (CIL) benchmarks demonstrate that SCL-MGSM achieves superior performance compared to state-of-the-art methods.

[578] Rigorous Error Certification for Neural PDE Solvers: From Empirical Residuals to Solution Guarantees

Amartya Mukherjee, Maxwell Fitzsimmons, David C. Del Rey Fernández, Jun Liu

Main category: cs.LG

TL;DR: Theoretical analysis of physics-informed neural networks (PINNs) establishing generalization bounds connecting residual error to solution-space error for PDEs.

Details

Motivation: Traditional PDE uncertainty quantification relies on discretization theory, but PINNs introduce new error sources (optimization, sampling, representation, overfitting) making generalization error an open problem that needs theoretical analysis.

Method: Theoretical analysis establishing generalization bounds that connect residual control to solution-space error. Proves that when neural approximations lie in a compact subset of the solution space, vanishing residual error guarantees convergence to the true solution.

Result: Derived deterministic and probabilistic convergence results with certified generalization bounds that translate residual, boundary, and initial errors into explicit solution error guarantees.

Conclusion: Provides theoretical foundation for uncertainty quantification in physics-informed neural networks, addressing the open problem of generalization error in PDE solutions approximated by neural networks.

Abstract: Uncertainty quantification for partial differential equations is traditionally grounded in discretization theory, where solution error is controlled via mesh/grid refinement. Physics-informed neural networks fundamentally depart from this paradigm: they approximate solutions by minimizing residual losses at collocation points, introducing new sources of error arising from optimization, sampling, representation, and overfitting. As a result, the generalization error in the solution space remains an open problem. Our main theoretical contribution establishes generalization bounds that connect residual control to solution-space error. We prove that when neural approximations lie in a compact subset of the solution space, vanishing residual error guarantees convergence to the true solution. We derive deterministic and probabilistic convergence results and provide certified generalization bounds translating residual, boundary, and initial errors into explicit solution error guarantees.

[579] DyMoE: Dynamic Expert Orchestration with Mixed-Precision Quantization for Efficient MoE Inference on Edge

Yuegui Huang, Zhiyuan Fang, Weiqi Luo, Ruoyu Wu, Wuhui Chen, Zibin Zheng

Main category: cs.LG

TL;DR: DyMoE is a dynamic mixed-precision quantization framework for MoE models that enables real-time inference on edge devices by addressing memory and I/O bottlenecks through importance-aware prioritization, depth-adaptive scheduling, and look-ahead prefetching.

Details

Motivation: MoE models face challenges for edge deployment due to excessive memory footprint and I/O overhead. Existing static methods have rigid latency-accuracy trade-offs, while expert importance is observed to be highly skewed and depth-dependent.

Method: DyMoE introduces three key techniques: (1) importance-aware prioritization to dynamically quantize experts at runtime based on their importance, (2) depth-adaptive scheduling to preserve semantic integrity in critical layers, and (3) look-ahead prefetching to overlap I/O stalls.

Result: On commercial edge hardware, DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and achieves up to 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference.

Conclusion: DyMoE successfully addresses the memory and I/O bottlenecks of MoE models for edge deployment through dynamic quantization and scheduling techniques, enabling efficient real-time inference on resource-constrained devices.

Abstract: Despite the computational efficiency of MoE models, the excessive memory footprint and I/O overhead inherent in multi-expert architectures pose formidable challenges for real-time inference on resource-constrained edge platforms. While existing static methods struggle with a rigid latency-accuracy trade-off, we observe that expert importance is highly skewed and depth-dependent. Motivated by these insights, we propose DyMoE, a dynamic mixed-precision quantization framework designed for high-performance edge inference. Leveraging insights into expert importance skewness and depth-dependent sensitivity, DyMoE introduces: (1) importance-aware prioritization to dynamically quantize experts at runtime; (2) depth-adaptive scheduling to preserve semantic integrity in critical layers; and (3) look-ahead prefetching to overlap I/O stalls. Experimental results on commercial edge hardware show that DyMoE reduces Time-to-First-Token (TTFT) by 3.44x-22.7x and up to a 14.58x speedup in Time-Per-Output-Token (TPOT) compared to state-of-the-art offloading baselines, enabling real-time, accuracy-preserving MoE inference on resource-constrained edge devices.

[580] SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Edward Lin, Sahil Modi, Siva Kumar Sastry Hari, Qijing Huang, Zhifan Ye, Nestor Qin, Fengzhe Zhou, Yuan Zhang, Jingquan Wang, Sana Damani, Dheeraj Peri, Ouye Xie, Aditya Kane, Moshe Maor, Michael Behar, Triston Cao, Rishabh Mehta, Vartika Singh, Vikram Sharma Mailthody, Terry Chen, Zihao Ye, Hanfeng Chen, Tianqi Chen, Vinod Grover, Wei Chen, Wei Liu, Eric Chung, Luis Ceze, Roger Bringmann, Cyril Zeller, Michael Lightstone, Christos Kozyrakis, Humphrey Shi

Main category: cs.LG

TL;DR: SOL-ExecBench is a benchmark for GPU kernel optimization that measures performance against hardware Speed-of-Light bounds rather than software baselines, covering 235 CUDA kernels from AI models across multiple domains including audio and vision.

Details

Motivation: Current benchmarks for AI-generated GPU kernels focus on speedup over software baselines, which is mutable and doesn't measure true hardware efficiency. There's a need for benchmarks that evaluate how close kernel optimizations get to fundamental hardware performance limits.

Method: Created SOL-ExecBench with 235 CUDA kernel optimization problems from 124 production AI models across language, diffusion, vision, audio, video, and hybrid architectures. Developed SOLAR pipeline to compute hardware-grounded Speed-of-Light (SOL) bounds analytically. Introduced SOL Score metric measuring gap closure between baseline and hardware limits. Built sandboxed evaluation harness with GPU clock locking, cache clearing, and anti-reward-hacking measures.

Result: The benchmark provides a fixed hardware-efficiency target (SOL bounds) rather than mutable software baselines. It covers diverse AI workloads including audio and vision models, with support for Blackwell GPU-specific optimizations and various precision formats (BF16, FP8, NVFP4).

Conclusion: SOL-ExecBench shifts GPU kernel benchmarking paradigm from beating software baselines to closing gaps to hardware Speed-of-Light limits, enabling more meaningful evaluation of AI-driven kernel optimization systems.

Abstract: As agentic AI systems become increasingly capable of generating and optimizing GPU kernels, progress is constrained by benchmarks that reward speedup over software baselines rather than proximity to hardware-efficient execution. We present SOL-ExecBench, a benchmark of 235 CUDA kernel optimization problems extracted from 124 production and emerging AI models spanning language, diffusion, vision, audio, video, and hybrid architectures, targeting NVIDIA Blackwell GPUs. The benchmark covers forward and backward workloads across BF16, FP8, and NVFP4, including kernels whose best performance is expected to rely on Blackwell-specific capabilities. Unlike prior benchmarks that evaluate kernels primarily relative to software implementations, SOL-ExecBench measures performance against analytically derived Speed-of-Light (SOL) bounds computed by SOLAR, our pipeline for deriving hardware-grounded SOL bounds, yielding a fixed target for hardware-efficient optimization. We report a SOL Score that quantifies how much of the gap between a release-defined scoring baseline and the hardware SOL bound a candidate kernel closes. To support robust evaluation of agentic optimizers, we additionally provide a sandboxed harness with GPU clock locking, L2 cache clearing, isolated subprocess execution, and static analysis based checks against common reward-hacking strategies. SOL-ExecBench reframes GPU kernel benchmarking from beating a mutable software baseline to closing the remaining gap to hardware Speed-of-Light.

[581] MIDST Challenge at SaTML 2025: Membership Inference over Diffusion-models-based Synthetic Tabular data

Masoumeh Shafieinejad, Xi He, Mahshid Alinoori, John Jewell, Sana Ayromlou, Wei Pang, Veronica Chatrath, Garui Sharma, Deval Pandya

Main category: cs.LG

TL;DR: The MIDST challenge evaluates privacy resilience of synthetic tabular data generated by diffusion models against membership inference attacks, developing novel attack methods for comprehensive assessment.

Details

Motivation: Synthetic data from diffusion models is seen as a privacy solution, but its resilience to privacy attacks for tabular data remains largely unexplored, particularly for membership inference attacks.

Method: The MIDST challenge conducted quantitative evaluation of synthetic tabular data privacy, exploring multiple target models for membership inference attacks including diffusion models for single tables with mixed data types and multi-relational tables with constraints. Developed novel black-box and white-box MIAs tailored to these diffusion models.

Result: The challenge enabled comprehensive evaluation of privacy efficacy of diffusion models for synthetic tabular data generation, with novel attack methods developed as key outcomes.

Conclusion: The MIDST initiative provides important tools and frameworks for assessing privacy resilience of synthetic tabular data generated by diffusion models, addressing a critical gap in understanding their security properties.

Abstract: Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST

[582] Improving RCT-Based Treatment Effect Estimation Under Covariate Mismatch via Calibrated Alignment

Amir Asiaee, Samhita Pal

Main category: cs.LG

TL;DR: CALM learns embeddings to align RCT and observational data with covariate mismatch, enabling transfer of outcome models while preserving causal identification from randomization.

Details

Motivation: RCTs are underpowered for detecting heterogeneous treatment effects, while observational studies have covariate mismatch issues. There's a need to combine both sources effectively without imputation.

Method: CALM learns embeddings that map features from both RCT and observational data into a common representation space. Observational outcome models are transferred to RCT embedding space and calibrated using trial data. Two variants: calibration-based linear and neural embedding approaches.

Result: Theoretical analysis shows finite-sample risk bounds with alignment error, outcome-model complexity, and calibration complexity terms. Linear variant provides protection against negative transfer; neural variant can be vulnerable under severe distributional shift. Simulations across 51 settings show calibration-based methods equivalent for linear CATEs, and neural embedding variant wins all 22 nonlinear-regime settings.

Conclusion: CALM provides a principled framework for combining RCT and observational data under covariate mismatch, with theoretical guarantees and empirical advantages over imputation methods.

Abstract: Randomized controlled trials (RCTs) are the gold standard for estimating heterogeneous treatment effects, yet they are often underpowered for detecting effect heterogeneity. Large observational studies (OS) can supplement RCTs for conditional average treatment effect (CATE) estimation, but a key barrier is covariate mismatch: the two sources measure different, only partially overlapping, covariates. We propose CALM (Calibrated ALignment under covariate Mismatch), which bypasses imputation by learning embeddings that map each source’s features into a common representation space. OS outcome models are transferred to the RCT embedding space and calibrated using trial data, preserving causal identification from randomization. Finite-sample risk bounds decompose into alignment error, outcome-model complexity, and calibration complexity terms, identifying when embedding alignment outperforms imputation. Under the calibration-based linear variant, the framework provides protection against negative transfer; the neural variant can be vulnerable under severe distributional shift. Under sparse linear models, the embedding approach strictly generalizes imputation. Simulations across 51 settings confirm that (i) calibration-based methods are equivalent for linear CATEs, and (ii) the neural embedding variant wins all 22 nonlinear-regime settings with large margins.

[583] Robustness, Cost, and Attack-Surface Concentration in Phishing Detection

Julian Allagan, Mohamed Elbakary, Zohreh Safari, Weizheng Gao, Gabrielle Morgan, Essence Morgan, Vladimir Deriglazov

Main category: cs.LG

TL;DR: Phishing detector robustness depends on feature economics, not model complexity, with most evasions concentrating on low-cost surface features.

Details

Motivation: Despite near-perfect accuracy in static evaluation, phishing detectors may be vulnerable to post-deployment feature manipulation by attackers with limited budgets, highlighting the need to study robustness through cost-aware evasion.

Method: Introduces cost-aware evasion framework with three diagnostics: minimal evasion cost (MEC), evasion survival rate S(B), and robustness concentration index (RCI). Evaluates Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost on UCI Phishing Websites benchmark under budgeted sanitization-style evasion.

Result: All models achieve AUC ≥ 0.979 under static evaluation but show similar robustness under evasion: median MEC = 2 with full features, over 80% of successful evasions concentrate on three low-cost surface features. Feature restriction only helps when removing all dominant low-cost transitions.

Conclusion: Adversarial robustness in phishing detection is governed by feature economics rather than model complexity, with fundamental limits on raising MEC without changing feature representation or cost model.

Abstract: Phishing detectors built on engineered website features attain near-perfect accuracy under i.i.d.\ evaluation, yet deployment security depends on robustness to post-deployment feature manipulation. We study this gap through a cost-aware evasion framework that models discrete, monotone feature edits under explicit attacker budgets. Three diagnostics are introduced: minimal evasion cost (MEC), the evasion survival rate $S(B)$, and the robustness concentration index (RCI). On the UCI Phishing Websites benchmark (11,055 instances, 30 ternary features), Logistic Regression, Random Forests, Gradient Boosted Trees, and XGBoost all achieve $\mathrm{AUC}\ge 0.979$ under static evaluation. Under budgeted sanitization-style evasion, robustness converges across architectures: the median MEC equals 2 with full features, and over 80% of successful minimal-cost evasions concentrate on three low-cost surface features. Feature restriction improves robustness only when it removes all dominant low-cost transitions. Under strict cost schedules, infrastructure-leaning feature sets exhibit 17-19% infeasible mass for ensemble models, while the median MEC among evadable instances remains unchanged. We formalize this convergence: if a positive fraction of correctly detected phishing instances admit evasion through a single feature transition of minimal cost $c_{\min}$, no classifier can raise the corresponding MEC quantile above $c_{\min}$ without modifying the feature representation or cost model. Adversarial robustness in phishing detection is governed by feature economics rather than model complexity.

[584] Unlocking Full Efficiency of Token Filtering in Large Language Model Training

Di Chai, Pengbo Li, Feiyuan Zhang, Yilun Jin, Han Tian, Kaiqiang Xu, Binhang Yuan, Dian Shen, Junxue Zhang, Kai Chen

Main category: cs.LG

TL;DR: Centrifuge is a system that accelerates LLM training by filtering inconsequential tokens through algorithm-system co-design, achieving up to 34.7% faster training while preserving or improving model utility.

Details

Motivation: Existing token filtering methods fail to achieve real-world efficiency gains due to insufficient sparsity and incompatibility with standard ML libraries, despite the potential computational benefits of reducing token processing.

Method: Centrifuge uses algorithm-system co-design: (1) algorithmically filters activations of inconsequential tokens in attention backward kernels to amplify sparsity, and (2) systemically transforms sparse GEMM operations into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries.

Result: Centrifuge reduces backpropagation time by up to 49.9% and end-to-end training time by up to 34.7% when filtering 50% of tokens, while preserving utility benefits and even improving model performance by up to 26.6% compared to standard training.

Conclusion: Centrifuge successfully unleashes the efficiency potential of token filtering through co-design, offering significant training acceleration with minimal code changes while maintaining or enhancing model quality.

Abstract: Token filtering has been proposed to enhance the utility of large language models (LLMs) by eliminating inconsequential tokens during training. While usingfewer tokens is expected to reduce computational workloads, existing methods have not yet achieved a real-world efficiency boost. This is primarily due to two factors: (1) existing work has inadequate sparsity for speedup, and (2) token filtering operates within a sparsity range that is non-standard in existing machine learning (ML) libraries and thus cannot be efficiently supported. This paper presents Centrifuge, a system that leverages algorithm and system co-design to unleash the full efficiency of token filtering in LLM training. At the algorithm level, Centrifuge filters activations of inconsequential tokens in the attention backward kernel to amplify the sparsity in backward computation. At the system level, Centrifuge proposes an automatic workflow that transforms sparse GEMM into dimension-reduced dense GEMM for optimized efficiency using standard ML libraries. Evaluations on models with various scales–from 1.1B to 40B–demonstrate that Centrifuge reduces backpropagation time by up to 49.9% and end-to-end training time by up to 34.7% when filtering 50% of tokens. Utility assessments indicate that Centrifuge preserves the utility benefits of token filtering and significantly enhances model performance by up to 26.6% compared to standard training. Centrifuge is designed for seamless integration into existing LLM training frameworks, enabling systems already utilizing token filtering to accelerate training with just one line of code.

[585] Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

Main category: cs.LG

TL;DR: Discrete Transformer architecture bridges continuous representations and discrete symbolic logic for algorithm extraction from trained models, overcoming representation entanglement issues in standard Transformers.

Details

Motivation: Standard Transformers suffer from representation entanglement (superposition) that hinders algorithm extraction - the synthesis of executable programs from models trained on algorithmic tasks. This prevents de novo algorithm discovery without human-written code.

Method: Proposes Discrete Transformer with temperature-annealed sampling to inject discreteness, enabling hypothesis testing and symbolic regression to extract human-readable programs from continuous representations.

Result: Achieves performance comparable to RNN-based methods while extending interpretability to continuous variable domains, with annealing dynamics showing clear exploration-to-exploitation transitions.

Conclusion: Architectural inductive biases provide fine-grained control over synthesized programs, establishing Discrete Transformer as robust framework for demonstration-free algorithm discovery and Transformer interpretability.

Abstract: Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, applying this paradigm to Transformer is hindered by representation entanglement (e.g., superposition), where entangled features encoded in overlapping directions obstruct the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to RNN-based methods while extending interpretability to continuous variable domains, and the annealing dynamics exhibit a clear exploration-to-exploitation transition. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a robust framework for demonstration-free algorithm discovery and Transformer interpretability.

[586] Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

Injin Kong, Hyoungjoon Lee, Yohan Jo

Main category: cs.LG

TL;DR: Post-training autoregressive models into masked diffusion models causes systematic mechanism shifts: MDMs preserve autoregressive circuitry for local tasks but develop new bidirectional reasoning for global planning tasks, showing genuine computational reorganization rather than just parameter adjustment.

Details

Motivation: The paper aims to understand whether converting autoregressive models to masked diffusion models through post-training actually enables genuine bidirectional reasoning or if they simply repackage autoregressive heuristics, addressing a gap in understanding the internal algorithmic changes induced by this shift.

Method: The researchers conduct comparative circuit analysis of autoregressive models and their masked diffusion model counterparts, examining internal algorithmic changes and computational pathways to understand the nature of the transformation.

Result: The analysis reveals a systematic “mechanism shift”: MDMs preserve autoregressive circuitry for tasks with local causal dependencies, but for global planning tasks they abandon initialized pathways and show distinct rewiring with increased early-layer processing. At the semantic level, there’s a transition from sharp, localized specialization in ARMs to distributed integration in MDMs.

Conclusion: Diffusion post-training doesn’t just adjust model parameters but fundamentally reorganizes internal computation to support non-sequential global planning, demonstrating that MDMs acquire genuine bidirectional reasoning capabilities rather than merely repackaging autoregressive heuristics.

Abstract: Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet the internal algorithmic changes induced by this shift remain poorly understood, leaving it unclear whether post-trained MDMs acquire genuine bidirectional reasoning or merely repackage autoregressive heuristics. We address this question through a comparative circuit analysis of ARMs and their MDM counterparts. Our analysis reveals a systematic “mechanism shift” that depends on the structural nature of the task. MDMs largely preserve autoregressive circuitry for tasks driven by local causal dependencies, but for global planning tasks they abandon initialized pathways and exhibit distinct rewiring with increased early-layer processing. At the semantic level, we observe a transition from sharp, localized specialization in ARMs to distributed integration in MDMs. These findings show that diffusion post-training does not simply adjust model parameters, but reorganizes internal computation to support non-sequential global planning.

[587] CeRA: Breaking the Linear Ceiling of Low-Rank Adaptation via Manifold Expansion

Hung-Hsuan Chen

Main category: cs.LG

TL;DR: CeRA introduces a capacity-enhanced rank adaptation method that breaks LoRA’s linear ceiling in complex reasoning tasks through SiLU gating and structural dropout, achieving superior performance with fewer parameters.

Details

Motivation: LoRA faces a "linear ceiling" limitation in complex reasoning tasks where simply increasing rank yields diminishing returns due to intrinsic linear constraints. There's a need for more efficient parameter adaptation methods that can handle complex reasoning without exponentially increasing parameters.

Method: CeRA is a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. It enhances capacity beyond linear constraints through non-linear activation and dropout mechanisms, enabling more efficient parameter usage.

Result: On SlimOrca benchmark, CeRA at rank 64 (PPL 3.89) outperforms LoRA at rank 512 (PPL 3.90). In mathematical reasoning, CeRA at rank 64 (Pass@1 16.36%) outperforms LoRA at rank 512 (Pass@1 15.72%) on MATH dataset, achieving superior accuracy with only 1/8 parameters.

Conclusion: CeRA successfully breaks LoRA’s linear ceiling through manifold expansion, demonstrating extreme parameter efficiency in complex reasoning tasks. SVD analysis confirms it activates dormant singular value spectrum, preventing rank collapse observed in linear methods.

Abstract: Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning (PEFT). However, it faces a critical ``linear ceiling’’ in complex reasoning tasks: simply increasing the rank yields diminishing returns due to intrinsic linear constraints. We introduce CeRA (Capacity-enhanced Rank Adaptation), a weight-level parallel adapter that injects SiLU gating and structural dropout to induce manifold expansion. On the SlimOrca benchmark, CeRA breaks this linear barrier: at rank 64 (PPL 3.89), it outperforms LoRA at rank 512 (PPL 3.90), demonstrating superior spectral efficiency. This advantage generalizes to downstream mathematical reasoning. We reveal a profound impact of task complexity: In fundamental arithmetic (GSM8K), CeRA matches standard baselines, but in the highly complex MATH dataset, CeRA demonstrates extreme parameter efficiency. Remarkably, CeRA at rank 64 (Pass@1 16.36%) outperforms LoRA at rank 512 (Pass@1 15.72%), achieving superior reasoning accuracy with only 1/8 of the parameter budget. Mechanism analysis via Singular Value Decomposition (SVD) confirms that CeRA activates the dormant tail of the singular value spectrum, effectively preventing the rank collapse observed in linear methods.

[588] Improved Learning Rates for Stochastic Optimization

Shaojie Li, Pengwei Tang, Yong Liu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2107.08686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2107.08686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[589] Rule Extraction in Machine Learning: Chat Incremental Pattern Constructor

Caleb Princewill Nwokocha

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2208.00335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2208.00335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] Inverse classification with logistic and softmax classifiers: efficient optimization

Miguel Á. Carreira-Perpiñán, Suryabhan Singh Hada

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2309.08945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2309.08945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods

Zijian Liu, Zhengyuan Zhou

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2312.08531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.08531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] On Minimal Depth in Neural Networks

Juan L. Valerdi

Main category: cs.LG

TL;DR: Failed to fetch summary for paper 2402.15315 due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2402.15315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.15315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] $μ$LO: Compute-Efficient Meta-Generalization of Learned Optimizers

Benjamin Thérien, Charles-Étienne Joseph, Boris Knyazev, Edouard Oyallon, Irina Rish, Eugene Belilovsky

Main category: cs.LG

TL;DR: Paper 2406.00153: Could not fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing abstract

Method: Unable to determine method due to missing abstract

Result: Unable to determine results due to missing abstract

Conclusion: Unable to determine conclusion due to missing abstract

Abstract: Failed to fetch summary for 2406.00153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.00153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] Modeling Inverse Ellipsometry Problem via Flow Matching with a Large-Scale Dataset

Yiming Ma, Jianzhi Teng, Xinjie Li, Xin Sun, Zhiyong Wang, Yuzhou Song, Lionel Z. Wang, Bin Chen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2407.17869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] ODE-Constrained Generative Modeling of Cardiac Dynamics for 12-Lead ECG Synthesis

Yakir Yehuda, Kira Radinsky

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions about the paper as content retrieval failed due to arXiv API rate limiting

Abstract: Failed to fetch summary for 2409.17833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.17833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] Vector Optimization with Gaussian Process Bandits

İlter Onat Korkmaz, Yaşar Cahit Yıldırım, Çağın Ararat, Cem Tekin

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access restrictions

Method: Unable to determine method due to access restrictions

Result: Unable to determine results due to access restrictions

Conclusion: Unable to determine conclusion due to access restrictions

Abstract: Failed to fetch summary for 2412.02484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.02484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] This looks like what? Challenges and Future Research Directions for Part-Prototype Models

Khawla Elhadri, Tomasz Michalski, Adam Wróbel, Jörg Schlötterer, Bartosz Zieliński, Christin Seifert

Main category: cs.LG

TL;DR: Unable to analyze paper 2502.09340 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions about paper content due to retrieval failure

Abstract: Failed to fetch summary for 2502.09340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.09340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] Integrating Weather Station Data and Radar for Precipitation Nowcasting: SmaAt-fUsion and SmaAt-Krige-GNet

Jie Shi, Aleksej Cornelissen, Siamak Mehrkanoon

Main category: cs.LG

TL;DR: Unable to analyze paper 2502.16116 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation due to failed data retrieval

Method: Cannot determine method due to failed data retrieval

Result: Cannot determine results due to failed data retrieval

Conclusion: Cannot draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2502.16116: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.16116&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[599] Causal Intervention Framework for Variational Auto Encoder Mechanistic Interpretability

Dip Roy

Main category: cs.LG

TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API

Details

Motivation: Unable to determine motivation due to API fetch failure

Method: Unable to determine method due to API fetch failure

Result: Unable to determine results due to API fetch failure

Conclusion: Unable to determine conclusion due to API fetch failure

Abstract: Failed to fetch summary for 2505.03530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Xiang Shi, Rui Zhang, Jiawei Liu, Yinpeng Liu, Qikai Cheng, Wei Lu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2506.00030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.00030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[601] GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM

Kyeongjin Ahn, Sungwon Han, Seungeon Lee, Donghyun Ahn, Hyoshin Kim, Jungwon Kim, Jihee Kim, Sangyoon Park, Meeyoung Cha

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.13323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[602] Online Convex Optimization with Heavy Tails: Old Algorithms, New Regrets, and Applications

Zijian Liu

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper fetch

Method: Cannot determine method due to failed paper fetch

Result: Cannot determine results due to failed paper fetch

Conclusion: Cannot determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2508.07473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[603] Physics-informed neural network for predicting fatigue life of unirradiated and irradiated austenitic and ferritic/martensitic steels under reactor-relevant conditions

Dhiraj S Kori, Abhinav Chandraker, Syed Abdur Rahman, Punit Rathore, Ankur Chauhan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to draw conclusions due to retrieval failure

Abstract: Failed to fetch summary for 2508.17303: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17303&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[604] Fourier Learning Machines: Nonharmonic Fourier-Based Neural Networks for Scientific Machine Learning

Mominul Rubel, Adam Meyers, Gabriel Nicolosi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2509.08759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[605] Pi-transformer: A prior-informed dual-attention model for multivariate time-series anomaly detection

Sepehr Maleki, Negar Pourmoazemi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to missing paper content

Method: Unable to determine method due to missing paper content

Result: Unable to determine results due to missing paper content

Conclusion: Unable to determine conclusion due to missing paper content

Abstract: Failed to fetch summary for 2509.19985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

Shuofeng Zhang, Ard Louis

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2509.21181: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.21181&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] OT-MeanFlow3D: Bridging Optimal Transport and Meanflow for Efficient 3D Point Cloud Generation

Elaheh Akbari, Shansita Sharma, Ping He, Ahmadreza Moradipari, Kyungtae Han, Hamed Pirsiavash, Yikun Bai, Soheil Kolouri

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2509.22592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] Splines-Based Feature Importance in Kolmogorov-Arnold Networks: A Framework for Supervised Tabular Data Dimensionality Reduction

Ange-Clément Akazan, Verlon Roel Mbingui

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2509.23366: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23366&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] Support Basis: Fast Attention Beyond Bounded Entries

Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2510.01643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Haolin Liu, Chen-Yu Wei, Julian Zimmert

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2510.08882: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08882&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching

Bernardo Perrone Ribeiro, Jana Faganeli Pucer

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2511.09731: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09731&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] Image2Gcode: Image-to-G-code Generation for Additive Manufacturing Using Diffusion-Transformer Model

Ziyue Wang, Yayati Jadhav, Peter Pak, Amir Barati Farimani

Main category: cs.LG

TL;DR: Unable to analyze paper 2511.20636 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract is unavailable

Method: Cannot determine method as abstract is unavailable

Result: Cannot determine results as abstract is unavailable

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2511.20636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] Capturing reduced-order quantum many-body dynamics out of equilibrium via neural ordinary differential equations

Patrick Egenlauf, Iva Březinová, Sabine Andergassen, Miriam Klopotek

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper details

Method: Unable to determine method due to API rate limiting preventing access to paper details

Result: Unable to determine results due to API rate limiting preventing access to paper details

Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper details

Abstract: Failed to fetch summary for 2512.13913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] DeeperBrain: A Neuro-Grounded EEG Foundation Model Towards Universal BCI

Jiquan Wang, Sha Zhao, Yangxuan Zhou, Yiming Kang, Shijian Li, Gang Pan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.06134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] Multimodal Machine Learning for Soft High-k Elastomers under Data Scarcity

Brijesh FNU, Viet Thanh Duy Nguyen, Ashima Sharma, Md Harun Rashid Molla, Chengyi Xu, Truong-Son Hy

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2601.18032: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18032&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] A Unified Generalization Framework for Model Merging: Trade-offs, Non-Linearity, and Scaling Laws

Qinglun Li, Anke Tang, Miao Zhang, Mengzhu Wang, Quanjun Yin, Li Shen

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2601.21690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] Mixed-Precision Training and Compilation for RRAM-based Computing-in-Memory Accelerators

Rebecca Pelke, Joel Klein, Jose Cubero-Cascante, Nils Bosbach, Jan Moritz Joseph, Rainer Leupers

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) for arXiv ID 2601.21737

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2601.21737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] Koopman Autoencoders with Continuous-Time Latent Dynamics for Fluid Dynamics Forecasting

Rares Grozavescu, Pengyu Zhang, Etienne Meunier, Mark Girolami

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot determine conclusion without access to paper content

Abstract: Failed to fetch summary for 2602.02832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[619] TS-Haystack: A Multi-Scale Retrieval Benchmark for Time Series Language Models

Nicolas Zumarraga, Thomas Kaar, Ning Wang, Maxwell A. Xu, Max Rosenblattl, Markus Kreft, Kevin O’Sullivan, Paul Schmiedmayer, Patrick Langer, Robert Jakob

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try alternative approach or wait before retrying.

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot draw conclusions without access to paper content.

Abstract: Failed to fetch summary for 2602.14200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] Causality is Key for Interpretability Claims to Generalise

Shruti Joshi, Aaron Mueller, David Klindt, Wieland Brendel, Patrik Reizinger, Dhanya Sridhar

Main category: cs.LG

TL;DR: Paper 2602.16698: Unable to fetch summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to access limitations

Method: Cannot determine method due to access limitations

Result: Cannot determine results due to access limitations

Conclusion: Cannot determine conclusion due to access limitations

Abstract: Failed to fetch summary for 2602.16698: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16698&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] Benchmarking State Space Models, Transformers, and Recurrent Networks for US Grid Forecasting

Sunki Hong, Jisoo Lee, Yuanyuan Shi

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting).

Details

Motivation: Cannot determine motivation without access to paper content.

Method: Cannot determine method without access to paper content.

Result: Cannot determine results without access to paper content.

Conclusion: Cannot determine conclusion without access to paper content.

Abstract: Failed to fetch summary for 2602.21415: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21415&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[622] Improving Spatial Allocation for Energy System Coupling with Graph Neural Networks

Xuanhao Mu, Jakob Geiges, Nan Liu, Thorsten Schlachter, Veit Hagenmeyer

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed API request

Method: Unable to determine method due to failed API request

Result: Unable to determine results due to failed API request

Conclusion: Unable to determine conclusion due to failed API request

Abstract: Failed to fetch summary for 2602.22249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] What You Read is What You Classify: Highlighting Attributions to Text and Text-Like Inputs

Daniel S. Berman, Brian Merritt, Stanley Ta, Dana Udwin, Amanda Ernlund, Jeremy Ratcliff, Vijay Narayan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation as paper content could not be retrieved

Method: Unable to determine method as paper content could not be retrieved

Result: Unable to determine results as paper content could not be retrieved

Conclusion: Unable to draw conclusions as paper content could not be retrieved

Abstract: Failed to fetch summary for 2602.24149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[624] AutoResearch-RL: Perpetual Self-Evaluating Reinforcement Learning Agents for Autonomous Neural Architecture Discovery

Nilesh Jain, Rohit Yadav, Sagar Kotian, Claude AI

Main category: cs.LG

TL;DR: Unable to analyze paper 2603.07300 due to HTTP 429 error when fetching from arXiv API

Details

Motivation: Cannot determine motivation as paper content could not be retrieved

Method: Cannot determine method as paper content could not be retrieved

Result: Cannot determine results as paper content could not be retrieved

Conclusion: Cannot draw conclusions without access to the paper content

Abstract: Failed to fetch summary for 2603.07300: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.07300&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] Nonparametric Variational Differential Privacy via Embedding Parameter Clipping

Dina El Zein, Shashi Kumar, James Henderson

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2603.09583: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09583&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] Systematic Scaling Analysis of Jailbreak Attacks in Large Language Models

Xiangwen Wang, Ananth Balashankar, Varun Chandrasekaran

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to draw conclusions due to fetch failure

Abstract: Failed to fetch summary for 2603.11149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] A Stability-Aware Frozen Euler Autoencoder for Physics-Informed Tracking in Continuum Mechanics (SAFE-PIT-CM)

Emil Hovad

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about the paper due to access limitations

Abstract: Failed to fetch summary for 2603.13280: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13280&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[628] Privacy-Preserving Machine Learning for IoT: A Cross-Paradigm Survey and Future Roadmap

Zakia Zaman, Praveen Gauravaram, Mahbub Hassan, Sanjay Jha, Wen Hu

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.13570: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13570&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[629] Flow Matching Policy with Entropy Regularization

Ting Gao, Stavros Orfanoudakis, Nan Lin, Elvin Isufi, Winnie Daamen, Serge Hoogendoorn

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2603.17685: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17685&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[630] “Calibeating”: Beating Forecasters at Their Own Game

Dean P. Foster, Sergiu Hart

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to data fetch failure

Method: Unable to determine method due to data fetch failure

Result: Unable to determine results due to data fetch failure

Conclusion: Unable to determine conclusion due to data fetch failure

Abstract: Failed to fetch summary for 2209.04892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2209.04892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[631] Hidden yet quantifiable: A lower bound for confounding strength using randomized trials

Piersilvio De Bartolomeis, Javier Abad, Konstantin Donhauser, Fanny Yang

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in accessing paper content

Method: Unable to determine method due to technical error in accessing paper content

Result: Unable to determine results due to technical error in accessing paper content

Conclusion: Unable to draw conclusions due to technical error in accessing paper content

Abstract: Failed to fetch summary for 2312.03871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.03871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[632] Combining T-learning and DR-learning: a framework for oracle-efficient estimation of causal contrasts

Lars van der Laan, Marco Carone, Alex Luedtke

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2402.01972: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.01972&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[633] Assessing the Distributional Fidelity of Synthetic Chest X-rays using the Embedded Characteristic Score

Edric Tam, Barbara E Engelhardt

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2501.00744: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.00744&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[634] Multifidelity Simulation-based Inference for Computationally Expensive Simulators

Anastasia N. Krouglova, Hayden R. Johnson, Basile Confavreux, Michael Deistler, Pedro J. Gonçalves

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to draw conclusions due to failed paper fetch

Abstract: Failed to fetch summary for 2502.08416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.08416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[635] Learning Transferable Friction Models and LuGre Identification Via Physics-Informed Neural Networks

Asutay Ozmen, João P. Hespanha, Katie Byl

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Unable to determine motivation due to API access limitations

Method: Unable to determine method due to API access limitations

Result: Unable to determine results due to API access limitations

Conclusion: Unable to determine conclusion due to API access limitations

Abstract: Failed to fetch summary for 2504.12441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.12441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[636] Visualization Tasks for Unlabeled Graphs

Matt I. B. Oddo, Ryan Smith, Stephen Kobourov, Tamara Munzner

Main category: cs.LG

TL;DR: Unable to analyze paper 2504.14115 due to HTTP 429 error when fetching abstract from arXiv API

Details

Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting

Method: Cannot determine method as abstract retrieval failed

Result: Cannot determine results as abstract retrieval failed

Conclusion: Cannot draw conclusions without access to paper abstract

Abstract: Failed to fetch summary for 2504.14115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.14115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[637] Hardware-Aware Neural Architecture Search for Encrypted Traffic Classification on Resource-Constrained Devices

Adel Chehade, Edoardo Ragusa, Paolo Gastaldo, Rodolfo Zunino

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2506.11319: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.11319&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[638] First-Order Sparse Convex Optimization: Better Rates with Sparse Updates

Dan Garber

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions as paper content is unavailable

Abstract: Failed to fetch summary for 2506.19075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.19075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[639] Recurrent neural network-based robust control systems with regional properties and application to MPC design

Daniele Ravasio, Marcello Farina, Alessio La Bella, Andrea Ballarino

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2506.20334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.20334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[640] Entity-Specific Cyber Risk Assessment using InsurTech Empowered Risk Factors

Jiayi Guo, Zhiyu Quan, Linfeng Zhang

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2507.08193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.08193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[641] Transfer Learning for Neutrino Scattering: Domain Adaptation with GANs

Jose L. Bonilla, Krzysztof M. Graczyk, Artur M. Ankowski, Rwik Dharmapal Banerjee, Beata E. Kowal, Hemant Prasad, Jan T. Sobczyk

Main category: cs.LG

TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv

Details

Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limiting (HTTP 429 error)

Method: Cannot analyze method due to missing paper content

Result: No results available due to technical error in fetching paper information

Conclusion: Unable to provide analysis due to arXiv API rate limiting preventing access to paper content

Abstract: Failed to fetch summary for 2508.12987: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12987&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[642] Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

Nikita Kornilov, David Li, Tikhon Mavrin, Aleksei Leonov, Nikita Gushchin, Evgeny Burnaev, Iaroslav Koshelev, Alexander Korotin

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2509.22459: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22459&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[643] Transformer-Based Rate Prediction for Multi-Band Cellular Handsets

Ruibin Chen, Haozhe Lei, Hao Guo, Marco Mezzavilla, Hitesh Poddar, Tomoki Yoshimura, Sundeep Rangan

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2509.25722: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25722&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[644] Bridging Earth and Space: A Survey on HAPS for Non-Terrestrial Networks

G. Svistunov, A. Akhtarshenas, D. López-Pérez, M. Giordani, G. Geraci, H. Yanikomeroglu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to retrieval failure

Method: Unable to determine method due to retrieval failure

Result: Unable to determine results due to retrieval failure

Conclusion: Unable to determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2510.19731: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19731&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[645] Generalization of Long-Range Machine Learning Potentials in Complex Chemical Spaces

Michal Sanocki, Julija Zavadlav

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting

Method: Cannot determine method as paper content is unavailable due to API rate limiting

Result: Cannot determine results as paper content is unavailable due to API rate limiting

Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting

Abstract: Failed to fetch summary for 2512.10989: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10989&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[646] Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks

Irched Chafaa, Giacomo Bacci, Luca Sanguinetti

Main category: cs.LG

TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to failed paper retrieval

Method: Cannot determine method due to failed paper retrieval

Result: Cannot determine results due to failed paper retrieval

Conclusion: Cannot draw conclusions due to failed paper retrieval

Abstract: Failed to fetch summary for 2512.17466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[647] A Structured Nonparametric Framework for Nonlinear Accelerated Failure Time Models (KAN-AFT)

Mebin Jose, Jisha Francis, Sudheesh Kumar Kattumannil

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot determine conclusion as paper content is unavailable

Abstract: Failed to fetch summary for 2512.20305: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20305&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[648] Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2512.23178: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23178&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[649] Studying the Role of Synthetic Data for Machine Learning-based Wireless Networks Traffic Forecasting

José Pulido, Francesc Wilhelmi, Sergio Fortes, Alfonso Fernández-Durán, Lorenzo Galati Giordano, Raquel Barco

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to technical error in fetching paper content

Method: Unable to determine method due to technical error in fetching paper content

Result: Unable to determine results due to technical error in fetching paper content

Conclusion: Unable to draw conclusions due to technical error in fetching paper content

Abstract: Failed to fetch summary for 2601.07646: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07646&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[650] Multi-Preconditioned LBFGS for Training Finite-Basis PINNs

Marc Salvadó-Benasco, Aymane Kssim, Alexander Heinlein, Rolf Krause, Serge Gratton, Alena Kopaničáková

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation without access to paper content

Method: Cannot determine method without access to paper content

Result: Cannot determine results without access to paper content

Conclusion: Cannot draw conclusions without access to paper content

Abstract: Failed to fetch summary for 2601.08709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Ahmed M. Elshazly, Ahmed Arafa

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation as paper content is unavailable

Method: Cannot determine method as paper content is unavailable

Result: Cannot determine results as paper content is unavailable

Conclusion: Cannot draw conclusions about unavailable paper

Abstract: Failed to fetch summary for 2602.02469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[652] Theory and interpretability of Quantum Extreme Learning Machines: a Pauli-transfer matrix approach

Markus Gross, Hans-Martin Rieser

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: Cannot determine motivation due to failed data retrieval

Method: Cannot determine method due to failed data retrieval

Result: Cannot determine results due to failed data retrieval

Conclusion: Cannot draw conclusions due to failed data retrieval

Abstract: Failed to fetch summary for 2602.18377: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18377&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[653] Score Reversal Is Not Free for Quantum Diffusion Models

Ammar Fayad

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to access error

Method: Unable to determine method due to access error

Result: Unable to determine results due to access error

Conclusion: Unable to determine conclusion due to access error

Abstract: Failed to fetch summary for 2603.06488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[654] Learning-to-Defer with Expert-Conditioned Advice

Yannis Montreuil, Leïna Montreuil, Axel Carlier, Lai Xing Ng, Wei Tsang Ooi

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to failed paper fetch

Method: Unable to determine method due to failed paper fetch

Result: Unable to determine results due to failed paper fetch

Conclusion: Unable to determine conclusion due to failed paper fetch

Abstract: Failed to fetch summary for 2603.14324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[655] $K-$means with learned metrics

Pablo Groisman, Matthieu Jonckheere, Jordan Serres, Mariela Sued

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to fetch failure

Method: Unable to determine method due to fetch failure

Result: Unable to determine results due to fetch failure

Conclusion: Unable to determine conclusion due to fetch failure

Abstract: Failed to fetch summary for 2603.14601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[656] Neural Networks as Local-to-Global Computations

Vicente Bosca, Robert Ghrist

Main category: cs.LG

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: No method information available - paper content inaccessible

Result: No results available - paper summary retrieval failed

Conclusion: Cannot analyze paper due to technical limitations in accessing content

Abstract: Failed to fetch summary for 2603.14831: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14831&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[657] The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery

Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal

Main category: cs.LG

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to missing paper content

Method: Cannot determine method due to missing paper content

Result: Cannot determine results due to missing paper content

Conclusion: Cannot draw conclusions due to missing paper content

Abstract: Failed to fetch summary for 2603.17790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.MA

[658] The Provenance Paradox in Multi-Agent LLM Routing: Delegation Contracts and Attested Identity in LDP

Sunil Prakash

Main category: cs.MA

TL;DR: Multi-agent LLM systems need delegation protocols that handle dishonest quality claims, as current quality-based routing paradoxically selects worst performers when delegates inflate self-reported scores.

Details

Motivation: Current multi-agent LLM delegation protocols fail to govern delegation under unverifiable quality claims, creating a "provenance paradox" where quality-based routing systematically selects the worst delegates when they can inflate self-reported scores.

Method: Extends LLM Delegate Protocol (LDP) with: 1) delegation contracts bounding authority through objectives, budgets, and failure policies; 2) claimed-vs-attested identity model distinguishing self-reported from verified quality; 3) typed failure semantics enabling automated recovery.

Result: Routing by self-claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real Claude models: 8.90 vs. 9.30), while attested routing achieves near-optimal performance (d = 9.51, p < 0.001). Paradox emerges reliably when dishonest delegates are present across 36 configurations.

Conclusion: The provenance paradox in multi-agent LLM systems requires explicit delegation contracts and attested quality verification to prevent dishonest delegates from gaming quality-based routing systems, with backward-compatible extensions adding minimal overhead.

Abstract: Multi-agent LLM systems delegate tasks across trust boundaries, but current protocols do not govern delegation under unverifiable quality claims. We show that when delegates can inflate self-reported quality scores, quality-based routing produces a provenance paradox: it systematically selects the worst delegates, performing worse than random. We extend the LLM Delegate Protocol (LDP) with delegation contracts that bound authority through explicit objectives, budgets, and failure policies; a claimed-vs-attested identity model that distinguishes self-reported from verified quality; and typed failure semantics enabling automated recovery. In controlled experiments with 10 simulated delegates and validated with real Claude models, routing by self-claimed quality scores performs worse than random selection (simulated: 0.55 vs. 0.68; real models: 8.90 vs. 9.30), while attested routing achieves near-optimal performance (d = 9.51, p < 0.001). Sensitivity analysis across 36 configurations confirms the paradox emerges reliably when dishonest delegates are present. All extensions are backward-compatible with sub-microsecond validation overhead.

[659] A Trace-Based Assurance Framework for Agentic AI Orchestration: Contracts, Testing, and Governance

Ciprian Paduraru, Petru-Liviu Bouruc, Alin Stefanescu

Main category: cs.MA

TL;DR: A framework for testing and assuring Agentic AI systems where LLMs orchestrate multiple agents, with contracts, stress testing, fault injection, and governance components.

Details

Motivation: Agentic AI systems using LLMs for orchestration face complex failures beyond incorrect outputs, including non-termination, role drift, propagation of unsupported claims, and security vulnerabilities from untrusted context or external channels.

Method: Instrument executions as Message-Action Traces (MAT) with explicit step and trace contracts; implement stress testing as budgeted counterexample search; use structured fault injection at service boundaries; and include runtime governance with capability limits and action mediation.

Result: Framework provides machine-checkable verdicts, localizes first violating steps, supports deterministic replay, and enables comparative evaluation across stochastic seeds, models, and configurations using trace-based metrics.

Conclusion: The framework serves as a common abstraction for testing multi-agent LLM systems, facilitating reproducible comparison across orchestration designs and improving reliability assurance in Agentic AI.

Abstract: In Agentic AI, Large Language Models (LLMs) are increasingly used in the orchestration layer to coordinate multiple agents and to interact with external services, retrieval components, and shared memory. In this setting, failures are not limited to incorrect final outputs. They also arise from long-horizon interaction, stochastic decisions, and external side effects (such as API calls, database writes, and message sends). Common failures include non-termination, role drift, propagation of unsupported claims, and attacks via untrusted context or external channels. This paper presents an assurance framework for such Agentic AI systems. Executions are instrumented as Message-Action Traces (MAT) with explicit step and trace contracts. Contracts provide machine-checkable verdicts, localize the first violating step, and support deterministic replay. The framework includes stress testing, formulated as a budgeted counterexample search over bounded perturbations. It also supports structured fault injection at service, retrieval, and memory boundaries to assess containment under realistic operational faults and degraded conditions. Finally, governance is treated as a runtime component, enforcing per-agent capability limits and action mediation (allow, rewrite, block) at the language-to-action boundary. To support comparative evaluations across stochastic seeds, models, and orchestration configurations, the paper defines trace-based metrics for task success, termination reliability, contract compliance, factuality indicators, containment rate, and governance outcome distributions. More broadly, the framework is intended as a common abstraction to support testing and evaluation of multi-agent LLM systems, and to facilitate reproducible comparison across orchestration designs and configurations.

[660] Game-Theoretic Coordination for Time-Critical Missions of UAV Systems

Mikayel Aramyan, Anna Manucharyan, Lusine Poghosyan, Tigran Bakaryan, Naira Hovakimyan

Main category: cs.MA

TL;DR: Game-theoretic distributed MPC framework for coordinated UAV path following with reduced computational complexity via 1D optimization.

Details

Motivation: Need for scalable autonomous coordination of UAVs in dynamic environments that maintains both coordination and agility while accommodating heterogeneous objectives.

Method: Combines cooperative path following with game-theoretic formulation where each UAV optimizes cost functions in 1D domain, uses distributed MPC for non-ideal scenarios with path-following errors and communication failures.

Result: Proves existence and exponential convergence to Nash equilibrium for ideal systems; simulation shows effective agile mission execution in diverse realistic scenarios.

Conclusion: Scalable distributed framework enables real-time coordination of UAV fleets in dynamic environments with reduced computational complexity and robust performance.

Abstract: Coordinated missions involving Unmanned Aerial Vehicles (UAVs) in dynamic environments pose significant challenges in maintaining both coordination and agility. In this paper, relying on the cooperative path following framework and using a game-theoretic formulation, we introduce a novel and scalable approach in which each UAV acts autonomously in different mission conditions. This formulation naturally accommodates heterogeneous and time-varying objectives across the system. In our setting, each UAV optimizes a cost function that incorporates temporal and mission-specific constraints. The optimization is performed within a one-dimensional domain, significantly reducing the computational cost and enabling real-time application to complex and dynamic scenarios. The framework is distributed in structure, enabling global, system-wide coordination (a Nash equilibrium) by using only local information. For ideal systems, we prove the existence and the Nash equilibrium exhibits exponential convergence. Furthermore, we invoke model predictive control (MPC) for non-ideal scenarios. In particular, we propose a discrete-time optimization approach that tackles path-following errors and communication failures, ensuring reliable and agile performance in dynamic and uncertain environments. Simulation results demonstrate the effectiveness and agility of the approach in ensuring successful mission execution across diverse realistic scenarios.

[661] Adaptive Accountability in Networked MAS: Tracing and Mitigating Emergent Norms at Scale

Saad Alqithami

Main category: cs.MA

TL;DR: AAF is a runtime framework for multi-agent systems that detects norm violations, attributes responsibility via causal graphs, and applies interventions to steer systems toward compliant behavior with bounded compromise guarantees.

Details

Motivation: Large-scale networked multi-agent systems in critical infrastructure can develop undesirable emergent behaviors like collusion, resource hoarding, and unfairness, requiring automated mechanisms to detect and correct these norm violations while maintaining system functionality.

Method: The Adaptive Accountability Framework (AAF) has four components: (1) cryptographically verifiable interaction provenance recording, (2) distributional change point detection in streaming traces, (3) responsibility attribution via causal influence graphs, and (4) cost-bounded interventions including reward shaping and targeted policy patching.

Result: AAF reduces compromise ratio by median 11.9% compared to PPO baseline in 96% of regimes across 87,480 simulation runs, maintains social welfare (median change 0.4%), detects violations with median 71-step delay, and achieves 0.97 mean top-ranked attribution accuracy at 10% Byzantine rate.

Conclusion: AAF provides a practical framework for ensuring accountability in multi-agent systems with theoretical bounded-compromise guarantees and strong empirical performance across diverse scenarios, making it suitable for deployment in critical infrastructure applications.

Abstract: Large-scale networked multi-agent systems increasingly underpin critical infrastructure, yet their collective behavior can drift toward undesirable emergent norms such as collusion, resource hoarding, and implicit unfairness. We present the Adaptive Accountability Framework (AAF), an end-to-end runtime layer that (i) records cryptographically verifiable interaction provenance, (ii) detects distributional change points in streaming traces, (iii) attributes responsibility via a causal influence graph, and (iv) applies cost-bounded interventions-reward shaping and targeted policy patching-to steer the system back toward compliant behavior. We establish a bounded-compromise guarantee: if the expected cost of intervention exceeds an adversary’s expected payoff, the long-run fraction of compromised interactions converges to a value strictly below one. We evaluate AAF in a large-scale factorial simulation suite (87,480 runs across two tasks; up to 100 agents plus a 500-agent scaling sweep; full and partial observability; Byzantine rates up to 10%; 10 seeds per regime). Across 324 regimes, AAF lowers the executed compromise ratio relative to a Proximal Policy Optimization baseline in 96% of regimes (median relative reduction 11.9%) while preserving social welfare (median change 0.4%). Under adversarial injections, AAF detects norm violations with a median delay of 71 steps (interquartile range 39-177) and achieves a mean top-ranked attribution accuracy of 0.97 at 10% Byzantine rate.

[662] The Coordination Gap: Multi-Agent Alternation Metrics for Temporal Fairness in Repeated Games

Nikolaos Al. Papadopoulos, Konstantinos Psannis

Main category: cs.MA

TL;DR: Paper introduces temporally-sensitive metrics for multi-agent coordination, showing conventional metrics fail to detect poor temporal structure despite high aggregate payoffs.

Details

Motivation: Existing metrics for multi-agent coordination are temporally blind and cannot distinguish structured coordination patterns from random or monopolistic behavior, especially as the number of agents grows.

Method: Introduces Perfect Alternation (PA) as reference coordination regime and six novel Alternation (ALT) metrics as temporally sensitive observables. Uses Q-learning agents as diagnostic baseline and compares against random-policy null processes in a BoE-derived multi-agent variant of Battle of the Exes formalized as a Markov game.

Result: Despite high traditional metrics (reward fairness often >0.9), learned policies perform up to 81% below random baselines under ALT metrics. This deficit exists even in two-agent case and intensifies with more agents, showing conventional metrics can severely mischaracterize emergent dynamics.

Conclusion: Temporally aware observables are necessary for analyzing coordination in multi-agent games, and random-policy baselines are essential null processes for interpreting coordination outcomes relative to chance-level behavior.

Abstract: Multi-agent coordination dilemmas expose a fundamental tension between individual optimization and collective welfare, yet characterizing such coordination requires metrics sensitive to temporal structure and collective dynamics. As a diagnostic testbed, we study a BoE-derived multi-agent variant of the Battle of the Exes, formalizing it as a Markov game in which turn-taking emerges as a periodic coordination regime. Conventional outcome-based metrics (e.g., efficiency and min/max fairness) are temporally blind (they cannot distinguish structured alternation from monopolistic or random access patterns) and fairness ratios lose discriminative power as n grows, obscuring inequities. To address this limitation, we introduce Perfect Alternation (PA) as a reference coordination regime and propose six novel Alternation (ALT) metrics designed as temporally sensitive observables of coordination quality. Using Q-learning agents as a minimal adaptive diagnostic baseline, and comparing against random-policy null processes, we uncover a clear measurement failure: despite exhibiting deceptively high traditional metrics (e.g., reward fairness often exceeding 0.9), learned policies perform up to 81% below random baselines under ALT-variant evaluation, a deficit already present in the two-agent case and intensifying as n grows. These results demonstrate, in this setting, that high aggregate payoffs can coexist with poor temporal coordination, and that conventional metrics may severely mischaracterize emergent dynamics. Our findings underscore the necessity of temporally aware observables for analyzing coordination in multi-agent games and highlight random-policy baselines as essential null processes for interpreting coordination outcomes relative to chance-level behavior.

cs.MM

[663] EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities

Xinyuan Qian, Xinjia Zhu, Alessio Brutti, Dong Liang

Main category: cs.MM

TL;DR: EgoAdapt framework improves egocentric “Talking to Me” speaker detection by addressing missing visual data, using head orientation cues, and handling noisy audio through adaptive multimodal fusion.

Details

Motivation: Traditional TTM (Talking to Me) models struggle with real-world challenges: missing visual data, ignoring head orientation cues, and background noise. These limitations hinder robust speaker detection in egocentric video scenarios.

Method: EgoAdapt introduces three modules: 1) Visual Speaker Target Recognition (VSTR) captures head orientation (non-verbal) and lip movement (verbal) cues, 2) Parallel Shared-weight Audio (PSA) encoder enhances audio feature extraction in noise, and 3) Visual Modality Missing Awareness (VMMA) dynamically adjusts system response based on modality availability.

Result: On the TTM benchmark of Ego4D dataset, EgoAdapt achieves 67.39% mAP and 62.01% Accuracy, outperforming state-of-the-art by 4.96% in Accuracy and 1.56% in mAP.

Conclusion: EgoAdapt demonstrates robust performance for egocentric speaker detection by effectively handling missing modalities and noisy environments through adaptive multimodal fusion, advancing TTM task understanding.

Abstract: TTM (Talking to Me) task is a pivotal component in understanding human social interactions, aiming to determine who is engaged in conversation with the camera-wearer. Traditional models often face challenges in real-world scenarios due to missing visual data, neglecting the role of head orientation, and background noise. This study addresses these limitations by introducing EgoAdapt, an adaptive framework designed for robust egocentric “Talking to Me” speaker detection under missing modalities. Specifically, EgoAdapt incorporates three key modules: (1) a Visual Speaker Target Recognition (VSTR) module that captures head orientation as a non-verbal cue and lip movement as a verbal cue, allowing a comprehensive interpretation of both verbal and non-verbal signals to address TTM, setting it apart from tasks focused solely on detecting speaking status; (2) a Parallel Shared-weight Audio (PSA) encoder for enhanced audio feature extraction in noisy environments; and (3) a Visual Modality Missing Awareness (VMMA) module that estimates the presence or absence of each modality at each frame to adjust the system response dynamically.Comprehensive evaluations on the TTM benchmark of the Ego4D dataset demonstrate that EgoAdapt achieves a mean Average Precision (mAP) of 67.39% and an Accuracy (Acc) of 62.01%, significantly outperforming the state-of-the-art method by 4.96% in Accuracy and 1.56% in mAP.

[664] Rethink Web Service Resilience in Space: A Radiation-Aware and Sustainable Transmission Solution

Long Chen, Hao Fang, Yi Ching Chou, Haoyuan Zhao, Xiaoyi Fan, Zhe Chen, Hengzhi Wang, Jiangchuan Liu

Main category: cs.MM

TL;DR: RALT is a radiation-aware traffic routing system for LEO satellite networks that dynamically reroutes traffic during radiation events to minimize battery degradation while maintaining web service performance.

Details

Motivation: LEO satellite networks integrated with cloud infrastructure provide global internet backbone, but space radiation threatens resilience by degrading hardware, draining batteries, and disrupting web services. Conventional fixes consume energy and risk battery aging or service interruptions.

Method: Proposes RALT (Radiation-Aware LEO Transmission), a control-plane solution that dynamically reroutes traffic during radiation events while accounting for energy constraints to minimize battery degradation.

Result: The work demonstrates that effective mitigation requires rethinking resilience through the lens of the space environment itself to unlock space-based web services’ full potential for global reliable connectivity.

Conclusion: Unlocking space-based web services’ full potential requires radiation-aware network-layer solutions that balance energy constraints with service continuity during space weather events.

Abstract: Low Earth Orbit (LEO) satellite networks such as Starlink and Project Kuiper are increasingly integrated with cloud infrastructures, forming an important internet backbone for global web services. By extending connectivity to remote regions, oceans, and disaster zones, these networks enable reliable access to applications ranging from real-time WebRTC communication to emergency response portals. Yet the resilience of these web services is threatened by space radiation: it degrades hardware, drains batteries, and disrupts continuity, even if the space-cloud integrated providers use machine learning to analyze space weather and radiation data. Specifically, conventional fixes like altitude adjustments and thermal annealing consume energy; neglecting this energy use results in deep discharge and faster battery aging, whereas sleep modes risk abrupt web session interruptions. Efficient network-layer mitigation remains a critical gap. We propose RALT (Radiation-Aware LEO Transmission), a control-plane solution that dynamically reroutes traffic during radiation events, accounting for energy constraints to minimize battery degradation and sustain service performance. Our work shows that unlocking space-based web services’ full potential for global reliable connectivity requires rethinking resilience through the lens of the space environment itself.

[665] Modeling the Impacts of Swipe Delay on User Quality of Experience in Short Video Streaming

Duc V. Nguyen, Huyen T. T. Tran

Main category: cs.MM

TL;DR: First systematic study on swipe delay effects on user QoE in short video streaming, showing delay duration, frequency, and timing impact user experience, with proposed predictive QoE model.

Details

Motivation: Swipe gestures are critical for user interaction with short video platforms, but delays between swipe actions and video playback can significantly impact user experience, which hasn't been systematically studied before.

Method: Conducted subjective quality assessment with 132 swipe delay patterns to analyze effects of delay duration, frequency, and temporal positioning on user QoE.

Result: Found that user experience is affected by delay duration, number of delays, and their temporal positions; single delays of 8+ seconds cause dissatisfaction; early-session delays are less harmful than late-session delays.

Conclusion: Proposed a novel QoE model that accurately predicts user experience based on swipe delay characteristics, outperforming existing models for short video streaming.

Abstract: Short video streaming platforms have gained immense popularity in recent years, transforming the way users consume video content. A critical aspect of user interaction with these platforms is the swipe gesture, which allows users to navigate through videos seamlessly. However, the delay between a user’s swipe action and the subsequent video playback can significantly impact the overall user experience. This paper presents the first systematic study investigating the effects of swipe delay on user Quality of Experience (QoE) in short video streaming. In particular, we conduct a subjective quality assessment containing 132 swipe delay patterns. The obtained results show that user experience is affected not only by the swipe delay duration, but also by the number of delays and their temporal positions. A single delay of eight seconds or longer is likely to lead to user dissatisfaction. Moreover, early-session delays are less harmful to user QoE than late-session delays. Based on the findings, we propose a novel QoE model that accurately predicts user experience based on swipe delay characteristics. The proposed model demonstrates high correlation with subjective ratings, outperforming existing models in short video streaming.

Tingxuan Wu, Zhaorui Ma, Yanjun Cui, Ziyi Zhou, Eric Wang

Main category: cs.MM

TL;DR: MSM-BD is a multimodal bot detection approach for social media that uses heterogeneous information (images, texts, user features) with cross-modal fusion technology to identify sophisticated AI-generated social bots.

Details

Motivation: Social bots can be used for both constructive and malicious purposes, with AI advances making them increasingly indistinguishable from humans. There's a critical need for advanced detection techniques that can handle the heterogeneous multimodal nature of social media content.

Method: Proposes MSM-BD with specialized encoders for different data types (images, texts, user statistical features) and introduces Cross-Modal Residual Cross-Attention (CMRCA) for effective fusion of heterogeneous information.

Result: Validated effectiveness through extensive experiments using the TwiBot-22 dataset, demonstrating improved detection accuracy for sophisticated social bots.

Conclusion: Multimodal approaches using heterogeneous information and advanced fusion techniques like CMRCA are essential for detecting increasingly sophisticated AI-generated social bots on social media platforms.

Abstract: Although social bots can be engineered for constructive applications, their potential for misuse in manipulative schemes and malware distribution cannot be overlooked. This dichotomy underscores the critical need to detect social bots on social media platforms. Advances in artificial intelligence have improved the abilities of social bots, allowing them to generate content that is almost indistinguishable from human-created content. These advancements require the development of more advanced detection techniques to accurately identify these automated entities. Given the heterogeneous information landscape on social media, spanning images, texts, and user statistical features, we propose MSM-BD, a Multimodal Social Media Bot Detection approach using heterogeneous information. MSM-BD incorporates specialized encoders for heterogeneous information and introduces a cross-modal fusion technology, Cross-Modal Residual Cross-Attention (CMRCA), to enhance detection accuracy. We validate the effectiveness of our model through extensive experiments using the TwiBot-22 dataset.

[667] DuoTeach: Dual Role Self-Teaching for Coarse-to-Fine Decision Coordination in Vision–Language Models

Wei Yang, Yiran Zhu, Zilin Li, Xunjia Zhang, Jun Xia, Hongtao Wang

Main category: cs.MM

TL;DR: DuoTeach: A dual-role self-teaching framework for vision-language models that improves multi-level decision coordination in taxonomy path prediction through decision-conditioned rollouts without ground-truth labels.

Details

Motivation: Existing benchmarks for coarse-to-fine path decision-making evaluate each taxonomic level independently, failing to capture cross-level validity and consistency issues. Vision-language models often produce invalid parent-child pairs and brittle full-path predictions due to unstable decision coordination across hierarchical levels.

Method: Proposes DuoTeach, a dual-role self-teaching distillation framework that uses the same pretrained VLM in two roles without ground-truth labels. Introduces Decision-Conditioned Rollout (DCR) to generate coherent teacher traces by conditioning each level on prior decisions, then distills this coordinated behavior into the student without additional test-time rollouts.

Result: DuoTeach improves in-domain Depth-Weighted Prefix Accuracy (alpha = 0.95) by up to 30.24 points and boosts zero-shot performance on unseen taxonomies from 17.17% to 43.66%. Gains are attributed to improved within-call multi-level decision coordination.

Conclusion: The work addresses a critical gap in evaluating vision-language models for hierarchical reasoning, showing that VLMs’ failures stem not just from incomplete knowledge but from unstable cross-level decision coordination. DuoTeach provides an effective solution through self-teaching distillation.

Abstract: Coarse-to-fine path decision-making requires predicting a valid taxonomy path in which earlier decisions constrain later ones. However, existing benchmarks score each level independently, obscuring cross-level validity and consistency. To better align evaluation with this setting, we introduce a Joint Path Decision (JPD) protocol that requires predicting the full path in one call, together with Depth-Weighted Prefix Accuracy (DWPA), a metric family that measures path reliability with tunable emphasis on deeper levels. Under JPD, strong vision-language models (VLMs) frequently produce invalid parent-child pairs and brittle full-path predictions, suggesting that their failures stem not only from incomplete taxonomic knowledge but also from unstable cross-level decision coordination. To address this problem, we propose DuoTeach, a dual-role self-teaching distillation framework that requires no ground-truth labels and reuses the same pretrained VLM in two roles. Its Decision-Conditioned Rollout (DCR) generates more coherent teacher traces by conditioning each level on prior decisions, and distills this coordinated behavior into the student without additional test-time rollouts. Across multiple taxonomy-structured benchmarks and VLM base models, DuoTeach improves in-domain DWPA (alpha = 0.95) by up to 30.24 points and boosts zero-shot performance on unseen taxonomies from 17.17% to 43.66%. Further analyses attribute these gains to improved within-call multi-level decision coordination.

eess.AS

[668] PCOV-KWS: Multi-task Learning for Personalized Customizable Open Vocabulary Keyword Spotting

Jianan Pan, Kejie Huang

Main category: eess.AS

TL;DR: A lightweight multi-task learning framework for personalized keyword spotting that simultaneously performs keyword spotting and speaker verification using binary classification instead of softmax.

Details

Motivation: As voice assistants become more prevalent with IoT, ASR, SV, and TTS technologies, there's growing demand for privacy and personalization in keyword spotting systems.

Method: Multi-task learning framework with lightweight network performing KWS and SV simultaneously; uses binary classification instead of softmax to eliminate inter-category competition; employs optimization strategy for multi-task loss weighting.

Result: Outperforms baselines on multiple datasets while requiring fewer parameters and lower computational resources.

Conclusion: The PCOV-KWS framework effectively addresses personalized keyword spotting needs with improved performance and efficiency.

Abstract: As advancements in technologies like Internet of Things (IoT), Automatic Speech Recognition (ASR), Speaker Verification (SV), and Text-to-Speech (TTS) lead to increased usage of intelligent voice assistants, the demand for privacy and personalization has escalated. In this paper, we introduce a multi-task learning framework for personalized, customizable open-vocabulary Keyword Spotting (PCOV-KWS). This framework employs a lightweight network to simultaneously perform Keyword Spotting (KWS) and SV to address personalized KWS requirements. We have integrated a training criterion distinct from softmax-based loss, transforming multi-class classification into multiple binary classifications, which eliminates inter-category competition, while an optimization strategy for multi-task loss weighting is employed during training. We evaluated our PCOV-KWS system in multiple datasets, demonstrating that it outperforms the baselines in evaluation results, while also requiring fewer parameters and lower computational resources.

[669] ProKWS: Personalized Keyword Spotting via Collaborative Learning of Phonemes and Prosody

Jianan Pan, Yuanming Zhang, Kejie Huang

Main category: eess.AS

TL;DR: ProKWS: A keyword spotting framework that combines phoneme-level matching with personalized prosody modeling (intonation, stress, rhythm) using a dual-stream encoder and collaborative fusion module.

Details

Motivation: Current keyword spotting systems focus only on phoneme-level matching and ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm), which limits their ability to distinguish confusable words and adapt to individual speaking styles.

Method: ProKWS uses a dual-stream encoder: one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information to enhance adaptability across acoustic environments.

Result: ProKWS delivers highly competitive performance comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

Conclusion: Integrating prosody modeling with phoneme-level matching significantly improves keyword spotting performance, especially for personalized keywords and in varying acoustic conditions.

Abstract: Current keyword spotting systems primarily use phoneme-level matching to distinguish confusable words but ignore user-specific pronunciation traits like prosody (intonation, stress, rhythm). This paper presents ProKWS, a novel framework integrating fine-grained phoneme learning with personalized prosody modeling. We design a dual-stream encoder where one stream derives robust phonemic representations through contrastive learning, while the other extracts speaker-specific prosodic patterns. A collaborative fusion module dynamically combines phonemic and prosodic information, enhancing adaptability across acoustic environments. Experiments show ProKWS delivers highly competitive performance, comparable to state-of-the-art models on standard benchmarks and demonstrates strong robustness for personalized keywords with tone and intent variations.

[670] ARTT: Augmented Reverberant-Target Training for Unsupervised Monaural Speech Dereverberation

Siqi Song, Fulin Wu, Zhong-Qiu Wang

Main category: eess.AS

TL;DR: Proposes ARTT, a two-stage unsupervised method for monaural speech dereverberation using augmented reverberant-target training and self-distillation.

Details

Motivation: Monaural unsupervised speech dereverberation is challenging due to lack of clean reference signals and spatial cues, requiring novel approaches to solve this ill-posed inverse problem.

Method: Two-stage approach: 1) RTT stage further reverberates observed signal then trains DNN to recover it via discriminative training, 2) Online self-distillation using mean-teacher algorithm to improve performance.

Result: ARTT achieves strong unsupervised dereverberation performance, significantly outperforming previous baselines in evaluation results.

Conclusion: Proposed ARTT method effectively addresses unsupervised speech dereverberation challenge through innovative training approach and self-distillation mechanism.

Abstract: Due to the absence of clean reference signals and spatial cues, monaural unsupervised speech dereverberation is a challenging ill-posed inverse problem. To realize it, we propose augmented reverberant-target training (ARTT), which consists of two stages. In the first stage, reverberant-target training (RTT) is proposed to first further reverberate the observed reverberant mixture signal, and then train a deep neural network (DNN) to recover the observed reverberant mixture via discriminative training. Although the target signal to fit is reverberant, we find that the resulting DNN can effectively reduce reverberation. In the second stage, an online self-distillation mechanism based on the mean-teacher algorithm is proposed to further improve dereverberation. Evaluation results demonstrate that ARTT achieves strong unsupervised dereverberation performance, significantly outperforming previous baselines.

[671] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Main category: eess.AS

TL;DR: LLMs vary in auditory knowledge from text-only pre-training, and this knowledge strongly correlates with audio-grounded performance in Large Audio Language Models.

Details

Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee

Main category: eess.AS

TL;DR: DeSTA2.5-Audio is a Large Audio Language Model that addresses catastrophic forgetting in LLMs when adding audio capabilities, using a self-generated cross-modal alignment strategy to preserve language proficiency while achieving strong audio understanding.

Details

Motivation: Existing Large Audio Language Models suffer from catastrophic forgetting of the original LLM's abilities when trained on audio-instruction datasets, creating a challenge in balancing knowledge retention and audio perception.

Method: Proposes a self-generated cross-modal alignment strategy (DeSTA) where the backbone LLM generates its own training targets, and constructs DeSTA-AQA5M dataset with 5M samples from 7,000 hours of diverse audio data across 50 datasets.

Result: Achieves state-of-the-art or competitive performance on multiple audio-language benchmarks including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench, outperforming existing training strategies.

Conclusion: Carefully designed data construction is crucial for LALM development, and the self-generated strategy enables robust, general-purpose audio-language models with preserved language proficiency.

Abstract: We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model (LALM) designed for robust auditory perception and instruction-following. Recent LALMs augment Large Language Models (LLMs) with auditory capabilities by training on large-scale audio-instruction datasets. However, existing LALMs have often suffered from the catastrophic forgetting of the LLM’s original abilities. Therefore, balancing knowledge retention and audio perception has become a critical challenge. To address this, we revisit the data construction pipeline and propose a self-generated cross-modal alignment strategy in which the backbone LLM generates its own training targets, named DeSTA. This approach aims at preserving the LLM’s native language proficiency thereby enabling zero-shot generalization without task-specific tuning. We construct DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training samples derived from 7,000 hours of audio spanning 50 diverse datasets, including speech, environmental sounds, and music. DeSTA2.5-Audio achieves state-of-the-art or competitive performance across a wide range of audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA, Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate that our self-generated strategy outperforms existing training strategies. Our findings underscore the importance of carefully designed data construction in LALM development and offer practical insights for building robust, general-purpose LALMs.

[673] MPDR Beamforming for Almost-Cyclostationary Processes

Giovanni Bologni, Martin Bo Møller, Richard Heusdens, Richard C. Hendriks

Main category: eess.AS

TL;DR: cMPDR beamformer extends MPDR to exploit spectral correlations in cyclostationary noise (like engines, fans) by using frequency-shifted filtering, achieving better noise reduction than spatial-only methods.

Details

Motivation: Conventional beamformers assume short-time stationarity and process frequency bins independently, ignoring inter-frequency correlations. This is suboptimal for almost-periodic noise sources (engines, fans, musical instruments) which are better modeled as cyclostationary processes with statistically correlated spectral components.

Method: Introduces cyclic minimum power distortionless response (cMPDR) beamformer that extends MPDR to jointly exploit spatial and spectral correlations. Uses frequency-shifted (FRESH) filtering to suppress noise components coherent across harmonically related frequencies. For inharmonicity, estimates resonant frequencies from periodogram and derives frequency shifts from pairwise spacing.

Result: In low-SNR scenarios, cMPDR achieves up to 5dB improvement in SI-SDR over MPDR, yields consistent STOI gains, and remains effective with single microphone. When spectral correlation is absent, reduces to conventional MPDR without performance degradation. Theoretical analysis shows output power decreases monotonically with number of cyclic components.

Conclusion: Cyclic processing is a viable direction for acoustic noise reduction that deserves further investigation, especially for almost-periodic noise sources like engines and fans.

Abstract: Conventional acoustic beamformers typically assume short-time stationarity and process frequency bins independently, ignoring inter-frequency correlations. This is suboptimal for almost-periodic noise sources such as engines, fans, and musical instruments: these signals are better modeled as (almost) cyclostationary (ACS) processes with statistically correlated spectral components. This paper introduces the cyclic minimum power distortionless response (cMPDR) beamformer, which extends the conventional MPDR to jointly exploit spatial and spectral correlations. Building on frequency-shifted (FRESH) filtering, it suppresses noise components that are coherent across harmonically related frequencies, reducing residual noise beyond what spatial filtering alone achieves. To address inharmonicity, where partials deviate from exact integer multiples of a fundamental frequency, we estimate resonant frequencies from a periodogram and derive frequency shifts from their pairwise spacing. Theoretical analysis yields closed-form expressions for residual noise and proves that output power decreases monotonically with the number of cyclic components. Experiments on synthetic harmonic noise and real UAV motor recordings confirm these findings: in low-SNR scenarios, the cMPDR achieves up to 5dB improvement in SI-SDR over the MPDR, yields consistent STOI gains, and remains effective with a single microphone. When spectral correlation is absent, the method reduces to conventional MPDR and does not degrade performance. These results suggest that cyclic processing is a viable direction for acoustic noise reduction that deserves further investigation. Code is available at https://github.com/Screeen/cMPDR.

[674] Group-Aware Partial Model Merging for Children’s Automatic Speech Recognition

Thomas Rolland, Alberto Abad

Main category: eess.AS

TL;DR: GRAPAM: Group-aware partial model merging for children’s ASR adaptation using unsupervised clustering, partial fine-tuning, and parameter merging

Details

Motivation: Supervised fine-tuning of adult pre-trained models for children's ASR often fails to capture group-specific characteristics and variations among children, necessitating a more nuanced adaptation approach.

Method: Unsupervised clustering of children’s data by acoustic similarity, partial fine-tuning of adult pre-trained models for each group, and parameter-level merging of resulting models.

Result: Achieves 6% relative WER improvement on MyST children’s speech corpus using same data, outperforming full fine-tuning while training fewer parameters.

Conclusion: GRAPAM provides effective parameter-efficient adaptation for children’s ASR by capturing group-specific variations through clustering and partial model merging.

Abstract: While supervised fine-tuning of adult pre-trained models for children’s ASR has shown promise, it often fails to capture group-specific characteristics and variations among children. To address this, we introduce GRoup-Aware PARtial model Merging, a parameter-efficient approach that combines unsupervised clustering, partial fine-tuning, and model merging. Our approach adapts adult-pre-trained models to children by first grouping the children’s data based on acoustic similarity. Each group is used to partially fine-tune an adult pre-trained model, and the resulting models are merged at the parameter level. Experiments conducted on the MyST children’s speech corpus indicate that GRAPAM achieves a relative WER improvement of 6%, using the same amount of data, outperforming full fine-tuning while training fewer parameters.

[675] Affect Decoding in Phonated and Silent Speech Production from Surface EMG

Simon Pistrosch, Kleanthis Avramidis, Zhao Ren, Tiantian Feng, Jihwan Lee, Monica Gonzalez-Machorro, Anton Batliner, Tanja Schultz, Shrikanth Narayanan, Björn W. Schuller

Main category: eess.AS

TL;DR: EMG-based affect decoding from facial and neck muscle activity during speech, showing reliable frustration detection that generalizes across phonated and silent articulation modes.

Details

Motivation: To understand the link between affect expression and articulatory execution in speech, and investigate whether EMG sensing can reveal how speech production is modulated by emotion, particularly for potential silent speech interfaces.

Method: Collected dataset of 2,780 utterances from 12 participants across 3 tasks, evaluated intra- and inter-subject decoding using various features and model embeddings, with ablation studies to analyze affective signatures in facial motor activity.

Result: EMG representations reliably discriminate frustration with up to 0.845 AUC, generalize well across articulation modes, and affective signatures persist in the absence of phonation.

Conclusion: EMG sensing shows strong potential for affect-aware silent speech interfaces, as affective information is embedded in facial motor activity and can be detected even without audible speech.

Abstract: The expression of affect is integral to spoken communication, yet, its link to underlying articulatory execution remains unclear. Measures of articulatory muscle activity such as EMG could reveal how speech production is modulated by emotion alongside acoustic speech analyses. We investigate affect decoding from facial and neck surface electromyography (sEMG) during phonated and silent speech production. For this purpose, we introduce a dataset comprising 2,780 utterances from 12 participants across 3 tasks, on which we evaluate both intra- and inter-subject decoding using a range of features and model embeddings. Our results reveal that EMG representations reliably discriminate frustration with up to 0.845 AUC, and generalize well across articulation modes. Our ablation study further demonstrates that affective signatures are embedded in facial motor activity and persist in the absence of phonation, highlighting the potential of EMG sensing for affect-aware silent speech interfaces.

eess.IV

[676] A Novel Framework using Intuitionistic Fuzzy Logic with U-Net and U-Net++ Architecture: A case Study of MRI Bain Image Segmentation

Hanuman Verma, Kiho Im, Akshansh Gupta, M. Tanveer

Main category: eess.IV

TL;DR: Proposes IFS U-Net and IFS U-Net++ that integrate intuitionistic fuzzy logic into U-Net architectures to handle uncertainty in brain MRI segmentation, showing improved performance on IBSR and OASIS datasets.

Details

Motivation: Brain MRI segmentation is crucial for neurological diagnosis, but existing deep learning methods like U-Net/U-Net++ struggle with uncertainty from vague/imprecise data, partial volume effects, and boundary ambiguities.

Method: Integrates intuitionistic fuzzy logic into U-Net and U-Net++ architectures (IFS U-Net and IFS U-Net++) to accept input data in intuitionistic fuzzy representation, enabling better uncertainty management in segmentation tasks.

Result: Experiments on IBSR and OASIS MRI brain datasets show improved segmentation performance measured by Accuracy, Dice Coefficient, and IoU compared to baseline methods.

Conclusion: The proposed IFS-enhanced architectures effectively address uncertainty in brain MRI segmentation, leading to consistently better performance for neurological analysis applications.

Abstract: Accurate segmentation of brain images from magnetic resonance imaging (MRI) scans plays a pivotal role in brain image analysis and the diagnosis of neurological disorders. Deep learning algorithms, particularly U-Net and U-Net++, are widely used for image segmentation. However, it finds difficult to deal with uncertainty in images. To address this challenge, this work integrates intuitionistic fuzzy logic into U-Net and U-Net++, propose a novel framework, named as IFS U-Net and IFS U-Net++. These models accept input data in an intuitionistic fuzzy representation to manage uncertainty arising from vague ness and imprecise data. This approach effectively handles tissue ambiguity caused by the partial volume effect and boundary uncertainties. To evaluate the effectiveness of IFS U-Net and IFS U-Net++, experiments are conducted on two publicly available MRI brain datasets: the Internet Brain Segmentation Repository (IBSR) and the Open Access Series of Imaging Studies (OASIS). Segmentation performance is quantitatively assessed using Accuracy, Dice Coefficient, and Intersection over Union (IoU). The results demonstrate that the proposed architectures consistently improve segmentation performance by effectively addressing uncertainty

[677] Quality assessment of brain structural MR images: Comparing generalization of deep learning versus hand-crafted feature-based machine learning methods to new sites

Prabhjot Kaur, John S. Thornton, Frederik Barkhof, Tarek A. Yousry, Sjoerd B. Vos, Hui Zhang

Main category: eess.IV

TL;DR: Comparison of two automated brain MRI quality assessment methods (MRIQC using hand-crafted features vs CNNQC using deep learning) shows both struggle with cross-site generalization, with trade-offs in accuracy vs sensitivity.

Details

Motivation: Visual quality assessment of brain MR images is time-consuming and subjective, creating need for automated methods that can scale for large neuroimaging studies while handling motion artifacts that bias clinical estimates.

Method: Evaluated two AQA methods on 1,098 T1-weighted volumes from 17 sites: MRIQC (hand-crafted image-quality metrics with traditional ML) vs CNNQC (deep learning architecture). Used leave-one-site-out approach to test generalization to new scanners/sites.

Result: Both methods struggled to generalize to new sites/scanners. MRIQC generally achieved higher accuracy across most unseen sites, while CNNQC demonstrated higher sensitivity for detecting poor-quality scans. DL methods offer computational efficiency and avoid expensive pre-processing.

Conclusion: DL-based methods like CNNQC may be preferred for widespread deployment due to computational efficiency and no pre-processing requirements, but future work must focus on improving cross-site generalizability for both approaches.

Abstract: Quality assessment of brain structural MR images is critical for large-scale neuroimaging studies, where motion artifacts can significantly bias clinical estimates. While visual rating remains the gold standard, it is time-consuming and subjective. This study evaluates the relative performance and generalization capabilities of two prominent Automated Quality Assessment (AQA) methods: MRIQC, which uses hand-crafted image-quality metrics with traditional machine learning, and CNNQC, which utilizes a deep learning (DL) architecture. Using a heterogeneous dataset of 1,098 T1-weighted volumes from 17 different sites, we assessed performance on both seen sites and entirely new sites using a leave-one-site-out (LOSO) approach. Our results indicate that both DL and traditional ML methods struggle to generalize to new scanners or sites. While MRIQC generally achieved higher accuracy across most unseen sites, CNNQC demonstrated higher sensitivity for detecting poor-quality scans. Given that DL-based methods like CNNQC offer higher computational efficiency and do not require expensive pre-processing, they may be preferred for widespread deployment, provided that future work focuses on improving cross-site generalizability.

[678] Dual Agreement Consistency Learning with Foundation Models for Semi-Supervised Fetal Heart Ultrasound Segmentation and Diagnosis

Fangyijie Wang, Guénolé Silvestre, Kathleen M. Curran

Main category: eess.IV

TL;DR: FM-DACL is a semi-supervised dual agreement consistency learning framework for fetal heart ultrasound segmentation and diagnosis that combines a pretrained ultrasound foundation model with a convolutional network through heterogeneous co-training.

Details

Motivation: Developing reliable AI models for congenital heart disease screening from fetal echocardiography is challenging due to limited annotations and variable image quality, requiring methods that can effectively leverage both labeled and unlabeled data.

Method: Proposes FM-DACL framework combining a pretrained ultrasound foundation model (EchoCare) with a convolutional network through heterogeneous co-training and exponential moving average teacher to better exploit unlabeled data for semi-supervised learning.

Result: Achieves Dice score of 59.66 and NSD of 42.82 on multi-center challenge dataset using heterogeneous backbones, demonstrating feasibility of the proposed semi-supervised framework for fetal cardiac ultrasound analysis.

Conclusion: FM-DACL provides a flexible approach for leveraging heterogeneous models in low-annotation fetal cardiac ultrasound analysis, showing promise for medical imaging applications with limited labeled data.

Abstract: Congenital heart disease (CHD) screening from fetal echocardiography requires accurate analysis of multiple standard cardiac views, yet developing reliable artificial intelligence models remains challenging due to limited annotations and variable image quality. In this work, we propose FM-DACL, a semi-supervised Dual Agreement Consistency Learning framework for the FETUS 2026 challenge on fetal heart ultrasound segmentation and diagnosis. The method combines a pretrained ultrasound foundation model (EchoCare) with a convolutional network through heterogeneous co-training and an exponential moving average teacher to better exploit unlabeled data. Experiments on the multi-center challenge dataset show that FM-DACL achieves a Dice score of 59.66 and NSD of 42.82 using heterogeneous backbones, demonstrating the feasibility of the proposed semi-supervised framework. These results suggest that FM-DACL provides a flexible approach for leveraging heterogeneous models in low-annotation fetal cardiac ultrasound analysis. The code is available on https://github.com/13204942/FM-DACL.

[679] Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

Fangyijie Wang, Tanya Akumu, Vien Ngoc Dang, Amelia Jimńez-Sánchez, Jieyun Bai, Guénolé Silvestre, Karim Lekadir, Kathleen M. Curran

Main category: eess.IV

TL;DR: Unified ultrasound foundation models often underperform task-specific ones; M2DINO framework shows aggregation effectiveness depends on training data scale and task type, with segmentation most sensitive.

Details

Motivation: Foundation models promise unified clinical task handling but recent ultrasound studies show unified models underperform task-specific baselines. The degradation may stem from task aggregation strategies ignoring interactions between task heterogeneity and training data scale.

Method: Introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. Systematically evaluate 27 ultrasound tasks (segmentation, classification, detection, regression) under three paradigms: task-specific, clinically-grouped, and all-task unified training.

Result: Aggregation effectiveness strongly depends on training data scale. Clinically-grouped training improves performance in data-rich settings but causes negative transfer in low-data settings. All-task unified training shows more consistent performance. Segmentation tasks show largest performance drops compared to regression and classification.

Conclusion: Aggregation strategies should jointly consider training data availability and task characteristics rather than relying solely on clinical taxonomy. Provides practical guidance for ultrasound foundation models.

Abstract: Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

[680] Energy-Aware Frame Rate Selection for Video Coding

Geetha Ramasubbu, Andrè Kaup, Christian Herglotz

Main category: eess.IV

TL;DR: Energy-aware video encoding method using frame rate selection to reduce encoding/decoding energy while maintaining visual quality

Details

Motivation: To reduce energy consumption in video encoding and decoding while maintaining visual quality, as frame rate reduction can save energy but may impact quality

Method: Two-part approach: 1) Extensive analysis of frame rate impact on energy and quality, identifying Pareto-optimal frame rates; 2) Lightweight ML-based frame rate selection using spatio-temporal features

Result: Significant energy savings: 17.46% encoding and 17.60% decoding energy reduction, plus 3.38% bitrate savings at constant quality

Conclusion: Proposed method effectively reduces video encoding/decoding energy while maintaining quality through intelligent frame rate selection

Abstract: The main contributions of this paper are twofold: First, we present an in-depth analysis of the impact of frame rate reductions on the visual quality of the video and the encoding as well as decoding energy. Second, we propose a lightweight frame rate selection method for energy- and quality-aware encoding. Concerning the first contribution, this paper performs extensive encoding and decoding measurements, followed by an investigation of the impact of temporal downsampling on the energy demand of encoding and decoding at different frame rates. Furthermore, we determine the objective visual quality of the downsampled videos. As a result of this investigation, we identify content- and quantization-setting-dependent energy-aware frame rates, i.e., the temporal downsampling factors that lead to Pareto-optimality in terms of energy and quality. We demonstrate that significant energy savings are achieved while maintaining constant visual quality. Subsequently, a subjective experiment is conducted to verify this observation regarding perceptual quality using mean opinion scores. As the second contribution, we propose an energy-aware frame rate selection method that extracts spatio-temporal features from the video sequences. Based on these features, the proposed method employs a feature-based supervised machine learning approach to predict energy-aware frame rates for a given quantization parameter and video sequence, aiming to reduce energy consumption during encoding and decoding. The experimental results demonstrate that the proposed method offers significant energy savings, with an average of 17.46% and 17.60% of encoding and decoding energy demand reduction, respectively, alongside 3.38% average bitrate savings at a constant quality.

Haonan Ping, Jian Jiang, Cheng Yuan, Qizhen Sun, Lv Wu, Yutong Ban

Main category: eess.IV

TL;DR: SCISSR is a scribble-promptable framework for interactive surgical scene segmentation that converts freehand scribbles into dense prompt embeddings for iterative refinement, achieving state-of-the-art performance on surgical datasets.

Details

Motivation: Surgical scene segmentation is challenging due to irregular shapes, thin structures, specularities, and occlusions. Existing SAM models use point/box prompts that are too sparse/coarse for surgical targets, requiring a more intuitive scribble-based interaction method.

Method: Introduces lightweight Scribble Encoder to convert scribbles into dense prompt embeddings compatible with SAM mask decoder. Uses Spatial Gated Fusion and LoRA adapters, interacting only through standard embedding interfaces to maintain transferability across SAM versions. Framework enables iterative refinement by drawing corrective strokes on error regions.

Result: Achieves 95.41% Dice on EndoVis 2018 with 5 interaction rounds and 96.30% Dice on CholecSeg8k with 3 rounds, outperforming iterative point prompting. Demonstrates strong in-domain performance and cross-domain robustness.

Conclusion: SCISSR provides an effective scribble-based interactive segmentation framework for surgical scenes that is model-agnostic, preserves pre-trained capabilities, and achieves superior performance through intuitive iterative refinement.

Abstract: Accurate segmentation of tissues and instruments in surgical scenes is annotation-intensive due to irregular shapes, thin structures, specularities, and frequent occlusions. While SAM models support point, box, and mask prompts, points are often too sparse and boxes too coarse to localize such challenging targets. We present SCISSR, a scribble-promptable framework for interactive surgical scene segmentation. It introduces a lightweight Scribble Encoder that converts freehand scribbles into dense prompt embeddings compatible with the mask decoder, enabling iterative refinement for a target object by drawing corrective strokes on error regions. Because all added modules (the Scribble Encoder, Spatial Gated Fusion, and LoRA adapters) interact with the backbone only through its standard embedding interfaces, the framework is not tied to a single model: we build on SAM 2 in this work, yet the same components transfer to other prompt-driven segmentation architectures such as SAM 3 without structural modification. To preserve pre-trained capabilities, we train only these lightweight additions while keeping the remaining backbone frozen. Experiments on EndoVis 2018 demonstrate strong in-domain performance, while evaluation on the out-of-distribution CholecSeg8k further confirms robustness across surgical domains. SCISSR achieves 95.41% Dice on EndoVis 2018 with five interaction rounds and 96.30% Dice on CholecSeg8k with three interaction rounds, outperforming iterative point prompting on both benchmarks.

[682] UEPS: Robust and Efficient MRI Reconstruction

Xiang Zhou, Hong Shang, Zijian Zhan, Tianyu He, Jintao Meng, Dong Liang

Main category: eess.IV

TL;DR: UEPS: A novel deep unrolled MRI reconstruction model that eliminates coil sensitivity map dependency for improved robustness under domain shifts, achieving state-of-the-art performance across diverse clinical scenarios.

Details

Motivation: Deep unrolled models (DUMs) for accelerated MRI reconstruction suffer from poor robustness under domain shifts (different anatomies, views, contrasts, vendors, etc.), primarily due to their dependency on coil sensitivity map (CSM) estimation. This limits clinical adoption where diverse acquisition conditions are common.

Method: UEPS introduces three key innovations: 1) Unrolled Expanded (UE) design that reconstructs each coil independently to eliminate CSM dependency, 2) progressive resolution using k-space-to-image mapping for coarse-to-fine refinement, and 3) sparse attention tailored to MRI’s 1D undersampling patterns. The model is evaluated on a large-scale zero-shot transfer benchmark with 10 out-of-distribution test sets.

Result: UEPS consistently and substantially outperforms existing DUMs, end-to-end methods, diffusion models, and untrained methods across all out-of-distribution tests. It achieves state-of-the-art robustness while maintaining low-latency inference suitable for real-time clinical deployment.

Conclusion: By addressing the CSM estimation bottleneck through physics-grounded architectural innovations, UEPS demonstrates superior generalization across diverse clinical domain shifts, making it a promising solution for robust, real-time MRI reconstruction in clinical settings.

Abstract: Deep unrolled models (DUMs) have become the state of the art for accelerated MRI reconstruction, yet their robustness under domain shift remains a critical barrier to clinical adoption. In this work, we identify coil sensitivity map (CSM) estimation as the primary bottleneck limiting generalization. To address this, we propose UEPS, a novel DUM architecture featuring three key innovations: (i) an Unrolled Expanded (UE) design that eliminates CSM dependency by reconstructing each coil independently; (ii) progressive resolution, which leverages k-space-to-image mapping for efficient coarse-to-fine refinement; and (iii) sparse attention tailored to MRI’s 1D undersampling nature. These physics-grounded designs enable simultaneous gains in robustness and computational efficiency. We construct a large-scale zero-shot transfer benchmark comprising 10 out-of-distribution test sets spanning diverse clinical shifts – anatomy, view, contrast, vendor, field strength, and coil configurations. Extensive experiments demonstrate that UEPS consistently and substantially outperforms existing DUM, end-to-end, diffusion, and untrained methods across all OOD tests, achieving state-of-the-art robustness with low-latency inference suitable for real-time deployment.

[683] A Hybrid Physical–Digital Framework for Annotated Fracture Reduction Data Evaluated using Clinically Relevant 3D metrics

Basile Longo, Paul-Emmanuel Edeline, Hoel Letissier, Marc-Olivier Gauci, Aziliz Guezou-Philippe, Valérie Burdin, Guillaume Dardenne

Main category: eess.IV

TL;DR: Hybrid physical-digital framework for generating annotated fracture reduction data using 3D printing, physical reduction, and CT scanning to create realistic training data for automatic fracture reduction algorithms.

Details

Motivation: Limited annotated data exists for evaluating automatic fracture reduction methods in Computer-Assisted Preoperative Planning (CAPP). Current approaches use synthetic fractures (lacking realism) or manual virtual reductions (time-consuming, operator-dependent, error-prone).

Method: Fragments from fracture CTs are 3D printed, physically reduced and fixed by operators, then CT scanned to recover transformation matrices. Introduces reproducible 3D fracture metrics (3D gap, 3D step-off, total gap area) for quantitative assessment.

Result: Evaluated on 11 clinical acetabular fracture cases reduced by two operators. Achieved mean improvements of 168.85 mm² in total gap area, 1.82 mm in 3D gap, and 0.81 mm in 3D step-off compared to preoperative measurements.

Conclusion: The hybrid physical-digital framework enables efficient generation of realistic, clinically relevant annotated fracture reduction data for developing and evaluating automatic fracture reduction algorithms.

Abstract: A major bottleneck in Computer-Assisted Preoperative Planning (CAPP) for fracture reduction is the limited availability of annotated data. While annotated datasets are now available for evaluating bone fracture segmentation algorithms, there is a notable lack of annotated data for the evaluation of automatic fracture reduction methods. Obtaining precise annotations, which are essential for training and evaluating automatic CAPP algorithm, of the reduced bone therefore remains a critical and underexplored challenge. Existing approaches to assess reduction methods rely either on synthetic fracture simulation which often lacks realism, or on manual virtual reductions, which are complex, time-consuming, operator-dependant and error-prone. To address these limitations, we propose a hybrid physical-digital framework for generating annotated fracture reduction data. Based on fracture CTs, fragments are first 3D printed, physically reduced, fixed and CT scanned to accurately recover transformation matrix applied to each fragment. To quantitatively assess reduction quality, we introduce a reproducible formulation of clinically relevant 3D fracture metrics, including 3D gap, 3D step-off, and total gap area. The framework was evaluated on 11 clinical acetabular fracture cases reduced by two independent operators. Compared to preoperative measurements, the proposed approach achieved mean improvements of 168.85 mm 2 in total gap area, 1.82 mm in 3D gap, and 0.81 mm in 3D step-off. This hybrid physical–digital framework enables the efficient generation of realistic, clinically relevant annotated fracture reduction data that can be used for the development and evaluation of automatic fracture reduction algorithms.

[684] GenMFSR: Generative Multi-Frame Image Restoration and Super-Resolution

Harshana Weligampola, Joshua Peter Ebenezer, Weidi Liu, Abhinau K. Venkataramanan, Sreenithy Chandran, Seok-Jun Lee, Hamid Rahim Sheikh

Main category: eess.IV

TL;DR: GenMFSR is a generative multi-frame raw-to-RGB super-resolution pipeline that uses foundation model priors for camera ISP applications, aligning multiple raw frames and preventing low-frequency artifacts.

Details

Motivation: Camera pipelines need to process raw Bayer-format frames through denoising, demosaicing, and super-resolution. Existing adversarial methods are limited by ground truth quality, and single-frame methods cannot align multiple frames for enhanced resolution from natural hand tremors.

Method: Proposes GenMFSR, a generative multi-frame raw-to-RGB super-resolution pipeline that incorporates image priors from foundation models to obtain sub-pixel information. The method aligns multiple raw frames and uses a loss term that restricts generation to high-frequency regions in the raw domain to prevent low-frequency artifacts.

Result: GenMFSR is presented as the first generative multi-frame raw-to-RGB super-resolution pipeline that can effectively align multiple raw frames for camera ISP applications while avoiding low-frequency artifacts.

Conclusion: The proposed GenMFSR pipeline addresses limitations of existing methods by leveraging foundation model priors and multi-frame alignment for improved super-resolution in camera image signal processing pipelines.

Abstract: Camera pipelines receive raw Bayer-format frames that need to be denoised, demosaiced, and often super-resolved. Multiple frames are captured to utilize natural hand tremors and enhance resolution. Multi-frame super-resolution is therefore a fundamental problem in camera pipelines. Existing adversarial methods are constrained by the quality of ground truth. We propose GenMFSR, the first Generative Multi-Frame Raw-to-RGB Super Resolution pipeline, that incorporates image priors from foundation models to obtain sub-pixel information for camera ISP applications. GenMFSR can align multiple raw frames, unlike existing single-frame super-resolution methods, and we propose a loss term that restricts generation to high-frequency regions in the raw domain, thus preventing low-frequency artifacts.

[685] Hallucination Detection in Virtually-Stained Histology: A Latent Space Baseline

Ji-Hun Oh, Kianoush Falahkheirkhah, John Cheville, Rohit Bhargava

Main category: eess.IV

TL;DR: Proposes Neural Hallucination Precursor (NHP), a post-hoc method for detecting hallucinations in virtual staining of histopathology images by analyzing generator latent spaces.

Details

Motivation: Virtual staining offers cost reduction and workflow streamlining for histopathology, but hallucinations pose serious clinical reliability risks that need detection methods.

Method: NHP leverages the generator’s latent space to preemptively flag hallucinations in virtual staining outputs through scalable post-hoc analysis.

Result: Extensive experiments across diverse VS tasks show NHP is effective and robust. Also reveals that models with fewer hallucinations don’t necessarily offer better detectability.

Conclusion: Highlights a gap in current VS evaluation and underscores the need for hallucination detection benchmarks to ensure clinical reliability.

Abstract: Histopathologic analysis of stained tissue remains central to biomedical research and clinical care. Virtual staining (VS) offers a promising alternative, with potential to reduce costs and streamline workflows, yet hallucinations pose serious risks to clinical reliability. Here, we formalize the problem of hallucination detection in VS and propose a scalable post-hoc method: Neural Hallucination Precursor (NHP), which leverages the generator’s latent space to preemptively flag hallucinations. Extensive experiments across diverse VS tasks show NHP is both effective and robust. Critically, we also find that models with fewer hallucinations do not necessarily offer better detectability, exposing a gap in current VS evaluation and underscoring the need for hallucination detection benchmarks.

[686] ELiC: Efficient LiDAR Geometry Compression via Cross-Bit-depth Feature Propagation and Bag-of-Encoders

Junsik Kim, Gun Bang, Soowoong Kim

Main category: eess.IV

TL;DR: ELiC is a real-time hierarchical LiDAR geometry compression framework that improves compression efficiency through cross-bit-depth feature propagation, Bag-of-Encoders selection, and Morton-order-preserving hierarchy.

Details

Motivation: Prior LiDAR compression methods treat each bit-depth independently and re-estimate local context from coordinates at every level, which limits compression efficiency and computational performance.

Method: Three key components: 1) Cross-bit-depth feature propagation reuses features from denser lower depths to support prediction at sparser higher depths; 2) Bag-of-Encoders (BoE) selects the most suitable coding network per depth from a small pool; 3) Morton-order-preserving hierarchy maintains global Z-order across depth transitions.

Result: Achieves state-of-the-art compression performance at real-time throughput on Ford and SemanticKITTI datasets, with improved entropy modeling and computation efficiency.

Conclusion: ELiC demonstrates that hierarchical LiDAR compression can be significantly improved through feature reuse, adaptive encoder selection, and efficient data ordering, enabling real-time performance with superior compression ratios.

Abstract: Hierarchical LiDAR geometry compression encodes voxel occupancies from low to high bit-depths, yet prior methods treat each depth independently and re-estimate local context from coordinates at every level, limiting compression efficiency. We present ELiC, a real-time framework that combines cross-bit-depth feature propagation, a Bag-of-Encoders (BoE) selection scheme, and a Morton-order-preserving hierarchy. Cross-bit-depth propagation reuses features extracted at denser, lower depths to support prediction at sparser, higher depths. BoE selects, per depth, the most suitable coding network from a small pool, adapting capacity to observed occupancy statistics without training a separate model for each level. The Morton hierarchy maintains global Z-order across depth transitions, eliminating per-level sorting and reducing latency. Together these components improve entropy modeling and computation efficiency, yielding state-of-the-art compression at real-time throughput on Ford and SemanticKITTI. Code and pretrained models are available at https://github.com/moolgom/ELiCv1.

[687] Deep Learning for Restoring MPI System Matrices Using Simulated Training Data

Artyom Tsanda, Sarah Reiss, Konrad Scheffler, Marija Boberg, Tobias Knopp

Main category: eess.IV

TL;DR: Physics-based simulated system matrices can train deep learning models for MPI system matrix restoration tasks that generalize to measured data, addressing data scarcity.

Details

Motivation: Magnetic particle imaging relies on system matrices obtained through time-consuming, noise-prone calibration measurements. Deep learning methods for addressing system matrix imperfections face scarcity of curated training data.

Method: Generated large dataset of simulated system matrices using equilibrium magnetization model extended with uniaxial anisotropy, spanning particle/scanner/calibration parameters for 2D/3D trajectories with injected background noise. Compared deep learning models (DnCNN/RDN/SwinIR/SMRnet/PConvUNet) with classical baselines for denoising, accelerated calibration, upsampling, and inpainting tasks.

Result: Models trained solely on simulated data generalized to measured data across all tasks: denoising models outperformed DCT-F baseline by >10 dB PSNR; 2D upsampling exceeded bicubic by 20 dB PSNR; 3D accelerated calibration matched tricubic in noiseless cases and was more robust under noise; 3D inpainting maintained quality with noise.

Conclusion: Transferability of deep learning models trained on simulations to real measurements mitigates data-scarcity problem and enables development of new methods beyond current measurement capabilities.

Abstract: Magnetic particle imaging reconstructs tracer distributions using a system matrix obtained through time-consuming, noise-prone calibration measurements. Methods for addressing imperfections in measured system matrices increasingly rely on deep neural networks, yet curated training data remain scarce. This study evaluates whether physics-based simulated system matrices can be used to train deep learning models for different system matrix restoration tasks, i.e., denoising, accelerated calibration, upsampling, and inpainting, that generalize to measured data. A large system matrices dataset was generated using an equilibrium magnetization model extended with uniaxial anisotropy. The dataset spans particle, scanner, and calibration parameters for 2D and 3D trajectories, and includes background noise injected from empty-frame measurements. For each restoration task, deep learning models were compared with classical non-learning baseline methods. The models trained solely on simulated system matrices generalized to measured data across all tasks: for denoising, DnCNN/RDN/SwinIR outperformed DCT-F baseline by >10 dB PSNR and up to 0.1 SSIM on simulations and led to perceptually better reconstuctions of real data; for 2D upsampling, SMRnet exceeded bicubic by 20 dB PSNR and 0.08 SSIM at $\times 2$-$\times 4$ which did not transfer qualitatively to real measurements. For 3D accelerated calibration, SMRnet matched tricubic in noiseless cases and was more robust under noise, and for 3D inpainting, biharmonic inpainting was superior when noise-free but degraded with noise, while a PConvUNet maintained quality and yielded less blurry reconstructions. The demonstrated transferability of deep learning models trained on simulations to real measurements mitigates the data-scarcity problem and enables the development of new methods beyond current measurement capabilities.

Editor’s Picks

[1] MOSS-TTS Technical Report

[2] DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

[3] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Today’s Research Highlights

Table of Contents

cs.CL

[1] Do Large Language Models Possess a Theory of Mind? A Comparative Evaluation Using the Strange Stories Paradigm

[2] TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

[3] How Confident Is the First Token? An Uncertainty-Calibrated Prompt Optimization Framework for Large Language Model Classification and Understanding

[4] Agentic Framework for Political Biography Extraction

[5] Controllable Evidence Selection in Retrieval-Augmented Question Answering via Deterministic Utility Gating

[6] DynaRAG: Bridging Static and Dynamic Knowledge in Retrieval-Augmented Generation

[7] Learned but Not Expressed: Capability-Expression Dissociation in Large Language Models

[8] Real-Time Trustworthiness Scoring for LLM Structured Outputs and Data Extraction

[9] Beyond Accuracy: An Explainability-Driven Analysis of Harmful Content Detection

[10] DiscoPhon: Benchmarking the Unsupervised Discovery of Phoneme Inventories With Discrete Speech Units

[11] MineDraft: A Framework for Batch Parallel Speculative Decoding

[12] An Agentic System for Schema Aware NL2SQL Generation

[13] BenchBrowser – Collecting Evidence for Evaluating Benchmark Validity

[14] Evaluating FrameNet-Based Semantic Modeling for Gender-Based Violence Detection in Clinical Records

[15] How LLMs Distort Our Written Language

[16] Zipper-LoRA: Dynamic Parameter Decoupling for Speech-LLM based Multilingual Speech Recognition

[17] Modeling the human lexicon under temperature variations: linguistic factors, diversity and typicality in LLM word associations

[18] GRAFITE: Generative Regression Analysis Framework for Issue Tracking and Evaluation

[19] CWoMP: Morpheme Representation Learning for Interlinear Glossing

[20] How Psychological Learning Paradigms Shaped and Constrained Artificial Intelligence

[21] From Noise to Signal: When Outliers Seed New Topics

[22] Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

[23] PowerFlow: Unlocking the Dual Nature of LLMs via Principled Distribution Matching

[24] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models

[25] AutoScreen-FW: An LLM-based Framework for Resume Screening

[26] The Geometry of Dialogue: Graphing Language Models to Reveal Synergistic Teams for Multi-Agent Collaboration

[27] TopoChunker: Topology-Aware Agentic Document Chunking Framework

[28] TARo: Token-level Adaptive Routing for LLM Test-time Alignment

[29] Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs

[30] Adaptive Decoding via Test-Time Policy Learning for Self-Improving Generation

[31] UT-ACA: Uncertainty-Triggered Adaptive Context Allocation for Long-Context Inference

[32] GAIN: A Benchmark for Goal-Aligned Decision-Making of Large Language Models under Imperfect Norms

[33] WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

[34] The Truncation Blind Spot: How Decoding Strategies Systematically Exclude Human-Like Token Choices

[35] EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

[36] When Names Change Verdicts: Intervention Consistency Reveals Systematic Bias in LLM Decision-Making

[37] Cross-Lingual LLM-Judge Transfer via Evaluation Decomposition

[38] ICE: Intervention-Consistent Explanation Evaluation with Statistical Grounding for LLMs

[39] Language Model Maps for Prompt-Response Distributions via Log-Likelihood Vectors

[40] Cross-Modal Rationale Transfer for Explainable Humanitarian Classification on Social Media

[41] Learning to Self-Evolve

[42] A Comparative Empirical Study of Catastrophic Forgetting Mitigation in Sequential Task Adaptation for Continual Natural Language Processing Systems

[43] Automatic detection of Gen-AI texts: A comparative framework of neural models

[44] Implicit Grading Bias in Large Language Models: How Writing Style Affects Automated Assessment Across Math, Programming, and Essay Tasks

[45] Mi:dm K 2.5 Pro

[46] Detecting Basic Values in A Noisy Russian Social Media Text Data: A Multi-Stage Classification Framework

[47] Why Better Cross-Lingual Alignment Fails for Better Cross-Lingual Transfer: Case of Encoders

[48] Evaluating LLM-Generated Lessons from the Language Learning Students’ Perspective: A Short Case Study on Duolingo

[49] A Human-in/on-the-Loop Framework for Accessible Text Generation

[50] Progressive Training for Explainable Citation-Grounded Dialogue: Reducing Hallucination to Zero in English-Hindi LLMs

[51] Entropy trajectory shape predicts LLM reasoning reliability: A diagnostic study of uncertainty dynamics in chain-of-thought

[52] RADIUS: Ranking, Distribution, and Significance - A Comprehensive Alignment Suite for Survey Simulation

[53] Hypothesis-Conditioned Query Rewriting for Decision-Useful Retrieval

[54] What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

[55] MoRI: Learning Motivation-Grounded Reasoning for Scientific Ideation in Large Language Models

[56] Parallelograms Strike Back: LLMs Generate Better Analogies than People

[57] A Dataset and Resources for Identifying Patient Health Literacy Information from Clinical Notes

[58] DaPT: A Dual-Path Framework for Multilingual Multi-hop Question Answering

[59] UGID: Unified Graph Isomorphism for Debiasing Large Language Models

[60] Optimal Splitting of Language Models from Mixtures to Specialized Domains

[61] VEPO: Variable Entropy Policy Optimization for Low-Resource Language Foundation Models

[62] Evaluating Counterfactual Strategic Reasoning in Large Language Models

[63] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

[64] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[65] Using Optimal Transport as Alignment Objective for fine-tuning Multilingual Contextualized Embeddings

[66] SQLBench: A Comprehensive Evaluation for Text-to-SQL Capabilities of Large Language Models

[67] LLMs Faithfully and Iteratively Compute Answers During CoT: A Systematic Analysis With Multi-step Arithmetics

[68] Enhancing Lexicon-Based Text Embeddings with Large Language Models

[69] HiFi-KPI: A Dataset for Hierarchical KPI Extraction from Earnings Filings

[70] PlainQAFact: Retrieval-augmented Factual Consistency Evaluation Metric for Biomedical Plain Language Summarization

[71] Fin-R1: A Large Language Model for Financial Reasoning through Reinforcement Learning

[72] Enhancing Multi-Label Emotion Analysis and Corresponding Intensities for Ethiopian Languages

[73] ELM: A Hybrid Ensemble of Language Models for Automated Tumor Group Classification in Population-Based Cancer Registries