Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 181]
- cs.CV [Total: 177]
- cs.AI [Total: 121]
- cs.SD [Total: 17]
- cs.LG [Total: 153]
- cs.MA [Total: 12]
- cs.MM [Total: 1]
- eess.AS [Total: 4]
- eess.IV [Total: 8]
cs.CL
[1] Two-dimensional early exit optimisation of LLM inference
Jan Hůla, David Adamczyk, Tomáš Filip, Martin Pavlíček, Petr Sosík
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively activating deeper layers, our method achieves multiplicative computational savings that exceed those from optimizing either dimension independently. Experimental evaluation across four state-of-the-art LLMs (Llama 3.1, Llama 3.2, Gemma, Qwen; 3B-8B parameters) on three sentiment classification datasets demonstrates additional speed-ups of 1.4–2.3$\times$ over optimal layer-wise early exit for simpler tasks with vanilla models, with graceful degradation on complex multi-class problems. Fine-tuning reduces but does not eliminate this advantage. The approach is model-agnostic, requires only lightweight classification adapters, and is orthogonal to complementary efficiency methods such as quantization and pruning. Our findings indicate that 2D early exit strategies excel when semantic information accumulates predictably across input structure, suggesting possible applicability to sequence-processing tasks beyond sentiment classification.
[2] Probing for Reading Times
Eleftheria Tsipidi, Samuel Kiegeland, Francesco Ignazio Re, Tianyang Xu, Mario Giulianelli, Karolina Stanczak, Ryan Cotterell
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Probing has shown that language model representations encode rich linguistic information, but it remains unclear whether they also capture cognitive signals about human processing. In this work, we probe language model representations for human reading times. Using regularized linear regression on two eye-tracking corpora spanning five languages (English, Greek, Hebrew, Russian, and Turkish), we compare the representations from every model layer against scalar predictors – surprisal, information value, and logit-lens surprisal. We find that the representations from early layers outperform surprisal in predicting early-pass measures such as first fixation and gaze duration. The concentration of predictive power in the early layers suggests that human-like processing signatures are captured by low-level structural or lexical representations, pointing to a functional alignment between model depth and the temporal stages of human reading. In contrast, for late-pass measures such as total reading time, scalar surprisal remains superior, despite its being a much more compressed representation. We also observe performance gains when using both surprisal and early-layer representations. Overall, we find that the best-performing predictor varies strongly depending on the language and eye-tracking measure.
[3] Characterizing AlphaEarth Embedding Geometry for Agentic Environmental Reasoning
Mashrekur Rahman, Samuel J. Barrett, Christina Last
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Earth observation foundation models encode land surface information into dense embedding vectors, yet the geometric structure of these representations and its implications for downstream reasoning remain underexplored. We characterize the manifold geometry of Google AlphaEarth’s 64-dimensional embeddings across 12.1 million Continental United States samples (2017–2023) and develop an agentic system that leverages this geometric understanding for environmental reasoning. The manifold is non-Euclidean: effective dimensionality is 13.3 (participation ratio) from 64 raw dimensions, with local intrinsic dimensionality of approximately 10. Tangent spaces rotate substantially, with 84% of locations exceeding 60\textdegree{} and local-global alignment (mean$|\cosθ| = 0.17$) approaching the random baseline of 0.125. Supervised linear probes indicate that concept directions rotate across the manifold, and compositional vector arithmetic using both PCA-derived and probe-derived directions yields poor precision. Retrieval instead produces physically coherent results, with local geometry predicting retrieval coherence ($R^2 = 0.32$). Building on this characterization, we introduce an agentic system with nine specialized tools that decomposes environmental queries into reasoning chains over a FAISS-indexed embedding database. A five-condition ablation (120 queries, three complexity tiers) shows that embedding retrieval dominates response quality ($μ= 3.79 \pm 0.90$ vs.\ $3.03 \pm 0.77$ parametric-only; scale 1–5), with peak performance on multi-step comparisons ($μ= 4.28 \pm 0.43$). A cross-model benchmark show that geometric tools reduce Sonnet 4.5’s score by 0.12 points but improve Opus 4.6’s by 0.07, with Opus achieving higher geometric grounding (3.38 vs.\ 2.64), suggesting that the value of geometric characterization scales with the reasoning capability of the consuming model.
[4] Speculative End-Turn Detector for Efficient Speech Chatbot Assistant
Hyunjong Ok, Suho Yoo, Jaeho Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Spoken dialogue systems powered by large language models have demonstrated remarkable abilities in understanding human speech and generating appropriate spoken responses. However, these systems struggle with end-turn detection (ETD) – the ability to distinguish between user turn completion and hesitation. This limitation often leads to premature or delayed responses, disrupting the flow of spoken conversations. In this paper, we introduce the ETD Dataset, the first public dataset for end-turn detection. The ETD dataset consists of both synthetic speech data generated with text-to-speech models and real-world speech data collected from web sources. We also propose SpeculativeETD, a novel collaborative inference framework that balances efficiency and accuracy to improve real-time ETD in resource-constrained environments. Our approach jointly employs a lightweight GRU-based model, which rapidly detects the non-speaking units in real-time on local devices, and a high-performance Wav2vec-based model running on the server to make a more challenging classification of distinguishing turn ends from mere pauses. Experiments demonstrate that the proposed SpeculativeETD significantly improves ETD accuracy while keeping the required computations low. Datasets and code will be available after the review.
[5] Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP
Thanmay Jayakumar, Deepon Halder, Raj Dabre
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Cross-lingual transfer in NLP is often hindered by the ``script barrier’’ where differences in writing systems inhibit transfer learning between languages. Transliteration, the process of converting the script, has emerged as a powerful technique to bridge this gap by increasing lexical overlap. This paper provides a comprehensive survey of the application of transliteration in cross-lingual NLP. We present a taxonomy of key motivations to utilize transliterations in language models, and provide an overview of different approaches of incorporating transliterations as input. We analyze the evolution and effectiveness of these methods, discussing the critical trade-offs involved, and contextualize their need in modern LLMs. The review explores various settings that show how transliteration is beneficial, including handling code-mixed text, leveraging language family relatedness, and pragmatic gains in inference efficiency. Based on this analysis, we provide concrete recommendations for researchers on selecting and implementing the most appropriate transliteration strategy based on their specific language, task, and resource constraints.
[6] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present OmniVoice, a massively multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.
[7] Investigating Counterfactual Unfairness in LLMs towards Identities through Humor
Shubin Kim, Yejin Son, Junyeong Park, Keummin Ka, Seungbeen Lee, Jaeyoung Lee, Hyeju Jang, Alice Oh, Youngjae Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model’s responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.
[8] Qwen3.5-Omni Technical Report
Qwen Team
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.
[9] Remask, Don’t Replace: Token-to-Mask Refinement in Masked Diffusion Language Models
Lin Yao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Masked diffusion language models such as LLaDA2.1 rely on Token-to-Token (T2T) editing to correct their own generation errors: whenever a different token crosses a confidence threshold, the committed token is overwritten. We identify three structural failure modes of this rule. The trigger cannot fire when no single alternative is confident enough; the replacement is computed under a context that may itself contain errors; and the uniform perturbations used to train the T2T stream do not resemble the coherent, semantically plausible mistakes that the model actually makes at inference. As an alternative, we propose Token-to-Mask (T2M) remasking. Rather than overwriting a suspect token with a new guess, T2M resets the position to the mask state, so that the next denoising step re-predicts it from an in-distribution context. The method is training-free, modifies only the editing rule, and introduces no new parameters. We pair it with three detection heuristics and give a short theoretical account of why a mask is a better conditioning signal than an erroneous token. Across 8 benchmarks, T2M improves accuracy on tasks that require exact token-level output. Its largest gain is +5.92 points on CMATH, where we attribute 79.9% of baseline errors to last-mile corruption (correct reasoning followed by a garbled final answer); T2M repairs 41.3% of these cases.
[10] Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation
Abhishek Purushothama, Emma Thronson, Alexia Guo, Amir Zeldes
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Low-resource machine translation requires methods that differ from those used for high-resource languages. This paper proposes a novel in-context learning approach to support low-resource machine translation of the Coptic language to English, with syntactic augmentation from Universal Dependencies parses of input sentences. Building on existing work using bilingual dictionaries to support inference for vocabulary items, we add several representations of syntactic analyses to our inputs , specifically exploring the inclusion of raw parser outputs, verbalizations of parses in plain English, and targeted instructions of difficult constructions identified in sub-trees and how they can be translated. Our results show that while syntactic information alone is not as useful as dictionary-based glosses, combining retrieved dictionary items with syntactic information achieves significant gains across model sizes, achieving new state-of-the-art translation results for Coptic.
[11] Model-Agnostic Meta Learning for Class Imbalance Adaptation
Hanshu Rao, Guangzeng Han, Xiaolei Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Class imbalance is a widespread challenge in NLP tasks, significantly hindering robust performance across diverse domains and applications. We introduce Hardness-Aware Meta-Resample (HAMR), a unified framework that adaptively addresses both class imbalance and data difficulty. HAMR employs bi-level optimizations to dynamically estimate instance-level weights that prioritize genuinely challenging samples and minority classes, while a neighborhood-aware resampling mechanism amplifies training focus on hard examples and their semantically similar neighbors. We validate HAMR on six imbalanced datasets covering multiple tasks and spanning biomedical, disaster response, and sentiment domains. Experimental results show that HAMR achieves substantial improvements for minority classes and consistently outperforms strong baselines. Extensive ablation studies demonstrate that our proposed modules synergistically contribute to performance gains and highlight HAMR as a flexible and generalizable approach for class imbalance adaptation. Code is available at https://github.com/trust-nlp/ImbalanceLearning.
[12] An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models
Hanrui Luo, Shreyank N Gowda
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.
[13] Mango: Multi-Agent Web Navigation via Global-View Optimization
Weixi Tong, Yifeng Di, Tianyi Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website’s structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at https://github.com/VichyTong/Mango.
[14] Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
Seyedali Mohammadi, Manas Gaur, Francis Ferraro
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.
[15] Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
Sinan G. Aksoy, Alexandra A. Sabrio, Erik VonKaenel, Lee Burke
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable “fingerprint” that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.
[16] LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification
Pedro Barbosa de Carvalho Neto
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.
[17] Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India
Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.
[18] Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs
Yuefei Chen, Yihao Quan, Xiaodong Lin, Ruixiang Tang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.
[19] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language
Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.
[20] Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness
Mengzhao Jia, Zhihan Zhang, Meng Jiang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.
[21] Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning
Manuel Israel Cazares
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas – a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60–79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering
[22] LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval
He Cheng, Yifu Wu, Saksham Khatwani, Maya Kruse, Dmitriy Dligach, Timothy A. Miller, Majid Afshar, Yanjun Gao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at https://github.com/LARK-NLP-Lab/LogosKG, and an online demo is available at https://lark-nlp-lab-logoskg.hf.space/.
[23] MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Mehul Agarwal, Aditya Aggarwal, Arnav Goel, Medha Hira, Anubha Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.
[24] Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data
Yura Yoshida, Masato Kanai, Masataka Nakayama, Haruki Ohsawa, Yukiko Uchida, Arata Yuminaga, Gakuse Hoshina, Nobuo Sayama
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.
[25] Disparities In Negation Understanding Across Languages In Vision-Language Models
Charikleia Moraitaki, Sarah Pan, Skyler Pulling, Gwendolyn Flusche, Kumail Alhamoud, Marzyeh Ghassemi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions (“X is present”) even when the correct description contains negation (“no X”). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.
[26] A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition
Jiang Xiaobo, Dinghong Lai, Song Qiu, Yadong Deng, Xinkai Zhan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation – employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,’’ ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.
[27] Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest
Ramtin Davoudi, Kartik Thakkar, Nazanin Donyapour, Tyler Derr, Hamid Karimi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate “seen-data” bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users’ perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.
[28] STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming
MinJae Jung, YongTaek Lim, Chaeyun Kim, Junghwan Kim, Kihyun Kim, Minwoo Kim
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM’s strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar-ai/STAR-Teaming-paper.
[29] $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction
Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.
[30] When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains
Ishita Kakkar, Enze Zhang, Rheeya Uppaal, Junjie Hu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces – a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts
[31] Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.
[32] AlignCultura: Towards Culturally Aligned Large Language Models?
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO’s principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.
[33] RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora
Hanjun Cho, Jay-Yoon Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.
[34] SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning
Boyan Shi, Wei Chen, Shuyuan Zhao, Junfeng Shen, Shengnan Guo, Shaojiang Wang, Huaiyu Wan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA
[35] Cell-Based Representation of Relational Binding in Language Models
Qin Dai, Benjamin Heinzerling, Kentaro Inui
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell’’ corresponds to an entity–relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.
[36] Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference
Aby Mammen Mathew
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.
[37] TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only
Yilun Liu, Ruihong Qiu, Zi Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.
[38] HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing
Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Yilin Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, Minlie Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM’s performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.
[39] SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning
Rania Elbadry, Sarfraz Ahmad, Ahmed Heakl, Dani Bouch, Momina Ahsan, Muhra AlMahri, Marwa Elsaid khalil, Yuxia Wang, Salem Lahlou, Sophia Ananiadou, Veselin Stoyanov, Jimin Huang, Xueqing Peng, Preslav Nakov, Zhuohan Xie
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.
[40] Detoxification for LLM: From Dataset Itself
Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model’s inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)
[41] Do Emotions Influence Moral Judgment in Large Language Models?
Mohammad Saim, Tianyu Jiang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.
[42] Construction of Knowledge Graph based on Language Model
Qiubai Zhu, Qingwang Wang, Haibin Yuan, Wei Chen, Tao Shen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.
[43] The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models
Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics – repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (“That’s a great question!”, “Awesome!”) to pseudo-empathetic affirmations (“I completely understand your concern”, “I’m right here to catch you”) and overused vocabulary (“delve”, “tapestry”, “nuanced”). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the “alignment tax” of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.
[44] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation
Kunquan Li, Yingxue Zhang, Fandong Meng, Jinsong Su
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a “think-first-then-translate” paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a “translate-first-think-later” paradigm. Our approach develops the model’s “translate-reflect-refine” capability through reinforcement learning. In the first stage, we cultivate the model’s capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model’s first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.
[45] How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning
Haoyang Chen, Yi Liu, Jianzhi Shao, Tao Zhang, Chengfu Huo, Wei Hu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.
[46] Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Hongxing Pan, Yingying Guo, Wenqing Kuang, Jiashi Lu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size–that is, the number of distinct meanings expressed in the sampled responses–provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.
[47] SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization
Bo-Jyun Wang, Ying-Jia Lin, Hung-Yu Kao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.
[48] Headlines You Won’t Forget: Can Pronoun Insertion Increase Memorability?
Selina Meyer, Magdalena Abel, Michael Roth
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.
[49] Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Clara Lachenmaier, Hannah Bultmann, Sina Zarrieß
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.
[50] ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning
Xianming Li, Zongxi Li, Tsz-fung Andrew Lee, Jing Li, Haoran Xie, Qing Li
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.
[51] Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework
Alessandro Maisto
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.
[52] CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks
Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye, Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Bo Zeng, Fan Jiang, Yuanbin Cao, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena Marinković, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do, Sarakmatak Sophy, Baotian Hu, Guanhua Chen, Gongbo Tang, Alham Fikri Aji, Longyue Wang, Weihua Luo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks – where models must reason within real-world, context-rich scenarios – largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs’ multilingual and multicultural competence on grounded tasks. CulturALL is built via a human–AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.
[53] HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Euntae Kim, Soomin Han, Buru Chang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench
[54] Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs
Guy Mor-Lan, Omer Goldman, Matan Eyal, Adi Mayrav Gilady, Sivan Eiger, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Reut Tsarfaty
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models’ inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs’ responses to LocQA locale-ambiguous questions thus reveal models’ implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs’ desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.
[55] IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text
Rajveer Singh Pall
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench
[56] Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms
Xinlin Wang, Mats Brorsson
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.
[57] Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation
Eoghan Cunningham, Derek Greene, James Cross, Antonio Rago
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.
[58] Are Large Language Models Economically Viable for Industry Deployment?
Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora, Rafiq Ali, Ebad Shabbir, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the <2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.
[59] DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
Jinyu Guo, Zhihan Zhang, Yutong Li, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan-Zh/DASH-KV
[60] Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?
Niclas Doll, Jasper Schulze Buschhoff, Shalaka Satheesh, Hammam Abdelwahab, Héctor Allende-Cid, Katrin Klug
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.
[61] Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?
Sho Hoshino, Ukyo Honda, Peinan Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89% accuracy on MMLU, the best performance to date with the use of GPT-4o.
[62] Lost in Translation: Do LVLM Judges Generalize Across Languages?
Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.
[63] What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search
Xinhao Zhang, Xi Chen, François Portet, Maxime Peyrard
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.
[64] ‘The Order in the Horse’s Heart’: A Case Study in LLM-Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction
Ewan Cameron
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a dual-track pipeline for detecting biblical allusions in literary fiction and apply it to the novels of Cormac McCarthy. A bottom-up embedding track uses inverse document frequency to identify rare vocabulary shared with the King James Bible, embeds occurrences in their local context for sense disambiguation, and passes candidate passage pairs through cascaded LLM review. A top-down register track asks an LLM to read McCarthy’s prose undirected to any specific biblical passage for comparison, catching allusions not distinguished by word or phrase rarity. Both tracks are cross-validated by a long-context model that holds entire novels alongside the KJV in a single pass, and every finding is checked against published scholarship. Restricting attention to allusions that carry a textual echo–shared phrasing, reworked vocabulary, or transplanted cadence–and distinguishing literary allusions proper from signposted biblical references (similes naming biblical figures, characters overtly citing scripture), the pipeline surfaces 349 allusions across the corpus. Among a target set of 115 previously documented allusions retrieved through human review of the academic literature, the pipeline independently recovers 62 (54% recall), with recall varying by connection type from 30% (transformed imagery) to 80% (register collisions). We contextualise these results with respect to the value-add from LLMs as assistants to mechanical stylometric analyses, and their potential to facilitate the statistical study of intertextuality in massive literary corpora.
[65] LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues
Fanyu Wang, Xiaoxi Kang, Paul Burgess, Aashish Srivastava, Chetan Arora, Adnan Trakic, Lay-Ki Soon, Md Khalid Hossain, Lizhen Qu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs’ capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.
[66] Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics
Dmitry Pronin, Evgeny Kazartsev
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows’s classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows’s Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.
[67] Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews
Bowen Li, Haochen Ma, Yuxin Wang, Jie Yang, Xinchi Chen, Xuanjing Huang, Yining Zheng, Xipeng Qiu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification–its arguments, questions, and critique–rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics–particularly the recall of weakness arguments–correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.
[68] Bangla Key2Text: Text Generation from Keywords for a Low Resource Language
Tonmoy Talukder, G M Shahariar
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of $2.6$ million Bangla keyword–text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword–text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.
[69] Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment
Tianxiang Ma, Weijie Feng, Xinyu Wang, Zhiyong Cheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at https://github.com/CoCoSphere/SCALE.
[70] Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment
Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.
[71] Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps
Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.
[72] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression
Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.
[73] Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI
Wenqing Wu, Chengzhi Zhang, Yi Zhao, Tong Bao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.
[74] A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry
Silvio Calderaro, Johanna Monti
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord’s theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.
[75] RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian
Mircea Timpuriu, Dumitru-Clementin Cercel
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.
[76] Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Kihyuk Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.
[77] The “Small World of Words” German Free-Association Norms
Samuel Aeschbach, Rui Mata, Kaidi Lõo, Simon De Deyne, Dirk U. Wulff
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Free-association norms provide essential empirical data for investigating linguistic, semantic, and cultural phenomena in the cognitive sciences. Although large-scale norms exist for languages such as English, Dutch, Spanish, and Mandarin Chinese, no comparable resource has been available for German. To address this gap, we present free-association norms for 5,877 German cue words as part of the German version of the multilingual Small World of Words (SWOW) project. We describe the data collection procedures, participant characteristics, and our comprehensive preprocessing pipeline before introducing the resulting SWOW-DE data set. Using data from three established psycholinguistic paradigms, we show that SWOW-DE norms robustly predict performance in lexical decision tasks, relatedness judgments, and psycholinguistic word ratings. Furthermore, we demonstrate that SWOW-DE responses compare favorably with existing German resources and provide a preliminary cross-linguistic comparison revealing both shared and language-specific association patterns, highlighting promising directions for future research. Overall, SWOW-DE represents the largest collection of German free associations to date and offers a unique resource for linguistic, psychological, and cross-cultural research.
[78] Micro Language Models Enable Instant Responses
Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan, Luke Zettlemoyer, Shyamnath Gollakota
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($μ$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $μ$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.
[79] The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text
Andrew Hong, Jason Potteiger, Luis E. Zapata
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that “prompt engineering helps a little” but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.
[80] Pause or Fabricate? Training Language Models for Grounded Reasoning
Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma, Xu Tan, Yao Hu, Daoxin Zhang, Wenqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions – a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness – the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.
[81] Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation
Nurkhan Laiyk, Gerard I. Gállego, Javier Ferrando, Fajri Koto
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\rightarrow$Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.
[82] An Answer is just the Start: Related Insight Generation for Open-Ended Document-Grounded QA
Saransh Sharma, Pritika Ramu, Aparna Garimella, Koyel Mukherjee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.
[83] Epistemic orientation in parliamentary discourse is associated with deliberative democracy
Segun Aroyehun, Stephan Lewandowsky, David Garcia
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence–Minus–Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.
[84] Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views
Feihao Fang, My T. Thai, Yuanyuan Lei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.
[85] Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications
Sander Noels, Alexander Rogiers, Maarten Buyl, Tijl De Bie
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid rise of Large Language Models (LLMs) has created new disruptive possibilities for persuasive communication, enabling fully-automated, personalized, and interactive content generation at an unprecedented scale. In this paper, we survey the emerging field of LLM-based persuasion, reviewing empirical studies that measure the influence of LLM Systems on human attitudes and behaviors. We categorize applications across domains such as politics, marketing, public health, e-commerce, and charitable giving, finding that such systems have frequently achieved human-level or even superhuman persuasiveness. Synthesizing recent evidence, we identify key factors influencing this effectiveness, including the interaction approach, model scale and capability, prompt design, personalization, and AI source disclosure. Furthermore, we critically examine the experimental designs and success metrics used to evaluate these Systems, distinguishing between direct behavioral outcomes and proxy indicators. Our survey suggests that the current capabilities of LLM-based persuasion pose profound ethical and societal risks, including to information integrity, fairness and inclusion, privacy, and individual autonomy. These risks underscore the urgent need for ethical guidelines and updated regulatory frameworks to avoid the widespread deployment of irresponsible and harmful LLM Systems.
[86] FoNE: Precise Single-Token Number Embeddings via Fourier Features
Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model’s performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.
[87] Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.
[88] TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation
Vihang Pancholi, Jainit Bafna, Tejas Anvekar, Manish Shrivastava, Vivek Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating tables qualitatively and quantitatively poses a significant challenge, as standard metrics often overlook subtle structural and content-level discrepancies. To address this, we propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals, enabling more precise and consistent table comparison. Building on this, we introduce TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval first aligns reference and predicted tables structurally via TabAlign, then performs semantic and syntactic comparison using TabCompare, offering interpretable and granular feedback. We evaluate TabXEval on TabXBench, a diverse, multi-domain benchmark featuring realistic table perturbations and human annotations. A sensitivity-specificity analysis further demonstrates the robustness and explainability of TabXEval across varied table tasks. Code and data are available at https://coral-lab-asu.github.io/tabxeval/
[89] StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with simple subword-level tasks like ‘How many r’s in strawberry?’. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: github.com/anyasims/stochastok.
[90] PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts
Hengzhi Li, Justin Zhang, Brendon Jiang, Alexander Naehu, Regan Song, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.
[91] Improving the Distributional Alignment of LLMs using Supervision
Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The ability to accurately align LLMs with diverse population groups on subjective questions would have great value. In this work, we show that adding simple supervision can more consistently improve the alignment of LLM-generated distributions with diverse population groups, as measured across three datasets spanning public health, public opinion, and values and beliefs. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLM generations with diverse populations. By conducting evaluation over many LLMs and prompting strategies, we provide a benchmark to stimulate future research.
[92] Accelerating Prefilling via Decoding-time Contribution Sparsity
Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone. Our code is released at https://aka.ms/TriangleMix.
[93] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance – i.e., situating a chunk’s meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.
[94] Comparing energy consumption and accuracy in text classification inference
Johannes Zschache, Tilman Hartwig
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that in some contexts the best-performing model in terms of accuracy can also be energy-efficient. While LLMs tend to consume significantly more energy than traditional machine learning models, they show the same or even lower levels of accuracy in our zero-shot classification setting. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. Our findings demonstrate that energy efficiency and accuracy represent distinct evaluation dimensions that do not necessarily align. We argue that sustainable AI development requires systematic evaluation of both performance and resource efficiency.
[95] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains
Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.
[96] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design
Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) with respect to a variable of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM’s predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to purely prompting-based design generation and other adaptive design strategies.
[97] InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation
Yixin Wan, Xingrun Chen, Kai-Wei Chang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation. However, recent research raised concerns about culture-related fairness issues in LLM-generated content. In this work, we identify and systematically investigate LLMs’ insider-outsider bias, a phenomenon where models position themselves as “insiders” of mainstream cultures during generation while externalizing less dominant cultures. We propose the InsideOut benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a culturally situated interview script generation task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals that while models adopt insider tones in over 88% US-contexted scripts on average, they disproportionately default to “outsider” stances for non-Western cultures. To mitigate these biases, we propose 2 inference-time methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of a Single-Agent (MFA-SA), a Hierarchical-Agent (MFA-HA), and an autonomous Agentic Planning (MFA-Plan) pipeline. Empirical results demonstrate that agent-based MFA methods achieve outstanding and robust performance in mitigating the insider-outsider bias: For instance, on the Cultural Alignment Gap (CAG) metric, MFA-SA reduces bias in Llama model by 89.70 % and MFA-HA mitigates bias in Qwen by 82.54%. These findings showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
[98] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights
Sangmin Bae, Bilge Acun, Chien-Yu Lin, Haroun Habeeb, Seungyeon Kim, Liang Luo, Junjie Wang, Carole-Jean Wu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent progress in large language models demonstrates that hybrid architectures–combining self-attention mechanisms with structured state space models like Mamba–can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.
[99] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks
Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent’s planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.
[100] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling
Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.
[101] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent’s responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
[102] VISTA: Verification In Sequential Turn-based Assessment
Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.
[103] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models
Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.
[104] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling
MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Gen Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Wenhai Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.11793: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11793&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers
Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.02304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] RepIt: Steering Language Models with Concept-Specific Refusal Vectors
Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.13281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning
Jinlong Liu, Mohammed Bahja, Venelin Kovatchev, Mark Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] TabReX : Tabular Referenceless eXplainable Evaluation
Tejas Anvekar, Junha Park, Aparna Garimella, Vivek Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.15907: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15907&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[109] FaithLens: Detecting and Explaining Faithfulness Hallucination
Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.20182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading
Kumar Satvik Chaudhary, Chengshuai Zhao, Fan Zhang, Garima Agrawal, Yuli Deng, Huan Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.20817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[111] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation
Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.02993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[112] Do LLMs Encode Functional Importance of Reasoning Tokens?
Janvijay Singh, Dilek Hakkani-Tür
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[113] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning
Juntong Ni, Shiyu Wang, Qi He, Ming Jin, Wei Jin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[114] SAGE-32B: Agentic Reasoning via Iterative Distillation
Basab Jha, Firoj Paudel, Ujjwal Puri, Ethan Henkel, Zhang Yuting, Mateusz Kowalczyk, Mei Huang, Choi Donghyuk, Wang Junhao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[115] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences
Arkadiusz Modzelewski, Paweł Golik, Anna Kołos, Giovanni Da San Martino
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[116] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
Minda Zhao, Yilun Du, Mengyu Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.05414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[117] OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.16161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.16161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[118] Reasoning Models Will Sometimes Lie About Their Reasoning
William Walden, Miriam Wanner
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[119] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations
Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.09953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[120] LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding
Yichen Jiang, Jiakang Yuan, Chongjun Tu, Peng Ye, Tao Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[121] On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation
Weichuan Wang, Mingyang Liu, Linqi Song, Chen Ma
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.13729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[122] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.20279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[123] Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models
Hyunjong Ok, Jaeho Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.14152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[124] OmniGen2: Towards Instruction-Aligned Multimodal Generation
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.18871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[125] Sentipolis: Emotion-Aware Agents for Social Simulations
Chiyuan Fu, Lyuhao Chen, Yunze Xiao, Weihao Xuan, Carlos Busso, Mona Diab
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.18027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[126] Multi-Persona Thinking for Bias Mitigation in Large Language Models
Yuxing Chen, Guoqing Luo, Zijun Wu, Lili Mou
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.15488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[127] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs
Tristan Williams, Franziska Weeber, Sebastian Padó, Alan Akbik
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.15755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[128] Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning
Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.18296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[129] Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length
Jingxuan Chen, Mohammad Taher Pilehvar, Jose Camacho-Collados
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.22608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[130] One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization
Franziska Weeber, Vera Neplenbroek, Jan Batzner, Sebastian Padó
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.18572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[131] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images
Boammani Aser Lompo, Marc Haraoui
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.07966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[132] Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents
Thanh Luong Tuan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.00555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[133] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study
Ali El Lahib, Ying-Jieh Xia, Zehan Li, Yuxuan Wang, Xinyu Pi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.00758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[134] ChemPro: A Progressive Chemistry Benchmark for Large Language Models
Aaditya Baranwal, Shruti Vyas
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.03108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[135] Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs
Ziqian Zhong, Aditi Raghunathan
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.00161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[136] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation
Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Yida Ding, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Jiashuo Liu, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Pengbo Niu, Yueyan Qiu, Yanle Ren, Xinyu Shen, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Chun Zhang, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu, Shanshan Wu, Qi Zhao, Wenhao Huang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[137] Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models
Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[138] Investigating the structure of emotions by analyzing similarity and association of emotion words
Fumitaka Iwaki, Tatsuji Takahashi
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[139] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering
Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.09642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[140] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification
Jiale Zhao, Ke Fang, Lu Cheng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[141] Cross-lingual Matryoshka Representation Learning across Speech and Text
Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.19991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[142] ConFu: Contemplate the Future for Better Speculative Sampling
Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.08899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[143] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge
Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[144] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering
Tianyi Huang, Ying Kai Deng
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.16091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[145] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection
Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jie Zhou, Jiwen Lu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.21298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[146] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?
Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.24472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[147] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval
Zhiyuan Cheng, Longying Lai, Yue Liu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.26815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[148] Article and Comment Frames Shape the Quality of Online Comments
Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.27889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[149] PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression
Caio Vicentino
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.29078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[150] Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus
Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.02923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[151] CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark
Ahmed Heakl, Gustavo Bertolo Stahl, Sarim Hashmi, Seung Hun Eddie Han, Mukul Ranjan, Arina Kharlamova, Salman Khan, Abdulrahman Mahmoud
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.16968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[152] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers
Bo Kang, Sander Noels, Tijl De Bie
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[153] Multilingual Language Models Encode Script Over Linguistic Structure
Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta, Tanmoy Chakraborty
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[154] VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.12735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[155] NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data
Cong Ming, Ruixin Shi, Yifan Hu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.10401: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10401&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[156] A Triadic Suffix Tokenization Scheme for Numerical Reasoning
Olga Chetverina
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[157] Evaluating Cooperation in LLM Social Groups through Elected Leadership
Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[158] Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research
Taehun Kim, Hyeryun Park, Hyeonhoon Lee, Yushin Lee, Kyungsang Kim, Hyung-Chul Lee
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[159] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data
Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge, Bowen Li, He Du, Kai Chen, Qipeng Guo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[160] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
Naman Ahuja, Saniya Mulla, Muhammad Ali Khan, Zaryab Bin Riaz, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Irbaz Bin Riaz, Vivek Gupta
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[161] The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring
Jon-Paul Cacioli
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[162] Cross-Family Speculative Decoding for Polish Language Models on AppleSilicon: An Empirical Evaluation of Bielik11B with UAG-Extended MLX-LM
Krzysztof Fonal
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[163] IYKYK (But AI Doesn’t): Automated Content Moderation Does Not Capture Communities’ Heterogeneous Attitudes Towards Reclaimed Language
Christina Chance, Rebecca Pattichis, Arjun Subramonian, James He, Shruti Narayanan, Saadia Gabriel, Kai-Wei Chang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[164] No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs
Wei-Chi Wu, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[165] SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction
Hangxiao Zhu, Yuyu Zhang, Ping Nie, Yu Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[166] REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
Seungmin Lee, Jeonghwan Lee, Hyunkuk Lim, Sejoon Kim, Mingi Sung
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[167] Cat-DPO: Category-Adaptive Safety Alignment
Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding, Yue Zhao
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[168] Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains
Finn Schmidt, Jan Philip Wahle, Terry Ruas, Bela Gipp
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[169] On the Emergence of Syntax by Means of Local Interaction
Zichao Wei
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[170] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge
Sua Lee, Sanghee Park, Jinbae Im
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[171] STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs
Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, Hima Patel
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[172] AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment
Yixuan Wang, Yue Huang, Hong Qian, Yunzhao Wei, Yifei Ding, Wenkai Wang, Zhi Liu, Zhongjing Huang, Aimin Zhou, Jiajun Guo
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[173] MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[174] Position: LLM Watermarking Should Align Stakeholders’ Incentives for Practical Adoption
Yepeng Liu, Xuandong Zhao, Dawn Song, Gregory W. Wornell, Yuheng Bu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.18333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[175] ContextLeak: Auditing Leakage in Private In-Context Learning Methods
Jacob Choi, Shuying Cao, Xingjian Dong, Amin Banayeeanzade, Wang Bill Zhu, Robin Jia, Sai Praneeth Karimireddy
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.16059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[176] DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[177] Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey
Yasmin Moslem, John D. Kelleher
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.04445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[178] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
Alexandra Dragomir, Ioana Pintilie, Antonio Barbalau, Marius Dragoi, Florin Brad, Cristian Daniel Paduraru, Alexandru Tifrea, Elena Burceanu, Radu Tudor Ionescu
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[179] Why AI Readiness Is an Organizational Learning Problem, Not a Technology Purchase
Jeanne McClure, Gregg Gerdau
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[180] Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations
Yunjia Xi, Menghui Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[181] Sessa: Selective State Space Attention
Liubomyr Horbatko
Main category: cs.CL
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18580: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18580&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[182] Can We Build Scene Graphs, Not Classify Them? FlowSG: Progressive Image-Conditioned Scene Graph Generation with Flow Matching
Xin Hu, Ke Qin, Wen Yin, Yuan-Fang Li, Ming Li, Tao He
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scene Graph Generation (SGG) unifies object localization and visual relationship reasoning by predicting boxes and subject-predicate-object triples. Yet most pipelines treat SGG as a one-shot, deterministic classification problem rather than a genuinely progressive, generative task. We propose FlowSG, which recasts SGG as continuous-time transport on a hybrid discrete-continuous state: starting from a noised graph, the model progressively grows an image-conditioned scene graph through constraint-aware refinements that jointly synthesize nodes (objects) and edges (predicates). Specifically, we first leverage a VQ-VAE to quantize a scene graph (e.g., continuous visual features) into compact, predictable tokens; a graph Transformer then (i) predicts a conditional velocity field to transport continuous geometry (boxes) and (ii) updates discrete posteriors for categorical tokens (object features and predicate labels), coupling semantics and geometry via flow-conditioned message aggregation. Training combines flow-matching losses for geometry with a discrete-flow objective for tokens, yielding few-step inference and plug-and-play compatibility with standard detectors and segmenters. Extensive experiments on VG and PSG under closed- and open-vocabulary protocols show consistent gains in predicate R/mR and graph-level metrics, validating the mixed discrete-continuous generative formulation over one-shot classification baselines, with an average improvement of about 3 points over the state-of-the-art USG-Par.
[183] AutoAWG: Adverse Weather Generation with Adaptive Multi-Controls for Automotive Videos
Jiagao Hu, Daiguo Zhou, Danzhen Fu, Fuhao Li, Zepeng Wang, Fei Wang, Wenhua Liao, Jiayi Xie, Haiyang Sun
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Perception robustness under adverse weather remains a critical challenge for autonomous driving, with the core bottleneck being the scarcity of real-world video data in adverse weather. Existing weather generation approaches struggle to balance visual quality and annotation reusability. We present AutoAWG, a controllable Adverse Weather video Generation framework for Autonomous driving. Our method employs a semantics-guided adaptive fusion of multiple controls to balance strong weather stylization with high-fidelity preservation of safety-critical targets; leverages a vanishing point-anchored temporal synthesis strategy to construct training sequences from static images, thereby reducing reliance on synthetic data; and adopts masked training to enhance long-horizon generation stability. On the nuScenes validation set, AutoAWG significantly outperforms prior state-of-the-art methods: without first-frame conditioning, FID and FVD are relatively reduced by 50.0% and 16.1%; with first-frame conditioning, they are further reduced by 8.7% and 7.2%, respectively. Extensive qualitative and quantitative results demonstrate advantages in style fidelity, temporal consistency, and semantic–structural integrity, underscoring the practical value of AutoAWG for improving downstream perception in autonomous driving. Our code is available at: https://github.com/higherhu/AutoAWG
[184] Vision-Based Human Awareness Estimation for Enhanced Safety and Efficiency of AMRs in Industrial Warehouses
Maximilian Haug, Christian Stippel, Lukas Pscherer, Benjamin Schwendinger, Ralph Hoch, Angel Gaydarov, Sebastian Schlund, Thilo Sauter
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human’s position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.
[185] COMODO: Cross-Modal Video-to-IMU Distillation for Efficient Egocentric Human Activity Recognition
Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.
[186] StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network
Quanling Zhao, Meng’en Qin, Yanfeng Sun, Yuan Miao, Xiaohui Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Stomata play a crucial role in regulating plant physiological processes and reflecting environmental responses. However, accurate and high-throughput stomatal phenotyping remains challenging, as conventional approaches rely on destructive sampling and manual annotation, restricting large-scale and field deployment. To overcome these limitations, a noninvasive restoration-detection integrated framework, termed StomaD2, is developed to achieve accurate and fast stomatal phenotyping under complex imaging conditions. The framework incorporates a diffusion-based restoration module to recover degraded images and a specialized rotated object detection network tailored to the small, dense, and cluttered characteristics of stomata. The proposed network enhances feature representation through three key innovations: a column-wise structure for global feature interaction, context-aware resampling and reweighting mechanism to improve multi-scale consistency, and a feature reassembly module to boost discrimination against complex backgrounds. In extensive comparisons, StomaD2 demonstrated state-of-the-art performance. On public Maize and Wheat datasets, it achieved accuracies of 0.994 and 0.992, respectively, significantly outperforming existing benchmarks. When benchmarked against ten other advanced models, including Oriented Former and YOLOv12, StomaD2 achieved a top-tier F1-score/mAP of 0.989. The framework is integrated into a user-friendly, field-operable system that supports the fast extraction of eight stomatal phenotypes, such as density and conductance. Validated on more than 130 plant species, StomaD2’s results highlight its strong generalizability and potential for large-scale phenotyping, plant physiology analysis, and precision agriculture applications.
[187] DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax
Hang Yuan, Xiaolin Hu, Yan Wan, Menglin Gao, Wenzhe Yu, Cong Huang, Fei Xu, Qing Li, Christina Dan Wang, Zhou Yu, Kai Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.
[188] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents
Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.
[189] Align then Refine: Text-Guided 3D Prostate Lesion Segmentation
Cuiling Sun, Linkai Peng, Adam Murphy, Elif Keles, Hiten D. Patel, Ashley Ross, Frank Miller, Baris Turkbey, Andrea Mia Bejar, Halil Ertugrul Aktas, Gorkem Durak, Ulas Bagci
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at https://github.com/NUBagciLab/Prostate-Lesion-Segmentation.
[190] Colour Extraction Pipeline for Odonates using Computer Vision
Megan Mirnalini Sundaram Rajaraman, Fons J. Verbeek, Vincent J. Kalkman, Rita Pucci
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species’ morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts’ colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.
[191] Silicon Aware Neural Networks
Sebastian Fieldhouse, Kea-Tiong Tang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent work in the machine learning literature has demonstrated that deep learning can train neural networks made of discrete logic gate functions to perform simple image classification tasks at very high speeds on CPU, GPU and FPGA platforms. By virtue of being formed by discrete logic gates, these Differentiable Logic Gate Networks (DLGNs) lend themselves naturally to implementation in custom silicon - in this work we present a method to map DLGNs in a one-to-one fashion to a digital CMOS standard cell library by converting the trained model to a gate-level netlist. We also propose a novel loss function whereby the DLGN can optimize the area, and indirectly power consumption, of the resulting circuit by minimizing the expected area per neuron based on the area of the standard cells in the target standard cell library. Finally, we also show for the first time an implementation of a DLGN as a silicon circuit in simulation, performing layout of a DLGN in the SkyWater 130nm process as a custom hard macro using a Cadence standard cell library and performing post-layout power analysis. We find that our custom macro can perform classification on MNIST with 97% accuracy 41.8 million times a second at a power consumption of 83.88 mW.
[192] Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
Jay Jung, Ahmad Arrabi, Jax Luo, Scott Raymond, Safwan Wshah
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git
[193] Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras
Ruijun Zhang, Hang Su, Kostas Daniilidis, Ziyun Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: https://github.com/spikelab-jhu/Match-Any-Events.
[194] DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation
Enrique Hernandez Noguera, Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.
[195] URoPE: Universal Relative Position Embedding across Geometric Spaces
Yichen Xie, Depu Meng, Chensheng Peng, Yihan Hu, Quentin Herau, Masayoshi Tomizuka, Wei Zhan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.
[196] REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction
Seowung Leem, Lin Gu, Chenyu You, Kuang Gong, Ruogu Fang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The retina provides a unique, noninvasive window into Alzheimer’s disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer’s Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.
[197] CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans
Sergio Morell-Ortega, Ángela González-Cebrián, Boris Mansencal, Marien Gadea, Roberto Vivo-Hernando, Gregorio Rubio, Fernando Aparici, Maria de la Iglesia-Vaya, Gwenaelle Catheline, Pierrick Coupé, José V. Manjón
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large-scale automated morphometric analysis of brain MRI is limited by the thick-slice, anisotropic acquisitions prevalent in routine clinical practice. Existing generative super-resolution (SR) methods produce visually compelling isotropic volumes but often introduce anatomical hallucinations, systematic volumetric overestimation, and structural distortions that compromise downstream quantitative analysis and diagnostic safety. To address this, we propose CAHAL (Clinically Applicable resolution enHAncement for Low-resolution MRI scans), a hallucination-robust, physics-informed resolution enhancement framework that operates directly in the patient’s native acquisition space. CAHAL employs a deterministic bivariate Mixture of Experts (MoE) architecture routing each input through specialised residual 3D U-Net experts conditioned on both volumetric resolution and acquisition anisotropy, two independent descriptors of clinical MRI acquisition. Experts are optimised with a composite loss combining edge-penalised spatial reconstruction, Fourier-domain spectral coherence matching, and a segmentation-guided semantic consistency constraint. Training pairs are generated on-the-fly via physics-based degradation sampled from a large-scale real-world database, ensuring robust generalisation. Validated on T1-weighted and FLAIR sequences against generative baselines, CAHAL achieves state-of-the-art results, improving the best related methods in terms of accuracy and efficiency.
[198] EfficientPENet: Real-Time Depth Completion from Sparse LiDAR via Lightweight Multi-Modal Fusion
Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Anton Netchaev, Steven Sloan, Ken Pathak, Kendall N. Niles
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.
[199] CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization
Linkai Peng, Cuiling Sun, Zheyuan Zhang, Wanying Dou, Halil Ertugrul Aktas, Andrea M Bejar, Elif Keles, Tamas Gonda, Michael B Wallace, Zongwei Zhou, Gorkem Durak, Rajesh N Keswani, Ulas Bagci
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (<0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at https://crosspan.netlify.app/.
[200] LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models
Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families – text-illegibility, time-reading, and object-absence – each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels – patterns that aggregate metrics obscure.
[201] Geometric Decoupling: Diagnosing the Structural Instability of Latent
Yuanbang Liang, Zhengwen Chen, Yu-Kun Lai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Latent Diffusion Models (LDMs) achieve high-fidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry into \textit{Local Scaling} (capacity) and \textit{Local Complexity} (curvature). Our study uncovers a \textbf{Geometric Decoupling"}: while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifies Geometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.
[202] DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
Abrar Majeedi, Zhiyuan Ruan, Ziyi Zhao, Hongcheng Wang, Jianglin Lu, Yin Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at https://abrarmajeedi.github.io/dualvision.
[203] Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model
Haiyang Wu, Juan J. Gonzales Torres, George Vosselman, Ville Lehtola
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.
[204] Multi-Domain Learning with Global Expert Mapping
Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Oscar Mendez, Dacheng Tao, Xuelong Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.
[205] DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification
Mohammed Q. Alkhatib
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents DDF2Pol, a lightweight dual-domain convolutional neural network for PolSAR image classification. The proposed architecture integrates two parallel feature extraction streams, one real-valued and one complex-valued, designed to capture complementary spatial and polarimetric information from PolSAR data. To further refine the extracted features, a depth-wise convolution layer is employed for spatial enhancement, followed by a coordinate attention mechanism to focus on the most informative regions. Experimental evaluations conducted on two benchmark datasets, Flevoland and San Francisco, demonstrate that DDF2Pol achieves superior classification performance while maintaining low model complexity. Specifically, it attains an Overall Accuracy (OA) of 98.16% on the Flevoland dataset and 96.12% on the San Francisco dataset, outperforming several state-of-the-art real- and complex-valued models. With only 91,371 parameters, DDF2Pol offers a practical and efficient solution for accurate PolSAR image analysis, even when training data is limited. The source code is publicly available at https://github.com/mqalkhatib/DDF2Pol
[206] ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification
Mohammed Q. Alkhatib
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba
[207] ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models
Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.
[208] HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images
Pourya Shamsolmoali, Masoumeh Zareapoor, Michael Felsberg, Nick Pears, Yue Lu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.
[209] Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling
Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Building robust vision systems for high-stakes domains such as remote sensing requires stronger visual reasoning than what single-pass inference typically provides; yet, retraining large models is often computationally expensive and data intensive. We present Visual Reasoning Agent (VRA), a training-free agentic visual reasoning framework that orchestrates off-the-shelf large vision-language models (LVLMs) with a large reasoning model (LRM) through an iterative Think-Critique-Act loop for cross-model verification, self-critique, and recursive refinement. On the remote sensing benchmark VRSBench VQA dataset, VRA consistently outperforms multiple standalone LVLM baselines and achieves up to 40.67% improvement on challenging question types spanning both perception and reasoning tasks. In addition, integrating three LVLMs with VRA improves the overall accuracy of the standalone LVLMs from 52.8% to 78.8%, demonstrating the effectiveness of agentic reasoning with increased inference-time compute.
[210] Hierarchically Robust Zero-shot Vision-language Models
Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.
[211] A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders
Zhongying Wang, Kevin Lane, Levi Cai, Morteza Karimzadeh, Esther Rolf
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Supervised learning with Earth observation inputs is often limited by the sparsity of high-quality labeled or in-situ measured data to use as training labels. With the abundance of geographic data products, in many cases there are variables correlated with - but different from - the variable of interest that can be leveraged. We integrate such proxy variables within a geographic prior via a trainable location encoder and introduce a proxy consistency loss (PCL) formulation to imbue proxy data into the location encoder. The first key insight behind our approach is to use the location encoder as an agile and flexible way to learn from abundantly available proxy data which can be sampled independently of training label availability. Our second key insight is that we will need to regularize the location encoder appropriately to achieve performance and robustness with limited labeled data. Our experiments on air quality prediction and poverty mapping show that integrating proxy data implicitly through the location encoder outperforms using both as input to an observation encoder and fusion strategies that use frozen, pretrained location embeddings as a geographic prior. Superior performance for in-sample prediction shows that the PCL can incorporate rich information from the proxies, and superior out-of-sample prediction shows that the learned latent embeddings help generalize to areas without training labels.
[212] Localization-Guided Foreground Augmentation in Autonomous Driving
Jiawei Yong, Deyuan Qu, Qi Chen, Kentaro Oguchi, Shintaro Fukushima
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Autonomous driving systems often degrade under adverse visibility conditions-such as rain, nighttime, or snow-where online scene geometry (e.g., lane dividers, road boundaries, and pedestrian crossings) becomes sparse or fragmented. While high-definition (HD) maps can provide missing structural context, they are costly to construct and maintain at scale. We propose Localization-Guided Foreground Augmentation (LG-FA), a lightweight and plug-and-play inference module that enhances foreground perception by enriching geometric context online. LG-FA: (i) incrementally constructs a sparse global vector layer from per-frame Bird’s-Eye View (BEV) predictions; (ii) estimates ego pose via class-constrained geometric alignment, jointly improving localization and completing missing local topology; and (iii) reprojects the augmented foreground into a unified global frame to improve per-frame predictions. Experiments on challenging nuScenes sequences demonstrate that LG-FA improves the geometric completeness and temporal stability of BEV representations, reduces localization error, and produces globally consistent lane and topology reconstructions. The module can be seamlessly integrated into existing BEV-based perception systems without backbone modification. By providing a reliable geometric context prior, LG-FA enhances temporal consistency and supplies stable structural support for downstream modules such as tracking and decision-making.
[213] Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images
Abdul Mueez, Shruti Vyas
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator.
[214] Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2
Aaron Nicolson, Elizabeth J. Cooper, Hwan-Jin Yoon, Claire McCafferty, Ramya Krishnan, Michelle Craigie, Nivene Saad, Jason Dowling, Ian A. Scott, Bevan Koopman
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.
[215] AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs
Joongho Jo, Hyerin Lim, Hanjun Choi, Jongsun Park
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves a geometric mean speedup of 13.8x over original 3D-GS on a GPU, with only about 0.5 dB degradation in PSNR on city-scale scenes.
[216] A Multi-Agent Framework with Structured Reasoning and Reflective Refinement for Multimodal Empathetic Response Generation
Liping Wang, Cheng Ye, Weidong Chen, Peipei Song, Bo Hu, Zhendong Mao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users’ multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.
[217] Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents
Xu Chen, Shichao Xie, Zhining Gu, Lu Jia, Minghua Luo, Fei Liu, Zedong Chu, Yanfen Shen, Xiaolong Wu, Mu Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent’s movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.
[218] Generative Texture Filtering
Rongjia Zheng, Shangwei Huang, Lei Zhu, Wei-Shi Zheng, Qing Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at https://github.com/OnlyZZZZ/Generative_Texture_Filtering.
[219] Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge
Zihao Ye, Yung Hsiang Lu, Xiao Hu, Shuai Zhang, Taotao Jing, Xin Li, Zhen Yao, Bo Lang, Zhihao Zheng, Seungmin Oh, Hankyul Kang, Seunghun Kang, Jongbin Ryu, Kexin Chen, Yuan Qi, George K Thiruvathukal, Mooi Choo Chuah
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.
[220] The Essence of Balance for Self-Improving Agents in Vision-and-Language Navigation
Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.
[221] Multi-modal Test-time Adaptation via Adaptive Probabilistic Gaussian Calibration
Jinglin Xu, Yi Li, Chuxiong Sun, Xiao Xu, Jiangmeng Li, Fanjiang Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.
[222] EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation
Ruibing Hou, Mingyue Zhou, Yuwei Gui, Mingshuang Luo, Bingpeng Ma, Hong Chang, Shiguang Shan, Xilin Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.
[223] PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment
Chaonan Ji, Jinwei Qi, Sheng Xu, Peng Zhang, Bang Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.
[224] BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination
Michele Grimaldi, David Nakath, Oscar Pizarro, Jonatan Scharff Willners, Ignacio Carlucho, Yvan R. Petillot
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Robust 3D reconstruction across varying environmental conditions remains a critical challenge for robotic perception, particularly when transitioning between air and water. To address this, we introduce BALTIC, a controlled benchmark designed to systematically evaluate modern 3D reconstruction methods under variations in medium and lighting. The benchmark comprises 13 datasets spanning two media (air and water) and three lighting conditions (ambient, artificial, and mixed), with additional variations in motion type, scanning pattern, and initialization trajectory, resulting in a diverse set of sequences. Our experimental setup features a custom water tank equipped with a monocular camera and an HTC Vive tracker, enabling accurate ground-truth pose estimation. We further investigate cross-domain reconstruction by augmenting underwater image sequences with a small number of in-air views captured under similar lighting conditions. We evaluate Structure-from-Motion reconstruction using COLMAP in terms of both trajectory accuracy and scene geometry, and use these reconstructions as input to Neural Radiance Fields and 3D Gaussian Splatting methods. The resulting models are assessed against ground-truth trajectories and in-air references, while rendered outputs are compared using perceptual and photometric metrics. Additionally, we perform a color restoration analysis to evaluate radiometric consistency across domains. Our results show that under controlled, texture-consistent conditions, Gaussian Splatting with simple preprocessing (e.g., white balance correction) can achieve performance comparable to specialized underwater methods, although its robustness decreases in more complex and heterogeneous real-world environments
[225] Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval
Hang Cheng, Fanhe Dong, Long Zeng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.
[226] Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation
Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, Björn Ommer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.
[227] ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving
Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.
[228] MSDS: Deep Structural Similarity with Multiscale Representation
Danling Kang, Xue-Hua Chen, Bin Liu, Keke Zhang, Weiling Chen, Tiesong Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.
[229] Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement
Pritam Kar, Gouri Lakshmi S, Saptarshi Bej
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.
[230] How Far Are Video Models from True Multimodal Reasoning?
Xiaotian Zhang, Jianhui Wei, Yuan Wang, Jie Tan, Yichen Li, Yan Zhang, Ziyi Chen, Daoan Zhang, Dezhi YU, Wei Xu, Songtao Jiang, Zuozhu Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models’ zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.
[231] Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing
Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .
[232] When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide
Hang-Cheng Dong, Yuhao Jiang, Yibo Jiao, Lu Zou, Kai Zheng, Bingguo Liu, Dong Ye, Guodong Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The deployment of AI systems in safety-critical domains, such as industrial defect inspection, autonomous driving, and medical diagnosis, is severely hampered by their lack of reliability. A single undetected erroneous prediction can lead to catastrophic outcomes. Unfortunately, there is often no alternative but to place trust in the outputs of a trained AI system, which operates without an internal safeguard to flag unreliable predictions, even in cases of high accuracy. We propose a post-hoc explanation-based indicator to detect false negatives in binary defect detection networks. To our knowledge, this is the first method to proactively identify potentially erroneous network outputs. Our core idea leverages the difference between class-specific discriminative heatmaps and class-agnostic ones. We compute the difference in their intersection over union (IoU) as a reliability score. An adversarial enhancement method is further introduced to amplify this disparity. Evaluations on two industrial defect detection benchmarks show our method effectively identifies false negatives. With adversarial enhancement, it achieves 100% recall, albeit with a trade-off for true negatives. Our work thus advocates for a new and trustworthy deployment paradigm: data-model-explanation-output, moving beyond conventional end-to-end systems to provide critical support for reliable AI in real-world applications.
[233] An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones
Yuezhe Zhang, Luqian Bai, Mengting Yu, Lei Wei, Shuai Wan, Yifan Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Data acquisition through mobile phones remains a challenge for 3D Gaussian Splatting (3DGS). In this work we target the object-centered scenario and enable reliable mobile acquisition by providing on-device capture guidance and recording onboard sensor signals for offline reconstruction. After the calibration step, the device orientations are aligned to a baseline frame to obtain relative poses, and the optical axis of the camera is mapped to an object-centered spherical grid for uniform viewpoint indexing. To curb polar sampling bias, we compute area-weighted spherical coverage in real-time and guide the user’s motion accordingly. We compare the proposed method with RealityScan and the free-capture strategy. Our method achieves superior reconstruction quality using fewer input images compared to free capture and RealityScan. Further analysis shows that the proposed method is able to obtain more comprehensive and uniform viewpoint coverage during object-centered acquisition.
[234] Attention-based Multi-modal Deep Learning Model of Spatio-temporal Crop Yield Prediction with Satellite, Soil and Climate Data
Gopal Krishna Shyam, Ila Chandrakar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.
[235] Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
Quan Zhang, Jingze Wu, Jialong Wang, Xiaohua Xie, Jianhuang Lai, Hongbo Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.
[236] Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery
Francesco Moretti, Yi Jin, Guiqin Mario
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.
[237] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation
Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang, YunGu, XueLong Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.
[238] Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows
Zihan Wang, Xudong Huang, Junbo Qiao, Wei Li, Jie Hu, Xinghao Chen, Shaohui Lin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates “prior collapse” that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose Allo{SR}$^2$, a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that Allo{SR}$^2$ achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.
[239] Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images
Hongyuan Liu, Bochao Zou, Qiankun Liu, Haochen Yu, Qi Mei, Jianfei Jiang, Chen Liu, Cheng Bi, Zhao Wang, Xueyang Zhang, Yifei Zhan, Jiansheng Chen, Huimin Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.
[240] Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection
Yuanchan Xu, Wenjun Zang, Ying Wu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns – including Gaussian noise, F-Noise, and F-Drop – into the extracted feature representations, thereby strengthening the model’s robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\cite{you2022unified}, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17% image-level AUROC and 96.93% pixel-level AUROC on MVTec-AD, and 91.08% image-level AUROC and 99.08% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.
[241] DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents
Shengqin Wang, Wentao Yan, Huichi Zhou, Yihang Chen, Kun Shao, Zhizhong Zhang, Yuan Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$%$ on FVQA-test.
[242] Framelet-Based Blind Image Restoration with Minimax Concave Regularization
Heng Zhang, Reza Parvaz, Rui Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recovering corrupted images is one of the most challenging problems in image processing. Among various restoration tasks, blind image deblurring has been extensively studied due to its practical importance and inherent difficulty. In this problem, both the point spread function (PSF) and the underlying latent sharp image must be estimated simultaneously. This problem cannot be solved directly due to its ill-posed nature. One powerful tool for solving such problems is total variation (TV) regularization. The $\ell_0$-norm regularization within the TV framework has been widely adopted to promote sparsity in image gradients or transform domains, leading to improved preservation of edges and fine structures. However, the use of the $\ell_0$-norm results in a highly nonconvex and computationally intractable optimization problem, which limits its practical applicability. To overcome these difficulties, we employ the minimax concave penalty (MCP), which promotes enhanced sparsity and provides a closer approximation to the $\ell_0$-norm. In addition, a reweighted $\ell_1$-norm regularization is incorporated to further reduce estimation bias and improve the preservation of fine image details and textures. After introducing the proposed model, a numerical algorithm is developed to solve the resulting optimization problem. The effectiveness of the proposed approach is then demonstrated through experimental evaluations on several test images.
[243] Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes
Qi Zhang, Jixuan Chen, Kaiyi Zhang, Xinquan Yu, Antoni B. Chan, Hui Huang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-view crowd tracking estimates each person’s tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.
[244] PLaMo 2.1-VL Technical Report
Tommi Kerola, Yuya Masuda, Takashi Masuko, Toshiki Nakanishi, Daisuke Nishino, Kuniyuki Takahashi, Hanqin Wang, Yoshihiro Yamada
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.
[245] Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data
Shijie Wang, Zijian Wang, Yadan Luo, Haojie Li, Zi Huang, Mahsa Baktashmotlagh
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.
[246] Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data
Shijie Wang, Yadan Luo, Zijian Wang, Haojie Li, Zi Huang, Mahsa Baktashmotlagh
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation – a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.
[247] RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow
Xunpei Sun, Zuoxun Hou, Yi Chang, Gang Chen, Wei-Shi Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at https://github.com/sunzunyi/RAFT-MSF-PlusPlus.
[248] Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms
Samyak Sanghvi, Piyush Miglani, Sarvesh Shashikumar, Kaustubh R Borgavi, Veenu Singla, Chetan Arora
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters
[249] Detection of T-shirt Presentation Attacks in Face Recognition Systems
Mathias Ibsen, Loris Tim Ide, Christian Rathgeb, Christoph Busch
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T-shirt attacks of the T-shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T-shirt attacks. Precisely, state-of-the-art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T-shirt attacks can be reliably detected.
[250] Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving
Ghadah Alosaimi, Hanadi Alhamdan, Wenke E, Stamos Katsigiannis, Amir Atapour-Abarghouei, Toby P. Breckon
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: https://github.com/galosaimi/Mind2Drive.
[251] IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging
Philipp Weigand, Niels Nawrot, Nikolas Ebert, Carsten Hopf, Oliver Wasenmüller
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at https://github.com/CeMOS-IS/IonMorphNet.
[252] PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving
Yining Pan, Shijie Li, Yuchen Wu, Xulei Yang, Na Zhao
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.
[253] Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, Zixu Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the “small loss hypothesis”, but the unique semantic ambiguity in NTC, such as “partial matching”, invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic “representation pollution”. To address this critical challenge, we propose a novel “Expert-Proxy-Diversion” decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy “arbiter” to internalize the expert’s discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI’s matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.
[254] HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition
Xiaoqi Zhuang, Jefersson A. Dos Santos, Jungong Han
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: https://github.com/XiaoqiZhuang/HarmoniDiff-RS.
[255] VecHeart: Holistic Four-Chamber Cardiac Anatomy Modeling via Hybrid VecSets
Yihong Chen, Pascal Fua
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate cardiac anatomy modeling requires the model to be able to handle intricate interrelations among structures. In this paper, we propose VecHeart, a unified framework for holistic reconstruction and generation of four-chamber cardiac structures. To overcome the limitations of current feed-forward implicit methods, specifically their restriction to single-object modeling and their neglect of inter-part correlations, we introduce Hybrid Part Transformer, which leverages part-specific learnable queries and interleaved attention to capture complex inter-chamber dependencies. Furthermore, we propose Anatomical Completion Masking and Modality Alignment strategies, enabling the model to infer complete four-chamber structures from partial, sparse, or noisy observations, even when certain anatomical parts are entirely missing. VecHeart also seamlessly extends to 3D+t dynamic mesh sequence generation, demonstrating exceptional versatility. Experiments show that our method achieves state-of-the-art performance, maintaining high-fidelity reconstruction across diverse challenging scenarios. Code will be released.
[256] HP-Edit: A Human-Preference Post-Training Framework for Image Editing
Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu, Jiaxiu Jiang, Xinran Qin, Zhikai Chen, Fenglong Song, Zhixin Wang, Renjing Pei, Wangmeng Zuo
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer–an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.
[257] GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes
Joshua Niemeijer, Alaa Eddine Ben Zekri, Reza Bahmanyar, Philipp M. Schmälzle, Houda Chaabouni-Chouayakh, Franz Kurz
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird’s-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.
[258] VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing
Yanbin Huang, Yisen Li, Guiyao Tie, Xiaoye Qu, Pan Zhou, Hongfei Wang, Zhaofan Zou, Hao Sun, Xuelong Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model’s response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model’s activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model’s original computational efficiency.
[259] TESO: Online Tracking of Essential Matrix by Stochastic Optimization
Jaroslav Moravec, Radim Šára, Akihiro Sugimoto
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems’ perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis (critical for stereo accuracy) while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.
[260] DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval
Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai, Yang Zhou, Jingbo Xia, Yulong Wang, Jinhai Xiang, Xiang Bai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP’s strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP’s broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.
[261] LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results
Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, Xin He, Naiwei Chen, Shengyuan Li, Fengning Liu, Haoyi Lv, Haowei Peng, Yilian Zhong, Yuxiang Chen, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Kaibin Chen, Xu Zhang, Xuhui Cao, Jiaqi Ma, Ziqi Wang, Shengkai Hu, Yuning Cui, Huan Zhang, Shi Chen, Bin Ren, Lefei Zhang, Guanglu Dong, Qiyao Zhao, Tianheng Zheng, Chunlei Li, Lichao Mou, Chao Ren, Wangzhi Xing, Xin Lu, Enxuan Gu, Jingxi Zhang, Diqi Chen, Qiaosi Yi, Bingcai Wei, Mingyu Liu, Pengyu Wang, Ce Liu, Miaoxin Guan, Boyu Chen, Hongyu Li, Jian Zhu, Xinrui Luo, Ziyang He, Jiayu Wang, Yichen Xiang, Huayi Qi, Haoyu Bian, Yiran Li, Sunlichen Zhou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.
[262] TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation
Hongyu Zhang, Yufan Deng, Zilin Pan, Peng-Tao Jiang, Bo Li, Qibin Hou, Zhiyang Dou, Zhen Dong, Daquan Zhou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.
[263] Deep sprite-based image models: An analysis
Zeynep Sonat Baltacı, Romain Loiseau, Mathieu Aubry
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.
[264] Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram
Michael Achmann-Denkler, Mario Haim, Christian Wolff
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.
[265] Evaluating Histogram Matching for Robust Deep learning-Based Grapevine Disease Detection
Ruben Pascual, Inés Hernández, Salvador Gutiérrez, Javier Tardaguila, Pedro Melo-Pinto, Daniel Paternain, Mikel Galar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Variability in illumination is a primary factor limiting deep learning robustness for field-based plant disease detection. This study evaluates Histogram Matching (HM), a technique that transforms the pixel intensity distribution of an image to match a reference profile, to mitigate this in grapevine classification, distinguishing among healthy leaves, downy mildew, and spider mite damage. We propose a dual-stage integration of HM: (i) as a preprocessing step for normalization, and (ii) as a data augmentation technique to introduce controlled training variability. Experiments using 1,469 RGB images (comprising homogeneous leaf-focused and heterogeneous canopy samples) to train ResNet-18 models demonstrate that this combination significantly enhances robustness on real-world canopy images. While leaf-focused samples showed marginal gains, the canopy subset improved markedly, indicating that balancing normalization with histogram-based diversification effectively bridges the domain gap caused by uncontrolled lighting.
[266] Paparazzo: Active Mapping of Moving 3D Objects
Davide Allegro, Shiyao Li, Stefano Ghidoni, Vincent Lepetit
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object’s motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target’s trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: https://davidea97.github.io/paparazzo-page/
[267] EgoSelf: From Memory to Personalized Egocentric Assistant
Yanshuo Wang, Yuan Xu, Xuesong Li, Jie Hong, Yizhou Wang, Chang Wen Chen, Wentao Zhu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user’s historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself_project/}.
[268] RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation
Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Cosimo Distante, Abdenour Hadid
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.
[269] TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing
Yanhui Chen, Jiahong Li, Jingchao Wang, Junyi Lin, Zixin Zeng, Yang Shi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Language-driven 3D Gaussian Splatting (3DGS) editing provides a more convenient approach for modifying complex scenes in VR/AR. Standard pipelines typically adopt a two-stage strategy: first editing multiple 2D views, and then optimizing the 3D representation to match these edited observations. Existing methods mainly improve view consistency through multi-view feature fusion, attention filtering, or iterative recalibration. However, they fail to explicitly address a more fundamental issue: the semantic correspondence between edited 2D evidence and 3D Gaussians. To tackle this problem, we propose TransSplat, which formulates language-driven 3DGS editing as a multi-view unbalanced semantic transport problem. Specifically, our method establishes correspondences between visible Gaussians and view-specific editing prototypes, thereby explicitly characterizing the semantic relationship between edited 2D evidence and 3D Gaussians. It further recovers a cross-view shared canonical 3D edit field to guide unified 3D appearance updates. In addition, we use transport residuals to suppress erroneous edits in non-target regions, mitigating edit leakage and improving local control precision. Qualitative and quantitative results show that, compared with existing 3D editing methods centered on enhancing view consistency, TransSplat achieves superior performance in local editing accuracy and structural consistency.
[270] SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing
Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan, Linxiao Shi, Qirui Yang, Jian Zhang, Chengcheng Liu, Siming Zheng, Jinwei Chen, Bo Li, Peng-Tao Jiang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.
[271] Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping
Jienan Lyu, Miao Yang, Jinchen Cai, Yiwen Hu, Guanyi Lu, Junhao Qiu, Runmin Dong
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.
[272] PC2Model: ISPRS benchmark on 3D point cloud to model registration
Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham, Yelda Turkan, Karam Mawas
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/uploads/17581812.
[273] Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding
Kadir Yilmaz, Adrian Kruse, Tristan Höfer, Daan de Geus, Bastian Leibe
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.
[274] GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction
Pradyumna YM, Yuxuan Xue, Yue Chen, Nikita Kister, István Sárándi, Gerard Pons-Moll
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human–scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .
[275] MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation
Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.
[276] CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers
Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu, Lei Zhang, Yongdong Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7% across all metrics.
[277] CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation
Yanhui Chen, Baoyao Yang, Siqi Liu, Jingchao Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.
[278] CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation
Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Synthesizing human–object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand–object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.
[279] InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.
[280] MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention
Zhi Chen, Runze Hu, Le Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.
[281] MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation
Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.
[282] IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow
Zihao Fan, Xin Lu, Jie Xiao, Dong Li, Jie Huang, Xueyang Fu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at https://github.com/fanzh03/IR-Flow.
[283] Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang, Tianrun Yuan, Juntong Chen, Yongkang Zhu, Fanhu Zeng, Xuanyu Zhu, Yige Xu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.
[284] Face Anything: 4D Face Reconstruction from Any Image Sequence
Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.
[285] SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
Zewei Zhou, Ruining Yang, Xuewei, Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, Jiaqi Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.
[286] A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems
Houchao Gan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal–dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.
[287] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis
Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.
[288] Generative Drifting for Conditional Medical Image Generation
Zirong Li, Siyuan Mei, Weiwen Wu, Andreas Maier, Lina Gölz, Yan Xia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.
[289] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.
[290] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model
Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.
[291] Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items
Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.
[292] Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets
Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2304.02296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2304.02296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] How to Teach Large Multimodal Models New Skills
Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.08564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
Huzheng Yang, James Gee, Jianbo Shi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2406.18344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.18344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] On the Generalizability of Foundation Models for Crop Type Mapping
Yi-Chia Chang, Adam J. Stewart, Favyen Bastani, Piper Wolters, Shreya Kannan, George R. Huber, Jingtong Wang, Arindam Banerjee
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2409.09451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.09451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training
Yiying Wei, Hadi Amirpour, Jong Hwan Ko, Christian Timmerer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.16312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] Visual Adversarial Attack on Vision-Language Models for Autonomous Driving
Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, Xianglong Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2411.18275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] Uncertainty Quantification in Detection Transformers: Object-Level Calibration and Image-Level Reliability
Young-Jin Park, Carson Sobolewski, Navid Azizan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2412.01782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] 3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography
Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O’Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2502.02779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging
Wentian Xu, Ziyun Liang, Harry Anthony, Yasin Ibrahim, Felix Cohen, Guang Yang, Konstantinos Kamnitsas
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.06717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, Gautier Viaud
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.08620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations
Zeping Liu, Ni Lao, Zhangyu Wang, Junfeng Jiao, Gengchen Mai
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.16683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility
Dawood Wasif, Terrence J. Moore, Jin-Hee Cho
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.16251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation
Joon Tai Kim, Tianle Chen, Ziyu Dong, Nishanth Kunchala, Alexander Guller, Daniel Ospina Acero, Roger Williams, Mrinal Kumar
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.06321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends
Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, Yifan Peng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.09861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.09373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] SMART-Ship: A Comprehensive Synchronized Multi-modal Aligned Remote Sensing Targets Dataset and Benchmark for Berthed Ships Analysis
Chen-Chen Fan, Peiyao Guo, Linping Zhang, Kehan Qi, Haolin Huang, Yong-Qiang Mao, Yuxi Suo, Zhizhuo Jiang, Yu Liu, You He
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.02384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling
Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.12052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation
Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.14630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition
Abdulwahab Felemban, Yahia Battach, Faizan Farooq Khan, Yuqian Fu, Xuhui Liu, Yesmeen M. Khattab, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.16822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark
Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.01233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.12606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting
Shantanu Ghosh, Vedant Parthesh Joshi, Rayan Syed, Param Budhraja, Aya Kassem, Katelyn C. Morrison, Alex Tang, Ho Cheung Aiden Wong, Abhishek Varshney, Payel Basak, Weicheng Dai, Judy Wawira Gichoya, Hari M. Trivedi, Imon Banerjee, Shyam Visweswaran, Clare B. Poynton, Kayhan Batmanghelich
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.00198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges
Kiri L. Wagstaff
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.00676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] PhotoFramer: Multi-modal Image Composition Instruction
Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.00993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] Recurrent Video Masked Autoencoders
Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.13684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.15577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing
Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.19302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers
Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.18358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] Adversarial Attacks on Medical Hyperspectral Imaging Exploiting Spectral-Spatial Dependencies and Multiscale Features
Yunrui Gu, Zhenzhe Gao, Cong Kong, Jiawei Du, Zhaoxia Yin
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.07056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[321] KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering
Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[322] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)
Ghazaleh Serati, Samuel Foucher, Jerome Theau
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.18891: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18891&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[323] Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models
Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.26782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[324] Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models
Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xiaobo Xia, Fei Shen, Tat-Seng Chua
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.22737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[325] Learning Evolution via Optimization Knowledge Adaptation
Chao Wang, Lingling Li, Licheng Jiao, Jiaxuan Zhao, Fang Liu, Shuyuan Yang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2501.02200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[326] Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation
Buddhi Wijenayake, Nichula Wasalathilake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake, Vishal M. Patel
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.04749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[327] TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction
Zhenxing Ming, Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.06400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[328] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation
Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.20409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[329] AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison
Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.13779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[330] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping
Shiyao Li, Antoine Guédon, Shizhe Chen, Vincent Lepetit
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.22650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[331] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection
Haojing Chen, Zhihang Liu, Yutong Li, Tao Tan, Haoyu Bian, Qiuju Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.28584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[332] Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Longwei Xu, Feng Feng, Shaojie Zhang, Xin Chen, Hang Li, Anan Du, Hailong Yu, Pei Fu, Zhenbo Luo, Jian Luan
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.00161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[333] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Arezoo Borji, Gernot Kronreif, Bernhard Angermayr, Francisco Mario Calisto, Ali Abbasian Ardakani, Wolfgang Birkfellner, Inna Servetnyk, Yinyin Yuan, Sepideh Hatamikia
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[334] Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
Haocheng Tang, Xingyu Dang, Junmei Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[335] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations
Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[336] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration
Xueming Fu, Lixia Han
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.05301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[337] VDPP: Video Depth Post-Processing for Speed and Scalability
Daewon Yoon, Injun Baek, Sangyu Han, Yearim Kim, Nojun Kwak
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[338] Adaptive Prompt Elicitation for Text-to-Image Generation
Xinyi Wen, Lena Hegemann, Xiaofu Jin, Shuai Ma, Antti Oulasvirta
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.04713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[339] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding
Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[340] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation
Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[341] Unsupervised Local Plasticity in a Multi-Frequency VisNet Hierarchy
Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.09734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[342] Radar-Informed 3D Multi-Object Tracking under Adverse Conditions
Bingxue Xu, Emil Hedemalm, Ajinkya Khoche, Patric Jensfelt
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[343] Reward-Aware Trajectory Shaping for Few-step Visual Generation
Rui Li, Bingyu Li, Yuanzhi Liang, HuangHai Bin, Chi Zhang, XueLong Li
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[344] The Amazing Stability of Flow Matching
Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[345] Winner of CVPR2026 NTIRE Challenge on Image Shadow Removal: Semantic and Geometric Guidance for Shadow Removal via Cascaded Refinement
Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Jasmin Lampert, Phillipp Fanta-Jende, Radu Timofte, Marco Körner
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[346] Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering
Nirmalendu Prakash, Narmeen Fatimah Oozeer, Xin Su, Phillip Howard, Shaan Shah, Zoe Wanying He, Shuang Wu, Shivam Raval, Roy Ka-Wei Lee, Meenakshi Khosla, Amir Abdullah
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[347] BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation
Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[348] MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures
Vasileios Toulatzis, Sofia Theodoridou, Ioannis Fudos
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[349] ARM: Advantage Reward Modeling for Long-Horizon Manipulation
Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[350] IncreFA: Breaking the Static Wall of Generative Model Attribution
Haotian Qin, Dongliang Chang, Yueying Gao, Yuexuan Tan, Lei Chen, Zhanyu Ma
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[351] Weakly-Supervised Referring Video Object Segmentation through Text Supervision
Miaojing Shi, Jun Huang, Zijie Yue, Hanli Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[352] Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation
Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, Lei Zhang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[353] UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69%$ to $96%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8%$ to $57%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.
[354] AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
Rui Qian, Chuanhang Deng, Qiang Huang, Jian Xiong, Mingxuan Li, Yingbo Zhou, Wei Zhai, Jintao Chen, Dejing Dou
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[355] MultiWorld: Scalable Multi-Agent Multi-View Video World Models
Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[356] Personalized Embodied Navigation for Portable Object Finding
Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, Dinesh Manocha
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2403.09905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.09905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[357] Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm
Moritz Blumenthal, Tina Holliber, Jonathan I. Tamir, Martin Uecker
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.05791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[358] Memory Over Maps: 3D Object Localization Without Reconstruction
Rui Zhou, Xander Yap, Jianwen Cao, Allison Lau, Boyang Sun, Marc Pollefeys
Main category: cs.CV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[359] On Solving the Multiple Variable Gapped Longest Common Subsequence Problem
Marko Djukanović, Nikola Balaban, Christian Blum, Aleksandar Kartelj, Sašo Džeroski, Žiga Zebec
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper addresses the Variable Gapped Longest Common Subsequence (VGLCS) problem, a generalization of the classical LCS problem involving flexible gap constraints between consecutive solutions’ characters. The problem arises in molecular sequence comparison, where structural distance constraints between residues must be respected, and in time-series analysis where events are required to occur within specified temporal delays. We propose a search framework based on the root-based state graph representation, in which the state space comprises a generally large number of rooted state subgraphs. To cope with the resulting combinatorial explosion, an iterative beam search strategy is employed, dynamically maintaining a global pool of promising candidate root nodes, enabling effective control of diversification across iterations. To exploit the search for high-quality solutions, several known heuristics from the LCS literature are utilized into the standalone beam search procedure. To the best of our knowledge, this is the first comprehensive computational study on the VGLCS problem comprising 320 synthetic instances with up to 10 input sequences and up to 500 characters. Experimental results show robustness of the designed approach over the baseline beam search in comparable runtimes.
[360] Beyond One Output: Visualizing and Comparing Distributions of Language Model Generations
Emily Reif, Claire Yang, Jared Hwang, Deniz Nazar, Noah Smith, Jeff Heer
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Users typically interact with and evaluate language models via single outputs, but each output is just one sample from a broad distribution of possible completions. This interaction hides distributional structure such as modes, uncommon edge cases, and sensitivity to small prompt changes, leading users to over-generalize from anecdotes when iterating on prompts for open-ended tasks. Informed by a formative study with researchers who use LMs (n=13) examining when stochasticity matters in practice, how they reason about distributions over language, and where current workflows break down, we introduce GROVE. GROVE is an interactive visualization that represents multiple LM generations as overlapping paths through a text graph, revealing shared structure, branching points, and clusters while preserving access to raw outputs. We evaluate across three crowdsourced user studies (N=47, 44, and 40 participants) targeting complementary distributional tasks. Our results support a hybrid workflow: graph summaries improve structural judgments such as assessing diversity, while direct output inspection remains stronger for detail-oriented questions.
[361] ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna, Rahul Gupta, Kai-Wei Chang, Aram Galstyan, Charith Peris
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement Learning from Human Feedback (RLHF) is central to aligning Large Language Models (LLMs), yet it introduces a critical vulnerability: an imperfect Reward Model (RM) can become a single point of failure when it fails to penalize unsafe behaviors. While existing red-teaming approaches primarily target policy-level weaknesses, they overlook what we term systemic weaknesses cases where both the core LLM and the RM fail in tandem. We present ARES, a framework that systematically discovers and mitigates such dual vulnerabilities. ARES employs a ``Safety Mentor’’ that dynamically composes semantically coherent adversarial prompts by combining structured component types (topics, personas, tactics, goals) and generates corresponding malicious and safe responses. This dual-targeting approach exposes weaknesses in both the core LLM and the RM simultaneously. Using the vulnerabilities gained, ARES implements a two-stage repair process: first fine-tuning the RM to better detect harmful content, then leveraging the improved RM to optimize the core model. Experiments across multiple adversarial safety benchmarks demonstrate that ARES substantially enhances safety robustness while preserving model capabilities, establishing a new paradigm for comprehensive RLHF safety alignment.
[362] AI scientists produce results without reasoning scientifically
Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.
[363] Quantum inspired qubit qutrit neural networks for real time financial forecasting
Kanishk Bakshi, Kathiravan Srinivasan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This research investigates the performance and efficacy of machine learning models in stock prediction, comparing Artificial Neural Networks (ANNs), Quantum Qubit-based Neural Networks (QQBNs), and Quantum Qutrit-based Neural Networks (QQTNs). By outlining methodologies, architectures, and training procedures, the study highlights significant differences in training times and performance metrics across models. While all models demonstrate robust accuracies above 70%, the Quantum Qutrit-based Neural Network consistently outperforms with advantages in risk-adjusted returns, measured by the Sharpe ratio, greater consistency in prediction quality through the Information Coefficient, and enhanced robustness under varying market conditions. The QQTN not only surpasses its classical and qubit-based counterparts in multiple quantitative and qualitative metrics but also achieves comparable performance with significantly reduced training times. These results showcase the promising prospects of Quantum Qutrit-based Neural Networks in practical financial applications, where real-time processing is critical. By achieving superior accuracy, efficiency, and adaptability, the proposed models underscore the transformative potential of quantum-inspired approaches, paving the way for their integration into computationally intensive fields.
[364] Human-Guided Harm Recovery for Computer Use Agents
Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,150 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent’s ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods – ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.
[365] From Natural Language to Executable Narsese: A Neuro-Symbolic Benchmark and Pipeline for Reasoning with NARS
Mina Gabriel, Pei Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are highly capable at language generation, but they remain unreliable when reasoning requires explicit symbolic structure, multi-step inference, and interpretable uncertainty. This paper presents a neuro-symbolic framework for translating natural-language reasoning problems into executable formal representations using first-order logic (FOL) and Narsese, the language of the Non-Axiomatic Reasoning System (NARS). To support this direction, we introduce NARS-Reasoning-v0.1, a benchmark of natural-language reasoning problems paired with FOL forms, executable Narsese programs, and three gold labels: True, False, and Uncertain. We develop a deterministic compilation pipeline from FOL to executable Narsese and validate retained examples through runtime execution in OpenNARS for Applications (ONA), ensuring that the symbolic targets are not only syntactically well formed but also behaviorally aligned with the intended answer. We further present Language-Structured Perception (LSP), a formulation in which an LLM is trained to produce reasoning-relevant symbolic structure rather than only a final verbal response. As an initial proof of concept, we also train and release a Phi-2 LoRA adapter on NARS-Reasoning-v0.1 for three-label reasoning classification, showing that the benchmark can support supervised adaptation in addition to executable evaluation. Overall, the paper positions executable symbolic generation and execution-based validation as a practical path toward more reliable neuro-symbolic reasoning systems.
[366] How Adversarial Environments Mislead Agentic AI?
Zhonghao Zhan, Huichi Zhou, Zhenhao Li, Peiyuan Jing, Krinos Li, Hamed Haddadi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking “can the agent use tools correctly” but never “what if the tools lie”. We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a “fake world” of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.
[367] Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI + Lean 4 Pipeline
George Koomullil
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present a formally verified framework for patent analysis as a hybrid AI + Lean 4 pipeline. The DAG-coverage core (Algorithm 1b) is fully machine-verified once bounded match scores are fixed. Freedom-to-operate, claim-construction sensitivity, cross-claim consistency, and doctrine-of-equivalents analyses are formalized at the specification level with kernel-checked candidate certificates. Existing patent-analysis approaches rely on manual expert analysis (slow, non-scalable) or ML/NLP methods (probabilistic, opaque, non-compositional). To our knowledge, this is the first framework that applies interactive theorem proving based on dependent type theory to intellectual property analysis. Claims are encoded as DAGs in Lean 4, match strengths as elements of a verified complete lattice, and confidence scores propagate through dependencies via proven-correct monotone functions. We formalize five IP use cases (patent-to-product mapping, freedom-to-operate, claim construction sensitivity, cross-claim consistency, doctrine of equivalents) via six algorithms. Structural lemmas, the coverage-core generator, and the closed-path identity coverage = W_cov are machine-verified in Lean 4. Higher-level theorems for the other use cases remain informal proof sketches, and their proof-generation functions are architecturally mitigated (untrusted generators whose outputs are kernel-checked and sorry-free axiom-audited). Guarantees are conditional on the ML layer: they certify mathematical correctness of computations downstream of ML scores, not the accuracy of the scores themselves. A case study on a synthetic memory-module claim demonstrates weighted coverage and construction-sensitivity analysis. Validation against adjudicated cases is future work.
[368] Explicit Trait Inference for Multi-Agent Coordination
Suhaib Abdurahman, Etsuko Ishii, Katerina Margatina, Divya Bhargavi, Monica Sunkara, Yi Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: LLM-based multi-agent systems (MAS) show promise on complex tasks but remain prone to coordination failures such as goal drift, error cascades, and misaligned behaviors. We propose Explicit Trait Inference (ETI), a psychologically grounded method for improving coordination. ETI enables agents to infer and track partner characteristics along two established psychological dimensions–warmth (e.g., trust) and competence (e.g., skill)–from interaction histories to guide decisions. We evaluate ETI in controlled settings (economic games), where it reduces payoff loss by 45-77%, and in more realistic, complex multi-agent settings (MultiAgentBench), where it improves performance by 3-29% depending on the scenario and model, relative to a CoT baseline. Additional analysis shows that gains are closely linked to trait inference: ETI profiles predict agents’ actions, and informative profiles drive improvements. These results highlight ETI as a lightweight and robust mechanism for improving coordination in diverse multi-agent settings, and provide the first systematic evidence that LLM agents can (i) reliably infer others’ traits from interaction histories and (ii) leverage structured awareness of others’ traits for coordination.
[369] Error-free Training for MedMNIST Datasets
Bo Deng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we introduce a new concept called Artificial Special Intelligence by which Machine Learning models for the classification problem can be trained error-free, thus acquiring the capability of not making repeated mistakes. The method is applied to 18 MedMNIST biomedical datasets. Except for three datasets, which suffer from the double-labeling problem, all are trained to perfection.
[370] Large Language Models Exhibit Normative Conformity
Mikako Bito, Keita Nishimoto, Kimitaka Asatani, Ichiro Sakata
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The conformity bias exhibited by large language models (LLMs) can pose a significant challenge to decision-making in LLM-based multi-agent systems (LLM-MAS). While many prior studies have treated “conformity” simply as a matter of opinion change, this study introduces the social psychological distinction between informational conformity and normative conformity in order to understand LLM conformity at the mechanism level. Specifically, we design new tasks to distinguish between informational conformity, in which participants in a discussion are motivated to make accurate judgments, and normative conformity, in which participants are motivated to avoid conflict or gain acceptance within a group. We then conduct experiments based on these task settings. The experimental results show that, among the six LLMs evaluated, up to five exhibited tendencies toward not only informational conformity but also normative conformity. Furthermore, intriguingly, we demonstrate that by manipulating subtle aspects of the social context, it may be possible to control the target toward which a particular LLM directs its normative conformity. These findings suggest that decision-making in LLM-MAS may be vulnerable to manipulation by a small number of malicious users. In addition, through analysis of internal vectors associated with informational and normative conformity, we suggest that although both behaviors appear externally as the same form of “conformity,” they may in fact be driven by distinct internal mechanisms. Taken together, these results may serve as an initial milestone toward understanding how “norms” are implemented in LLMs and how they influence group dynamics.
[371] AutomationBench
Daniel Shepard, Robin Salimans
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Existing AI benchmarks for software automation rarely combine cross-application coordination, autonomous API discovery, and policy adherence. Real business workflows demand all three: a single task may span a CRM, inbox, calendar, and messaging platform - requiring the agent to find the right endpoints, follow a policy document, and write correct data to each system. To address this gap, we introduce AutomationBench, a benchmark for evaluating AI agents on cross-application workflow orchestration via REST APIs. Drawing on real workflow patterns from Zapier’s platform, tasks span Sales, Marketing, Operations, Support, Finance, and HR domains. Agents must discover relevant endpoints themselves, follow layered business rules, and navigate environments with irrelevant and sometimes misleading records. Grading is programmatic and end-state only: whether the correct data ended up in the right systems. Even the best frontier models currently score below 10%. AutomationBench provides a challenging, realistic measure of where current models stand relative to the agentic capabilities businesses actually need.
[372] Personalized Benchmarking: Evaluating LLMs by Individual Preferences
Cristina Garbacea, Heran Wang, Chenhao Tan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: With the rise in capabilities of large language models (LLMs) and their deployment in real-world tasks, evaluating LLM alignment with human preferences has become an important challenge. Current benchmarks average preferences across all users to compute aggregate ratings, overlooking individual user preferences when establishing model rankings. Since users have varying preferences in different contexts, we call for personalized LLM benchmarks that rank models according to individual needs. We compute personalized model rankings using ELO ratings and Bradley-Terry coefficients for 115 active Chatbot Arena users and analyze how user query characteristics (topics and writing style) relate to LLM ranking variations. We demonstrate that individual rankings of LLM models diverge dramatically from aggregate LLM rankings, with Bradley-Terry correlations averaging only $ρ= 0.04$ (57% of users show near-zero or negative correlation) and ELO ratings showing moderate correlation ($ρ= 0.43$). Through topic modeling and style analysis, we find users exhibit substantial heterogeneity in topical interests and communication styles, influencing their model preferences. We further show that a compact combination of topic and style features provides a useful feature space for predicting user-specific model rankings. Our results provide strong quantitative evidence that aggregate benchmarks fail to capture individual preferences for most users, and highlight the importance of developing personalized benchmarks that rank LLM models according to individual user preferences.
[373] Integrating Anomaly Detection into Agentic AI for Proactive Risk Management in Human Activity
Farbod Zorriassatine, Ahmad Lotfi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agentic AI, with goal-directed, proactive, and autonomous decision-making capabilities, offers a compelling opportunity to address movement-related risks in human activity, including the persistent hazard of falls among elderly populations. Despite numerous approaches to fall mitigation through fall prediction and detection, existing systems have not yet functioned as universal solutions across care pathways and safety-critical environments. This is largely due to limitations in consistently handling real-world complexity, particularly poor context awareness, high false alarm rates, environmental noise, and data scarcity. We argue that fall detection and fall prediction can usefully be formulated as anomaly detection problems and more effectively addressed through an agentic AI system. More broadly, this perspective enables the early identification of subtle deviations in movement patterns associated with increased risk, whether arising from age-related decline, fatigue, or environmental factors. While technical requirements for immediate deployment are beyond the scope of this paper, we propose a conceptual framework that highlights potential value. This framework promotes a well-orchestrated approach to risk management by dynamically selecting relevant tools and integrating them into adaptive decision-making workflows, rather than relying on static configurations tailored to narrowly defined scenarios.
[374] Reasoning Structure Matters for Safety Alignment of Reasoning Models
Yeonjun In, Wonjoong Kim, Sangwu Park, Chanyoung Park
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large reasoning models (LRMs) achieve strong performance on complex reasoning tasks but often generate harmful responses to malicious user queries. This paper investigates the underlying cause of these safety risks and shows that the issue lies in the reasoning structure itself. Based on this insight, we claim that effective safety alignment can be achieved by altering the reasoning structure. We propose AltTrain, a simple yet effective post training method that explicitly alters the reasoning structure of LRMs. AltTrain is both practical and generalizable, requiring no complex reinforcement learning (RL) training or reward design, only supervised finetuning (SFT) with a lightweight 1K training examples. Experiments across LRM backbones and model sizes demonstrate strong safety alignment, along with robust generalization across reasoning, QA, summarization, and multilingual setting.
[375] AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
Xue Xia, Chengkai Yao, Mingyu Tsoi, Xinjie Mao, Wenxuan Huang, Jiaqi Wei, Hao Wu, Cheng Tan, Lang Yu, Yuejin Yang, Siqi Sun, Zhangyang Gao
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Systematic ablations are essential to attribute performance gains in AI Virtual Cells, yet they are rarely performed because biological repositories are under-standardized and tightly coupled to domain-specific data and formats. While recent coding agents can translate ideas into implementations, they typically stop at producing code and lack a verifier that can reproduce strong baselines and rigorously test which components truly matter. We introduce AblateCell, a reproduce-then-ablate agent for virtual cell repositories that closes this verification gap. AblateCell first reproduces reported baselines end-to-end by auto-configuring environments, resolving dependency and data issues, and rerunning official evaluations while emitting verifiable artifacts. It then conducts closed-loop ablation by generating a graph of isolated repository mutations and adaptively selecting experiments under a reward that trades off performance impact and execution cost. Evaluated on three single-cell perturbation prediction repositories (CPA, GEARS, BioLORD), AblateCell achieves 88.9% (+29.9% to human expert) end-to-end workflow success and 93.3% (+53.3% to heuristic) accuracy in recovering ground-truth critical components. These results enable scalable, repository-grounded verification and attribution directly on biological codebases.
[376] DW-Bench: Benchmarking LLMs on Data Warehouse Graph Topology Reasoning
Ahmed G. A. H Ahmed, C. Okan Sakar
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper introduces DW-Bench, a new benchmark that evaluates large language models (LLMs) on graph-topology reasoning over data warehouse schemas, explicitly integrating both foreign-key (FK) and data-lineage edges. The benchmark comprises 1,046 automatically generated, verifiably correct questions across five schemas. Experiments show that tool-augmented methods substantially outperform static approaches but plateau on hard compositional subtypes.
[377] UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
Yadong Li, Guoxin Wu, Haiping Hou, Biye Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.
[378] SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
Xiachong Feng, Yi Jiang, Xiaocheng Feng, Deyi Yin, Libo Qin, Yangfan Ye, Lei Huang, Weitao Ma, Yuxuan Gu, Chonghan Qin, Bing Qin, Lingpeng Kong
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance’s strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.
[379] On Accelerating Grounded Code Development for Research
Santosh Ganji
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via doc-search.dev and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows
[380] Plausible Reasoning and First-Order Plausible Logic
David Billington
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Defeasible statements are statements that are likely, or probable, or usually true, but may occasionally be false. Plausible reasoning makes conclusions from statements that are either facts or defeasible statements without using numbers. So there are no probabilities or suchlike involved. Seventeen principles of logics that do plausible reasoning are suggested and several important plausible reasoning examples are considered. There are 14 necessary principles and 3 desirable principles, one of which is not formally stated. A first-order logic, called Plausible Logic (PL), is defined that satisfies all but two of the desirable principles and reasons correctly with all the examples. As far as we are aware, this is the only such logic. PL has 8 reasoning algorithms because, from a given plausible reasoning situation, there are different sensible conclusions. This article is a condensation of my book `Plausible Reasoning and Plausible Logic’ (PRPL), which is to be submitted. Each section of this article corresponds to a chapter in PRPL, and vice versa. The proofs of all the results are in PRPL, so they are omitted in this article.
[381] Learning Lifted Action Models from Unsupervised Visual Traces
Kai Xi, Stephen Gould, Sylvie Thiébaux
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.
[382] Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports
Yishu Wei, Yi Lin, Adam Flanders, George Shih, Yifan Peng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.
[383] OLLM: Options-based Large Language Models
Shashank Sharma, Janina Hoffmann, Vinay Namboodiri
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight “plug-in” that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51%$ final answer correctness, while OLLM’s option set allows up to $\sim 70%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.
[384] Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
Dahyun Jung, Jaewook Lee, Heuiseok Lim
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model’s original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.
[385] Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory
Masaki Uto
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human–human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.
[386] Reasoning-Aware AIGC Detection via Alignment and Reinforcement
Zhao Wang, Max Xiong, Jianxun Lian, Zhicheng Dou
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal
[387] ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation
Zhiqin Yang, Zhenyuan Zhang, Xianzhang Jia, Jun Song, Wei Xue, Yonggang Zhang, Yike Guo
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human-symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner’s behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per-identity access control and escalates boundary violations to the owner. Action-level accountability logs every operation against its owner’s identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.
[388] Industrial Surface Defect Detection via Diffusion Generation and Asymmetric Student-Teacher Network
Shuo Feng, Runlin Zhou, Yuyang Li, Guangcan Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Industrial surface defect detection often suffers from limited defect samples, severe long-tailed distributions, and difficulties in accurately localizing subtle defects under complex backgrounds. To address these challenges, this paper proposes an unsupervised defect detection method that integrates a Denoising Diffusion Probabilistic Model (DDPM) with an asymmetric teacher-student architecture. First, at the data level, the DDPM is trained solely on normal samples. By introducing constant-variance Gaussian perturbations and Perlin noise-based masks, high-fidelity and physically consistent defect samples along with pixel-level annotations are generated, effectively alleviating the data scarcity problem. Second, at the model level, an asymmetric dual-stream network is constructed. The teacher network provides stable representations of normal features, while the student network reconstructs normal patterns and amplifies discrepancies between normal and anomalous regions. Finally, a joint optimization strategy combining cosine similarity loss and pixel-wise segmentation supervision is adopted to achieve precise localization of subtle defects. Experimental results on the MVTecAD dataset show that the proposed method achieves 98.4% image-level AUROC and 98.3% pixel-level AUROC, significantly outperforming existing unsupervised and mainstream deep learning methods. The proposed approach does not require large amounts of real defect samples and enables accurate and robust industrial defect detection and localization. \keywords{Industrial defect detection \and diffusion models \and data generation \and teacher-student architecture \and pixel-level localization}
[389] Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, Maliheh Izadi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.
[390] Towards Energy Impact on AI-Powered 6G IoT Networks: Centralized vs. Decentralized
Anjie Qiu, Donglin Wang, Sanket Partani, Andreas Weinand, Hans D. Schotten
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The emergence of sixth-generation (6G) technologies has introduced new challenges and opportunities for machine learning (ML) applications in Internet of Things (IoT) networks, particularly concerning energy efficiency. As model training and data transmission contribute significantly to energy consumption, optimizing these processes has become critical for sustainable system design. This study first conduct analysis on the energy consumption model for both centralized and decentralized architecture and then presents a testbed deployed within the German railway infrastructure, leveraging sensor data for ML-based predictive maintenance. A comparative analysis of distributed versus Centralized Learning (CL) architectures reveals that distributed models maintain competitive predictive accuracy (~90%) while reducing overall electricity consumption by up to 70%. These findings underscore the potential of distributed ML to improve energy efficiency in real-world IoT deployments, particularly by mitigating transmission-related energy costs.
[391] GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang Li, Rui Mao, Jianbin Qin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.
[392] Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Vasundra Srininvasan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.
[393] Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning
Kyuhee Kim, Auguste Poiroux, Antoine Bosselut
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.
[394] CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation
Jianzhi Yan, Le Liu, Buzhou Tang, Yang Xiang, Dongning Sun, Zhiming Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model’s performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model’s ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.
[395] From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning
Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang, Jiaxuan Lu, Yifu Guo, Siyu Zhang, Yifan Wu, Ying Huang, Fu Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO
[396] SimDiff: Depth Pruning via Similarity and Difference
Yuli Chen, Shuhao Zhang, Fanshen Meng, Bo Cheng, Jiale Han, Qiang Tong, Xiulei Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer’s average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B’s performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.
[397] Revac: A Social Deduction Reasoning Agent
Mihir Shriniwas Arya, Avinash Anish, Aditya Ranjan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.
[398] DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
[399] Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics
Syed Sajid Ullah, Amir Khan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.
[400] Detecting Data Contamination in Large Language Models
Juliusz Janicki, Savvas Chamezopoulos, Evangelos Kanoulas, Georgios Tsatsaronis
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.
[401] Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
Chuou Xu, Liya Ji, Qifeng Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy “king”-“man”+“woman” = “queen” illustrates relational reasoning, yet replacing text with images of “king” and “man” significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that “powder” and “cake” are related by “is made of” grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots’ ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.
[402] Time Series Augmented Generation for Financial Applications
Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent’s core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent’s reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.
[403] SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models
Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git
[404] A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities
Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.
[405] A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.
[406] Conjuring Semantic Similarity
Tian Yu Liu, Stefano Soatto
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or ‘conjure.’ We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.
[407] User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation
Krisztian Balog, ChengXiang Zhai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. Because of its broad scope, research on this topic currently remains scattered across artificial intelligence, human-computer interaction, information science, computational social science, and psychology. To address this fragmented landscape of current research, this article presents a foundational synthesis. We highlight the paradigm shift from traditional predictive models to modern generative approaches, and explicitly frame critical ethical considerations – demonstrating how controlled simulation serves not merely as a risk vector for bias, but as a powerful, proactive tool to ensure fair representation and system safety. Furthermore, we establish the theoretical connection between user simulation and the pursuit of Artificial General Intelligence, arguing that realistic simulators are indispensable catalysts for overcoming critical data and evaluation bottlenecks and optimizing personalization. Ultimately, we propose a practical, self-sustaining innovation ecosystem bridging academia and industry to advance this increasingly important technology.
[408] Epistemic Skills: Reasoning about Knowledge and Oblivion
Xiaolong Liang, Yì N. Wáng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of knowability’’ and ``forgettability,’’ defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.
[409] Memory Assignment for Finite-Memory Strategies in Adversarial Patrolling Games
Vojtěch Kůr, Vít Musil, Vojtěch Řehák
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adversarial Patrolling games form a subclass of Security games where a Defender moves between locations, guarding vulnerable targets. The main algorithmic problem is constructing a strategy for the Defender that minimizes the worst damage an Attacker can cause. We focus on the class of finite-memory (also known as regular) Defender’s strategies that experimentally outperformed other competing classes. A finite-memory strategy can be seen as a positional strategy on a finite set of states. Each state consists of a pair of a location and a certain integer value–called memory. Existing algorithms improve the transitional probabilities between the states but require that the available memory size itself is assigned at each location manually. Choosing the right memory assignment is a well-known open and hard problem that hinders the usability of finite-memory strategies. We solve this issue by developing a general method that iteratively changes the memory assignment. Our algorithm can be used in connection with any black-box strategy optimization tool. We evaluate our method on various experiments and show its robustness by solving instances of various patrolling models.
[410] MRS: Multi-Resolution Skills for HRL Agents
Shashank Sharma, Janina Hoffmann, Vinay Namboodiri
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hierarchical reinforcement learning (HRL) decomposes the policy into a manager and a worker, enabling long-horizon planning but introducing a performance gap on tasks requiring agility. We identify a root cause: in subgoal-based HRL, the manager’s goal representation is typically learned without constraints on reachability or temporal distance from the current state, preventing precise local subgoal selection. We further show that the optimal subgoal distance is both task- and state-dependent: nearby subgoals enable precise control but amplify prediction noise, while distant subgoals produce smoother motion at the cost of geometric precision. We propose Multi-Resolution Skills (MRS), which learns multiple goal-prediction modules each specialized to a fixed temporal horizon, with a jointly trained meta-controller that selects among them based on the current state. MRS consistently outperforms fixed-resolution baselines and significantly reduces the performance gap between HRL and non-HRL state-of-the-art on DeepMind Control Suite, Gym-Robotics, and long-horizon AntMaze tasks. [Project page: https://sites.google.com/view/multi-res-skills/home]
[411] SEAT: Sparse Entity-Aware Tuning for Knowledge Adaptation while Preserving Epistemic Abstention
William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adapting LLMs with new knowledge is increasingly important, but standard fine-tuning often erodes aligned epistemic abstention: the ability to acknowledge when the model does not know. This failure mode is especially concerning in high-stakes settings, where abstention is a critical safeguard against hallucination. We present SEAT, a preventive fine-tuning method that preserves epistemic abstention while maintaining strong knowledge acquisition. SEAT combines sparse tuning, which constrains global activation drift, with entity-perturbed KL regularization, which sharpens local epistemic boundaries and prevents spillover to neighboring knowledge. Crucially, SEAT requires no alignment data, explicit boundary probing, or post-hoc re-alignment, making it attractive for lightweight and privacy-sensitive adaptation. Across models and datasets, SEAT improves human-evaluated abstention on unknown queries by 18%-101% over the strongest baseline while retaining near-perfect target knowledge acquisition, and produces coherent, context-aware abstentions after tuning. Further analyses show that both components are essential, that SEAT more cleanly separates known from unknown queries in representation space, and that it preserves downstream utility. These results identify preservation of epistemic abstention as a core objective for safe knowledge adaptation.
[412] GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning
Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.
[413] GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines
Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models’ understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux
[414] VideoAgent: Personalized Synthesis of Scientific Videos
Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, Quan Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.11253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] Plug-and-Play Dramaturge: A Divide-and-Conquer Approach for Iterative Narrative Script Refinement via Collaborative LLM Agents
Wenda Xie, Chao Guo, Yanqing Jing, Junle Wang, Yisheng Lv, Fei-Yue Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.05188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis
Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.10074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models
Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.06168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards
Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, Fuli Feng
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.07761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks
Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.22673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation
Dongyi Lv, Qiuyu Ding, Heng-Da Xu, Zhaoxu Sun, Zhi Wang, Feng Xiong, Mu Xu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.04562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI
Bo Shu, Yiting Zhang, Saisai Hu, Dong Shu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2403.10559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.10559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search
Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck
Meiru Zhang, Zaiqiao Meng, Nigel Collier
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.12499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients
Jinsheng Yuan, Yuhang Hao, Weisi Guo, Yun Wu, Chongyan Gu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.06335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Right for the Wrong Reasons: Epistemic Regret Minimization for LLM Causal Reasoning
Edward Y. Chang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11675: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11675&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs
Luise Ge, Yongyan Zhang, Yevgeniy Vorobeychik
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.15173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Best Agent Identification for General Game Playing
Matthew Stephenson, Alex Newcombe, Eric Piette, Dennis Soemers
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.00451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
Zhengyang Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.01375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.12812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] Autogenesis: A Self-Evolving Agent Protocol
Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Ming Yin, Bo An, Mengdi Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] Machine individuality: Separating genuine idiosyncrasy from response bias in large language models
Valentin Kriegmair, Dirk U. Wulff
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wang, Weinan E, Siheng Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval
Yifan Song, Xingjian Tao, Zhicheng Yang, Yihong Luo, Jing Tang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] How does the optimizer implicitly bias the model merging loss landscape?
Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.04686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent
Lingfeng Zhang, Yongan Sun, Jinpeng Hu, Hui Ma, Yang Ying, Kuien Liu, Zenglin Shi, Meng Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] Towards Generalization of Graph Neural Networks for AC Optimal Power Flow
Olayiwola Arowolo, Jochen L. Cremer
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.06860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion
Terry Leitch
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs
Kevin Murphy
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] Unifying Controller Design for Stabilizing Nonlinear Systems with Norm-Bounded Control Inputs
Ming Li, Zhiyong Sun, Siep Weiland
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2403.03030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.03030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks
Koki Shibata, Tianheng Ling, Chao Qian, Tomokazu Matsui, Hirohiko Suwa, Keiichi Yasumoto, Gregor Schiele
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.23156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] Towards Auto-Building of Embedded FPGA-based Soft Sensors for Wastewater Flow Estimation
Tianheng Ling, Chao Qian, Gregor Schiele
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.05102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.05102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] Multiclass Local Calibration with the Jensen-Shannon Distance
Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.26566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] Idle is the New Sleep: Configuration-Aware Alternative to Powering Off FPGA-Based DL Accelerators During Inactivity
Chao Qian, Christopher Cichiwskyj, Tianheng Ling, Gregor Schiele
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.12027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.12027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] Who Benefits from AI? Self-Selection, Skill Gap, and the Hidden Costs of AI Feedback
Christoph Riedl, Eric Bogert
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2409.18660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.18660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs
Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.22099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] Benchmarking Misuse Mitigation Against Covert Adversaries
Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.06414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] Graph Data Augmentation with Contrastive Learning on Covariate Distribution Shift
Fanlong Zeng, Wensheng Gan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.00716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] End-to-End Large Portfolio Optimization for Variance Minimization with Neural Networks through Covariance Cleaning
Christian Bongiorno, Efstratios Manolakis, Rosario Nunzio Mantegna
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.01918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] Fine-Tuning Code Language Models to Detect Cross-Language Bugs
Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, Yutao Ma
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.21954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] Prompt to Pwn: Automated Exploit Generation for Smart Contracts
ZeKe Xiao, Qin Wang, Yuekang Li, Shiping Chen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.01371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] Quantum spatial best-arm identification via quantum walks
Tomoki Yamagami, Etsuo Segawa, Takatomo Mihana, André Röhm, Atsushi Uchida, Ryoichi Horisaki
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.05890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] SpecAgent: A Speculative Retrieval and Forecasting Agent for Code Completion
George Ma, Anurag Koul, Qi Chen, Yawen Wu, Sachit Kuhar, Yu Yu, Aritra Sengupta, Varun Kumar, Murali Krishna Ramanathan
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.17925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] Knowledge-Guided Time-Varying Causal Inference for Arctic Sea Ice Dynamics
Akila Sampath, Vandana Janeja, Jianwu Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.17647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators
Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.09427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models
Rachmad Vidya Wicaksana Putra, Pasindu Wickramasinghe, Muhammad Shafique
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.00679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] Bootstrapping Code Translation with Weighted Multilanguage Exploration
Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] On the Spatiotemporal Dynamics of Generalization in Neural Networks
Zichao Wei
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.01651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers
Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.02063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps
Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models
Freja Høgholm Petersen, Jesper Sandvig Mariegaard, Rocco Palmitessa, Allan P. Engsig-Karup
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.05416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] Chimera: Neuro-Symbolic Attention Primitives for Trustworthy Dataplane Intelligence
Rong Fu, Xiaowen Ma, Kun Liu, Wangyu Wu, Ziyu Kong, Jia Yee Tan, Tailong Luo, Xianda Li, Zeli Su, Youjin Wang, Yongtai Liu, Simon Fong
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.12851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?
Spandan Garg, Yufan Huang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.18571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] PhysMem: Scaling Test-time Physical Memory for Robot Manipulation
Haoyang Li, Yang You, Hao Su, Leonidas Guibas
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.20323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models
Subina Khanal, Seshu Tirupathi, Merim Dzaferagic, Marco Ruffini, Torben Bach Pedersen
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.16497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] Reinforced Generation of Combinatorial Structures: Ramsey Numbers
Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.09172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] Adapting Dijkstra for Buffers and Unlimited Transfers
Denys Katkalo, Andrii Rohovyi, Toby Walsh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.11729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] Early Pruning for Public Transport Routing
Andrii Rohovyi, Abdallah Abuaisha, Toby Walsh
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.12592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors
Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.15956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] GAIN: Multiplicative Modulation for Domain Adaptation
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.04516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] The data heat island effect: quantifying the impact of AI data centers in a warming world
Andrea Marinoni, Erik Cambria, Weisi Lin, Mauro Dalla Mura, Jocelyn Chanussot, Edoardo Ragusa, Chi Yan Tso, Yihao Zhu, Benjamin Horton
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization
Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.06798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[472] Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines
Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[473] THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture
Augustus Haoyang Li
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.11284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[474] Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation
Jiaqi Zhu, Shaofeng Cai, Jie Chen, Fang Deng, Beng Chin Ooi, Wenqiao Zhang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.14726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[475] Beyond the ‘Diff’: Addressing Agentic Entropy in Agentic Software Development
Matteo Casserini, Alessandro Facchini, Andrea Ferrario
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[476] SCATR: Simple Calibrated Test-Time Ranking
Divya Shyamal, Marta Knežević, Lan Tran, Chanakya Ekbote, Vijay Lingam, Paul Pu Liang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[477] Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration
Donghwan Lee
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[478] LEPO: Latent Reasoning Policy Optimization for Large Language Models
Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.17892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[479] Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks
Jun Chen, Umberto Biccari, Junmin Wang
Main category: cs.AI
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18507: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18507&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[480] A Complementary Visualisation Suite for Empirical Performance Analysis: Tempographs, Histograms, Ridgeline Plots, Stacked Bar Charts, and Combination Charts Applied to Beethoven’s Piano and Cello Sonatas
Ignasi Sole
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The choice of visualisation in empirical performance analysis is not a neutral presentation decision but an analytical one: different graphical forms reveal different features of the same dataset, and reliance on any single type systematically conceals what the others expose. This paper presents and argues for a suite of five complementary visualisation tools; tempographs, histograms with spline-smoothed probability density functions, ridgeline plots, stacked bar charts, and combination charts. These are applied to bar-level beats-per-minute data from recordings of Beethoven’s five piano and cello sonatas (Op.~5 Nos.1 and2; Op.~69; Op.~102 Nos.1 and2) spanning 1930–2012. Each tool is described formally, its analytical properties characterised, its implementation detailed in working Python and MATLAB code, and its specific contribution demonstrated on a worked example using two recordings of Op.~5 No.~1 (Casals/Horszowski 1930–39 and Isserlis/Levin 2012) separated by eight decades. A five-panel composite figure applies all five tools to the same two recordings simultaneously, making the complementarity argument concrete: the tempograph reveals moment-to-moment structural parallels invisible in aggregate statistics; the spline-smoothed histogram exposes bimodality and secondary peaks suppressed by binning artefacts; the ridgeline plot positions both recordings within the full distributional space; the stacked bar chart shows divergent sectional pacing concealed by identical movement means; and the combination chart integrates mean tempo, variability, and historical reference marks in a single view. The spline-CDF smoothing method, applied to histogram data via cubic spline interpolation with zero-slope boundary conditions, is presented as a novel contribution to the performance analysis toolkit. Full implementation code is publicly available.
[481] Towards Revised Tempo Indications for Beethoven’s Piano and Cello Sonatas: Czerny, Moscheles, Kolisch, and Recorded Practice 1930-2012
Ignasi Sole
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Historical metronome indications for Beethoven’s five piano and cello sonatas (as transmitted by Czerny, Moscheles, and Kolisch), have long been regarded as problematic by performers and scholars alike. This paper presents the first systematic empirical assessment of those indications against a corpus of over one hundred movement-level recordings spanning 1930–2012, encompassing first, second, and third movements across all five sonatas (Op.~5 Nos.1 and2; Op.~69; Op.~102 Nos.1 and2). The core findings are threefold. First, Czerny’s and Moscheles’s markings are consistently and substantially exceeded by the entire recording corpus: gaps of 15–39% are documented across movements, with the largest divergences in slow Adagio movements and the smallest in fast Allegro finales. Second, Kolisch’s 1943 markings align considerably more closely with recorded practice than either Czerny’s or Moscheles’s, a striking result given that Kolisch was reasoning without corpus data. Third, the central Allegro tempo traditions for each movement are stable across eight decades; not because all performers play alike, but because three coexisting slow, mid-range, and fast traditions persist simultaneously, with the mid-range dominant throughout. Building on these findings, this paper proposes a set of revised tempo indications grounded in the statistical modal tempi of the corpus, presented as ranges reflecting the documented spectrum of expert interpretive practice rather than single prescriptive values. These indications are offered not as claims about Beethoven’s intentions but as evidence-based reference points for performers and scholars navigating the gap between historical prescription and performable reality.
[482] DASB - Discrete Audio and Speech Benchmark
Pooneh Mousavi, Jarod Duret, Darius Petermann, Artem Ploujnikov, Luca Della Libera, Anastasia Kuznetsova, Cem Subakan, Mirco Ravanelli
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Discrete audio tokens have recently gained considerable attention for their potential to bridge audio and language processing, enabling multimodal language models that can both generate and understand audio. However, preserving key information such as phonetic content, speaker identity, and paralinguistic cues remains a major challenge. Identifying the optimal tokenizer and configuration is further complicated by inconsistent evaluation settings across existing studies. To address this, we introduce the Discrete Audio and Speech Benchmark (DASB), a comprehensive framework for benchmarking discrete audio tokens across speech, general audio, and music domains on a range of discriminative and generative tasks. Our results show that discrete representations are less robust than continuous ones and require careful tuning of factors such as model architecture, data size, learning rate, and capacity. Semantic tokens generally outperform acoustic tokens, but a gap remains between discrete tokens and continuous features, highlighting the need for further research. DASB codes, evaluation setup, and leaderboards are publicly available at https://poonehmousavi.github.io/DASB-website/.
[483] Virtual boundary integral neural network for three-dimensional exterior acoustic problems
Jiahao Li, Qiang Xi, Ilia Marchevskiy, Zhuojia Fu
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a virtual boundary integral neural network (VBINN) for exterior acoustic problems in three dimensions. The method introduces a virtual boundary inside the scatterer or vibrating body and represents the associated source density with a neural network. Coupled with the acoustic fundamental solution, this representation satisfies the Sommerfeld radiation condition by construction and enables direct evaluation of the acoustic pressure and its normal derivative at arbitrary field points. Because the integration surface is separated from the physical boundary, the formulation avoids the singular and near singular kernel evaluations associated with coincident source and collocation points in conventional boundary integral learning methods. To reduce sensitivity to boundary placement, the geometric parameters of the virtual boundary are optimized jointly with the source density during training. Numerical examples for acoustic scattering, multiple body interaction, and underwater acoustic propagation show close agreement with analytical solutions and COMSOL results, and the Burton Miller extension further improves stability near characteristic frequencies. These results demonstrate the potential of VBINN for exterior acoustic analysis in three dimensions.
[484] APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
Deshui Miao, Yameng Gu, Chao Yang, Xin Li, Haijun Zhang, Ming-Hsuan Yang
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This report presents an Audio-aware Referring Video Object Segmentation (Ref-VOS) pipeline tailored to the MEVIS_Audio setting, where the referring expression is provided in spoken form rather than as clean text. Compared with a standard Sa2VA-based Ref-VOS pipeline, the proposed system introduces two additional front-end stages: speech transcription and visual existence verification. Specifically, we first employ VibeVoice-ASR to convert long-form spoken input into a structured textual transcript. Since audio-derived queries are inherently noisy and may describe entities that are not visually present in the video, we then introduce an Omni-based judgment module to determine whether the transcribed target can be grounded in the visual content. If the target is judged to be absent, the pipeline terminates early and outputs all-zero masks. Otherwise, the transcript is transformed into a segmentation-oriented prompt and fed into Sa2VA to obtain a coarse mask trajectory over the full video. Importantly, this trajectory is treated as an initial semantic hypothesis rather than a final prediction. On top of it, an agentic refinement layer evaluates query reliability, temporal relevance, anchor quality, and potential error sources, and may invoke SAM3 to improve spatial boundary precision and temporal consistency. The resulting framework explicitly decomposes the MEVIS_Audio task into audio-to-text conversion, visual existence verification, coarse video segmentation, and agent-guided refinement. Such a staged design is substantially more appropriate for audio-conditioned Ref-VOS than directly sending noisy ASR outputs into a segmentation model.
[485] Comparison of sEMG Encoding Accuracy Across Speech Modes Using Articulatory and Phoneme Features
Chenqian Le, Ruisi Li, Beatrice Fumagalli, Xupeng Chen, Amirhossein Khalilian-Gourtani, Tianyu He, Adeen Flinker, Yao Wang
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We test whether Speech Articulatory Coding (SPARC) features can linearly predict surface electromyography (sEMG) envelopes across aloud, mimed, and subvocal speech in twenty-four subjects. Using elastic-net multivariate temporal response function (mTRF) with sentence-level cross-validation, SPARC yields higher prediction accuracy than phoneme one-hot representations on nearly all electrodes and in all speech modes. Aloud and mimed speech perform comparably, and subvocal speech remains above chance, indicating detectable articulatory activity. Variance partitioning shows a substantial unique contribution from SPARC and a minimal unique contribution from phoneme features. mTRF weight patterns reveal anatomically interpretable relationships between electrode sites and articulatory movements that remain consistent across modes. This study focuses on representation/encoding analysis (not end-to-end decoding) and supports SPARC as a robust and interpretable intermediate target for sEMG-based silent-speech modeling.
[486] Tadabur: A Large-Scale Quran Audio Dataset
Faisal Alherran
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite growing interest in Quranic data research, existing Quran datasets remain limited in both scale and diversity. To address this gap, we present Tadabur, a large-scale Quran audio dataset. Tadabur comprises more than 1400+ hours of recitation audio from over 600 distinct reciters, providing substantial variation in recitation styles, vocal characteristics, and recording conditions. This diversity makes Tadabur a comprehensive and representative resource for Quranic speech research and analysis. By significantly expanding both the total duration and variability of available Quran data, Tadabur aims to support future research and facilitate the development of standardized Quranic speech benchmarks.
[487] ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis
Aoduo Li, Haoran Lv, Shengmin Li, Sihao Qin, Hongjian Xu
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: High-fidelity character voice synthesis is a cornerstone of immersive multimedia applications, particularly for interacting with anime avatars and digital humans. However, existing systems struggle to maintain consistent persona traits across diverse emotional contexts. To bridge this gap, we present ATRIE, a unified framework utilizing a Persona-Prosody Dual-Track (P2-DT) architecture. Our system disentangles generation into a static Timbre Track (via Scalar Quantization) and a dynamic Prosody Track (via Hierarchical Flow-Matching), distilled from a 14B LLM teacher. This design enables robust identity preservation (Zero-Shot Speaker Verification EER: 0.04) and rich emotional expression. Evaluated on our extended AnimeTTS-Bench (50 characters), ATRIE achieves state-of-the-art performance in both generation and cross-modal retrieval (mAP: 0.75), establishing a new paradigm for persona-driven multimedia content creation.
[488] Audio Spoof Detection with GaborNet
Waldek Maciejko
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: An direction of development in the extraction of features from audio signals is based on processing raw samples in the time domain. Such an approach appears to be effective, especially in the era of neural networks. An example is SincNet. In this solution, the core of the neural network layer is a set of sinc functions that are convolved with the input signal. Due to the finite length of sinc functions, distortions appear in the frequency domain of the convolved signal, the same as in the case of windowing the signal. Recently, a new approach has been developed that uses Gabor filters to replace sinc functions. Due to the complex results, further modifications had to be applied, such as squared modulus or Gaussian Lowpass Pooling. In this work, an ingestion layer based on a bank of Gabor filters, named GaborNet, and its modifications are intensively examined within the popular RawNet2 and RawGAT- ST architectures. These have been developed for the purpose of audio spoof detection. Another issue that has been investigated was audio augmentation using codec conversions, room responses, and additive noises.
[489] HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
Feiyu Zhao, Yiming Chen, Wenhuan Lu, Daipeng Zhang, Xianghu Yue, Jianguo Wei
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Audio-Language Models (LALMs) have recently achieved strong performance across various audio-centric tasks. However, hallucination, where models generate responses that are semantically incorrect or acoustically unsupported, remains largely underexplored in the audio domain. Existing hallucination benchmarks mainly focus on text or vision, while the few audio-oriented studies are limited in scale, modality coverage, and diagnostic depth. We therefore introduce HalluAudio, the first large-scale benchmark for evaluating hallucinations across speech, environmental sound, and music. HalluAudio comprises over 5K human-verified QA pairs and spans diverse task types, including binary judgments, multi-choice reasoning, attribute verification, and open-ended QA. To systematically induce hallucinations, we design adversarial prompts and mixed-audio conditions. Beyond accuracy, our evaluation protocol measures hallucination rate, yes/no bias, error-type analysis, and refusal rate, enabling a fine-grained analysis of LALM failure modes. We benchmark a broad range of open-source and proprietary models, providing the first large-scale comparison across speech, sound, and music. Our results reveal significant deficiencies in acoustic grounding, temporal reasoning, and music attribute understanding, underscoring the need for reliable and robust LALMs.
[490] Deep Supervised Contrastive Learning of Pitch Contours for Robust Pitch Accent Classification in Seoul Korean
Hyunjung Joo, GyeongTaek Lee
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The intonational structure of Seoul Korean has been defined with discrete tonal categories within the Autosegmental-Metrical model of intonational phonology. However, it is challenging to map continuous $F_0$ contours to these invariant categories due to variable $F_0$ realizations in real-world speech. Our paper proposes Dual-Glob, a deep supervised contrastive learning framework to robustly classify fine-grained pitch accent patterns in Seoul Korean. Unlike conventional local predictive models, our approach captures holistic $F_0$ contour shapes by enforcing structural consistency between clean and augmented views in a shared latent space. To this aim, we introduce the first large-scale benchmark dataset, consisting of manually annotated 10,093 Accentual Phrases in Seoul Korean. Experimental results show that our Dual-Glob significantly outperforms strong baseline models with state-of-the-art accuracy (77.75%) and F1-score (51.54%). Therefore, our work supports AM-based intonational phonology using data-driven methodology, showing that deep contrastive learning effectively captures holistic structural features of continuous $F_0$ contours.
[491] BEAT: Tokenizing and Generating Symbolic Music by Uniform Temporal Steps
Lekai Qian, Haoyu Gu, Jingwei Zhao, Ziyu Wang
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Tokenizing music to fit the general framework of language models is a compelling challenge, especially considering the diverse symbolic structures in which music can be represented (e.g., sequences, grids, and graphs). To date, most approaches tokenize symbolic music as sequences of musical events, such as onsets, pitches, time shifts, or compound note events. This strategy is intuitive and has proven effective in Transformer-based models, but it treats the regularity of musical time implicitly: individual tokens may span different durations, resulting in non-uniform time progression. In this paper, we instead consider whether an alternative tokenization is possible, where a uniform-length musical step (e.g., a beat) serves as the basic unit. Specifically, we encode all events within a single time step at the same pitch as one token, and group tokens explicitly by time step, which resembles a sparse encoding of a piano-roll representation. We evaluate the proposed tokenization on music continuation and accompaniment generation tasks, comparing it with mainstream event-based methods. Results show improved musical quality and structural coherence, while additional analyses confirm higher efficiency and more effective capture of long-range patterns with the proposed tokenization.
[492] Towards Streaming Target Speaker Extraction via Chunk-wise Interleaved Splicing of Autoregressive Language Model
Shuhai Peng, Hui Lu, Jinjiang Liu, Liyang Chen, Guiping Zhong, Jiakui Li, Huimeng Wang, Haiyun Li, Liang Cao, Shiyin Kang, Zhiyong Wu
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While generative models have set new benchmarks for Target Speaker Extraction (TSE), their inherent reliance on global context precludes deployment in real-time applications. Direct adaptation to streaming scenarios often leads to catastrophic inference performance degradation due to the severe mismatch between training and streaming inference. To bridge this gap, we present the first autoregressive (AR) models tailored for streaming TSE. Our approach introduces a Chunk-wise Interleaved Splicing Paradigm that ensures highly efficient and stable streaming inference. To ensure the coherence between the extracted speech segments, we design a historical context refinement mechanism that mitigates boundary discontinuities by leveraging historical information. Experiments on Libri2Mix show that while AR generative baseline exhibits performance degradation at low latencies, our approach maintains 100% stability and superior intelligibility. Furthermore, our streaming results are comparable to or even surpass offline baselines. Additionally, our model achieves a Real-Time-Factor (RTF) of 0.248 on consumer-level GPUs. This work provides empirical evidence that AR generative backbones are viable for latency-sensitive applications through the Chunk-wise Interleaved Splicing Paradigm.
[493] Environmental Sound Deepfake Detection Using Deep-Learning Framework
Lam Pham, Khoi Vu, Dat Tran, Phat Lam, Vu Nguyen, David Fischinger, Alexander Schindler, Martin Boyer, Son Le
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In this paper, we propose a deep-learning framework for environmental sound deepfake detection (ESDD) – the task of identifying whether the sound scene and sound event in an input audio recording is fake or not. To this end, we conducted extensive experiments to explore how individual spectrograms, a wide range of network architectures and pre-trained models, ensemble of spectrograms or network architectures affect the ESDD task performance. The experimental results on the benchmark datasets of EnvSDD and ESDD-Challenge-TestSet indicate that detecting deepfake audio of sound scene and detecting deepfake audio of sound event should be considered as individual tasks. We also indicate that the approach of finetuning a pre-trained model is more effective compared with training a model from scratch for the ESDD task. Eventually, our best model, which was finetuned from the pre-trained WavLM model with the proposed three-stage training strategy, achieve the Accuracy of 0.98, F1 Score of 0.95, AuC of 0.99 on EnvSDD Test subset and the Accuracy of 0.88, F1 Score of 0.77, and AuC of 0.92 on ESDD-Challenge-TestSet dataset.
[494] Protecting Bystander Privacy via Selective Hearing in Audio LLMs
Xiao Zhan, Guangzhi Sun, Jose Such, Phil Woodland
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Audio Large language models (LLMs) are increasingly deployed in the real world, where they inevitably capture speech from unintended nearby bystanders, raising privacy risks that existing benchmarks and defences did not consider. We introduce SH-Bench, the first benchmark designed to evaluate selective hearing: a model’s ability to attend to an intended main speaker while refusing to process or reveal information about incidental bystander speech. SH-Bench contains 3,968 multi-speaker audio mixtures, including both real-world and synthetic scenarios, paired with 77k multiple-choice questions that probe models under general and selective operating modes. In addition, we propose Selective Efficacy (SE), a novel metric capturing both multi-speaker comprehension and bystander-privacy protection. Our evaluation of state-of-the-art open-source and proprietary LLMs reveals substantial bystander privacy leakage, with strong audio understanding failing to translate into selective protection of bystander privacy. To mitigate this gap, we also present Bystander Privacy Fine-Tuning (BPFT), a novel training pipeline that teaches models to refuse bystander-related queries without degrading main-speaker comprehension. We show that BPFT yields substantial gains, achieving an absolute 47% higher bystander accuracy under selective mode and an absolute 16% higher SE compared to Gemini 2.5 Pro, which is the best audio LLM without BPFT. Together, SH-Bench and BPFT provide the first systematic framework for measuring and improving bystander privacy in audio LLMs.
[495] Affectron: Emotional Speech Synthesis with Affective and Contextually Aligned Nonverbal Vocalizations
Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Nonverbal vocalizations (NVs), such as laughter and sighs, are central to the expression of affective cues in emotional speech synthesis. However, learning diverse and contextually aligned NVs remains challenging in open settings due to limited NV data and the lack of explicit supervision. Motivated by this challenge, we propose Affectron as a framework for affective and contextually aligned NV generation. Built on a small-scale open and decoupled corpus, Affectron introduces an NV-augmented training strategy that expands the distribution of NV types and insertion locations. We further incorporate NV structural masking into a speech backbone pre-trained on purely verbal speech to enable diverse and natural NV synthesis. Experimental results demonstrate that Affectron produces more expressive and diverse NVs than baseline systems while preserving the naturalness of the verbal speech stream.
[496] NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations
Liumeng Xue, Weizhen Bian, Jiahao Pan, Wenxuan Wang, Yilin Ren, Boyi Kang, Jingbin Hu, Ziyang Ma, Shuai Wang, Xinyuan Qian, Hung-yi Lee, Yike Guo
Main category: cs.SD
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Non-verbal vocalizations (NVVs) like laugh, sigh, and sob are essential for human-like speech, yet standardized evaluation remains limited in jointly assessing whether systems can generate the intended NVVs, place them correctly, and keep them salient without harming speech. We present Non-verbal Vocalization Benchmark (NVBench), a bilingual (English/Chinese) benchmark that evaluates speech synthesis with NVVs. NVBench pairs a unified 45-type taxonomy with a curated bilingual dataset and introduces a multi-axis protocol that separates general speech naturalness and quality from NVV-specific controllability, placement, and salience. We benchmark 15 TTS systems using objective metrics, listening tests, and an LLM-based multi-rater evaluation. Results reveal that NVVs controllability often decouples from quality, while low-SNR oral cues and long-duration affective NVVs remain persistent bottlenecks. NVBench enables fair cross-system comparison across diverse control interfaces under a unified, standardized framework.
cs.LG
[497] Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs
Guchan Li, Rui Tian, Hongning Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.
[498] Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning
Zhiyin Yu, Bo Zhang, Qibin Hou, Zhonghai Wu, Xiao Luo, Lei Bai
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model’s reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.
[499] FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing
Pronob Kumar Barman, Pronoy Kumar Barman, Plaban Kumar Barman, Rohan Mandar Salvi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, which integrates spatiotemporal crime prediction with fairness constrained patrol allocation and a closed loop deployment feedback simulator. We model Baltimore as a graph of 25 ZIP Code Tabulation Areas and use 139,982 Part 1 crime incidents from 2017 to 2019 at hourly resolution, producing a sparse feature tensor. The prediction module combines a spatiotemporal graph neural network with a multivariate Hawkes process to capture spatial dependencies and self exciting temporal dynamics. Outputs are modeled using a Zero Inflated Negative Binomial distribution, suitable for overdispersed and zero heavy crime counts. The model achieves a validation loss of 0.4800 and a test loss of 0.4857. Patrol allocation is formulated as a fairness constrained linear optimization problem that maximizes risk weighted coverage while enforcing a Demographic Impact Ratio constraint with deviation bounded by 0.05. Across six simulated deployment cycles, fairness remains within 0.9928 to 1.0262, and coverage ranges from 0.876 to 0.936. However, a persistent detection rate gap of approximately 3.5 percentage points remains between minority and non minority areas. This result shows that allocation level fairness constraints alone do not eliminate feedback induced bias in retraining data, highlighting the need for fairness interventions across the full pipeline.
[500] Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training
Vin Bhaskara, Haicheng Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model’s cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.
[501] The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification
Merkouris Papamichail, Konstantinos Varsos, Giorgos Flouris, João Marques-Silva
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Many neural network (NN) verification systems represent the network’s input-output relation as a constraint program. Sound and complete, representations involve integer constraints, for simulating the activations. Recent works convexly relax the integer constraints, improving performance, at the cost of soundness. Convex relaxations consider outputs that are unreachable by the original network. We study the worst case divergence between the original network and its convex relaxations; both qualitatively and quantitatively. The relaxations’ space forms a lattice, where the top element corresponds to a full relaxation, with every neuron linearized. The bottom element corresponds to the original network. We provide analytical upper and lower bounds for the $\ell_\infty$-distance between the fully relaxed and original outputs. This distance grows exponentially, w.r.t. the network’s depth, and linearly w.r.t. the input’s radius. The misclassification probability exhibits a step-like behavior, w.r.t. input radius. Our results are supported by experiments on MNIST, Fashion MNIST and random networks.
[502] Discrete Tilt Matching
Yuyuan Chen, Shiyi Wang, Peter Potaptchik, Jaeyeon Kim, Michael S. Albergo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM’s annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.
[503] Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models
Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.
[504] Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
Andrew Wang, Ellie Pavlick, Ritambhara Singh
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient’s multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.
[505] Towards Understanding the Robustness of Sparse Autoencoders
Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.
[506] Multi-Level Temporal Graph Networks with Local-Global Fusion for Industrial Fault Diagnosis
Bibek Aryal, Gift Modekwe, Qiugang Lu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fault detection and diagnosis are critical for the optimal and safe operation of industrial processes. The correlations among sensors often display non-Euclidean structures where graph neural networks (GNNs) are widely used therein. However, for large-scale systems, local, global, and dynamic relations extensively exist among sensors, and traditional GNNs often overlook such complex and multi-level structures for various problems including the fault diagnosis. To address this issue, we propose a structure-aware multi-level temporal graph network with local-global feature fusion for industrial fault diagnosis. First, a correlation graph is dynamically constructed using Pearson correlation coefficients to capture relationships among process variables. Then, temporal features are extracted through long short-term memory (LSTM)-based encoder, whereas the spatial dependencies among sensors are learned by graph convolution layers. A multi-level pooling mechanism is used to gradually coarsen and learn meaningful graph structures, to capture higher-level patterns while keeping important fault related details. Finally, a fusion step is applied to combine both detailed local features and overall global patterns before the final prediction. Experimental evaluations on the Tennessee Eastman process (TEP) demonstrate that the proposed model achieves superior fault diagnosis performance, particularly for complex fault scenarios, outperforming various baseline methods.
[507] Streaming Structured Inference with Flash-SemiCRF
Benjamin K. Johnson, Thomas Goralski, Ayush Semwal, Hui Shen, H. Josh Jang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech-scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment-level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on-the-fly from a compact prefix-sum array, and make several improvements. First, replacing the stored edge tensor with prefix-sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward-backward pass with checkpoint-boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero-centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash-SemiCRF, a fused Triton kernel that enables exact semi-CRF inference on previously intractable problem sizes. Available at https://github.com/biobenkj/flash-semicrf.
[508] Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs
Afsara Benazir, Felix Xiaozhu Lin
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.
[509] HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation
Zijian Zeng, Fei Ding, Huiming Yang, Xianwei Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.
[510] Preserving Clusters in Error-Bounded Lossy Compression of Particle Data
Congrong Ren, Sheng Di, Katrin Heitmann, Franck Cappello, Hanqi Guo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Lossy compression is widely used to reduce storage and I/O costs for large-scale particle datasets in scientific applications such as cosmology, molecular dynamics, and fluid dynamics, where clustering structures (e.g., single-linkage or Friends-of-Friends) are critical for downstream analysis; however, existing compressors typically provide only pointwise error bounds on particle positions and offer no guarantees on preserving clustering outcomes, and even small perturbations can alter cluster connectivity and compromise scientific validity. We propose a correction-based technique to preserve single-linkage clustering under lossy compression, operating on decompressed data from off-the-shelf compressors such as SZ3 and Draco. Our key contributions are threefold: (1) a clustering-aware correction algorithm that identifies vulnerable particle pairs via spatial partitioning and local neighborhood search; (2) an optimization-based formulation that enforces clustering consistency using projected gradient descent with a loss that encodes pairwise distance violations; and (3) a scalable GPU-accelerated and distributed implementation for large-scale datasets. Experiments on cosmology and molecular dynamics datasets show that our method effectively preserves clustering results while maintaining competitive compression performance compared with SZ3, ZFP, Draco, LCP, and space-filling-curve-based schemes.
[511] A PPA-Driven 3D-IC Partitioning Selection Framework with Surrogate Models
Shang Wang, Shuai Liu, Owen Randall, Matthew E. Taylor
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: 3D-IC netlist partitioning is commonly optimized using proxy objectives, while final PPA is treated as a costly evaluation rather than an optimization signal. This proxy-driven paradigm makes it difficult to reliably translate additional PPA evaluations into better PPA outcomes. To bridge this gap, we present DOPP (D-Optimal PPA-driven partitioning selection), an approach that bridges the gap between proxies and true PPA metrics. Across eight 3D-IC designs, our framework improves PPA over Open3DBench (average relative improvements of 9.99% congestion, 7.87% routed wirelength, 7.75% WNS, 21.85% TNS, and 1.18% power). Compared with exhaustive evaluation over the full candidate set, DOPP achieves comparable best-found PPA while evaluating only a small fraction of candidates, substantially reducing evaluation cost. By parallelizing evaluations, our method delivers these gains while maintaining wall-clock runtime comparable to traditional baselines.
[512] Rethinking Dataset Distillation: Hard Truths about Soft Labels
Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, R. Venkatesh Babu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.
[513] Curvature-Aware PCA with Geodesic Tangent Space Aggregation for Semi-Supervised Learning
Alexandre L. M. Levada
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Principal Component Analysis (PCA) is a fundamental tool for representation learning, but its global linear formulation fails to capture the structure of data supported on curved manifolds. In contrast, manifold learning methods model nonlinearity but often sacrifice the spectral structure and stability of PCA. We propose \emph{Geodesic Tangent Space Aggregation PCA (GTSA-PCA)}, a geometric extension of PCA that integrates curvature awareness and geodesic consistency within a unified spectral framework. Our approach replaces the global covariance operator with curvature-weighted local covariance operators defined over a $k$-nearest neighbor graph, yielding local tangent subspaces that adapt to the manifold while suppressing high-curvature distortions. We then introduce a geodesic alignment operator that combines intrinsic graph distances with subspace affinities to globally synchronize these local representations. The resulting operator admits a spectral decomposition whose leading components define a geometry-aware embedding. We further incorporate semi-supervised information to guide the alignment, improving discriminative structure with minimal supervision. Experiments on real datasets show consistent improvements over PCA, Kernel PCA, Supervised PCA and strong graph-based baselines such as UMAP, particularly in small sample size and high-curvature regimes. Our results position GTSA-PCA as a principled bridge between statistical and geometric approaches to dimensionality reduction.
[514] The High Explosives and Affected Targets (HEAT) Dataset
Bryan Kaiser, Kyle Hickmann, Sharmistha Chakrabarti, Soumi De, Sourabh Pandit, David Schodt, Jesus Pulido, Divya Banesh, Christine Sweeney
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Artificial Intelligence (AI) surrogate models provide a computationally efficient alternative to full-physics simulations, but no public datasets currently exist for training and validating models of high-explosive-driven, multi-material shock dynamics. Simulating shock propagation is challenging due to the need for material-specific equations of state (EOS) and models of plasticity, phase change, damage, fluid instabilities, and multi-material interactions. Explosive-driven shocks further require reactive material models to capture detonation physics. To address this gap, we introduce the High-Explosives and Affected Targets (HEAT) dataset, a physics-rich collection of two-dimensional, cylindrically symmetric simulations generated using an Eulerian multi-material shock-propagation code developed at Los Alamos National Laboratory. HEAT consists of two partitions: expanding shock-cylinder (CYL) simulations and Perturbed Layered Interface (PLI) simulations. Each entry includes time series of thermodynamic fields (pressure, density, temperature), kinematic fields (position, velocity), and continuum quantities such as stress. The CYL partition spans a range of materials, including metals (aluminum, copper, depleted uranium, stainless steel, tantalum), a polymer, water, gases (air, nitrogen), and a detonating material. The PLI partition explores varied geometries with fixed materials: copper, aluminum, stainless steel, polymer, and high explosive. HEAT captures key phenomena such as shock propagation, momentum transfer, plastic deformation, and thermal effects, providing a benchmark dataset for AI/ML models of multi-material shock physics.
[515] One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
Chris Cameron, Wangzheng Wang, Nikita Ivanov, Ashmita Bhattacharyya, Didier Chételat, Yingxue Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Looped transformers scale computational depth without increasing parameter count by repeatedly applying a shared transformer block and can be used for iterative refinement, where each loop rewrites a full fixed-size prediction in parallel. On difficult problems, such as those that require search-like computation, reaching a highly structured solution starting from noise can require long refinement trajectories. Learning such trajectories is challenging when training specifies only the target solution and provides no supervision over the intermediate refinement path. Diffusion models tackle this issue by corrupting data with varying magnitudes of noise and training the model to reverse it in a \textit{single step}. However, this process misaligns training and testing behaviour. We introduce Denoising Recursion Models, a method that similarly corrupts data with noise but trains the model to reverse the corruption over \textit{multiple} recursive steps. This strategy provides a tractable curriculum of intermediate states, while better aligning training with testing and incentivizing non-greedy, forward-looking generation. Through extensive experiments, we show this approach outperforms the Tiny Recursion Model (TRM) on ARC-AGI, where it recently achieved breakthrough performance.
[516] Task Switching Without Forgetting via Proximal Decoupling
Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, William A. P. Smith, Yue Lu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.
[517] ParamBoost: Gradient Boosted Piecewise Cubic Polynomials
Nicolas Salvadé, Tim Hillel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Generalized Additive Models (GAMs) can be used to create non-linear glass-box (i.e. explicitly interpretable) models, where the predictive function is fully observable over the complete input space. However, glass-box interpretability itself does not allow for the incorporation of expert knowledge from the modeller. In this paper, we present ParamBoost, a novel GAM whose shape functions (i.e. mappings from individual input features to the output) are learnt using a Gradient Boosting algorithm that fits cubic polynomial functions at leaf nodes. ParamBoost incorporates several constraints commonly used in parametric analysis to ensure well-refined shape functions. These constraints include: (i) continuity of the shape functions and their derivatives (up to C2); (ii) monotonicity; (iii) convexity; (iv) feature interaction constraints; and (v) model specification constraints. Empirical results show that the unconstrained ParamBoost model consistently outperforms state-of-the-art GAMs across several real-world datasets. We further demonstrate that modellers can selectively impose required constraints at a modest trade-off in predictive performance, allowing the model to be fully tailored to application-specific interpretability and parametric-analysis requirements.
[518] Subgraph Concept Networks: Concept Levels in Graph Classification
Lucie Charlotte Magister, Alexander Norcliffe, Iulia Duta, Pietro Lio
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The reasoning process of Graph Neural Networks is complex and considered opaque, limiting trust in their predictions. To alleviate this issue, prior work has proposed concept-based explanations, extracted from clusters in the model’s node embeddings. However, a limitation of concept-based explanations is that they only explain the node embedding space and are obscured by pooling in graph classification. To mitigate this issue and provide a deeper level of understanding, we propose the Subgraph Concept Network. The Subgraph Concept Network is the first graph neural network architecture that distils subgraph and graph-level concepts. It achieves this by performing soft clustering on node concept embeddings to derive subgraph and graph-level concepts. Our results show that the Subgraph Concept Network allows to obtain competitive model accuracy, while discovering meaningful concepts at different levels of the network.
[519] AC-SINDy: Compositional Sparse Identification of Nonlinear Dynamics
Peter Racioppo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We present AC-SINDy, a compositional extension of the Sparse Identification of Nonlinear Dynamics (SINDy) framework that replaces explicit feature libraries with a structured representation based on arithmetic circuits. Rather than enumerating candidate basis functions, the proposed approach constructs nonlinear features through compositions of linear functions and multiplicative interactions, yielding a compact and scalable parameterization and enabling sparsity to be enforced directly over the computational graph. We also introduce a formulation that separates state estimation from dynamics identification by combining latent state inference with shared dynamics and multi-step supervision, improving robustness to noise while preserving interpretability. Experiments on nonlinear and chaotic systems demonstrate that the method recovers accurate and interpretable governing equations while scaling more favorably than standard SINDy.
[520] Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams
Isaac Llorente-Saguer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction ($73^\circ$ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains $\geq$0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@$1%$FPR should accompany AUROC in safety-adjacent evaluation.
[521] Gradient-Based Program Synthesis with Neurally Interpreted Languages
Matthew V. Macfarlane, Clément Bonnet, Herke van Hoof, Levi H. S. Lelis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labour-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but tend to generalise poorly in compositional and out-of-distribution settings. We bridge this divide with an instance of a Latent Adaptation Network architecture named Neural Language Interpreter (NLI), which learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI’s program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.
[522] Collaborative Contextual Bayesian Optimization
Chih-Yu Chang, Qiyuan Chen, Tianhan Gao, David Fenning, Chinedum Okwudire, Neil Dasgupta, Wei Lu, Raed Al Kontar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Discovering optimal designs through sequential data collection is essential in many real-world applications. While Bayesian Optimization (BO) has achieved remarkable success in this setting, growing attention has recently turned to context-specific optimal design, formalized as Contextual Bayesian Optimization (CBO). Unlike BO, CBO is inherently more challenging as it must approximate an entire mapping from the context space to its corresponding optimal design, requiring simultaneous exploration across contexts and exploitation within each. In many modern applications, such tasks arise across multiple potentially heterogeneous but related clients, where collaboration can significantly improve learning efficiency. We propose CCBO, Collaborative Contextual Bayesian Optimization, a unified framework enabling multiple clients to jointly perform CBO with controllable contexts, supporting both online collaboration and offline initialization from peers’ historical beliefs, with an optional privacy-preserving communication mechanism. We establish sublinear regret guarantees and demonstrate, through extensive simulations and a real-world hot rolling application, that CCBO achieves substantial improvements over existing approaches even under client heterogeneity. The code to reproduce the results can be found at https://github.com/cchihyu/Collaborative-Contextual-Bayesian-Optimization
[523] Fine-Tuning Small Reasoning Models for Quantum Field Theory
Nathaniel S. Woodward, Zhiqi Gao, Yurii Kvasiuk, Kendrick M. Smith, Frederic Sala, Moritz Münchmeyer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.
[524] TabEmb: Joint Semantic-Structure Embedding for Table Annotation
Ehsan Hoseinzade, Ke Wang, Anandharaju Durai Raju
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Table annotation is crucial for making web and enterprise tables usable in downstream NLP applications. Unlike textual data where learning semantically rich token or sentence embeddings often suffice, tables are structured combinations of columns wherein useful representations must jointly capture column’s semantics and the inter-column relationships. Existing models learn by linearizing the 2D table into a 1D token sequence and encoding it with pretrained language models (PLMs) such as BERT. However, this leads to limited semantic quality and weaker generalization to unseen or rare values compared to modern LLMs, and degraded structural modeling due to 2D-to-1D flattening and context-length constraints. We propose TabEmb, which directly targets these limitations by decoupling semantic encoding from structural modeling. An LLM first produces semantically rich embeddings for each column, and a graph-based module over columns then injects relationships into the embeddings, yielding joint semantic-tructural representations for table annotation. Experiments show that TabEmb consistently outperforms strong baselines on different table annotation tasks. Source code and datasets are available at https://github.com/hoseinzadeehsan/TabEmb
[525] FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction
Xiaowen Zhang, Ziming Zhou, Fengnian Zhao, David L. S. Hung
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deep learning surrogates for CFD flow-field prediction often rely on large, complex models, which can be slow and fragile when data are noisy or incomplete. We introduce FlowForge, a staged local rollout engine that predicts future flow fields by compiling a locality-preserving update schedule and executing it with a shared lightweight local predictor. Rather than producing the next frame in a single global pass, FlowForge rewrites spatial sites stage by stage so that each update conditions only on bounded local context exposed by earlier stages. This compile-execute design aligns inference with short-range physical dependence, keeps latency predictable, and limits error amplification from global mixing. Across PDEBench, CFDBench, and BubbleML, FlowForge matches or improves upon strong baselines in pointwise accuracy, delivers consistently better robustness to noise and missing observations, and maintains stable multi-step rollout behavior while reducing per-step latency.
[526] Distillation Traps and Guards: A Calibration Knob for LLM Distillability
Weixiao Zhan, Yongcheng Jing, Leszek Rutkowski, Dacheng Tao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher’s distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student transfer with deployment-aware model protection. Experiments across math, knowledge QA, and instruction-following tasks show that students distilled from distillable calibrated teachers outperform SFT and KD baselines, while undistillable calibrated teachers retain their task performance but cause distilled students to collapse, offering a practical knob for both better KD and model IP protection.
[527] Self-Improving Tabular Language Models via Iterative Group Alignment
Yunbo Long, Tejumade Afonja, Alexandra Brintrup, Mario Fritz
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: While language models have been adapted for tabular data generation, two fundamental limitations remain: (1) static fine-tuning produces models that cannot learn from their own generated samples and adapt to self-correct, and (2) autoregressive objectives preserve local token coherence but neglect global statistical properties, degrading tabular quality. Reinforcement learning offers a potential solution but requires designing reward functions that balance competing objectives – impractical for tabular data. To fill the gap, we introduce TabGRAA (Tabular Group-Relative Advantage Alignment), the first self-improving framework for tabular data generation via automated feedback. At each iteration, TabGRAA uses an \emph{automated quality signal} – such as a two-sample distinguishability classifier or a distance-based reward – to partition newly generated samples into high- and low-quality groups, then optimizes a group-relative advantage objective that reinforces realistic patterns while penalizing artifacts. The specific signal is a modular choice rather than a fixed component of the framework. This establishes a virtuous feedback cycle, where the quality signal is re-computed against newly \emph{generated synthetic} samples at each round; the language model is only fine-tuned on these self-generated signals, so no additional real record is exposed during alignment, mitigating data-leakage risk beyond the initial supervised fine-tuning. Experiments show TabGRAA outperforms existing methods in fidelity, utility, and privacy, while matching or exceeding diffusion-based synthesizers, advancing tabular synthesis from static statistical replication to dynamic, self-improving generation.
[528] Mechanistic Anomaly Detection via Functional Attribution
Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model’s output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.
[529] Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning
Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scaling critic capacity is a promising direction for enhancing off-policy reinforcement learning (RL). However, larger critics are prone to overfitting and unstable in replay-buffer-based bootstrap training. This paper leverages Low-Rank Adaptation (LoRA) as a structural-sparsity regularizer for off-policy critics. Our approach freezes randomly initialized base matrices and solely optimizes low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. Built on top of SimbaV2, we further develop a LoRA formulation, compatible with SimbaV2, that preserves its hyperspherical normalization geometry under frozen-backbone training. We evaluate our method with SAC and FastTD3 on DeepMind Control locomotion and IsaacLab robotics benchmarks. LoRA consistently achieves lower critic loss during training and stronger policy performance. Extensive experiments demonstrate that adaptive low-rank updates provide a simple, scalable, and effective structural regularization for critic learning in off-policy RL.
[530] Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees
Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code will be released to the public soon.
[531] Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning
Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.
[532] Accelerating trajectory optimization with Sobolev-trained diffusion policies
Théotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.
[533] FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion
Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free “plug-in” mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.
[534] Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Julian Skifstad, Xinyue Annie Yang, Glen Chou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering
[535] FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
Pingwei Sun, Yuxuan Hu, Jianchao Tan, Xue Wang, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $β_t$ in the delta update remains a scalar, limiting the model’s capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $β_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.
[536] Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback
Qiang Liu, Adrienne Kline, Ermin Wei
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.
[537] Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
Jeongwhan Choi, Jongwoo Kim, Woosung Kang, Noseong Park
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: One of the most challenging problems in graph machine learning is generalizing across graphs with diverse properties. Graph neural networks (GNNs) face a fundamental limitation: they require separate training for each new graph, preventing universal generalization across diverse graph datasets. A critical challenge facing GNNs lies in their reliance on labeled training data for each individual graph, a requirement that hinders the capacity for universal node classification due to the heterogeneity inherent in graphs – differences in homophily levels, community structures, and feature distributions across datasets. Inspired by the success of large language models (LLMs) that achieve in-context learning through massive-scale pre-training on diverse datasets, we introduce NodePFN. This universal node classification method generalizes to arbitrary graphs without graph-specific training. NodePFN learns posterior predictive distributions (PPDs) by training only on thousands of synthetic graphs generated from carefully designed priors. Our synthetic graph generation covers real-world graphs through the use of random networks with controllable homophily levels and structural causal models for complex feature-label relationships. We develop a dual-branch architecture combining context-query attention mechanisms with local message passing to enable graph-aware in-context learning. Extensive evaluation on 23 benchmarks demonstrates that a single pre-trained NodePFN achieves 71.27 average accuracy. These results validate that universal graph learning patterns can be effectively learned from synthetic priors, establishing a new paradigm for generalization in node classification.
[538] Intentional Updates for Streaming Reinforcement Learning
Arsalan Sharifnassab, Mohamed Elsayed, Kris De Asis, A. Rupam Mahmood, Richard S. Sutton
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in function output. This often leads to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose intentional updates: first specify the intended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via Normalized Least Mean Squares algorithm, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming deep reinforcement learning by defining appropriate intended outcomes: Intentional TD aims for a fixed fractional reduction of the TD error, and Intentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We propose practical algorithms combining eligibility traces and diagonal scaling. Empirically, these methods yield state-of-the-art streaming performance, frequently performing on par with batch and replay-buffer approaches.
[539] Age-Dependent Heterogeneity in the Association Between Physical Activity and Mental Distress: A Causal Machine Learning Analysis of 3.2 Million U.S. Adults
Yuan Shan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Physical activity (PA) is widely recognized as protective against mental distress, yet whether this benefit varies systematically across population subgroups remains poorly understood. Using pooled data from ten consecutive annual waves of the U.S. Behavioral Risk Factor Surveillance System (2015-2024; n = 3,242,218), we investigate heterogeneity in the association between leisure-time PA and frequent mental distress (FMD, >=14 days/month) across age groups. Survey-weighted logistic regression reveals a striking age gradient: the adjusted odds ratio for PA ranges from 0.89 among young adults (18-24) to 0.50 among adults aged 55-64, with the protective association strengthening monotonically with age. Temporal analysis across all ten years shows that the young-adult PA effect has been eroding over the past decade, with the 18-24 OR reaching 1.01 (null) in both 2018 and 2024 – paralleling the deepening youth mental health crisis. Causal Forest via Double Machine Learning independently identifies age as the dominant driver of treatment effect heterogeneity (feature importance = 0.39, 2.5x the next predictor). E-value sensitivity analysis, propensity score overlap checks, placebo tests, and imputation comparisons confirm the robustness of the findings. These results suggest that the well-documented exercise–mental health link may not generalize to the youngest adult population, whose distress appears increasingly driven by stressors that PA alone cannot mitigate.
[540] S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection
Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new \textit{Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.
[541] Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal
Eun-Ju Park, Youjin Shin, Simon S. Woo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: As a means to balance the growth of the AI industry with the need for privacy protection, machine unlearning plays a crucial role in realizing the ``right to be forgotten’’ in artificial intelligence. This technique enables AI systems to remove the influence of specific data while preserving the rest of the learned knowledge. Although it has been actively studied, most existing unlearning methods assume that unlearning is performed only once. In this work, we evaluate existing unlearning algorithms in a more realistic scenario where unlearning is conducted repeatedly, and in this setting, we identify two critical phenomena: (1) Knowledge Erosion, where the accuracy on retain data progressively degrades over unlearning phases, and (2) Forgetting Reversal, where previously forgotten samples become recognizable again in later phases. To address these challenges, we propose SAFER (StAbility-preserving Forgetting with Effective Regularization), a continual unlearning framework that maintains representation stability for retain data while enforcing negative logit margins for forget data. Extensive experiments show that SAFER mitigates not only knowledge erosion but also forgetting reversal, achieving stable performance across multiple unlearning phases.
[542] LLMs Know They’re Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit
Manav Pandey
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: When a language model agrees with a user’s false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a “this statement is wrong” signal whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple “truth-direction” reading of the substrate. Alignment training leaves this circuit in place: an RLHF refresh cuts sycophantic behavior roughly tenfold while the shared heads persist or grow, a pattern that replicates on an independent model family and under targeted anti-sycophancy DPO. When these models sycophant, they register that the user is wrong and agree anyway.
[543] RL-ABC: Reinforcement Learning for Accelerator Beamline Control
Anwar Ibrahim, Fedor Ratnikov, Maxim Kaledin, Alexey Petrenko, Denis Derkach
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Particle accelerator beamline optimization is a high-dimensional control problem traditionally requiring significant expert intervention. We present RLABC (Reinforcement Learning for Accelerator Beamline Control), an open-source Python framework that automatically transforms standard Elegant beamline configurations into reinforcement learning environments. RLABC integrates with the widely-used Elegant beam dynamics simulation code via SDDS-based interfaces, enabling researchers to apply modern RL algorithms to beamline optimization with minimal RL-specific development. The main contribution is a general methodology for formulating beamline tuning as a Markov decision process: RLABC automatically preprocesses lattice files to insert diagnostic watch points before each tunable element, constructs a 57-dimensional state representation from beam statistics, covariance information, and aperture constraints, and provides a configurable reward function for transmission optimization. The framework supports multiple RL algorithms through Stable-Baselines3 compatibility and implements stage learning strategies for improved training efficiency. Validation on a test beamline derived from the VEPP-5 injection complex (37 control parameters across 11 quadrupoles and 4 dipoles) demonstrates that the framework successfully enables RL-based optimization, with a Deep Deterministic Policy Gradient agent achieving 70.3% particle transmission – performance matching established methods such as differential evolution. The framework’s stage learning capability allows decomposition of complex optimization problems into manageable subproblems, improving training efficiency. The complete framework, including configuration files and example notebooks, is available as open-source software to facilitate adoption and further research.
[544] Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Weijie Zhao, Mingquan Liu, Bolun Wang, Simo Wu, Nuobei Xie, Rui-Jie Zhu, Peng Zhou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism’s linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer’s perplexity using up to 41.5% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.
[545] SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving
Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design–token-wise INT4 quantization with block-diagonal Hadamard rotation–consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.
[546] LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
Siqing Song, Chuang Wang, Yong Lang, Yi Yang, Xu-Yao Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.
[547] FOCAL-Attention for Heterogeneous Multi-Label Prediction
Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Heterogeneous graphs have attracted increasing attention for modeling multi-typed entities and relations in complex real-world systems. Multi-label node classification on heterogeneous graphs is challenging due to structural heterogeneity and the need to learn shared representations across multiple labels. Existing methods typically adopt either flexible attention mechanisms or meta-path constrained anchoring, but in heterogeneous multi-label prediction they often suffer from semantic dilution or coverage constraint. Both issues are further amplified under multi-label supervision. We present a theoretical analysis showing that as heterogeneous neighborhoods expand, the attention mass allocated to task-critical (primary) neighborhoods diminishes, and that meta-path constrained aggregation exhibits a dilemma: too few meta-paths intensify coverage constraint, while too many re-introduce dilution. To resolve this coverage-anchoring conflict, we propose FOCAL: Fusion Of Coverage and Anchoring Learning, with two components: coverage-oriented attention (COA) for flexible, unconstrained heterogeneous context aggregation, and anchoring-oriented attention (AOA) that restricts aggregation to meta-path-induced primary semantics. Our theoretical analysis and experimental results further indicates that FOCAL has a better performance than other state-of-the-art methods.
[548] Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning
Xiangmeng Wang, Qian Li, Haiyang Xia, Hao Miao, Qing Li, Guandong Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this work, we take a novel perspective by examining recurring inductive subgraphs, empirically and theoretically showing that they act as spurious shortcuts that mislead GNNs and reinforce non-causal correlations in heterophilic graphs. To address this, we adopt a causal inference perspective to analyze and correct the biased learning behavior induced by shortcut inductive subgraphs. We propose a debiased causal graph that explicitly blocks confounding and spillover paths responsible for these shortcuts. Guided by this causal graph, we introduce Causal Disentangled GNN (CD-GNN), a principled framework that disentangles spurious inductive subgraphs from true causal subgraphs by explicitly blocking non-causal paths. By focusing on genuine causal signals, CD-GNN substantially improves the robustness and accuracy of node classification in heterophilic graphs. Extensive experiments on real-world datasets not only validate our theoretical findings but also demonstrate that our proposed CD-GNN outperforms state-of-the-art heterophily-aware baselines.
[549] The Logical Expressiveness of Topological Neural Networks
Amirreza Akbari, Amauri H. Souza, Vikas Garg
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: what is the logical expressiveness of TNNs? Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic (TC$k$), an extension of standard counting logic featuring a novel pairwise counting quantifier $ \exists^{N}(x_i,x_j), \varphi(x_i,x_j), $ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $ \text{k-CCWL} \equiv \text{TC}{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.
[550] TEMPO: Scaling Test-time Training for Large Reasoning Models
Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.
[551] Debiased neural operators for estimating functionals
Konstantin Hess, Dennis Frauen, Niki Kilbertus, Stefan Feuerriegel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Neural operators are widely used to approximate solution maps of complex physical systems. In many applications, however, the goal is not to recover the full solution trajectory, but to summarize the solution trajectory via a scalar target quantity (e.g., a functional such as time spent in a target range, time above a threshold, accumulated cost, or total energy). In this paper, we introduce DOPE (debiased neural operator): a semiparametric estimator for such target quantities of solution trajectories obtained from neural operators. DOPE is broadly applicable to settings with both partial and irregular observations and can be combined with arbitrary neural operator architectures. We make three main contributions. (1) We show that, in contrast to DOPE, naive plug-in estimation can suffer from first-order bias. (2) To address this, we derive a novel one-step, Neyman-orthogonal estimator that treats the neural operator as a high-dimensional nuisance mapping between function spaces, and removes the leading bias term. For this, DOPE uses a weighting mechanism that simultaneously accounts for irregular observation designs and for how sensitive the target quantity is to perturbations of the underlying trajectory. (3) To learn the weights, we extend automatic debiased machine learning to operator-valued nuisances via Riesz regression. We demonstrate the benefits of DOPE across various numerical experiments.
[552] On the Conditioning Consistency Gap in Conditional Neural Processes
Robin Young
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a conditional neural process’s (CNP) predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is $O(1/n^2)$ in context size $n$, and that this rate is tight. These bounds establish the precise sense in which CNPs approximate valid stochastic processes. The inconsistency is negligible for moderate context sizes but can be significant in the few-shot regime.
[553] RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models
Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.
[554] Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset
Gonzalo Nápoles, Isel Grau, Yamisleydi Salgueiro
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.
[555] When Active Learning Falls Short: An Empirical Study on Chemical Reaction Extraction
Simin Yu, Sufia Fathima
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The rapid growth of chemical literature has generated vast amounts of unstructured data, where reaction information is particularly valuable for applications such as reaction predictions and drug design. However, the prohibitive cost of expert annotation has led to a scarcity of training data, severely hindering the performance of automatic reaction extraction. In this work, we conduct a systematic study of active learning for chemical reaction extraction. We integrate six uncertainty- and diversity-based strategies with pretrained transformer-CRF architectures, and evaluate them on product extraction and role labeling task. While several methods approach full-data performance with fewer labeled instances, learning curves are often non-monotonic and task-dependent. Our analysis shows that strong pretraining, structured CRF decoding, and label sparsity limit the stability of conventional active learning strategies. These findings provide practical insights for the effective use of active learning in chemical information extraction.
[556] FedSEA: Achieving Benefit of Parallelization in Federated Online Learning
Harekrushna Sahu, Pratik Jawanpuria, Pranay Sharma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Online federated learning (OFL) has emerged as a popular framework for decentralized decision-making over continuous data streams without compromising client privacy. However, the adversary model assumed in standard OFL typically precludes any potential benefits of parallelization. Further, it fails to adequately capture the different sources of statistical variation in OFL problems. In this paper, we extend the OFL paradigm by integrating a stochastically extended adversary (SEA). Under this framework, the loss function remains fixed across clients over time. However, the adversary dynamically and independently selects the data distribution for each client at each time. We propose the \algoOFL{} algorithm to solve this problem, which utilizes online stochastic gradient descent at the clients, along with periodic global aggregation via the server. We establish bounds on the global network regret over a time horizon (T) for two classes of functions: (1) for smooth and convex losses, we prove an (\mathcal{O}(\sqrt{T})) bound, and (2) for smooth and strongly convex losses, we prove an (\mathcal{O}(\log T)) bound. Through careful analysis, we quantify the individual impact of both spatial (across clients) and temporal (over time) data heterogeneity on the regret bounds. Consequently, we identify a regime of mild temporal variation (relative to stochastic gradient variance), where the network regret improves with parallelization. Hence, in the SEA setting, our results improve the existing pessimistic worst-case results in online federated learning.
[557] Evaluation-driven Scaling for Scientific Discovery
Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, Yuzhi Xu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.
[558] LASER: Learning Active Sensing for Continuum Field Reconstruction
Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ‘‘what-if’’ sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.
[559] FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition
Rudolf Debelak
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The evaluation of machine learning models typically relies mainly on performance metrics based on loss functions, which risk to overlook changes in performance in relevant subgroups. Auditing tools such as SliceFinder and SliceLine were proposed to detect such groups, but usually have conceptual disadvantages, such as the inability to directly address continuous covariates. In this paper, we introduce FairTree, a novel algorithm adapted from psychometric invariance testing. Unlike SliceFinder and related algorithms, FairTree directly handles continuous, categorical, and ordinal features without discretization. It further decomposes performance disparities into systematic bias and variance, allowing a categorization of changes in algorithm performance. We propose and evaluate two variations of the algorithm: a permutation-based approach, which is conceptually closer to SliceFinder, and a fluctuation test. Through simulation studies that include a direct comparison with SliceLine, we demonstrate that both approaches have a satisfactory rate of false-positive results, but that the fluctuation approach has relatively higher power. We further illustrate the method on the UCI Adult Census dataset. The proposed algorithms provide a flexible framework for the statistical evaluation of the performance and aspects of fairness of machine learning models in a wide range of applications even in relatively small data.
[560] TACENR: Task-Agnostic Contrastive Explanations for Node Representations
Vasiliki Papanikou, Evaggelia Pitoura
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.
[561] Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?
Yi Zhao, Di Yuan, Tao Deng, Suzhi Cao, Ying Dong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated learning (FL) is a key paradigm for distributed model learning across decentralized data sources. Communication in each FL round typically consists of two phases: (i) distributing the global model from a server to clients, and (ii) collecting updated local models from clients to the server for aggregation. This paper focuses on a type of FL where communication between a client and the server is relay-based over dynamic networks, making routing optimization essential. A typical scenario is in-orbit FL, where satellites act as clients and communicate with a server (which can be a satellite, ground station, or aerial platform) via multi-hop inter-satellite links. This paper presents a comprehensive tractability analysis of routing optimization for in-orbit FL under different settings. For global model distribution, these include the number of models, the objective function, and routing schemes (unicast versus multicast, and splittable versus unsplittable flow). For local model collection, the settings consider the number of models, client selection, and flow splittability. For each case, we rigorously prove whether the global optimum is obtainable in polynomial time or the problem is NP-hard. Together, our analysis draws clear boundaries between tractable and intractable regimes for a broad spectrum of routing problems for in-orbit FL. For tractable cases, the derived efficient algorithms are directly applicable in practice. For intractable cases, we provide fundamental insights into their inherent complexity. These contributions fill a critical yet unexplored research gap, laying a foundation for principled routing design, evaluation, and deployment in satellite-based FL or similar distributed learning systems.
[562] Revisiting Catastrophic Forgetting in Continual Knowledge Graph Embedding
Gerard Pons, Carlos Escolano, Besim Bilalli, Anna Queralt
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Knowledge Graph Embeddings (KGEs) support a wide range of downstream tasks over Knowledge Graphs (KGs). In practice, KGs evolve as new entities and facts are added, motivating Continual Knowledge Graph Embedding (CKGE) methods that update embeddings over time. Current CKGE approaches address catastrophic forgetting (i.e., the performance degradation on previously learned tasks) primarily by limiting changes to existing embeddings. However, we show that this view is incomplete. When new entities are introduced, their embeddings can interfere with previously learned ones, causing the model to predict them in place of previously correct answers. This phenomenon, which we call entity interference, has been largely overlooked and is not accounted for in current CKGE evaluation protocols. As a result, the assessment of catastrophic forgetting becomes misleading, and CKGE methods performance is systematically overestimated. To address this issue, we introduce a corrected CKGE evaluation protocol that accounts for entity interference. Through experiments on multiple benchmarks, we show that ignoring this effect can lead to performance overestimation of up to 25%, particularly in scenarios with significant entity growth. We further analyze how different CKGE methods and KGE models are affected by the different sources of forgetting, and introduce a catastrophic forgetting metric tailored to CKGE.
[563] Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation
Thomas Zollo, Jimmy Wang, Richard Zemel
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.
[564] Heterogeneity-Aware Personalized Federated Learning for Industrial Predictive Analytics
Yuhan Hu, Xiaolei Fang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Federated prognostics enable clients (e.g., companies, factories, and production lines) to collaboratively develop a failure time prediction model while keeping each client’s data local and confidential. However, traditional federated models often assume homogeneity in the degradation processes across clients, an assumption that may not hold in many industrial settings. To overcome this, this paper proposes a personalized federated prognostic model designed to accommodate clients with heterogeneous degradation processes, allowing them to build tailored prognostic models. The prognostic model iteratively facilitates the underlying pairwise collaborations between clients with similar degradation patterns, which enhances the performance of personalized federated learning. To estimate parameters jointly using decentralized datasets, we develop a federated parameter estimation algorithm based on proximal gradient descent. The proposed approach addresses the limitations of existing federated prognostic models by simultaneously achieving model personalization, preserving data privacy, and providing comprehensive failure time distributions. The superiority of the proposed model is validated through extensive simulation studies and a case study using the turbofan engine degradation dataset from the NASA repository.
[565] ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
Suvinava Basak
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.
[566] EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.
[567] When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift
Saket Maganti
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset’s topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.
[568] Accelerating Optimization and Machine Learning through Decentralization
Ziqin Chen, Zuang Wang, Yongqiang Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Decentralized optimization enables multiple devices to learn a global machine learning model while each individual device only has access to its local dataset. By avoiding the need for training data to leave individual users’ devices, it enhances privacy and scalability compared to conventional centralized learning, where all data has to be aggregated to a central server. However, decentralized optimization has traditionally been viewed as a necessary compromise, used only when centralized processing is impractical due to communication constraints or data privacy concerns. In this study, we show that decentralization can paradoxically accelerate convergence, outperforming centralized methods in the number of iterations needed to reach optimal solutions. Through examples in logistic regression and neural network training, we demonstrate that distributing data and computation across multiple agents can lead to faster learning than centralized approaches, even when each iteration is assumed to take the same amount of time, whether performed centrally on the full dataset or decentrally on local subsets. This finding challenges longstanding assumptions and reveals decentralization as a strategic advantage, offering new opportunities for more efficient optimization and machine learning.
[569] Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments
Jianyang Gao, Yutong Gou, Yuexuan Xu, Jifan Shi, Yongyi Yang, Shuolin Li, Raymond Chi-Wing Wong, Cheng Long
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant does not provide a consistent improvement over RaBitQ in directly comparable settings; in many tested configurations, it performs worse than RaBitQ. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.
[570] Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention
Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.
[571] Separating Geometry from Probability in the Analysis of Generalization
Maxim Raginsky, Benjamin Recht
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of in-sample'' performance) or some entirely new $S'$ (in which case we speak of out-of-sample’’ performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).
[572] Structure-guided molecular design with contrastive 3D protein-ligand learning
Carles Navarro, Philipp Tholke, Gianni de Fabritiis
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Structure-based drug discovery faces the dual challenge of accurately capturing 3D protein-ligand interactions while navigating ultra-large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)-equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero-shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target-specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.
[573] Lyapunov-Certified Direct Switching Theory for Q-Learning
Donghwan Lee
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.
[574] An Efficient Black-Box Reduction from Online Learning to Multicalibration, and a New Route to $Φ$-Regret Minimization
Gabriele Farina, Juan Carlos Perdomo
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We give a Gordon-Greenwald-Marks (GGM) style black-box reduction from online learning to online multicalibration. Concretely, we show that to achieve high-dimensional multicalibration with respect to a class of functions H, it suffices to combine any no-regret learner over H with an expected variational inequality (EVI) solver. We also prove a converse statement showing that efficient multicalibration implies efficient EVI solving, highlighting how EVIs in multicalibration mirror the role of fixed points in the GGM result for $Φ$-regret. This first set of results resolves the main open question in Garg, Jung, Reingold, and Roth (SODA ‘24), showing that oracle-efficient online multicalibration with $\sqrt{T}$-type guarantees is possible in full generality. Furthermore, our GGM-style reduction unifies the analyses of existing online multicalibration algorithms, enables new algorithms for challenging environments with delayed observations or censored outcomes, and yields the first efficient black-box reduction between online learning and multiclass omniprediction. Our second main result is a fine-grained reduction from high-dimensional online multicalibration to (contextual) $Φ$-regret minimization. Together with our first result, this establishes a new route from external regret to Phi-regret that bypasses sophisticated fixed-point or semi-separation machinery, dramatically simplifies a result of Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC ‘25) while improving rates, and yields new algorithms that are robust to richer deviation classes, such as those belonging to any reproducing kernel Hilbert space.
[575] SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets
Inhyeok Choi, Hyuncheol Park
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.
[576] Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification
Xudong Jian, Charikleia Stoura, Simon Scandella, Eleni Chatzi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Damage identification is a core task in structural health monitoring. In practice, however, its reliability is often compromised by confounding non-damage effects, such as variations in excitation and environmental conditions, which can induce changes comparable to or larger than those caused by structural damage. To address this challenge, this study proposes a self-supervised label-free disentangled representation learning framework for robust vibration-based structural damage identification. The proposed framework employs an autoencoder with two latent representations to learn directly from raw vibration acceleration signals. A self-supervised invariance regularization, implemented via Variance-Invariance-Covariance Regularization (VICReg), is imposed on one latent representation using baseline data where structural damage is assumed constant but operational and environmental conditions vary. In addition, a frequency-domain constraint is introduced to enforce agreement between the power spectral density reconstructed from the latent representation and that computed from the corresponding input time series. Together, these mechanisms promote disentanglement, enabling the learned representation to be sensitive to damage-related characteristics while remaining invariant to nuisance variability. The framework is trained in a fully end-to-end and label-free manner, requiring no prior information on damage, excitation, or environmental conditions, making it well-suited for real-world applications. Its effectiveness is validated on two distinct real-world vibration datasets, including a bridge and a gearbox. The results demonstrate robustness to operational variability, strong generalization capability, and good performance in both damage detection and quantification.
[577] HardNet++: Nonlinear Constraint Enforcement in Neural Networks
Andrea Goertzen, Kaveh Alim, Navid Azizan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via specific parameterizations or a projection layer, but are tailored to specific forms (e.g., linear constraints), limiting their utility in other general problem settings. Many real-world problems of interest are nonlinear, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-enforcement method that simultaneously satisfies linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure can enforce nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a model predictive control problem with nonlinear state constraints.
[578] Budgeted Online Influence Maximization
Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach better models the real-world setting where the cost of influencers varies and advertisers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge level semi-bandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinality constraint setting and improves the state of the art regret bound in this case.
[579] PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models
Salvatore Greco, Jacek Karolczak, Roman Słowiński, Jerzy Stefanowski
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Explainable artificial intelligence (XAI) has predominantly focused on generating model-centric explanations that approximate the behavior of black-box models. However, such explanations often overlook a fundamental aspect of interpretability: different users require different explanations depending on their goals, preferences, and cognitive constraints. Although recent work has explored user-centric and personalized explanations, most existing approaches rely on heuristic adaptations or implicit user modeling, lacking a principled framework for representing and learning individual preferences. In this paper, we consider Preference-Based Explainable Artificial Intelligence (PREF-XAI), a novel perspective that reframes explanation as a preference-driven decision problem. Within PREF-XAI, explanations are not treated as fixed outputs, but as alternatives to be evaluated and selected according to user-specific criteria. In the PREF-XAI perspective, here we propose a methodology that combines rule-based explanations with formal preference learning. User preferences are elicited through a ranking of a small set of candidate explanations and modeled via an additive utility function inferred using robust ordinal regression. Experimental results on real-world datasets show that PREF-XAI can accurately reconstruct user preferences from limited feedback, identify highly relevant explanations, and discover novel explanatory rules not initially considered by the user. Beyond the proposed methodology, this work establishes a connection between XAI and preference learning, opening new directions for interactive and adaptive explanation systems.
[580] Planning in entropy-regularized Markov decision processes and games
Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Ménard, Rémi Munos, Michal Valko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.
[581] On two ways to use determinantal point processes for Monte Carlo integration
Guillaume Gautier, Rémi Bardenet, Michal Valko
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The standard Monte Carlo estimator $\widehat{I}_N^{\mathrm{MC}}$ of $\int fdω$ relies on independent samples from $ω$ and has variance of order $1/N$. Replacing the samples with a determinantal point process (DPP), a repulsive distribution, makes the estimator consistent, with variance rates that depend on how the DPP is adapted to $f$ and $ω$. We examine two existing DPP-based estimators: one by Bardenet & Hardy (2020) with a rate of $\mathcal{O}(N^{-(1+1/d)})$ for smooth $f$, but relying on a fixed DPP. The other, by Ermakov & Zolotukhin (1960), is unbiased with rate of order $1/N$, like Monte Carlo, but its DPP is tailored to $f$. We revisit these estimators, generalize them to continuous settings, and provide sampling algorithms.
[582] Ultrametric OGP - parametric RDT \emph{symmetric} binary perceptron connection
Mihailo Stojnic
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In [97,99,100], an fl-RDT framework is introduced to characterize \emph{statistical computational gaps} (SCGs). Studying \emph{symmetric binary perceptrons} (SBPs), [100] obtained an \emph{algorithmic} threshold estimate $α_a\approx α_c^{(7)}\approx 1.6093$ at the 7th lifting level (for $κ=1$ margin), closely approaching $1.58$ local entropy (LE) prediction [18]. In this paper, we further connect parametric RDT to overlap gap properties (OGPs), another key geometric feature of the solution space. Specifically, for any positive integer $s$, we consider $s$-level ultrametric OGPs ($ult_s$-OGPs) and rigorously upper-bound the associated constraint densities $α_{ult_s}$. To achieve this, we develop an analytical union-bounding program consisting of combinatorial and probabilistic components. By casting the combinatorial part as a convex problem and the probabilistic part as a nested integration, we conduct numerical evaluations and obtain that the tightest bounds at the first two levels, $\barα_{ult_1} \approx 1.6578$ and $\barα_{ult_2} \approx 1.6219$, closely approach the 3rd and 4th lifting level parametric RDT estimates, $α_c^{(3)} \approx 1.6576$ and $α_c^{(4)} \approx 1.6218$. We also observe excellent agreement across other key parameters, including overlap values and the relative sizes of ultrametric clusters. Based on these observations, we propose several conjectures linking $ult$-OGP and parametric RDT. Specifically, we conjecture that algorithmic threshold $α_a=\lim_{s\rightarrow\infty} α_{ult_s} = \lim_{s\rightarrow\infty} \barα{ult_s} = \lim_{r\rightarrow\infty} α_{c}^{(r)}$, and $α_{ult_s} \leq α_{c}^{(s+2)}$ (with possible equality for some (maybe even all) $s$). Finally, we discuss the potential existence of a full isomorphism connecting all key parameters of $ult$-OGP and parametric RDT.
[583] Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes
Jake Lee
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique – which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm – we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.
[584] Benign Overfitting in Adversarial Training for Vision Transformers
Jiaming Zhang, Meng Ding, Shaopeng Fu, Jingfeng Zhang, Di Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \emph{benign overfitting}, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.
[585] FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning
Abdulmoneam Ali, Ahmed Arafa
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Personalized Federated Learning (PFL) aims to learn multiple task-specific models rather than a single global model across heterogeneous data distributions. Existing PFL approaches typically rely on iterative optimization-such as model update trajectories-to cluster users that need to accomplish the same tasks together. However, these learning-dynamics-based methods are inherently vulnerable to low-quality data and noisy labels, as corrupted updates distort clustering decisions and degrade personalization performance. To tackle this, we propose FB-NLL, a feature-centric framework that decouples user clustering from iterative training dynamics. By exploiting the intrinsic heterogeneity of local feature spaces, FB-NLL characterizes each user through the spectral structure of the covariances of their feature representations and leverages subspace similarity to identify task-consistent user groupings. This geometry-aware clustering is label-agnostic and is performed in a one-shot manner prior to training, significantly reducing communication overhead and computational costs compared to iterative baselines. Complementing this, we introduce a feature-consistency-based detection and correction strategy to address noisy labels within clusters. By leveraging directional alignment in the learned feature space and assigning labels based on class-specific feature subspaces, our method mitigates corrupted supervision without requiring estimation of stochastic noise transition matrices. In addition, FB-NLL is model-independent and integrates seamlessly with existing noise-robust training techniques. Extensive experiments across diverse datasets and noise regimes demonstrate that our framework consistently outperforms state-of-the-art baselines in terms of average accuracy and performance stability.
[586] FASTER: Value-Guided Sampling for Fast RL
Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .
[587] Safe Continual Reinforcement Learning in Non-stationary Environments
Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro, Gautam Biswas
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system’s lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.
[588] Generalization at the Edge of Stability
Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension’, and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
[589] Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs
Tianheng Ling, Chao Qian, Gregor Schiele
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2410.03294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.03294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] Quantum Non-Linear Bandit Optimization
Zakaria Shams Siam, Chaowen Guan, Chong Liu
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.03023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] AutoNFS: Automatic Neural Feature Selection
Witold Wydmański, Marek Śmieja
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2503.13304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels
Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, Christopher G. Brinton
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.18186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] IMPACT: Importance-Aware Activation Space Reconstruction
Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.03828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles
Cas Oude Hoekstra, Floris den Hengst
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.08080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks
Lorenzo Livi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.12121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management
Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann, Gregor Schiele
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.13905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] Benchmarking Physics-Informed Neural Networks and Boundary Elements Methods for Wave Scattering
Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.12483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.26238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Conditional Diffusion Modeling with Attention for Probabilistic Battery Capacity Prediction under Real-World Condition
Chunlin Jiang, Hequn Li, Zhongwei Deng, Jie Shao, Zhansheng Ning
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.17414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] Physics-Informed Neural Operators for Cardiac Electrophysiology
Hannah Lydon, Milad Kazemi, Martin Bishop, Nicola Paoletti
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.08418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] Breaking the Illusion: Consensus-Based Generative Mitigation of Adversarial Illusions in Multi-Modal Embeddings
Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2511.21893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] Bayesian Event-Based Model for Disease Subtype and Stage Inference
Hongtao Hao, Joseph L. Austerweil
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.03467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Optimized Architectures for Kolmogorov-Arnold Networks
James Bagrow, Josh Bongard
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.12448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] The PROPER Approach to Proactivity: Benchmarking and Advancing Knowledge Gap Navigation
Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, Chirag Shah
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.09926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families
Lennon Shikhman
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.11428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] MapPFN: Learning Causal Perturbation Maps in Context
Marvin Sextro, Weronika Kłos, Gabriel Dernbach
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.21092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees
Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning
Jon Irureta, Gorka Azkune, Jon Imaz, Aizea Lojo, Javier Fernandez-Marques
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.12708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Drift Localization using Conformal Predictions
Fabian Hinder, Valerie Vaquet, Johannes Brinkrolf, Barbara Hammer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.19790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] Tackling multiphysics problems via finite element-guided physics-informed operator learning
Yusuke Yamazaki, Reza Najian Asl, Markus Apel, Mayu Muramatsu, Shahed Rezaei
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.01420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models
Luca Ambrogioni
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.20092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] A Heterogeneous Long-Micro Scale Cascading Architecture for General Aviation Health Management
Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Wei Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.22885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] Semantic Interaction Information mediates compositional generalization in latent space
John Schwarcz
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2603.27134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
Simon Zhang, Ryan P. DeMilt, Kun Jin, Cathy H. Xia
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.08404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] LLM-Extracted Covariates for Clinical Causal Inference: Rethinking Integration Strategies
Lei Liu, Jialin Chen, Kathy Macropol
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] Towards Real-Time ECG and EMG Modeling on $μ$NPUs
Josh Millar, Ashok Samraj Thangarajan, Soumyajit Chatterjee, Hamed Haddadi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.18067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] Whispers in the Machine: Confidentiality in Agentic Systems
Jonathan Evertz, Merlin Chlosta, Lea Schönherr, Thorsten Eisenhofer
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2402.06922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.06922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] Finite-dimensional approximations of push-forwards on locally analytic functionals
Isao Ishikawa
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2404.10769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.10769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] Latent Linear Quadratic Regulator for Robotic Control Tasks
Yuan Zhang, Shaohui Yang, Toshiyuki Ohtsuka, Colin Jones, Joschka Boedecker
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.11107: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.11107&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] Byzantine-tolerant distributed learning of finite mixture models
Qiong Zhang, Yan Shuo Tan, Jiahua Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2407.13980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.13980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] Regression with Large Language Models for Materials and Molecular Property Prediction
Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2409.06080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation
Yilin Xie, Shiqiang Zhang, Joel A. Paulson, Calvin Tsay
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2410.16893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.16893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] The Data-Driven Censored Newsvendor Problem
Chamsi Hssaine, Sean R. Sinclair
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2412.01763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] Opinion de-polarization in social networks with GNNs
Konstantinos Mylonas, Thrasyvoulos Spyropoulos
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2412.09404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.09404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] DADA: Dual Averaging with Distance Adaptation
Mohammad Moshtaghifar, Anton Rodomanov, Daniil Vankov, Sebastian Stich
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2501.10258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data
Antony Sikorski, Michael Ivanitskiy, Nathan Lenssen, Douglas Nychka, Daniel McKenzie
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.09803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] Highly Efficient and Effective LLMs with Multi-Boolean Architectures
Ba-Hien Tran, Van Minh Nguyen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2505.22811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] ASVSim (AirSim for Surface Vehicles): A High-Fidelity Simulation Framework for Autonomous Surface Vehicle Research
Bavo Lesy, Siemen Herremans, Robin Kerstens, Jan Steckel, Walter Daems, Siegfried Mercelis, Ali Anwar
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2506.22174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[629] VoteGCL: Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation
Minh-Anh Nguyen, Bao Nguyen, Ha Lan N.T., Tuan Anh Hoang, Duc-Trong Le, Dung D. Le
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2507.21563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[630] QTMRL: An Agent for Quantitative Trading Decision-Making Based on Multi-Indicator Guided Reinforcement Learning
Jingfeng Pan, Jiahao Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2508.20467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[631] Energy-Weighted Flow Matching: Unlocking Continuous Normalizing Flows for Efficient and Scalable Boltzmann Sampling
Niclas Dern, Lennart Redl, Sebastian Pfister, Marcel Kollovieh, David Lüdke, Stephan Günnemann
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.03726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[632] When Langevin Monte Carlo Meets Randomization: Non-asymptotic Error Bounds beyond Log-Concavity and Gradient Lipschitzness
Xiaojie Wang, Bin Yang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2509.25630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[633] Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs
Patrick Forré, Abel Jansma
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.05786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[634] Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization
Simon Idoko, Prajyot Jadhav, Arun Kumar Singh
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.09204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[635] Efficient Autoregressive Inference for Transformer Probabilistic Models
Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.09477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[636] Quantifying Data Similarity Using Cross Learning
Shudong Sun, Hao Helen Zhang, Joseph C Watkins
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.10866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[637] PriorGuide: Test-Time Prior Adaptation for Simulation-Based Inference
Yang Yang, Severi Rissanen, Paul E. Chang, Nasrulloh Loka, Daolang Huang, Arno Solin, Markus Heinonen, Luigi Acerbi
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.13763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[638] Nonmonotone subgradient methods based on a local descent lemma
Francisco J. Aragón-Artacho, Rubén Campoy, Pedro Pérez-Aros, David Torregrosa-Belén
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.19341: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19341&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[639] StrikeWatch: Wrist-worn Gait Recognition with Compact Time-series Models on Low-power FPGAs
Tianheng Ling, Chao Qian, Peter Zdankin, Torben Weis, Gregor Schiele
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2510.24738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[640] Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
Lars van der Laan, Nathan Kallus
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2512.23805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[641] Local Updates in Distributed Optimization: Provable Acceleration and Topology Effects
Zuang Wang, Yongqiang Wang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2601.03442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[642] Enforcing Reciprocity in Operator Learning for Seismic Wave Propagation
Caifeng Zou, Yaozhong Shi, Zachary E. Ross, Robert W. Clayton, Kamyar Azizzadenesheli
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.11631: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11631&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[643] GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search
Rong Fu, Jia Yee Tan, Chunlei Meng, Shuo Yin, Xiaowen Ma, Wangyu Wu, Muge Qi, Guangzhen Yao, Zhaolu Kang, Zeli Su, Simon Fong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.15423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[644] LiveGraph: Active-Structure Neural Re-ranking for Exercise Recommendation
Rong Fu, Zijian Zhang, Haiyun Wei, Jiekai Wu, Kun Liu, Xianda Li, Haoyu Zhao, Yang Li, Yongtai Liu, Ziming Wang, Rui Lu, Simon Fong
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2602.17036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[645] Spatiotemporal-Aware Bit-Flip Injection on DNN-based Advanced Driver Assistance Systems (extended version)
Taibiao Zhao, Xiang Zhang, Mingxuan Sun, Ruyi Ding, Xugui Zhou
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.03753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[646] An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing
Philipp Röchner, Clarissa Krämer, Johannes U Mayer, Franz Rothlauf, Steffen Albrecht, Maximilian Sprang
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.04981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[647] Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel
Hongyi Jin, Bohan Hou, Guanjie Wang, Ruihang Lai, Jinqi Chen, Zihao Ye, Yaxing Cai, Yixin Dong, Xinhao Cheng, Zhihao Zhang, Yilong Zhao, Yingyi Huang, Lijie Yang, Jinchen Jiang, Gabriele Oliaro, Jianan Ji, Xupeng Miao, Vinod Grover, Todd C. Mowry, Zhihao Jia, Tianqi Chen
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.13327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[648] ExoNet: Multimodal Deep Learning for TESS Exoplanet Candidate Identification via Phase-Folded Light Curves, Stellar Parameters, and Multi-Head Attention Fusion
Md.Rashadul Islam
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.15560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[649] Q-SINDy: Quantum-Kernel Sparse Identification of Nonlinear Dynamics with Provable Coefficient Debiasing
Samrendra Roy, Syed Bahauddin Alam
Main category: cs.LG
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Failed to fetch summary for 2604.16779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[650] Superficial Success vs. Internal Breakdown: An Empirical Study of Generalization in Adaptive Multi-Agent Systems
Namyoung So, Seokgyu Jang, Taeuk Kim
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Adaptive multi-agent systems (MAS) are increasingly adopted to tackle complex problems.However, the narrow task coverage of their optimization raises the question of whether they can function as general-purpose systems.To address this gap, we conduct an extensive empirical study of adaptive MAS, revealing two key findings: (1) topological overfitting – they fail to generalize across different domains; and (2) illusory coordination – they achieve reasonable surface-level accuracy while the underlying agent interactions diverge from ideal MAS behavior, raising concerns about their practical utility.These findings highlight the pressing need to prioritize generalization in MAS development and motivate evaluation protocols that extend beyond simple final-answer correctness.
[651] Gated Coordination for Efficient Multi-Agent Collaboration in Minecraft Game
HuaDong Jian, Chenghao Li, Haoyu Wang, Jiajia Shuai, Jinyu Guo, Yang Yang, Chaoning Zhang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In long-horizon open-world multi-agent systems, existing methods often treat local anomalies as automatic triggers for communication. This default design introduces coordination noise, interrupts local execution, and overuses public interaction in cases that could be resolved locally. To address this issue, we propose a partitioned information architecture for MLLM agents that explicitly separates private execution states from public coordination states. Building on this design, we introduce two key mechanisms. First, we develop an event-triggered working memory based on system-verified outcomes to maintain compact and low-noise local state representations. Second, we propose a cost-sensitive gated escalation mechanism that determines whether cross-region communication should be initiated by jointly considering node criticality, local recovery cost, and downstream task impact. In this way, communication is transformed from a default reaction into a selective decision. Experiments conducted on long-term construction tasks in open environments demonstrate that, compared to baseline models based on strong communication and planned structures, the introduction of gated communication and a partitioned information architecture results in superior performance in terms of blueprint completion quality and execution chain length. It also improves local self-recovery, reduces ineffective escalations, and increases the utility of public communication.
[652] ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies
Shaoyu Li, Chaoyu Zhang, Hexuan Yu, Y. Thomas Hou, Wenjing Lou
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Autonomous AI agents live or die by the API tokens they consume: without paid inference capacity they cannot reason, act, or delegate. Compute-token cost has become the binding resource of the emerging agent economy, yet it is non-transferable: it is account-bound, vendor-specific, and absent from on-chain ledgers. Existing payment rails such as x402 move fiat-backed value between agents, but they do not represent the quantity agents actually burn. As a result, agents can transport purchasing power but cannot quote, escrow, or settle workflows in a unit aligned with compute cost. We present ClawCoin, a tokenized, compute-cost-indexed unit of account and settlement asset for decentralized agent economies. ClawCoin combines four layers: a robust basket index over standardized prices; an oracle publishing signed fresh attestations; a NAV-based mint/redeem vault with coverage thresholds and rate limits; and an on-chain settlement layer for multi-hop delegations. We implement a prototype on an Ethereum-compatible L2 and evaluate it using a multi-agent simulator and the OpenClaw testbed. Across single-agent, multi-agent, workflow, and procurement experiments, ClawCoin stabilizes execution capacity under cost shocks, reduces cross-agent quote dispersion, eliminates partial settlements, and sustains cooperative market dynamics that fiat-denominated baselines cannot. These results suggest that compute-indexed units of account can improve decentralized agent coordination.
[653] Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
Hongwei Xu
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Teams of LLM agents increasingly collaborate on tasks spanning days or weeks: multi-day data-generation sprints where generator, reviewer, and auditor agents coordinate in real time on overlapping batches; specialists carrying findings forward across session restarts; product decisions compounding over many review rounds. This requires agents to share, evaluate, and combine each other’s cognitive state in real time across sessions. We call this cross-session agent-to-agent cognitive collaboration, distinct from parallel agent execution. To enable it, three problems must be solved together. (P1) Each agent decides field by field what to accept from peers, not accept or reject whole messages. (P2) Every claim is traceable to source, so returning claims are recognised as echoes of the receiver’s own prior thinking. (P3) Memory that survives session restarts is relevant because of how it was stored, not how it is retrieved. These are protocol-level properties at the semantic layer of agent communication, distinct from tool-access and task-delegation protocols at lower layers. We call this missing protocol layer “semantic infrastructure,” and the Mesh Memory Protocol (MMP) specifies it. Four composable primitives work together: CAT7, a fixed seven-field schema for every Cognitive Memory Block (CMB); SVAF, which evaluates each field against the receiver’s role-indexed anchors and realises P1; inter-agent lineage, carried as parents and ancestors of content-hash keys and realising P2; and remix, which stores only the receiver’s own role-evaluated understanding of each accepted CMB, never the raw peer signal, realising P3. MMP is specified, shipped, and running in production across three reference deployments, where each session runs an autonomous agent as a mesh peer with its own identity and memory, collaborating with other agents across the network for collective intelligence.
[654] FOCAL: Filtered On-device Continuous Activity Logging for Efficient Personal Desktop Summarization
Haoran Yin, Zhiyuan Wen, Jiannong Cao, Bo Yuan, Ruosong Yang
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Desktop interaction streams provide a continuous, privacy-sensitive record of interleaved user tasks. Transforming these streams into task-organized personal logs on-device faces two main challenges: exhaustive Vision-Language Model (VLM) processing strains local resources, and global stream processing causes cross-task context pollution. We present FOCAL (Filtered On-device Continuous Activity Logging), a privacy-first multi-agent system utilizing a unified filter-plan-log architecture. It cascades a lightweight Filter Agent for noise suppression, a text-only Brain Agent for task attribution, a Record Agent for selective visual reasoning, and a task-isolated Memory Agent for context-coherent summarization. Experiments on DesktopBench (comprising 2,572 screenshots across 420 complex sessions) show FOCAL reduces total token consumption by 60.4% and VLM call count by 72.3% versus a baseline, while boosting Key Information Recall (KIR) from 0.38 to 0.61. Crucially, under $A{\to}B{\to}A$ task interruptions, FOCAL maintains Task Acc 0.81 and KIR 0.80, whereas the baseline collapses to Task Acc 0.03. FOCAL pioneers the efficient, on-device summarization of instruction-free desktop streams into multi-perspective personal logs.
[655] TeamFusion: Supporting Open-ended Teamwork with Multi-Agent Systems
Jiale Liu, Victor S. Bursztyn, Lin Ai, Haoliang Wang, Sunav Choudhary, Saayan Mitra, Qingyun Wu
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: In open-ended domains, teams must reconcile diverse viewpoints to produce strong deliverables. Answer aggregation approaches commonly used in closed domains are ill-suited to this setting, as they tend to suppress minority perspectives rather than resolve underlying disagreements. We present TeamFusion, a multi-agent system designed to support teamwork in open-ended domains by: 1. Instantiating a proxy agent for each team member conditioned on their expressed preferences; 2. Conducting a structured discussion to surface agreements and disagreements; and 3. Synthesizing more consensus-oriented deliverables that feed into new iterations of discussion and refinement. We evaluate TeamFusion on two teamwork tasks where team members can assess how well their individual views are represented in team decisions and how consensually strong the final deliverables are, finding that it outperforms direct aggregation baselines across metrics, tasks, and team configurations.
[656] Cost-Aware Distributed Online Learning with Strict Rejection Behavior against Adversarial Agents
Yuhan Suo, Senchun Chai, Xudong Zhao, Yuanqing Xia, and Runqi Chai
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Distributed online learning in multi-agent systems is highly vulnerable to adversarial influence, especially when malicious agents cannot be fully isolated during the transient stage. While existing studies mainly pursue resilient consensus or secure fusion, they pay much less attention to the learning inefficiency and extra evolution cost accumulated during the defense process. This paper addresses this gap by developing a cost-aware distributed online learning framework with strict rejection behavior against adversarial agents.Under this mechanism, the state evolution cost of online adaptation is formulated and the cost amplification effect caused by adversarial interactions is theoretically characterized. To balance robustness, convergence efficiency, and long-term cost, we propose an adaptive adjustment mechanism for the state-evolution rate. The resulting outer-layer update can be equivalently viewed as a constrained online optimization problem. We further establish the well-posedness and regularity of the associated periodic Riccati layer, and show that the outer-layer update ensures feasibility and controlled variation. Based on these properties, closed-loop practical stability is rigorously established via a two-time-scale Lyapunov framework. Simulations demonstrate that the proposed method achieves robust and low-cost convergence under adversarial disturbances. Furthermore, a multi-satellite target tracking scenario with malicious interference further demonstrates the practical effectiveness of the strict rejection behavior.
[657] AgentDynEx: Nudging the Mechanics and Dynamics of Multi-Agent Simulations
Jenny Ma, Riya Sahni, Karthik Sreedhar, Lydia B. Chilton
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-agent large language model simulations have the potential to model complex human behaviors and interactions. If the mechanics are set up properly, unanticipated and valuable social dynamics can surface. However, it is challenging to consistently enforce simulation mechanics while still allowing for notable and emergent dynamics. We present AgentDynEx, an AI system that helps set up simulations from user-specified mechanics and dynamics. AgentDynEx uses LLMs to guide users through a Configuration Matrix to identify core mechanics and define milestones to track dynamics. It also introduces a method called \textit{nudging}, where the system dynamically reflects on simulation progress and gently intervenes if it begins to deviate from intended outcomes. A technical evaluation found that nudging enables simulations to have more complex mechanics and maintain its notable dynamics compared to simulations without nudging. We discuss the importance of nudging as a technique for balancing mechanics and dynamics of multi-agent simulations.
[658] OMAC: A Holistic Optimization Framework for LLM-Based Multi-Agent Collaboration
Shijun Li, Hilaf Hasson, Joydeep Ghosh
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Agents powered by advanced large language models (LLMs) have demonstrated impressive capabilities across diverse complex applications. Recently, Multi-Agent Systems (MAS), wherein multiple agents collaborate and communicate with each other, have exhibited enhanced capabilities in complex tasks, such as high-quality code generation and arithmetic reasoning. However, the development of such systems often relies on handcrafted methods, and the literature on systematic design and optimization of LLM-based MAS remains limited. In this work, we introduce OMAC, a general framework designed for holistic optimization of LLM-based MAS. Specifically, we identify five key optimization dimensions for MAS, encompassing both agent functionality and collaboration structure. Building upon these dimensions, we first propose a general algorithm, utilizing two actors termed the Semantic Initializer and the Contrastive Comparator, to optimize any single dimension. Then, we present an algorithm for joint optimization across multiple dimensions. Extensive experiments demonstrate the superior performance of OMAC on code generation, arithmetic reasoning, and general reasoning tasks against state-of-the-art approaches.
[659] CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation
Kuo Tian, Pengfei Sun, Zhen Wu, Junran Ding, Xinyu Dai
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The autonomous synthesis of deep research reports represents a critical frontier for Large Language Models (LLMs), demanding sophisticated information orchestration and non-linear narrative logic. Current approaches rely on rigid predefined linear workflows, which cause error accumulation, preclude global restructuring from subsequent insights, and ultimately limit in-depth multimodal fusion and report quality. We propose CogGen, a Cognitively inspired recursive framework for deep research report Generation. Leveraging a Hierarchical Recursive Architecture to simulate cognitive writing, CogGen enables flexible planning and global restructuring. To extend this recursivity to multimodal content, we introduce Abstract Visual Representation (AVR): a concise intent-driven language that iteratively refines visual-text layouts without pixel-level regeneration overhead. We further present CLEF, a Cognitive Load Evaluation Framework, and curate a new benchmark from Our World in Data (OWID). Extensive experiments show CogGen achieves state-of-the-art results among open-source systems, generating reports comparable to professional analysts’ outputs and surpassing Gemini Deep Research. Our code and dataset are available at https://github.com/NJUNLP/CogGen.
[660] Multi-UAV Path Following using Vector-Field Guidance
Gautam Kumar, Amit Shivam, Ashwini Ratnoo
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: This paper presents a decentralized, collision-free framework for path following guidance of multiple uncrewed aerial vehicles (UAVs), while maintaining uniform spacing along a reference path. A vector field-based guidance law is employed to drive each UAV toward the reference path. A rotational repulsion mechanism, utilizing relative distance and bearing between UAVs, is proposed to avoid collisions during convergence to the path, and an inter-UAV spacing error-based velocity control law is presented to achieve uniform separation along the path. Analytical guarantees are established for collision avoidance and convergence of the inter-UAV spacing errors to zero, ensuring uniform separation along the path. Numerical simulations demonstrate the efficacy of the proposed method.
[661] Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation
Nuo Chen, Yicheng Tong, Yuzhe Yang, Yufei He, Xueyi Zhang, Qingyun Zou, Qian Wang, Bingsheng He
Main category: cs.MA
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Multi-agent systems (MAS) are increasingly used for open-ended idea generation, driven by the expectation that collective interaction will broaden the exploration diversity. However, when and why such collaboration truly expands the solution space remains unclear. We present a systematic empirical study of diversity in MAS-based ideation across three bottom-up levels: model intelligence, agent cognition, and system dynamics. At the model level, we identify a compute efficiency paradox, where stronger, highly aligned models yield diminishing marginal diversity despite higher per-sample quality. At the cognition level, authority-driven dynamics suppress semantic diversity compared to junior-dominated groups. At the system level, group-size scaling yields diminishing returns and dense communication topologies accelerate premature convergence. We characterize these outcomes as collective failures emerging from structural coupling, a process where interaction inadvertently contracts agent exploration and triggers diversity collapse. Our analysis shows that this collapse arises primarily from the interaction structure rather than inherent model insufficiency, highlighting the importance of preserving independence and disagreement when designing MAS for creative tasks. Our code is available at https://github.com/Xtra-Computing/MAS_Diversity.
cs.MM
[662] Smiling Regulates Emotion During Traumatic Recollection
Marcus Ma, Emily Zhou, Leonard Ludwig, Julia Hörath, Christina Winkler, Kleanthis Avramidis, Tiantian Feng, Gabor Toth, Alina Bothe, Shrikanth Narayanan
Main category: cs.MM
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study when, where, and why 978 Holocaust survivors smile in video testimonies. We create an automatic smile detection model from facial features with an F1 of 85% and annotate detected smiles under two established taxonomies of smiling. We produce narrative features on 1,083,417 transcript sentences as well as emotional valence from three different modalities: audio, eye gaze, and text transcript. Smiling rates are significantly correlated with specific semantic topics, narrative structures, and temporal syntaxes across the entire corpus. Smiles often occur during periods of intense negative affect; these negative-affect smiles improve the valence trajectory of surrounding sentences significantly across all three modalities. Smiling reduces eye dynamics and blink rates, and the strength of both of these effects is also modulated by narrative valence. Taken together, we conclude that smiling plays a critical role in regulating emotion and social interaction during traumatic recollection.
eess.AS
[663] Self-Noise Reduction for Capacitive Sensors via Photoelectric DC Servo: Application to Condenser Microphones
Hirotaka Obo, Atsushi Tsuchiya, Tadashi Ebihara, Naoto Wakatsuki
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: The self-noise of capacitive sensors, primarily caused by thermal noise from the gate-bias resistor in the preamplifier, imposes a fundamental limit on measurement sensitivity. In electret condenser microphones (ECMs), this resistor simultaneously determines the noise low-pass cutoff frequency and the signal high-pass cutoff frequency through a single RC time constant, creating a trade-off between noise reduction and signal bandwidth. This paper proposes PDS-Amp (Photoelectric DC Servo Amplifier), a circuit technique that replaces the gate-bias resistor with a photoelectric element functioning as an ultra-high-impedance current source. A DC servo loop using lag-lead compensation feeds back the preamplifier output through an LED to control the photocurrent, thereby stabilizing the gate bias while decoupling the noise and signal cutoff frequencies. A custom photosensor based on the external photoelectric effect of a zinc photocathode was fabricated to achieve sub-picoampere dark current, overcoming the limitations of commercial semiconductor photodiodes. Combined with a cascode JFET preamplifier that minimizes input capacitance through bootstrap action, PDS-Amp achieved a self-noise of 11 dBA with a 12 pF dummy microphone. Despite using a small-diameter ECM capsule, this performance is comparable to that of large-diaphragm condenser microphones costing several thousand dollars. Recording experiments with an actual ECM capsule qualitatively confirmed a significant reduction in background noise. The proposed technique is applicable not only to microphones but broadly to capacitive sensors including accelerometers, pressure sensors, and pyroelectric sensors.
[664] Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization
Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Nune Tadevosyan, Vitaly Lavrukhin, Boris Ginsburg
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Unification of automatic speech recognition (ASR) systems reduces development and maintenance costs, but training a single model to perform well in both offline and low-latency streaming settings remains challenging. We present a Unified ASR framework for Transducer (RNNT) training that supports both offline and streaming decoding within a single model, using chunk-limited attention with right context and dynamic chunked convolutions. To further close the gap between offline and streaming performance, we introduce an efficient Triton implementation of mode-consistency regularization for RNNT (MCR-RNNT), which encourages agreement across training modes. Experiments show that the proposed approach improves streaming accuracy at low latency while preserving offline performance and scaling to larger model sizes and training datasets. The proposed Unified ASR framework and the English model checkpoint are open-sourced.
[665] Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
Jianbo Ma, Richard Cartwright
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in Text-To-Speech (TTS) synthesis have seen the popularity of multi-stage approaches that first predict semantic tokens and then generate acoustic tokens. In this paper, we extend the coarse-to-fine generation paradigm to the temporal domain and introduce Chain-of-Details (CoD), a novel framework that explicitly models temporal coarse-to-fine dynamics in speech generation using a cascaded architecture. Our method progressively refines temporal details across multiple stages, with each stage targeting a specific temporal granularity. All temporal detail predictions are performed using a shared decoder, enabling efficient parameter utilization across different temporal resolutions. Notably, we observe that the lowest detail level naturally performs phonetic planning without the need for an explicit phoneme duration predictor. We evaluate our method on several datasets and compare it against several baselines. Experimental results show that CoD achieves competitive performance with significantly fewer parameters than existing approaches. Our findings demonstrate that explicit modeling of temporal dynamics with the CoD framework leads to more natural speech synthesis.
[666] Computational Narrative Understanding for Expressive Text-to-Speech
Gaspard Michel, Elena V. Epure, Christophe Cerisara
Main category: eess.AS
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Recent advances in text-to-speech (TTS) have been driven by large, multi-domain speech corpora, yet the expressive potential of audiobook data remains underexamined. We argue that human-narrated audiobooks, particularly fictional works, contain rich and diverse prosodic cues arising from the natural alternation between neutral narration and expressive character dialogue. Building from this observation, we introduce LibriQuote, a large-scale 5.3K hours of expressive speech drawn from character quotations. Each quote is supplemented with contextual pseudo-labels for speech verbs and adverbs that characterize the intended delivery of direct speech (e.g., “he whispered softly”). We found that fine-tuning a flow-matching model on LibriQuote yields substantial improvements in expressivity and intelligibility, while training from scratch enhances expressiveness of an autoregressive TTS model. Benchmarking on LibriQuote-test highlights significant variability across systems in generating expressive speech. We publicly release the dataset, code, and evaluation resources to facilitate reproducibility. Audio samples can be found at https://libriquote.github.io/.
eess.IV
[667] A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation
Nichula Wasalathilaka, Dineth Perera, Oshadha Samarakoon, Buddhi Wijenayake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones
[668] VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport
Yetao He, Wenhan Guo, Deliang Wei, Evan Bel, Ji Yi, Yu Sun
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Three-dimensional (3D) wide-field fluorescence microscopy is a widely used modality for volumetric imaging, but suffers from characteristic out-of-focus blur. Existing reconstruction methods either struggle to operate on high-dimensional volumes or fail to provide credibility characterization of the reconstruction. In this work, we introduce Volumetric Transport (VOLT), a 3D-native probabilistic framework for wide-field fluorescence microscopy reconstruction. VOLT combines a transport-based formulation that maps degraded measurements to clean volumes via stochastic interpolants with a 3D-native anisotropic network that separates lateral and axial processing. This design operates directly in voxel space and achieves improved scalability to large volumes without relying on slice-wise approximations. We develop both stochastic (SDE) and deterministic (ODE) variants within the same framework. We validate VOLT on simulated wide-field microscopy datasets. Our results show that VOLT significantly improves reconstruction quality in both lateral and axial directions while providing voxel-wise credibility estimates.
[669] ExplainS2A: Explainable Spectral-Spatial Duality Model for Fast Transforming Sentinel-2 Image to AVIRIS-Level Hyperspectral Image
Chia-Hsiang Lin, Zi-Chao Leng
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Mainstream optical satellites often acquire multispectral multi-resolution images, which have limited material identifiability compared to the HSIs. Thus, spectrally super-resolving the MSI into their hyperspectral counterparts greatly facilitates remote material identification and the downstream tasks. However, spectrally super-resolving the MSI into an HSI is often constrained by the multi-resolution nature of the sensor. Specifically, due to the presence of some LR bands in the MSI, the initial spectral super-resolution results often appear to be spatially blurry, resulting in an LR HSI. To overcome this bottleneck, we then leverage some HR band inherent in the acquired MSI to spatially guide the reconstruction procedure, thereby yielding the desired HR HSI. This fusion procedure elegantly coincides with a widely known spatial super-resolution problem in satellite remote sensing. Hence, we have reformulated the tough spectral super-resolution problem into a more widely investigated spatial super-resolution problem, referred to as the spectral-spatial duality theory. Accordingly, we propose ExplainS2A, consisting of a deep unfolding network and an explainable fusion network, that unifies spectral recovery and spatial fusion into a single explainable framework. Unlike conventional black-box models, ExplainS2A offers interpretability and operates as a linear-time algorithm. Remarkably, it can process a million-scale Sentinel-2 image in less than one second, yielding high-fidelity HSI over the same scene, and upgrades the blind source separation results. Although demonstrated on the Sentinel-2 and AVIRIS sensors, ExplainS2A also serves as a general framework applicable to various sensor pairs with different resolution configurations, and has experimentally demonstrated cross-region and cross-season generalization ability. Source codes: https://github.com/IHCLab/ExplainS2A.
[670] Deep Image Prior for photoacoustic tomography can mitigate limited-view artifacts
Hanna Pulkkinen, Jenni Poimala, Leonid Kunyansky, Janek Gröhl, Andreas Hauptmann
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: We study the deep image prior (DIP) framework applied to photoacoustic tomography (PAT) as an unsupervised reconstruction approach to mitigate limited-view artifacts and noise commonly encountered in experimental settings. Efficient implementation is achieved by employing recently published fast forward and adjoint algorithms for circular measurement geometries. Initialization via a fast inverse and total variation (TV) regularization are applied to further suppress noise and mitigate overfitting. For comparison, we compute a classical TV reconstruction. Our experiments comprise simulated PAT measurements under limited-view geometries and varying levels of added noise as well as experimental measurements together with using a digital twin for quality assessment. Our findings suggest that DIP framework provides an effective unsupervised strategy for robust PAT reconstruction even in the challenging case of a limited view geometry providing improvement in several quantitative measures over total variation reconstructions.
[671] Harmonizing MR Images Across 100+ Scanners: Multi-site Validation with Traveling Subjects and Real-world Protocols
Savannah P. Hays, Lianrui Zuo, Muhammad Faizyab Ali Chaudhary, Kathleen M. Bartz, Samuel W. Remedios, Jinwei Zhang, Jiachen Zhuo, Murat Bilgel, Shiv Saidha, Ellen M. Mowry, Scott D. Newsome, Jerry L. Prince, Blake E. Dewey, Aaron Carass
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Reliable harmonization of heterogeneous magnetic resonance~(MR) image datasets, especially those acquired in pragmatic clinical trials, is critical to advance multi-center neuroimaging studies and translational machine learning in healthcare. We present an enhanced and rigorously validated version of the HACA3 harmonization algorithm, which we refer to as HACA3$^+$, incorporating key methodological enhancements: (1)~an improved artifact encoder to better isolate and mitigate image artifacts, (2)~background and foreground-sensitive attention mechanisms to increase harmonization specificity, and (3)~extensive training using data spanning 100+ scanners from 64 independent sites, providing a broader diversity of scanners than other harmonization methods. Our study focuses on four commonly acquired MR image contrasts (T1-weighted, T2-weighted, proton density, & fluid-attenuated inversion recovery), reflecting realistic clinical protocols. We perform inter-site harmonization experiments using traveling subjects to assess the generalization and robustness of the harmonization model. We compare the results of the publicly available version of HACA3 and our implementation, HACA3$^+$. Downstream relevance is further established through whole brain segmentation and image imputation. Finally, we justify each enhancement through an ablation experiment. Pre-trained weights and code for HACA3$^+$ are made publicly available at https://github.com/shays15/haca3-plus.
[672] Defining Robust Ultrasound Quality Metrics via an Ultrasound Foundation Model
Ziyang Huang, Bingyan Li, Chen Ma, Tianyi Liu, Yihui Zhai, Hong Xu, Yi Guo, Zeju Li, Yuanyuan Wang
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Clinicians lack a principled framework to quantify diagnostic utility in ultrasound reconstructions. Existing standards like PSNR and VGG-LPIPS are inadequate, failing to account for modality-specific physics or the structural nuances of acoustic imaging. We close this gap with a TinyUSFM-based evaluation framework featuring two distinct metrics: TinyUSFM-uLPIPS, a full-reference perceptual distance based on multi-layer token relations, and TinyUSFM-NRQ, a deployable no-reference quality score utilizing clean-manifold modeling and worst-region aggregation to detect localized harmful artifacts. We demonstrate that the presented metrics have four unique advantages: 1) Task-linked quality, where TinyUSFM-uLPIPS achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail; 2) Cross-organ comparability, maintaining stable scoring scales and consistent severity rankings across diverse anatomical sites and domain-shifted data; 3) PSNR-consistent sensitivity, with TinyUSFM-NRQ providing a reliable quality score without ground-truth images that remains consistent with traditional fidelity benchmarks (i.e. PSNR); and 4) Clinical utility, improving the prediction of expert preference from 47.2$%$ to 72.8$%$ accuracy and producing super-resolution reconstructions preferred by sonographers. By integrating these advantages into a unified assessment and optimization loop, this work establishes a modality-aligned standard that finally bridges the gap between algorithmic performance and diagnostic utility. https://github.com/sextant-fable/US-Metrics
[673] Synthetic Abundance Maps for Unsupervised Super-Resolution of Hyperspectral Remote Sensing Images
Xinxin Xu, Yann Gousseau, Christophe Kervazo, Saïd Ladjal
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Hyperspectral single image super-resolution (HS-SISR) aims to enhance the spatial resolution of hyperspectral images to fully exploit their spectral information. While considerable progress has been made in this field, most existing methods are supervised and require ground truth data for training-data that is often unavailable in practice. To overcome this limitation, we propose a novel unsupervised training framework for HS-SISR, based on synthetic abundance data, where no high-resolution ground-truth reference is required for training. The approach begins by unmixing the hyperspectral image into endmembers and abundances. A neural network is then trained to perform abundance super-resolution using synthetic abundances only. These synthetic abundance maps are generated from a dead leaves model whose characteristics are inherited from the low-resolution image to be super-resolved and from the known point spread function (PSF) of the hyperspectral sensor. This trained network is subsequently used to enhance the spatial resolution of the original image’s abundances, and the final super-resolution hyperspectral image is reconstructed by combining them with the endmembers. Experimental results demonstrate both the training value of the synthetic data and the effectiveness of the proposed method across 3 datasets, 3 scaling factors, and several evaluation metrics. The code is available at https://github.com/xinxinxu99/SISR-DL.git
[674] AI-Based Detection of Temporal Changes in MR-Linac Images Acquired During Routine Prostate Radiotherapy
Seungbin Park, Peilin Wang, Ryan Pennell, Emily S. Weg, Himanshu Nagar, Timothy McClure, Mert R. Sabuncu, Daniel Margolis, Heejong Kim
Main category: eess.IV
TL;DR: Error: Processing failed
Details
Motivation: Error: Processing failedMethod: Error: Processing failed
Result: Error: Processing failed
Conclusion: Error: Processing failed
Abstract: Purpose: To investigate whether an AI-based method can detect subtle inter-fraction changes in MR-Linac images acquired during radiotherapy and explore the broader potential of MRLinac imaging. Methods: This retrospective study included longitudinal 0.35T MR-Linac images from 761 patients. To identify temporal changes, we employed a deep learning model using temporal ordering via pairwise comparison, previously shown effective for longitudinal imaging studies. The model was trained using first-to-last fraction pairs (F1-FL) and all pairs (All-pairs). Performance was assessed using quantitative metrics (accuracy and AUC) and compared against a radiologist’s performance. Qualitative evaluation was performed using saliency maps, which identify anatomical regions associated with temporal imaging changes. Results: The F1-FL model demonstrated high performance (AUC=0.99, accuracy=0.95) and outperformed the radiologist in temporal ordering task. The All-pairs model also showed high performance (AUC=0.97, accuracy=0.91). Regions contributing to predictions included the prostate, bladder, and pubic symphysis. The performance was correlated to fractional intervals and was reduced for non-radiation-exposed timepoints (Sim and F1), suggesting that observed changes may reflect both temporal variation and radiation exposure. Conclusion: MR-Linac imaging appears capable of capturing subtle changes during prostate radiotherapy that can be detected by AI models, even over approximately two-day intervals. The model’s high performance, together with quantitative and qualitative analyses, supports a potential role for MR-Linac in clinical applications beyond image guidance.