Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 181]
cs.CV [Total: 177]
cs.AI [Total: 121]
cs.SD [Total: 17]
cs.LG [Total: 153]
cs.MA [Total: 12]
cs.MM [Total: 1]
eess.AS [Total: 4]
eess.IV [Total: 8]

Abstract: Detecting jailbreak behaviour in large language models remains challenging, particularly when strongly aligned models produce harmful outputs only rarely. In this work, we present an empirical study of output based jailbreak detection under realistic conditions using the JailbreakBench Behaviors dataset and multiple generator models with varying alignment strengths. We evaluate both a lexical TF-IDF detector and a generation inconsistency based detector across different sampling budgets. Our results show that single output evaluation systematically underestimates jailbreak vulnerability, as increasing the number of sampled generations reveals additional harmful behaviour. The most significant improvements occur when moving from a single generation to moderate sampling, while larger sampling budgets yield diminishing returns. Cross generator experiments demonstrate that detection signals partially generalise across models, with stronger transfer observed within related model families. A category level analysis further reveals that lexical detectors capture a mixture of behavioural signals and topic specific cues, rather than purely harmful behaviour. Overall, our findings suggest that moderate multi sample auditing provides a more reliable and practical approach for estimating model vulnerability and improving jailbreak detection in large language models. Code will be released.

Weixi Tong, Yifeng Di, Tianyi Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing web agents typically initiate exploration from the root URL, which is inefficient for complex websites with deep hierarchical structures. Without a global view of the website’s structure, agents frequently fall into navigation traps, explore irrelevant branches, or fail to reach target information within a limited budget. We propose Mango, a multi-agent web navigation method that leverages the website structure to dynamically determine optimal starting points. We formulate URL selection as a multi-armed bandit problem and employ Thompson Sampling to adaptively allocate the navigation budget across candidate URLs. Furthermore, we introduce an episodic memory component to store navigation history, enabling the agent to learn from previous attempts. Experiments on WebVoyager demonstrate that Mango achieves a success rate of 63.6% when using GPT-5-mini, outperforming the best baseline by 7.3%. Furthermore, on WebWalkerQA, Mango attains a 52.5% success rate, surpassing the best baseline by 26.8%. We also demonstrate the generalizability of Mango using both open-source and closed-source models as backbones. Our data and code are open-source and available at https://github.com/VichyTong/Mango.

[14] Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

Seyedali Mohammadi, Manas Gaur, Francis Ferraro

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.

[15] Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Sinan G. Aksoy, Alexandra A. Sabrio, Erik VonKaenel, Lee Burke

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose a scalable, multifactorial experimental framework that systematically probes LLM sensitivity to subtle semantic changes in pairwise document comparison. We analogize this as a needle-in-a-haystack problem: a single semantically altered sentence (the needle) is embedded within surrounding context (the hay), and we vary the perturbation type (negation, conjunction swap, named entity replacement), context type (original vs. topically unrelated), needle position, and document length across all combinations, testing five LLMs on tens of thousands of document pairs. Our analysis reveals several striking findings. First, LLMs exhibit a within-document positional bias distinct from previously studied candidate-order effects: most models penalize semantic differences more harshly when they occur earlier in a document. Second, when the altered sentence is surrounded by topically unrelated context, it systematically lowers similarity scores and induces bipolarized scores that indicate either very low or very high similarity. This is consistent with an interpretive frame account in which topically-related context may allow models to contextualize and downweight the alterations. Third, each LLM produces a qualitatively distinct scoring distribution, a stable “fingerprint” that is invariant to perturbation type, yet all models share a universal hierarchy in how leniently they treat different perturbation types. Together, these results demonstrate that LLM semantic similarity scores are sensitive to document structure, context coherence, and model identity in ways that go beyond the semantic change itself, and that the proposed framework offers a practical, LLM-agnostic toolkit for auditing and comparing scoring behavior across current and future models.

[16] LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

Pedro Barbosa de Carvalho Neto

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce LegalBench-BR, the first public benchmark for evaluating language models on Brazilian legal text classification. The dataset comprises 3,105 appellate proceedings from the Santa Catarina State Court (TJSC), collected via the DataJud API (CNJ) and annotated across five legal areas through LLM-assisted labeling with heuristic validation. On a class-balanced test set, BERTimbau-LoRA, updating only 0.3% of model parameters, achieves 87.6% accuracy and 0.87 macro-F1 (+22pp over Claude 3.5 Haiku, +28pp over GPT-4o mini). The gap is most striking on administrativo (administrative law): GPT-4o mini scores F1 = 0.00 and Claude 3.5 Haiku scores F1 = 0.08 on this class, while the fine-tuned model reaches F1 = 0.91. Both commercial LLMs exhibit a systematic bias toward civel (civil law), absorbing ambiguous classes rather than discriminating them, a failure mode that domain-adapted fine-tuning eliminates. These results demonstrate that general-purpose LLMs cannot substitute for domain-adapted models in Brazilian legal classification, even when the task is a simple 5-class problem, and that LoRA fine-tuning on a consumer GPU closes the gap at zero marginal inference cost. We release the full dataset, model, and pipeline to enable reproducible research in Portuguese legal NLP.

[17] Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.

[18] Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

Yuefei Chen, Yihao Quan, Xiaodong Lin, Ruixiang Tang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLMs frequently generate fictitious yet convincing citations, often expressing high confidence even when the underlying reference is wrong. We study this failure across 9 models and 108{,}000 generated references, and find that author names fail far more often than other fields across all models and settings. Citation style has no measurable effect, while reasoning-oriented distillation degrades recall. Probes trained on one field transfer at near-chance levels to the others, suggesting that hallucination signals do not generalize across fields. Building on this finding, we apply elastic-net regularization with stability selection to neuron-level CETT values of Qwen2.5-32B-Instruct and identify a sparse set of field-specific hallucination neurons (FH-neurons). Causal intervention further confirms their role: amplifying these neurons increases hallucination, while suppressing them improves performance across fields, with larger gains in some fields. These results suggest a lightweight approach to detecting and mitigating citation hallucination using internal model signals alone.

[19] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve-making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic framework to mitigate recurrent execution errors. Chat2Workflow is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially under complex or changing requirements. Although our agentic framework yields up to 5.34% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

[20] Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

Mengzhao Jia, Zhihan Zhang, Meng Jiang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) improves multimodal reasoning by rewarding verifiable final answers. Yet answer-correct trajectories may still rely on incomplete derivations, weak evidence, or statements that contradict their conclusions. This gap between answer correctness and reasoning validity, which we call reasoning-answer inconsistency, motivates trajectory supervision in multimodal RL. We compare two main approaches: reward models (RMs), and Generative Rewards (GRs). RMs are efficient and help early in training, but their gains weaken as the policy distribution shifts; GRs improve performance, but may give unstable rewards and computationally expensive. We therefore propose Groupwise Ranking Reward, which ranks verifier-passed trajectories for the same prompt in one pass and redistributes reward accordingly. Groupwise comparison better separates stronger and weaker correct trajectories with lower judge overhead than GRs. Experiments show that RLVR aggravates reasoning-answer inconsistency, while trajectory supervision alleviates it. Groupwise Ranking Reward performs best overall, improving reliability-conditioned accuracy from 47.4% to 54.7% over RLVR.

[21] Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

Manuel Israel Cazares

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a systematic empirical study of prompt engineering for formal mathematical reasoning in the context of the SAIR Equational Theories Stage 1 competition. The task requires deciding whether one equational law implies another over all magmas – a problem that is undecidable in general but decidable for FALSE via finite model search. Over five weeks, we designed, tested, and analyzed more than 40 prompt variants, ranging from 0 to 4,878 bytes, across four evaluation splits and three language models (gpt-oss-120b, Llama 3.3 70B, Gemma 4 31B). Our central finding is a single-prompt ceiling: despite substantial engineering effort, balanced hard accuracy plateaus in an empirical saturation region of approximately 60–79% for gpt-oss-120b, compared to a 59.75% no-cheatsheet baseline. We identify three mechanisms underlying this ceiling: (1) the mathematical undecidability of the TRUE case limits what any finite prompt can encode; (2) complex rule systems decrease performance on weaker models (Llama 3.3 70B collapses to 0% TRUE recall with prompts exceeding 2KB); and (3) prompt ordering effects interact with model attention in fragile, non-monotonic ways. Our best submission (AN45c, 2,252 bytes) achieves 79.25% accuracy on hard3 (n=400; 95% CI: [75.0%, 82.9%]), with TRUE recall of 95.9% and FALSE recall of 63.4%, representing a +19.5 percentage-point improvement over the no-cheatsheet baseline (59.75%). We release all prompt variants, evaluation scripts, and results at https://github.com/israelcazares/sair-prompt-engineering

[22] LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

He Cheng, Yifu Wu, Saksham Khatwani, Maya Kruse, Dmitriy Dligach, Timothy A. Miller, Majid Afshar, Yanjun Gao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knowledge graphs (KGs) are increasingly integrated with large language models (LLMs) to provide structured, verifiable reasoning. A core operation in this integration is multi-hop retrieval, yet existing systems struggle to balance efficiency, scalability, and interpretability. We introduce LogosKG, a novel, hardware-aligned framework that enables scalable and interpretable k-hop retrieval on large KGs by building on symbolic KG formulations and executing traversal as hardware-efficient operations over decomposed subject, object, and relation representations. To scale to billion-edge graphs, LogosKG integrates degree-aware partitioning, cross-graph routing, and on-demand caching. Experiments show substantial efficiency gains over CPU and GPU baselines without loss of retrieval fidelity. With proven performance in KG retrieval, a downstream two-round KG-LLM interaction demonstrates how LogosKG enables large-scale, evidence-grounded analysis of how KG topology, such as hop distribution and connectivity, shapes the alignment between structured biomedical knowledge and LLM diagnostic reasoning, thereby opening the door for next-generation KG-LLM integration. The source code is publicly available at https://github.com/LARK-NLP-Lab/LogosKG, and an online demo is available at https://lark-nlp-lab-logoskg.hf.space/.

[23] MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

Mehul Agarwal, Aditya Aggarwal, Arnav Goel, Medha Hira, Anubha Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While multilingual large language models (LLMs) perform well on high-level tasks like translation and question answering, their ability to handle grammatical gender and morphological agreement remains underexplored. In morphologically rich languages, gender influences verb conjugation, pronouns, and even first-person constructions with explicit and implicit mentions of gender. We introduce MORPHOGEN, a morphologically grounded large-scale benchmark dataset for evaluating gender-aware generation in three typologically diverse grammatically gendered languages: French, Arabic, and Hindi. The core task, GENFORM, requires models to rewrite a first-person sentence in the opposite gender while preserving its meaning and structure. We construct a high-quality synthetic dataset spanning these three languages and benchmark 15 popular multilingual LLMs (2B-70B) on their ability to perform this transformation. Our results reveal significant gaps and interesting insights into how current models handle morphological gender. MORPHOGEN provides a focused diagnostic lens for gender-aware language modeling and lays the groundwork for future research on inclusive and morphology-sensitive NLP.

[24] Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

Yura Yoshida, Masato Kanai, Masataka Nakayama, Haruki Ohsawa, Yukiko Uchida, Arata Yuminaga, Gakuse Hoshina, Nobuo Sayama

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Analyzing topics extracted from text data in relation to external outcomes is important across fields such as computational social science and organizational research. However, existing topic modeling methods struggle to simultaneously achieve interpretability, topic specificity (alignment with concrete actions or characteristics), and polarity stance consistency (absence of mixed positive and negative evaluations within a topic). Focusing on leadership analysis using corporate review data, this study proposes a method leveraging large language models to generate topics that satisfy these properties, along with an evaluation framework tailored to external outcome analysis. The framework explicitly incorporates topic specificity and polarity stance consistency as evaluation criteria and examines automated evaluation methods based on existing metrics. Using employee reviews from OpenWork, a major corporate review platform in Japan, the proposed method achieves improved interpretability, specificity, and polarity consistency compared to existing approaches. In analyses of external outcomes such as employee morale, it also produces topics with higher explanatory power. These results suggest that the proposed method and evaluation framework provide a generalized approach for topic analysis in applications involving external outcomes.

[25] Disparities In Negation Understanding Across Languages In Vision-Language Models

Charikleia Moraitaki, Sarah Pan, Skyler Pulling, Gwendolyn Flusche, Kumail Alhamoud, Marzyeh Ghassemi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language models (VLMs) exhibit affirmation bias: a systematic tendency to select positive captions (“X is present”) even when the correct description contains negation (“no X”). While prior work has documented this failure mode in English and proposed solutions, negation manifests differently across languages through varying morphology, word order, and cliticization patterns, raising the question of whether these solutions serve all linguistic communities equitably. We introduce the first human-verified multilingual negation benchmark, spanning seven typologically diverse languages: English, Mandarin Chinese, Arabic, Greek, Russian, Tagalog, and Spanish. Evaluating three VLMs - CLIP, SigLIP, and MultiCLIP - we find that standard CLIP performs at or below chance on non-Latin-script languages, while MultiCLIP achieves the highest and most uniform accuracy. We also evaluate SpaceVLM, a proposed negation correction, and find that it produces substantial improvements for several languages - particularly English, Greek, Spanish, and Tagalog - while showing varied effectiveness across typologically different languages. This variation reveals that linguistic properties like morphology, script, and negation structure interact with model improvements in fairness-relevant ways. As VLMs are deployed globally, multilingual benchmarks are essential for understanding not just whether solutions work, but for whom.

[26] A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

Jiang Xiaobo, Dinghong Lai, Song Qiu, Yadong Deng, Xinkai Zhan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Named Entity Recognition (NER) models trained on clean, high-resource corpora exhibit catastrophic performance collapse when deployed on noisy, sparse User-Generated Content (UGC), such as social media. Prior research has predominantly focused on point-wise symptom remediation – employing customized fine-tuning to address issues like neologisms, alias drift, non-standard orthography, long-tail entities, and class imbalance. However, these improvements often fail to generalize because they overlook the structural sparsity inherent in UGC. This study reveals that surface-level noise symptoms share a unified root cause: low Information Density (ID). Through hierarchical confounding-controlled resampling experiments (specifically controlling for entity rarity and annotation consistency), this paper identifies ID as an independent key factor. We introduce Attention Spectrum Analysis (ASA) to quantify how reduced ID causally leads to ``attention blunting,’’ ultimately degrading NER performance. Informed by these mechanistic insights, we propose the Window-Aware Optimization Module (WOM), an LLM-empowered, model-agnostic framework. WOM identifies information-sparse regions and utilizes selective back-translation to directionally enhance semantic density without altering model architecture. Deployed atop mainstream architectures on standard UGC datasets (WNUT2017, Twitter-NER, WNUT2016), WOM yields up to 4.5% absolute F1 improvement, demonstrating robustness and achieving new state-of-the-art (SOTA) results on WNUT2017.

Ramtin Davoudi, Kartik Thakkar, Nazanin Donyapour, Tyler Derr, Hamid Karimi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In this study, we present the first comprehensive evaluation of modern LLMs - including GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT - across three core social media analytics tasks on a Twitter (X) dataset: (I) Social Media Authorship Verification, (II) Social Media Post Generation, and (III) User Attribute Inference. For the authorship verification, we introduce a systematic sampling framework over diverse user and post selection strategies and evaluate generalization on newly collected tweets from January 2024 onward to mitigate “seen-data” bias. For post generation, we assess the ability of LLMs to produce authentic, user-like content using comprehensive evaluation metrics. Bridging Tasks I and II, we conduct a user study to measure real users’ perceptions of LLM-generated posts conditioned on their own writing. For attribute inference, we annotate occupations and interests using two standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) and benchmark LLMs against existing baselines. Overall, our unified evaluation provides new insights and establishes reproducible benchmarks for LLM-driven social media analytics. The code and data are provided in the supplementary material and will also be made publicly available upon publication.

[28] STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

MinJae Jung, YongTaek Lim, Chaeyun Kim, Junghwan Kim, Kihyun Kim, Minwoo Kim

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Large Language Models (LLMs) are widely used, they remain susceptible to jailbreak prompts that can elicit harmful or inappropriate responses. This paper introduces STAR-Teaming, a novel black-box framework for automated red teaming that effectively generates such prompts. STAR-Teaming integrates a Multi-Agent System (MAS) with a Strategy-Response Multiplex Network and employs network-driven optimization to sample effective attack strategies. This network-based approach recasts the intractable high-dimensional embedding space into a tractable structure, yielding two key advantages: it enhances the interpretability of the LLM’s strategic vulnerabilities, and it streamlines the search for effective strategies by organizing the search space into semantic communities, thereby preventing redundant exploration. Empirical results demonstrate that STAR-Teaming significantly surpasses existing methods, achieving a higher attack success rate (ASR) at a lower computational cost. Extensive experiments validate the effectiveness and explainability of the Multiplex Network. The code is available at https://github.com/selectstar-ai/STAR-Teaming-paper.

[29] $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

Zhenbang Du, Kejing Xia, Xinrui Zhong, Yonggan Fu, Nicolai Oswald, Binfei Ji, Brucek Khailany, Pavlo Molchanov, Yingyan Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to autoregressive generation by enabling parallel token prediction. However, practical dLLM decoding still suffers from high inference latency, which limits deployment. In this work, we observe that a substantial part of this inefficiency comes from recurring redundancy in the decoding process, including spatial redundancy caused by confidence clusters and positional ambiguity, and temporal redundancy caused by repeatedly remasking predictions that have already stabilized. Motivated by these patterns, we propose $R^2$-dLLM, a unified framework for reducing decoding redundancy from both inference and training perspectives. At inference time, we introduce training-free decoding rules that aggregate local confidence and token predictions, and finalize temporally stable tokens to avoid redundant decoding steps. We further propose a redundancy-aware supervised fine-tuning pipeline that aligns the model with efficient decoding trajectories and reduces reliance on manually tuned thresholds. Experiments demonstrate that $R^2$-dLLM consistently reduces the number of decoding steps by up to 75% compared to existing decoding strategies, while maintaining competitive generation quality across different models and tasks. These results validate that decoding redundancy is a central bottleneck in dLLMs, and that explicitly reducing it yields substantial practical efficiency gains.

[30] When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

Ishita Kakkar, Enze Zhang, Rheeya Uppaal, Junjie Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large reasoning models (LRMs) produce complex, multi-step reasoning traces, yet safety evaluation remains focused on final outputs, overlooking how harm emerges during reasoning. When jailbroken, harm does not appear instantaneously but unfolds through distinct behavioral steps such as suppressing refusal, rationalizing compliance, decomposing harmful tasks, and concealing risk. However, no existing benchmark captures this process at sentence-level granularity within reasoning traces – a key step toward reliable safety monitoring, interventions, and systematic failure diagnosis. To address this gap, we introduce HarmThoughts, a benchmark for step-wise safety evaluation of reasoning traces. \ourdataset is built on our proposed harm taxonomy of 16 harmful reasoning behaviors across four functional groups that characterize how harm propagates rather than what harm is produced. The dataset consists of 56,931 sentences from 1,018 reasoning traces generated by four model families, each annotated with fine-grained sentence-level behavioral labels. Using HarmThoughts, we analyze harm propagation patterns across reasoning traces, identifying common behavioral trajectories and drift points where reasoning transitions from safe to unsafe. Finally, we systematically compare white-box and black-box detectors on the task of identifying harmful reasoning behaviours on HarmThoughts. Our results show that existing detectors struggle with fine-grained behavior detection in reasoning traces, particularly for nuanced categories within harm emergence and execution, highlighting a critical gap in process-level safety monitoring. HarmThoughts is available publicly at: https://huggingface.co/datasets/ishitakakkar-10/HarmThoughts

[31] Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

Yixuan Tang, Yirui Zhang, Hang Feng, Anthony K. H. Tung

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.

[32] AlignCultura: Towards Culturally Aligned Large Language Models?

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Cultural alignment in Large Language Models (LLMs) is essential for producing contextually aware, respectful, and trustworthy outputs. Without it, models risk generating stereotyped, insensitive, or misleading responses that fail to reflect cultural diversity w.r.t Helpful, Harmless, and Honest (HHH) paradigm. Existing benchmarks represent early steps toward cultural alignment; yet, no benchmarks currently enables systematic evaluation of cultural alignment in line with UNESCO’s principles of cultural diversity w.r.t HHH paradigm. Therefore, to address this gap, we built Align-Cultura, two-stage pipeline for cultural alignment. Stage I constructs CULTURAX, the HHH-English dataset grounded in the UNESCO cultural taxonomy, through Query Construction, which reclassifies prompts, expands underrepresented domains (or labels), and prevents data leakage with SimHash. Then, Response Generation pairs prompts with culturally grounded responses via two-stage rejection sampling. The final dataset contains 1,500 samples spanning 30 subdomains of tangible and intangible cultural forms. Stage II benchmarks CULTURAX on general-purpose models, culturally fine-tuned models, and open-weight LLMs (Qwen3-8B and DeepSeek-R1-Distill-Qwen-7B). Empirically, culturally fine-tuned models improve joint HHH by 4%-6%, reduce cultural failures by 18%, achieve 10%-12% efficiency gains, and limit leakage to 0.3%.

[33] RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

Hanjun Cho, Jay-Yoon Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing QA benchmarks typically assume distinct documents with minimal overlap, yet real-world retrieval-augmented generation (RAG) systems operate on corpora such as financial reports, legal codes, and patents, where information is highly redundant and documents exhibit strong inter-document similarity. This mismatch undermines evaluation validity: retrievers can be unfairly undervalued even when they retrieve documents that provide sufficient evidence, because redundancy across documents is not accounted for in evaluation. On the other hand, retrievers that perform well on standard benchmarks often generalize poorly to real-world corpora with highly similar and redundant documents. We present RARE (Redundancy-Aware Retrieval Evaluation), a framework for constructing realistic benchmarks by (i) decomposing documents into atomic facts to enable precise redundancy tracking and (ii) enhancing LLM-based data generation with CRRF. RAG benchmark data usually requires multiple quality criteria, but LLMs often yield trivial outputs. CRRF scores criteria separately and fuses decisions by rank, improving the reliability of generated data. Applying RARE to Finance, Legal, and Patent corpora, we introduce RedQA, where a strong retriever baseline drops from 66.4% PerfRecall@10 on 4-hop General-Wiki to 5.0-27.9% PerfRecall@10 at 4-hop depth, revealing robustness gaps that current benchmarks fail to capture. RARE enables practitioners to build domain-specific RAG evaluations that faithfully reflect real-world deployment conditions.

[34] SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

Boyan Shi, Wei Chen, Shuyuan Zhao, Junfeng Shen, Shengnan Guo, Shaojiang Wang, Huaiyu Wan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The combination of Mixture-of-Experts (MoE) and Low-Rank Adaptation (LoRA) has shown significant potential for enhancing the multi-task learning capabilities of Large Language Models. However, existing methods face two primary challenges: (1)Imprecise Routing in the current MoE-LoRA method fails to explicitly match input semantics with expert capabilities, leading to weak expert specialization. (2)Uniform weight fusion strategies struggle to provide adaptive update strengths, overlooking the varying complexity of different tasks. To address these limitations, we propose SAMoRA (Semantic-Aware Mixture of LoRA Experts), a novel parameter-efficient fine-tuning framework tailored for task-adaptive learning. Specifically, A Semantic-Aware Router is proposed to explicitly align textual semantics with the most suitable experts for precise routing. A Task-Adaptive Scaling mechanism is designed to regulate expert contributions based on specific task requirements dynamically. In addition, a novel regularization objective is proposed to jointly promote expert specialization and effective scaling. Extensive experiments on multiple multi-task benchmarks demonstrate that SAMoRA significantly outperforms the state-of-the-art methods and holds excellent task generalization capabilities. Code is available at https://github.com/boyan-code/SAMoRA

[35] Cell-Based Representation of Relational Binding in Language Models

Qin Dai, Benjamin Heinzerling, Kentaro Inui

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding a discourse requires tracking entities and the relations that hold between them. While Large Language Models (LLMs) perform well on relational reasoning, the mechanism by which they bind entities, relations, and attributes remains unclear. We study discourse-level relational binding and show that LLMs encode it via a Cell-based Binding Representation (CBR): a low-dimensional linear subspace in which each ``cell’’ corresponds to an entity–relation index pair, and bound attributes are retrieved from the corresponding cell during inference. Using controlled multi-sentence data annotated with entity and relation indices, we identify the CBR subspace by decoding these indices from attribute-token activations with Partial Least Squares regression. Across domains and two model families, the indices are linearly decodable and form a grid-like geometry in the projected space. We further find that context-specific CBR representations are related by translation vectors in activation space, enabling cross-context transfer. Finally, activation patching shows that manipulating this subspace systematically changes relational predictions and that perturbing it disrupts performance, providing causal evidence that LLMs rely on CBR for relational binding.

[36] Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

Aby Mammen Mathew

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neural NLI models overfit dataset artifacts instead of truly reasoning. A hypothesis-only model gets 57.7% in SNLI, showing strong spurious correlations, and 38.6% of the baseline errors are the result of these artifacts. We propose Product-of-Experts (PoE) training, which downweights examples where biased models are overconfident. PoE nearly preserves accuracy (89.10% vs. 89.30%) while cutting bias reliance by 4.71% (bias agreement 49.85% to 45%). An ablation finds lambda = 1.5 that best balances debiasing and accuracy. Behavioral tests still reveal issues with negation and numerical reasoning.

[37] TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

Yilun Liu, Ruihong Qiu, Zi Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Zero-shot reasoning on text-rich networks (TRNs) remains a challenging frontier, as models must integrate textual semantics with relational structure without task-specific supervision. While graph neural networks rely on fixed label spaces and supervised objectives, recent large language model (LLM)-based approaches often overlook graph context or depend on distillation from larger models, limiting generalisation. We propose TRN-R1-Zero, a post-training framework for TRN reasoning trained solely via reinforcement learning. TRN-R1-Zero directly optimises base LLMs using a Neighbour-aware Group Relative Policy Optimisation objective that dynamically adjusts rewards based on a novel margin gain metric for the informativeness of neighbouring signals, effectively guiding the model toward relational reasoning. Unlike prior methods, TRN-R1-Zero requires no supervised fine-tuning or chain-of-thought data generated from large reasoning models. Extensive experiments across citation, hyperlink, social and co-purchase TRN benchmarks demonstrate the superiority and robustness of TRN-R1-Zero. Moreover, relying strictly on node-level training, TRN-R1-Zero achieves zero-shot inference on edge- and graph-level tasks, extending beyond cross-domain transfer. The codebase is publicly available at https://github.com/superallen13/TRN-R1-Zero.

[38] HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing

Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Lin Fan, Yilin Zhou, Zikang Wang, Xiaotao Gu, Jie Tang, Hongning Wang, Minlie Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating the writing capabilities of large language models (LLMs) remains a significant challenge due to the multidimensional nature of writing skills and the limitations of existing metrics. LLM’s performance in thousand-words level and open-ended writing is inadequately assessed by traditional reference-based metrics or modern LLM-as-a-judge methods. We propose Tree-of-Writing (ToW), to resolve the implicit inconsistency often found when LLM-as-a-judge aggregates all sub-features in text evaluation. ToW incorporates a tree-structured workflow by explicitly modeling the aggregation weights of sub-features. We also present HowToBench, a large-scale Chinese writing benchmark encompassing 12 genres and 1302 instructions across three task categories: contextual completion, outline-guided writing, and open-ended generation. ToW successfully mitigates the biases, achieving a 0.93 Pearson correlation with human judgments. Furthermore, we detect that both overlap-based text generation metrics and popular LLM-as-a-judge practices are vulnerable to textual disturbances, while ToW is robust to them. We also uncover a negative correlation between input length and content-related scores in the Guide task, showcasing that it cannot be simply improved by input-side information piling.

[39] SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning

Rania Elbadry, Sarfraz Ahmad, Ahmed Heakl, Dani Bouch, Momina Ahsan, Muhra AlMahri, Marwa Elsaid khalil, Yuxia Wang, Salem Lahlou, Sophia Ananiadou, Veselin Stoyanov, Jimin Huang, Xueqing Peng, Preslav Nakov, Zhuohan Xie

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: English financial NLP has progressed rapidly through benchmarks for sentiment, document understanding, and financial question answering, while Arabic financial NLP remains comparatively under-explored despite strong practical demand for trustworthy finance and Islamic-finance assistants. We introduce SAHM, a document-grounded benchmark and instruction-tuning dataset for Arabic financial NLP and Shari’ah-compliant reasoning. SAHM contains 14,380 expert-verified instances spanning seven tasks: AAOIFI standards QA, fatwa-based QA/MCQ, accounting and business exams, financial sentiment analysis, extractive summarization, and event-cause reasoning, curated from authentic regulatory, juristic, and corporate sources. We evaluate 19 strong open and proprietary LLMs using task-specific metrics and rubric-based scoring for open-ended outputs, and find that Arabic fluency does not reliably translate to evidence-grounded financial reasoning: models are substantially stronger on recognition-style tasks than on generation and causal reasoning, with the largest gaps on event-cause reasoning. We release the benchmark, evaluation framework, and an instruction-tuned model to support future research on trustworthy Arabic financial NLP.

[40] Detoxification for LLM: From Dataset Itself

Wei Shao, Yihang Wang, Gaoyu Zhu, Ziqiang Cheng, Lei Yu, Jiafeng Guo, Xueqi Cheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing detoxification methods for large language models mainly focus on post-training stage or inference time, while few tackle the source of toxicity, namely, the dataset itself. Such training-based or controllable decoding approaches cannot completely suppress the model’s inherent toxicity, whereas detoxifying the pretraining dataset can fundamentally reduce the toxicity that the model learns during training. Hence, we attempt to detoxify directly on raw corpora with SoCD (Soft Contrastive Decoding), which guides an LLM to localize and rewrite toxic spans in raw data while preserving semantics, in our proposed HSPD (Hierarchical Semantic-Preserving Detoxification) pipeline, yielding a detoxified corpus that can drop-in replace the original for fine-tuning or other training. On GPT2-XL, HSPD attains state-of-the-art detoxification, reducing Toxicity Probability (TP) from 0.42 to 0.18 and Expected Maximum Toxicity (EMT) from 0.43 to 0.20. We further validate consistent best-in-class results on LLaMA2-7B, OPT-6.7B, and Falcon-7B. These findings show that semantics-preserving, corpus-level rewriting with HSPD effectively suppresses downstream toxicity while retaining data utility and allowing seamless source-level mitigation, thereby reducing the cost of later model behavior adjustment. (Code is available at: https://github.com/ntsw2001/data_detox_for_llm)

[41] Do Emotions Influence Moral Judgment in Large Language Models?

Mohammad Saim, Tianyu Jiang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models have been extensively studied for emotion recognition and moral reasoning as distinct capabilities, yet the extent to which emotions influence moral judgment remains underexplored. In this work, we develop an emotion-induction pipeline that infuses emotion into moral situations and evaluate shifts in moral acceptability across multiple datasets and LLMs. We observe a directional pattern: positive emotions increase moral acceptability and negative emotions decrease it, with effects strong enough to reverse binary moral judgments in up to 20% of cases, and with susceptibility scaling inversely with model capability. Our analysis further reveals that specific emotions can sometimes behave contrary to what their valence would predict (e.g., remorse paradoxically increases acceptability). A complementary human annotation study shows humans do not exhibit these systematic shifts, indicating an alignment gap in current LLMs.

[42] Construction of Knowledge Graph based on Language Model

Qiubai Zhu, Qingwang Wang, Haibin Yuan, Wei Chen, Tao Shen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knowledge Graph (KG) can effectively integrate valuable information from massive data, and thus has been rapidly developed and widely used in many fields. Traditional KG construction methods rely on manual annotation, which often consumes a lot of time and manpower. And KG construction schemes based on deep learning tend to have weak generalization capabilities. With the rapid development of Pre-trained Language Models (PLM), PLM has shown great potential in the field of KG construction. This paper provides a comprehensive review of recent research advances in the field of construction of KGs using PLM. In this paper, we explain how PLM can utilize its language understanding and generation capabilities to automatically extract key information for KGs, such as entities and relations, from textual data. In addition, We also propose a new Hyper-Relarional Knowledge Graph construction framework based on lightweight Large Language Model (LLM) named LLHKG and compares it with previous methods. Under our framework, the KG construction capability of lightweight LLM is comparable to GPT3.5.

[43] The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang, Ran Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As Large Language Models (LLMs) continue to evolve through alignment techniques such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, a growing and increasingly conspicuous phenomenon has emerged: the proliferation of verbal tics – repetitive, formulaic linguistic patterns that pervade model outputs. These range from sycophantic openers (“That’s a great question!”, “Awesome!”) to pseudo-empathetic affirmations (“I completely understand your concern”, “I’m right here to catch you”) and overused vocabulary (“delve”, “tapestry”, “nuanced”). In this paper, we present a systematic analysis of the verbal tic phenomenon across eight state-of-the-art LLMs: GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro, Grok 4.2, Doubao-Seed-2.0-pro, Kimi K2.5, DeepSeek V3.2, and MiMo-V2-Pro. Utilizing a custom evaluation framework for standardized API-based evaluation, we assess 10,000 prompts across 10 task categories in both English and Chinese, yielding 160,000 model responses. We introduce the Verbal Tic Index (VTI), a composite metric quantifying tic prevalence, and analyze its correlation with sycophancy, lexical diversity, and human-perceived naturalness. Our findings reveal significant inter-model variation: Gemini 3.1 Pro exhibits the highest VTI (0.590), while DeepSeek V3.2 achieves the lowest (0.295). We further demonstrate that verbal tics accumulate over multi-turn conversations, are amplified in subjective tasks, and show distinct cross-lingual patterns. Human evaluation (N = 120) confirms a strong inverse relationship between sycophancy and perceived naturalness (r = -0.87, p < 0.001). These results underscore the “alignment tax” of current training paradigms and highlight the urgent need for more authentic human-AI interaction frameworks.

[44] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

Kunquan Li, Yingxue Zhang, Fandong Meng, Jinsong Su

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent years have witnessed growing interest in applying Large Reasoning Models (LRMs) to Machine Translation (MT). Existing approaches predominantly adopt a “think-first-then-translate” paradigm. Although explicit reasoning trajectories significantly enhance translation quality, they incur prohibitive inference costs and latency. To address these limitations, we propose ReflectMT, a two-stage reflection internalization algorithm for machine translation that employs a “translate-first-think-later” paradigm. Our approach develops the model’s “translate-reflect-refine” capability through reinforcement learning. In the first stage, we cultivate the model’s capacity for high-quality reflection and refinement, thereby enhancing its semantic comprehension and task-specific knowledge. In the second stage, we train the model to internalize the knowledge acquired during reflection. As a result, during inference, ReflectMT operates in a direct translation mode, producing high-quality translations on the first attempt without any explicit reasoning steps. Experimental results on datasets such as WMT24 demonstrate that our model’s first-pass translations during inference outperform multi-step reasoning LRMs such as DeepSeek-R1 in both automatic metrics and GPT-based evaluation, achieving a 2.16-point improvement in GPT-based translation quality evaluation while reducing token consumption by 94.33%.

[45] How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

Haoyang Chen, Yi Liu, Jianzhi Shao, Tao Zhang, Chengfu Huo, Wei Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Thinking LLMs produce reasoning traces before answering. Prior activation steering work mainly targets on shaping these traces. It remains less understood how answer tokens actually read and integrate the reasoning to produce reliable outcomes. Focusing on quantitative reasoning, we analyze the answer-to-reasoning attention and observe a benign self-reading pattern aligned with correctness, characterized by a forward drift of the reading focus along the reasoning trace and a persistent concentration on key semantic anchors, whereas incorrect solutions exhibit diffuse and irregular attention pattern. We interpret this as internal certainty during answer decoding, where the model commits to a viable solution branch and integrates key evidence. Following this, we propose a training-free steering method driven by Self-Reading Quality (SRQ) scores combining geometric metrics for process control with semantic metrics for content monitoring. SRQ selects data to build steering vectors that guide inference toward benign self-reading and away from uncertain and disorganized reading. Experiments show that our method yields consistent accuracy gains.

[46] Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

Hongxing Pan, Yingying Guo, Wenqing Kuang, Jiashi Lu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper studies uncertainty quantification for large language models (LLMs) under black-box access, where only a small number of responses can be sampled for each query. In this setting, estimating the effective semantic alphabet size–that is, the number of distinct meanings expressed in the sampled responses–provides a useful proxy for downstream risk. However, frequency-based estimators tend to undercount rare semantic modes when the sample size is small, while graph-spectral quantities alone are not designed to estimate semantic occupancy accurately. To address this issue, we propose SHADE (Soft-Hybrid Alphabet Dynamic Estimator), a simple and interpretable estimator that combines Generalized Good-Turing coverage with a heat-kernel trace of the normalized Laplacian constructed from an entailment-weighted graph over sampled responses. The estimated coverage adaptively determines the fusion rule: under high coverage, SHADE uses a convex combination of the two signals, while under low coverage it applies a LogSumExp fusion to emphasize missing or weakly observed semantic modes. A finite-sample correction is then introduced to stabilize the resulting cardinality estimate before converting it into a coverage-adjusted semantic entropy score. Experiments on pooled semantic alphabet-size estimation against large-sample references and on QA incorrectness detection show that SHADE achieves the strongest improvements in the most sample-limited regime, while the performance gap narrows as the number of samples increases. These results suggest that hybrid semantic occupancy estimation is particularly beneficial when black-box uncertainty quantification must operate under tight sampling budgets.

[47] SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

Bo-Jyun Wang, Ying-Jia Lin, Hung-Yu Kao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Small language models (SLMs), such as BART, can achieve summarization performance comparable to large language models (LLMs) via distillation. However, existing LLM-based ranking strategies for summary candidates suffer from instability, while classical metrics (e.g., ROUGE) are insufficient to rank high-quality summaries. To address these issues, we introduce \textbf{SCURank}, a framework that enhances summarization by leveraging \textbf{Summary Content Units (SCUs)}. Instead of relying on unstable comparisons or surface-level overlap, SCURank evaluates summaries based on the richness and semantic importance of information content. We investigate the effectiveness of SCURank in distilling summaries from multiple diverse LLMs. Experimental results demonstrate that SCURank outperforms traditional metrics and LLM-based ranking methods across evaluation measures and datasets. Furthermore, our findings show that incorporating diverse LLM summaries enhances model abstractiveness and overall distilled model performance, validating the benefits of information-centric ranking in multi-LLM distillation. The code for SCURank is available at https://github.com/IKMLab/SCURank.

[48] Headlines You Won’t Forget: Can Pronoun Insertion Increase Memorability?

Selina Meyer, Magdalena Abel, Michael Roth

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: For news headlines to influence beliefs and drive action, relevant information needs to be retained and retrievable from memory. In this probing study we draw on experiment designs from cognitive psychology to examine how a specific linguistic feature, namely direct address through first- and second-person pronouns, affects memorability and to what extent it is feasible to use large language models for the targeted insertion of such a feature into existing text without changing its core meaning. Across three controlled memorization experiments with a total of 240 participants, yielding 7,680 unique memory judgments, we show that pronoun insertion has mixed effects on memorability. Exploratory analyses indicate that effects differ based on headline topic, how pronouns are inserted and their immediate contexts. Additional data and fine-grained analysis is needed to draw definitive conclusions on these mediating factors. We further show that automatic revisions by LLMs are not always appropriate: Crowdsourced evaluations find many of them to be lacking in content accuracy and emotion retention or resulting in unnatural writing style. We make our collected data available for future work.

[49] Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

Clara Lachenmaier, Hannah Bultmann, Sina Zarrieß

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Repair, an important resource for resolving trouble in human-human conversation, remains underexplored in human-LLM interaction. In this study, we investigate how LLMs engage in the interactive process of repair in multi-turn dialogues around solvable and unsolvable math questions. We examine whether models initiate repair themselves and how they respond to user-initiated repair. Our results show strong differences across models: reactions range from being almost completely resistant to (appropriate) repair attempts to being highly susceptible and easily manipulated. We further demonstrate that once conversations extend beyond a single turn, model behavior becomes more distinctive and less predictable across systems. Overall, our findings indicate that each tested LLM exhibits its own characteristic form of unreliability in the context of repair.

[50] ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

Xianming Li, Zongxi Li, Tsz-fung Andrew Lee, Jing Li, Haoran Xie, Qing Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Parameter-efficient fine-tuning (PEFT) reduces the training cost of full-parameter fine-tuning for large language models (LLMs) by training only a small set of task-specific parameters while freezing the pretrained backbone. However, existing approaches, such as Low-Rank Adaptation (LoRA), achieve adaptation by inserting independent low-rank perturbations directly to individual weights, resulting in a local parameterization of adaptation. We propose ShadowPEFT, a centralized PEFT framework that instead performs layer-level refinement through a depth-shared shadow module. At each transformer layer, ShadowPEFT maintains a parallel shadow state and evolves it repeatedly for progressively richer hidden states. This design shifts adaptation from distributed weight-space perturbations to a shared layer-space refinement process. Since the shadow module is decoupled from the backbone, it can be reused across depth, independently pretrained, and optionally deployed in a detached mode, benefiting edge computing scenarios. Experiments on generation and understanding benchmarks show that ShadowPEFT matches or outperforms LoRA and DoRA under comparable trainable-parameter budgets. Additional analyses on shadow pretraining, cross-dataset transfer, parameter scaling, inference latency, and system-level evaluation suggest that centralized layer-space adaptation is a competitive and flexible alternative to conventional low-rank PEFT.

[51] Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

Alessandro Maisto

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The evaluation of narrative quality remains a complex challenge, as it involves subjective factors such as plot, character development, and emotional impact. This work proposes a quantitative approach to narrative assessment by focusing on the linguistic dimension as a primary indicator of quality. The paper presents a methodology for the automatic evaluation of narrative based on the extraction of a comprehensive set of 33 quantitative linguistic features categorized into lexical, syntactic, and semantic groups. To test the model, an experiment was conducted on a specialized corpus of 23 books, including canonical masterpieces and self-published works. Through a similarity matrix, the system successfully clustered the narratives, distinguishing almost perfectly between professionally edited and self-published texts. Furthermore, the methodology was validated against a human-annotated dataset; it significantly outperforms traditional story-level evaluation metrics, demonstrating the effectiveness of quantitative linguistic features in assessing narrative quality.

[52] CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

Peiqin Lin, Chenyang Lyu, Wenjiang Luo, Haotian Ye, Md Mehrab Hossain, Chunlan Ma, Shaoxiong Ji, Younes Samih, Bo Zeng, Fan Jiang, Yuanbin Cao, Dilda Duisenbek, Adrian Neo Sau Xun, Daria Pozdniakova, Liubou Misevich, Nevena Marinković, Ngoc Gia Linh Nguyen, Thi Khanh Linh Do, Sarakmatak Sophy, Baotian Hu, Guanhua Chen, Gongbo Tang, Alham Fikri Aji, Longyue Wang, Weihua Luo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are now deployed worldwide, inspiring a surge of benchmarks that measure their multilingual and multicultural abilities. However, these benchmarks prioritize generic language understanding or superficial cultural trivia, leaving the evaluation of grounded tasks – where models must reason within real-world, context-rich scenarios – largely unaddressed. To fill this gap, we present CulturALL, a comprehensive and challenging benchmark to assess LLMs’ multilingual and multicultural competence on grounded tasks. CulturALL is built via a human–AI collaborative framework: expert annotators ensure appropriate difficulty and factual accuracy, while LLMs lighten the manual workload. By incorporating diverse sources, CulturALL ensures comprehensive scenario coverage. Each item is carefully designed to present a high level of difficulty, making CulturALL challenging. CulturALL contains 2,610 samples in 14 languages from 51 regions, distributed across 16 topics to capture the full breadth of grounded tasks. Experiments show that the best LLM achieves 44.48% accuracy on CulturALL, underscoring substantial room for improvement.

[53] HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

Euntae Kim, Soomin Han, Buru Chang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench

[54] Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

Guy Mor-Lan, Omer Goldman, Matan Eyal, Adi Mayrav Gilady, Sivan Eiger, Idan Szpektor, Avinatan Hassidim, Yossi Matias, Reut Tsarfaty

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multilingual large language models (LLMs) have minimized the fluency gap between languages. This advancement, however, exposes models to the risk of biased behavior, as knowledge and norms may propagate across languages. In this work, we aim to quantify models’ inter- and intra-lingual biases, via their ability to answer locale-ambiguous questions. To this end, we present LocQA, a test set containing 2,156 questions in 12 languages, referring to various locale-dependent facts such as laws, dates, and measurements. The questions do not contain indications of the locales they relate to, other than the querying language itself. LLMs’ responses to LocQA locale-ambiguous questions thus reveal models’ implicit priors. We used LocQA to evaluate 32 models, and detected two types of structural biases. Inter-lingually, we show a global bias towards answers relevant to the US-locale, even when models are asked in languages other than English. Moreover, we discovered that this global bias is exacerbated in models that underwent instruction tuning, compared to their base counterparts. Intra-lingually, we show that when multiple locales are relevant for the same language, models act as demographic probability engines, prioritizing locales with larger populations. Taken together, insights from LocQA may help in shaping LLMs’ desired local behavior, and in quantifying the impact of various training phases on different kinds of biases.

[55] IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

Rajveer Singh Pall

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce IndiaFinBench, to our knowledge the first publicly available evaluation benchmark for assessing large language model (LLM) performance on Indian financial regulatory text. Existing financial NLP benchmarks draw exclusively from Western financial corpora (SEC filings, US earnings reports, and English-language financial news), leaving a significant gap in coverage of non-Western regulatory frameworks. IndiaFinBench addresses this gap with 406 expert-annotated question-answer pairs drawn from 192 documents sourced from the Securities and Exchange Board of India (SEBI) and the Reserve Bank of India (RBI), spanning four task types: regulatory interpretation (174 items), numerical reasoning (92 items), contradiction detection (62 items), and temporal reasoning (78 items). Annotation quality is validated through a model-based secondary pass (kappa=0.918 on contradiction detection) and a 60-item human inter-annotator agreement evaluation (kappa=0.611; 76.7% overall agreement). We evaluate twelve models under zero-shot conditions, with accuracy ranging from 70.4% (Gemma 4 E4B) to 89.7% (Gemini 2.5 Flash). All models substantially outperform a non-specialist human baseline of 60.0%. Numerical reasoning is the most discriminative task, with a 35.9 percentage-point spread across models. Bootstrap significance testing (10,000 resamples) reveals three statistically distinct performance tiers. The dataset, evaluation code, and all model outputs are available at https://github.com/rajveerpall/IndiaFinBench

[56] Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

Xinlin Wang, Mats Brorsson

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the impressive capabilities of large language models, their substantial computational costs, latency, and privacy risks hinder their widespread deployment in real-world applications. Small Language Models (SLMs) with fewer than 10 billion parameters present a promising alternative; however, their inherent limitations in knowledge and reasoning curtail their effectiveness. Existing research primarily focuses on enhancing SLMs through scaling laws or fine-tuning strategies while overlooking the potential of using agent paradigms, such as tool use and multi-agent collaboration, to systematically compensate for the inherent weaknesses of small models. To address this gap, this paper presents the first large-scale, comprehensive study of <10B open-source models under three paradigms: (1) the base model, (2) a single agent equipped with tools, and (3) a multi-agent system with collaborative capabilities. Our results show that single-agent systems achieve the best balance between performance and cost, while multi-agent setups add overhead with limited gains. Our findings highlight the importance of agent-centric design for efficient and trustworthy deployment in resource-constrained settings.

[57] Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation

Eoghan Cunningham, Derek Greene, James Cross, Antonio Rago

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding how policy is debated and justified in parliament is a fundamental aspect of the democratic process. However, the volume and complexity of such debates mean that outside audiences struggle to engage. Meanwhile, Large Language Models (LLMs) have been shown to enable automated summarisation at scale. While summaries of debates can make parliamentary procedures more accessible, evaluating whether these summaries faithfully communicate argumentative content remains challenging. Existing automated summarisation metrics have been shown to correlate poorly with human judgements of consistency (i.e., faithfulness or alignment between summary and source). In this work, we propose a formal framework for evaluating parliamentary debate summaries that grounds argument structures in the contested proposals up for debate. Our novel approach, driven by computational argumentation, focuses the evaluation on formal properties concerning the faithful preservation of the reasoning presented to justify or oppose policy outcomes. We demonstrate our methods using a case-study of debates from the European Parliament and associated LLM-driven summaries.

[58] Are Large Language Models Economically Viable for Industry Deployment?

Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora, Rafiq Ali, Ebad Shabbir, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density (\r{ho}sys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)-capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier-models in the <2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3x higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly-while QLoRA reduces memory footprint, it increases adaptation energy by up to 7x for small models-challenging prevailing assumptions about quantization-aware training in edge deployment.

[59] DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

Jinyu Guo, Zhihan Zhang, Yutong Li, Jiehui Xie, Md. Tamim Iqbal, Dongshen Han, Lik-Hang Lee, Sung-Ho Bae, Jie Zou, Yang Yang, Chaoning Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The quadratic computational complexity of the standard attention mechanism constitutes a fundamental bottleneck for large language models in long-context inference. While existing KV cache compression methods alleviate memory pressure, they often sacrifice generation quality and fail to address the high overhead of floating-point arithmetic. This paper introduces DASH-KV, an innovative acceleration framework that reformulates attention as approximate nearest-neighbor search via asymmetric deep hashing. Under this paradigm, we design an asymmetric encoding architecture that differentially maps queries and keys to account for their distinctions in precision and reuse characteristics. To balance efficiency and accuracy, we further introduce a dynamic mixed-precision mechanism that adaptively retains full-precision computation for critical tokens. Extensive experiments on LongBench demonstrate that DASH-KV significantly outperforms state-of-the-art baseline methods while matching the performance of full attention, all while reducing inference complexity from O(N^2) to linear O(N). The code is available at https://github.com/Zhihan-Zh/DASH-KV

[60] Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

Niclas Doll, Jasper Schulze Buschhoff, Shalaka Satheesh, Hammam Abdelwahab, Héctor Allende-Cid, Katrin Klug

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.

[61] Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

Sho Hoshino, Ukyo Honda, Peinan Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While self-consistency is known to improve performance on symbolic reasoning, its effect on the recall of encyclopedic knowledge is unclear due to a lack of targeted evaluation grounds. To address this, we establish such a knowledge recall split for the popular MMLU benchmark by applying a data-driven heuristic from prior work. We validate this split by showing that the performance patterns on the symbolic reasoning and knowledge recall subsets mirror those of GSM8K and MedMCQA, respectively. Using this solid ground, we find that self-consistency consistently improves performance across both symbolic reasoning and knowledge recall, even though its underlying CoT prompting is primarily effective for symbolic reasoning. As a result, we achieve an 89% accuracy on MMLU, the best performance to date with the use of GPT-4o.

[62] Lost in Translation: Do LVLM Judges Generalize Across Languages?

Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Mir Tafseer Nayeem, Amran Bhuiyan, Mizanur Rahman, Shafiq Joty, Enamul Hoque, Jimmy Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic evaluators such as reward models play a central role in the alignment and evaluation of large vision-language models (LVLMs). Despite their growing importance, these evaluators are almost exclusively assessed on English-centric benchmarks, leaving open the question of how well these evaluators generalize across languages. To answer this question, we introduce MM-JudgeBench, the first large-scale benchmark for multilingual and multimodal judge model evaluation, which includes over 60K pairwise preference instances spanning 25 typologically diverse languages. MM-JudgeBench integrates two complementary subsets: a general vision-language preference evaluation subset extending VL-RewardBench, and a chart-centric visual-text reasoning subset derived from OpenCQA, enabling systematic analysis of reward models (i.e., LVLM judges) across diverse settings. We additionally release a multilingual training set derived from MM-RewardBench, disjoint from our evaluation data, to support domain adaptation. By evaluating 22 LVLMs (15 open-source, 7 proprietary), we uncover substantial cross-lingual performance variance in our proposed benchmark. Our analysis further shows that model size and architecture are poor predictors of multilingual robustness, and that even state-of-the-art LVLM judges exhibit inconsistent behavior across languages. Together, these findings expose fundamental limitations of current reward modeling and underscore the necessity of multilingual, multimodal benchmarks for developing reliable automated evaluators.

[63] What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

Xinhao Zhang, Xi Chen, François Portet, Maxime Peyrard

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent work has demonstrated the promise of orchestrating large language models (LLMs) within evolutionary and agentic optimization systems. However, the mechanisms driving these optimization gains remain poorly understood. In this work, we present a large-scale study of LLM-guided evolutionary search, collecting optimization trajectories for 15 LLMs across 8 tasks. Although zero-shot problem-solving ability correlates with final optimization outcomes, it explains only part of the variance: models with similar initial capability often induce dramatically different search trajectories and outcomes. By analyzing these trajectories, we find that strong LLM optimizers behave as local refiners, producing frequent incremental improvements while progressively localizing the search in semantic space. Conversely, weaker optimizers exhibit large semantic drift, with sporadic breakthroughs followed by stagnation. Notably, various measures of solution novelty do not predict final performance; novelty is beneficial only when the search remains sufficiently localized around high-performing regions of the solution space. Our results highlight the importance of trajectory analysis for understanding and improving LLM-based optimization systems and provide actionable insights for their design and training.

[64] ‘The Order in the Horse’s Heart’: A Case Study in LLM-Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction

Ewan Cameron

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a dual-track pipeline for detecting biblical allusions in literary fiction and apply it to the novels of Cormac McCarthy. A bottom-up embedding track uses inverse document frequency to identify rare vocabulary shared with the King James Bible, embeds occurrences in their local context for sense disambiguation, and passes candidate passage pairs through cascaded LLM review. A top-down register track asks an LLM to read McCarthy’s prose undirected to any specific biblical passage for comparison, catching allusions not distinguished by word or phrase rarity. Both tracks are cross-validated by a long-context model that holds entire novels alongside the KJV in a single pass, and every finding is checked against published scholarship. Restricting attention to allusions that carry a textual echo–shared phrasing, reworked vocabulary, or transplanted cadence–and distinguishing literary allusions proper from signposted biblical references (similes naming biblical figures, characters overtly citing scripture), the pipeline surfaces 349 allusions across the corpus. Among a target set of 115 previously documented allusions retrieved through human review of the academic literature, the pipeline independently recovers 62 (54% recall), with recall varying by connection type from 30% (transformed imagery) to 80% (register collisions). We contextualise these results with respect to the value-add from LLMs as assistants to mechanical stylometric analyses, and their potential to facilitate the statistical study of intertextuality in massive literary corpora.

[65] LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

Fanyu Wang, Xiaoxi Kang, Paul Burgess, Aashish Srivastava, Chetan Arora, Adnan Trakic, Lay-Ki Soon, Md Khalid Hossain, Lizhen Qu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: More than half of the global population struggles to meet their civil justice needs due to limited legal resources. While Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, significant challenges remain even at the foundational step of legal issue identification. To investigate LLMs’ capabilities in this task, we constructed a dataset from 769 real-world Malaysian Contract Act court cases, using GPT-4o to extract facts and generate candidate legal issues, annotated by senior legal experts, which reveals a critical limitation: while LLMs generate diverse issue candidates, their precision remains inadequate (GPT-4o achieves only 62%). To address this gap, we propose LePREC (Legal Professional-inspired Reasoning Elicitation and Classification), a neuro-symbolic framework combining neural generation with structured statistical reasoning. LePREC consists of: (1) a neuro component leverages LLMs to transform legal descriptions into question-answer pairs representing diverse analytical factors, and (2) a symbolic component applies sparse linear models over these discrete features, learning explicit algebraic weights that identify the most informative reasoning factors. Unlike end-to-end neural approaches, LePREC achieves interpretability through transparent feature weighting while maintaining data efficiency through correlation-based statistical classification. Experiments show a 30-40% improvement over advanced LLM baselines, including GPT-4o and Claude, confirming that correlation-based factor-issue analysis offers a more data-efficient solution for relevance decisions.

[66] Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

Dmitry Pronin, Evgeny Kazartsev

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This article introduces two new measures for authorship attribution - Rank-Turbulence Delta and Jensen-Shannon Delta - which generalise Burrows’s classical Delta by applying distance functions designed for probabilistic distributions. We first set out the theoretical basis of the measures, contrasting centred and uncentred z-scoring of word-frequency vectors and re-casting the uncentred vectors as probability distributions. Building on this representation, we develop a token-level decomposition that renders every Delta distance numerically interpretable, thereby facilitating close reading and the validation of results. The effectiveness of the methods is assessed on four literary corpora in English, German, French and Russian. The English, German and French datasets are compiled from Project Gutenberg, whereas the Russian benchmark is the SOCIOLIT corpus containing 755 works by 180 authors spanning the eighteenth to the twenty-first centuries. Rank-Turbulence Delta attains attribution accuracy comparable with Cosine Delta; Jensen-Shannon Delta consistently matches or exceeds the performance of canonical Burrows’s Delta. Finally, several established attribution algorithms are re-evaluated on the extended SOCIOLIT corpus.

[67] Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

Bowen Li, Haochen Ma, Yuxin Wang, Jie Yang, Xinchi Chen, Xuanjing Huang, Yining Zheng, Xipeng Qiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification–its arguments, questions, and critique–rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics–particularly the recall of weakness arguments–correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.

[68] Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

Tonmoy Talukder, G M Shahariar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper introduces \textit{Bangla Key2Text}, a large-scale dataset of $2.6$ million Bangla keyword–text pairs designed for keyword-driven text generation in a low-resource language. The dataset is constructed using a BERT-based keyword extraction pipeline applied to millions of Bangla news texts, transforming raw articles into structured keyword–text pairs suitable for supervised learning. To establish baseline performance on this new benchmark, we fine-tune two sequence-to-sequence models, \texttt{mT5} and \texttt{BanglaT5}, and evaluate them using multiple automatic metrics and human judgments. Experimental results show that task-specific fine-tuning substantially improves keyword-conditioned text generation in Bangla compared to zero-shot large language models. The dataset, trained models, and code are publicly released to support future research in Bangla natural language generation and keyword-to-text generation tasks.

[69] Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

Tianxiang Ma, Weijie Feng, Xinyu Wang, Zhiyong Cheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Emotion-Cause Pair Extraction in Conversations (ECPEC) aims to identify the set of causal relations between emotion utterances and their triggering causes within a dialogue. Most existing approaches formulate ECPEC as an independent pairwise classification task, overlooking the distinct semantics of emotion diffusion and cause explanation, and failing to capture globally consistent many-to-many conversational causality. To address these limitations, we revisit ECPEC from a semantic perspective and seek to disentangle emotion-oriented semantics from cause-oriented semantics, mapping them into two complementary representation spaces to better capture their distinct conversational roles. Building on this semantic decoupling, we naturally formulate ECPEC as a global alignment problem between the emotion-side and cause-side representations, and employ optimal transport to enable many-to-many and globally consistent emotion-cause matching. Based on this perspective, we propose a unified framework SCALE that instantiates the above semantic decoupling and alignment principle within a shared conversational structure. Extensive experiments on several benchmark datasets demonstrate that SCALE consistently achieves state-of-the-art performance. Our codes are released at https://github.com/CoCoSphere/SCALE.

[70] Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

Bobo Li, Rui Wu, Zibo Ji, Meishan Zhang, Hao Fei, Min Zhang, Mong-Li Lee, Wynne Hsu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Model agents have rapidly evolved from static text generators into dynamic systems capable of executing complex autonomous workflows. To enhance reliability, multi-agent frameworks assigning specialized roles are increasingly adopted to enable self-reflection and mutual auditing. While such role-playing effectively leverages domain expert knowledge, we find it simultaneously induces a human-like cognitive bias known as Actor-Observer Asymmetry (AOA). Specifically, an agent acting as an actor (during self-reflection) tends to attribute failures to external factors, whereas an observer (during mutual auditing) attributes the same errors to internal faults. We quantify this using our new Ambiguous Failure Benchmark, which reveals that simply swapping perspectives triggers the AOA effect in over 20% of cases for most models. To tame this bias, we introduce ReTAS (Reasoning via Thesis-Antithesis-Synthesis), a model trained through dialectical alignment to enforce perspective-invariant reasoning. By integrating dialectical chain-of-thought with Group Relative Policy Optimization, ReTAS guides agents to synthesize conflicting viewpoints into an objective consensus. Experiments demonstrate that ReTAS effectively mitigates attribution inconsistency and significantly improves fault resolution rates in ambiguous scenarios.

[71] Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

Jonas Waldendorf, Bashar Awwad Shiekh Hasan, Evgenii Tsymbalov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hallucinations in Speech Large Language Models (SpeechLLMs) pose significant risks, yet existing detection methods typically rely on gold-standard outputs that are costly or impractical to obtain. Moreover, hallucination detection methods developed for text-based LLMs do not directly capture audio-specific signals. We investigate four attention-derived metrics: AUDIORATIO, AUDIOCONSISTENCY, AUDIOENTROPY, and TEXTENTROPY, designed to capture pathological attention patterns associated with hallucination, and train lightweight logistic regression classifiers on these features for efficient inference-time detection. Across automatic speech recognition and speech-to-text translation tasks, evaluations on Qwen-2-Audio and Voxtral-3B show that our approach outperforms uncertainty-based and prior attention-based baselines on in-domain data, achieving improvements of up to +0.23 PR-AUC, and generalises to out-of-domain ASR settings. We further find that strong performance can be achieved with approximately 100 attention heads, improving out-of-domain generalisation compared to using all heads. While effectiveness is model-dependent and task-specific training is required, our results demonstrate that attention patterns provide a valuable tool for hallucination detection in SpeechLLMs.

[72] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

Jincheng Ren, Siwei Wu, Yizhi Li, Kang Zhu, Shu Xu, Boyu Feng, Ruibin Yuan, Wei Zhang, Riza Batista-Navarro, Jian Yang, Chenghua Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As model capabilities advance, research has increasingly shifted toward long-horizon, multi-turn terminal-centric agentic tasks, where raw environment feedback is often preserved in the interaction history to support future decisions. However, repeatedly retaining such feedback introduces substantial redundancy and causes cumulative token cost to grow quadratically with the number of steps, hindering long-horizon reasoning. Although observation compression can mitigate this issue, the heterogeneity of terminal environments makes heuristic-based or fixed-prompt methods difficult to generalize. We propose TACO, a plug-and-play, self-evolving Terminal Agent Compression framework that automatically discovers and refines compression rules from interaction trajectories for existing terminal agents. Experiments on TerminalBench (TB 1.0 and TB 2.0) and four additional terminal-related benchmarks (i.e., SWE-Bench Lite, CompileBench, DevEval, and CRUST-Bench) show that TACO consistently improves performance across mainstream agent frameworks and strong backbone models. With MiniMax-2.5, it improves performance on most benchmarks while reducing token overhead by around 10%. On TerminalBench, it brings consistent gains of 1%-4% across strong agentic models, and further improves accuracy by around 2%-3% under the same token budget. These results demonstrate the effectiveness and generalization of self-evolving, task-aware compression for terminal agents.

[73] Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

Wenqing Wu, Chengzhi Zhang, Yi Zhao, Tong Bao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: With the rapid advancement of Large Language Models (LLMs), the academic community has faced unprecedented disruptions, particularly in the realm of academic communication. The primary function of peer review is improving the quality of academic manuscripts, such as clarity, originality and other evaluation aspects. Although prior studies suggest that LLMs are beginning to influence peer review, it remains unclear whether they are altering its core evaluative functions. Moreover, the extent to which LLMs affect the linguistic form, evaluative focus, and recommendation-related signals of peer-review reports has yet to be systematically examined. In this study, we examine the changes in peer review reports for academic articles following the emergence of LLMs, emphasizing variations at fine-grained level. Specifically, we investigate linguistic features such as the length and complexity of words and sentences in review comments, while also automatically annotating the evaluation aspects of individual review sentences. We also use a maximum likelihood estimation method, previously established, to identify review reports that potentially have modified or generated by LLMs. Finally, we assess the impact of evaluation aspects mentioned in LLM-assisted review reports on the informativeness of recommendation for paper decision-making. The results indicate that following the emergence of LLMs, peer review texts have become longer and more fluent, with increased emphasis on summaries and surface-level clarity, as well as more standardized linguistic patterns, particularly reviewers with lower confidence score. At the same time, attention to deeper evaluative dimensions, such as originality, replicability, and nuanced critical reasoning, has declined.

[74] A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

Silvio Calderaro, Johanna Monti

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The growing interest of Natural Language Processing (NLP) in minority languages has not yet bridged the gap in the preservation of oral linguistic heritage. In particular, extemporaneous poetry - a performative genre based on real-time improvisation, metrical-rhetorical competence - remains a largely unexplored area of computational linguistics. This methodological gap necessitates the creation of specific resources to document and analyse the structures of improvised poetry. This is the context in which A Bolu was created, the first structured corpus of extemporaneous poetry dedicated to cantada logudorese, a variant of the Sardinian language. The dataset comprises 2,835 stanzas for a total of 141,321 tokens. The study presents the architecture of the corpus and applies a multidimensional analysis combining descriptive statistical indices and computational linguistics techniques to map the characteristics of the poetic text. The results indicate that the production of Sardinian extemporaneous poets is characterised by recurring patterns that support Parry and Lord’s theory of formulaicity. This evidence not only provides a new key to understanding oral creativity, but also offers a significant contribution to the development of NLP tools that are more inclusive and sensitive to the specificities of less widely spoken languages.

[75] RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

Mircea Timpuriu, Dumitru-Clementin Cercel

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.

[76] Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Kihyuk Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating that its high similarity score derived from text duplication rather than consistent reasoning. Identical decoding settings thus yielded fundamentally different consistency profiles, a distinction that single-output evaluations cannot capture. Safety expression reached ceiling levels across all models, confirming its limited utility as a differentiating metric. These results indicate that model selection constitutes a clinical rather than merely technical decision, and that output behavior under repeated generation conditions should be treated as a core criterion for reliable deployment of LLM-based exercise prescription systems.

[77] The “Small World of Words” German Free-Association Norms

Samuel Aeschbach, Rui Mata, Kaidi Lõo, Simon De Deyne, Dirk U. Wulff

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Free-association norms provide essential empirical data for investigating linguistic, semantic, and cultural phenomena in the cognitive sciences. Although large-scale norms exist for languages such as English, Dutch, Spanish, and Mandarin Chinese, no comparable resource has been available for German. To address this gap, we present free-association norms for 5,877 German cue words as part of the German version of the multilingual Small World of Words (SWOW) project. We describe the data collection procedures, participant characteristics, and our comprehensive preprocessing pipeline before introducing the resulting SWOW-DE data set. Using data from three established psycholinguistic paradigms, we show that SWOW-DE norms robustly predict performance in lexical decision tasks, relatedness judgments, and psycholinguistic word ratings. Furthermore, we demonstrate that SWOW-DE responses compare favorably with existing German resources and provide a preliminary cross-linguistic comparison revealing both shared and language-specific association patterns, highlighting promising directions for future research. Overall, SWOW-DE represents the largest collection of German free associations to date and offers a unique resource for linguistic, psychological, and cross-cultural research.

[78] Micro Language Models Enable Instant Responses

Wen Cheng, Tuochao Chen, Karim Helwani, Sriram Srinivasan, Luke Zettlemoyer, Shyamnath Gollakota

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Edge devices such as smartwatches and smart glasses cannot continuously run even the smallest 100M-1B parameter language models due to power and compute constraints, yet cloud inference introduces multi-second latencies that break the illusion of a responsive assistant. We introduce micro language models ($μ$LMs): ultra-compact models (8M-30M parameters) that instantly generate the first 4-8 words of a contextually grounded response on-device, while a cloud model completes it; thus, masking the cloud latency. We show that useful language generation survives at this extreme scale with our models matching several 70M-256M-class existing models. We design a collaborative generation framework that reframes the cloud model as a continuator rather than a respondent, achieving seamless mid-sentence handoffs and structured graceful recovery via three error correction methods when the local opener goes wrong. Empirical results show that $μ$LMs can initiate responses that larger models complete seamlessly, demonstrating that orders-of-magnitude asymmetric collaboration is achievable and unlocking responsive AI for extremely resource-constrained devices. The model checkpoint and demo are available at https://github.com/Sensente/micro_language_model_swen_project.

[79] The signal is the ceiling: Measurement limits of LLM-predicted experience ratings from open-ended survey text

Andrew Hong, Jason Potteiger, Luis E. Zapata

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: An earlier paper (Hong, Potteiger, and Zapata 2026) established that an unoptimized GPT 4.1 prompt predicts fan-reported experience ratings within one point 67% of the time from open-ended survey text. This paper tests the relative impact of prompt design and model selection on that performance. We compared four configurations on approximately 10,000 post-game surveys from five MLB teams: the original baseline prompt and a moderately customized version, crossed with three GPT models (4.1, 4.1-mini, 5.2). Prompt customization added roughly two percentage points of within +/-1 agreement on GPT 4.1 (from 67% to 69%). Both model swaps from that best configuration degraded performance: GPT 5.2 returned to the baseline, and GPT 4.1-mini fell six percentage points below it. Both levers combined were dwarfed by the input itself: across capable configurations, accuracy varied more than an order of magnitude more by the linguistic character of the text than by the choice of prompt or model. The ceiling has two parts. One is a bias in how the model reads text, which prompt design can correct. The other is a difference between what fans write about and what they actually decide, which no engineering can close because the missing information is not in the text. Prompt customization moved the first part; model selection moved neither reliably. The result is not that “prompt engineering helps a little” but that prompt engineering helps in a specific and predictable way, on the part of the ceiling it can reach.

[80] Pause or Fabricate? Training Language Models for Grounded Reasoning

Yiwen Qiu, Linjuan Wu, Yizhou Liu, Yuchen Yan, Jin Ma, Xu Tan, Yao Hu, Daoxin Zhang, Wenqi Zhang, Weiming Lu, Jun Xiao, Yongliang Shen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models have achieved remarkable progress on complex reasoning tasks. However, they often implicitly fabricate information when inputs are incomplete, producing confident but unreliable conclusions – a failure mode we term ungrounded reasoning. We argue that this issue arises not from insufficient reasoning capability, but from the lack of inferential boundary awareness – the ability to recognize when the necessary premises for valid inference are missing. To address this issue, we propose Grounded Reasoning via Interactive Reinforcement Learning (GRIL), a multi-turn reinforcement learning framework for grounded reasoning under incomplete information. GRIL decomposes the reasoning process into two stages: clarify and pause, which identifies whether the available information is sufficient, and grounded reasoning, which performs task solving once the necessary premises are established. We design stage-specific rewards to penalize hallucinations, enabling models to detect gaps, stop proactively, and resume reasoning after clarification. Experiments on GSM8K-Insufficient and MetaMATH-Insufficient show that GRIL significantly improves premise detection (up to 45%), leading to a 30% increase in task success while reducing average response length by over 20%. Additional analyses confirm robustness to noisy user responses and generalization to out-of-distribution tasks.

[81] Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

Nurkhan Laiyk, Gerard I. Gállego, Javier Ferrando, Fajri Koto

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Function vectors (FVs) are vector representations of tasks extracted from model activations during in-context learning. While prior work has shown that multilingual model representations can be language-agnostic, it remains unclear whether the same holds for function vectors. We study whether FVs exhibit language-agnosticity, using machine translation as a case study. Across three decoder-only multilingual LLMs, we find that translation FVs extracted from a single English$\rightarrow$Target direction transfer to other target languages, consistently improving the rank of correct translation tokens across multiple unseen languages. Ablation results show that removing the FV degrades translation across languages with limited impact on unrelated tasks. We further show that base-model FVs transfer to instruction-tuned variants and partially generalize from word-level to sentence-level translation.

Saransh Sharma, Pritika Ramu, Aparna Garimella, Koyel Mukherjee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Answering open-ended questions remains challenging for AI systems because it requires synthesis, judgment, and exploration beyond factual retrieval, and users often refine answers through multiple iterations rather than accepting a single response. Existing QA benchmarks do not explicitly support this refinement process. To address this gap, we introduce a new task, document-grounded related insight generation, where the goal is to generate additional insights from a document collection that help improve, extend, or rethink an initial answer to an open-ended question, ultimately supporting richer user interaction and a better overall question answering experience. We curate and release SCOpE-QA (Scientific Collections for Open-Ended QA), a dataset of 3,000 open-ended questions across 20 research collections. We present InsightGen, a two-stage approach that first constructs a thematic representation of the document collection using clustering, and then selects related context based on neighborhood selection from the thematic graph to generate diverse and relevant insights using LLMs. Extensive evaluation on 3,000 questions using two generation models and two evaluation settings shows that InsightGen consistently produces useful, relevant, and actionable insights, establishing a strong baseline for this new task.

[83] Epistemic orientation in parliamentary discourse is associated with deliberative democracy

Segun Aroyehun, Stephan Lewandowsky, David Garcia

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The pursuit of truth is central to democratic deliberation and governance, yet political discourse reflects varying epistemic orientations, ranging from evidence-based reasoning grounded in verifiable information to intuition-based reasoning rooted in beliefs and subjective interpretation. We introduce a scalable approach to measure epistemic orientation using the Evidence–Minus–Intuition (EMI) score, derived from large language model (LLM) ratings and embedding-based semantic similarity. Applying this approach to 15 million parliamentary speech segments spanning 1946 to 2025 across seven countries, we examine temporal patterns in discourse and its association with deliberative democracy and governance. We find that EMI is positively associated with deliberative democracy within countries over time, with consistent relationships in both contemporaneous and lagged analyses. EMI is also positively associated with the transparency and predictable implementation of laws as a dimension of governance. These findings suggest that the epistemic nature of political discourse is crucial for both the quality of democracy and governance.

[84] Discovering a Shared Logical Subspace: Steering LLM Logical Reasoning via Alignment of Natural-Language and Symbolic Views

Feihao Fang, My T. Thai, Yuanyuan Lei

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) still struggle with multi-step logical reasoning. Existing approaches either purely refine the reasoning chain in natural language form or attach a symbolic solver as an external module. In this work, we instead ask whether LLMs contain a shared internal logical subspace that simultaneously aligns natural-language and symbolic-language views of the reasoning process. Our hypothesis is that this logical subspace captures logical reasoning capabilities in LLMs that are shared across views while remaining independent of surface forms. To verify this, we employ Canonical Correlation Analysis on the paired residual activations from natural-language and symbolic-language reasoning chains, learning a low-dimensional subspace with maximum cross-view correlation. Furthermore, we design a training-free approach that steers LLMs reasoning chain along this logical subspace, thereby leveraging the complementary reasoning signals from both views. Experiments on four logical reasoning benchmarks demonstrate the effectiveness of our approach, improving accuracy by up to 11 percentage points and generalizing well on out-of-domain problems.

[85] Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications

Sander Noels, Alexander Rogiers, Maarten Buyl, Tijl De Bie

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid rise of Large Language Models (LLMs) has created new disruptive possibilities for persuasive communication, enabling fully-automated, personalized, and interactive content generation at an unprecedented scale. In this paper, we survey the emerging field of LLM-based persuasion, reviewing empirical studies that measure the influence of LLM Systems on human attitudes and behaviors. We categorize applications across domains such as politics, marketing, public health, e-commerce, and charitable giving, finding that such systems have frequently achieved human-level or even superhuman persuasiveness. Synthesizing recent evidence, we identify key factors influencing this effectiveness, including the interaction approach, model scale and capability, prompt design, personalization, and AI source disclosure. Furthermore, we critically examine the experimental designs and success metrics used to evaluate these Systems, distinguishing between direct behavioral outcomes and proxy indicators. Our survey suggests that the current capabilities of LLM-based persuasion pose profound ethical and societal risks, including to information integrity, fairness and inclusion, privacy, and individual autonomy. These risks underscore the urgent need for ethical guidelines and updated regulatory frameworks to avoid the widespread deployment of irresponsible and harmful LLM Systems.

[86] FoNE: Precise Single-Token Number Embeddings via Fourier Features

Tianyi Zhou, Deqing Fu, Mahdi Soltanolkotabi, Robin Jia, Vatsal Sharan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) typically represent numbers using multiple tokens, which requires the model to aggregate these tokens to interpret numerical values. This fragmentation makes both training and inference less efficient and adversely affects the model’s performance on number-related tasks. Inspired by the observation that pre-trained LLMs internally learn Fourier-like features for number tokens, we propose Fourier Number Embedding (FoNE), a novel method that directly maps numbers into the embedding space with their Fourier features. FoNE encodes each number as a single token with only two embedding dimensions per digit, effectively capturing numerical values without fragmentation. This compact representation accelerates both training and inference. Compared to traditional subword and digit-wise embeddings, FoNE not only reduces computational overhead but also achieves higher accuracy across various numerical tasks including addition, subtraction and multiplication. On 6-digit decimal addition, FoNE requires 64$\times$ less data to achieve 99% accuracy than subword and digit-wise embeddings while using 3$\times$ and 6$\times$ fewer tokens per number, respectively. Furthermore, FoNE is the only method that yields 100% accuracy on over 100,000 test examples for addition, subtraction, and multiplication. The codes and visualization are available at https://fouriernumber.github.io/.

[87] Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective

Krishna Singh Rajput, Tejas Anvekar, Chitta Baral, Vivek Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in multimodal question answering have primarily focused on combining heterogeneous modalities or fine-tuning multimodal large language models. While these approaches have shown strong performance, they often rely on a single, generalized reasoning strategy, overlooking the unique characteristics of each modality ultimately limiting both accuracy and interpretability. To address these limitations, we propose MAMMQA, a multi-agent QA framework for multimodal inputs spanning text, tables, and images. Our system includes two Visual Language Model (VLM) agents and one text-based Large Language Model (LLM) agent. The first VLM decomposes the user query into sub-questions and sequentially retrieves partial answers from each modality. The second VLM synthesizes and refines these results through cross-modal reasoning. Finally, the LLM integrates the insights into a cohesive answer. This modular design enhances interpretability by making the reasoning process transparent and allows each agent to operate within its domain of expertise. Experiments on diverse multimodal QA benchmarks demonstrate that our cooperative, multi-agent framework consistently outperforms existing baselines in both accuracy and robustness.

[88] TabXEval: Why this is a Bad Table? An eXhaustive Rubric for Table Evaluation

Vihang Pancholi, Jainit Bafna, Tejas Anvekar, Manish Shrivastava, Vivek Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating tables qualitatively and quantitatively poses a significant challenge, as standard metrics often overlook subtle structural and content-level discrepancies. To address this, we propose a rubric-based evaluation framework that integrates multi-level structural descriptors with fine-grained contextual signals, enabling more precise and consistent table comparison. Building on this, we introduce TabXEval, an eXhaustive and eXplainable two-phase evaluation framework. TabXEval first aligns reference and predicted tables structurally via TabAlign, then performs semantic and syntactic comparison using TabCompare, offering interpretable and granular feedback. We evaluate TabXEval on TabXBench, a diverse, multi-domain benchmark featuring realistic table perturbations and human annotations. A sensitivity-specificity analysis further demonstrates the robustness and explainability of TabXEval across varied table tasks. Code and data are available at https://coral-lab-asu.github.io/tabxeval/

[89] StochasTok: Improving Fine-Grained Subword Understanding in LLMs

Anya Sims, Thom Foster, Klara Kaleb, Tuan-Duy H. Nguyen, Joseph Lee, Jakob N. Foerster, Yee Whye Teh, Cong Lu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still struggle disproportionally with simple subword-level tasks like ‘How many r’s in strawberry?’. A key factor behind these failures is tokenization, which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to ‘see’ their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs’ downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok’s simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: github.com/anyasims/stochastok.

[90] PuzzleWorld: A Benchmark for Multimodal, Open-Ended Reasoning in Puzzlehunts

Hengzhi Li, Justin Zhang, Brendon Jiang, Alexander Naehu, Regan Song, Megan Tjandrasuwita, Chanakya Ekbote, Steven-Shine Chen, Adithya Balachandran, Wei Dai, Rebecca Chang, Paul Pu Liang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Puzzlehunts are a genre of complex, multi-step puzzles lacking well-defined problem definitions. In contrast to conventional reasoning benchmarks consisting of tasks with clear instructions and constrained environments, puzzlehunts requires discovering the underlying problem structure from multimodal evidence and iterative reasoning, mirroring real-world domains such as scientific discovery, exploratory data analysis, or investigative problem-solving. Despite progress in foundation models, their performance on open-ended settings remains largely untested. We introduce PuzzleWorld, a comprehensive benchmark of 667 puzzlehunt-style problems designed to assess step-by-step, open-ended, and creative multimodal reasoning. Each puzzle is annotated with the final solution, detailed reasoning traces, and cognitive skill labels, enabling holistic benchmarking and fine-grained diagnostic analysis. Most state-of-the-art models achieve only 1-4% final answer accuracy. On PuzzleWorld, the best model solves only 18% of puzzles and reaches 40% stepwise accuracy, matching human puzzle novices but falling significantly behind puzzle enthusiasts. To demonstrate the value of our reasoning annotations, we show that fine-tuning a small model on reasoning traces boosts stepwise accuracy from 4% to 11%, which translates to improvements in downstream visual reasoning tasks. Our detailed error analysis reveals that current models exhibit myopic reasoning, are bottlenecked by the limitations of language-based inference, and lack sketching capabilities crucial for visual and spatial reasoning. We release PuzzleWorld at https://github.com/MIT-MI/PuzzleWorld to support future work on building more general, open-ended, and creative reasoning systems.

[91] Improving the Distributional Alignment of LLMs using Supervision

Gauri Kambhatla, Sanjana Gautam, Angela Zhang, Alex Liu, Ravi Srinivasan, Junyi Jessy Li, Matthew Lease

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The ability to accurately align LLMs with diverse population groups on subjective questions would have great value. In this work, we show that adding simple supervision can more consistently improve the alignment of LLM-generated distributions with diverse population groups, as measured across three datasets spanning public health, public opinion, and values and beliefs. Beyond evaluating average alignment, we also report how alignment varies across specific groups. Our broad findings provide insights into the distributional alignment of LLM generations with diverse populations. By conducting evaluation over many LLMs and prompting strategies, we provide a benchmark to stimulate future research.

[92] Accelerating Prefilling via Decoding-time Contribution Sparsity

Zhiyuan He, Yike Zhang, Chengruidong Zhang, Huiqiang Jiang, Yuqing Yang, Lili Qiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) incur quadratic attention complexity with input length, creating a major time bottleneck in the prefilling stage. Existing acceleration methods largely exploit attention score sparsity by estimating blocks with high attention scores and applying dynamic sparse attention. In this work, we identify another untapped form of sparsity in the prefilling stage, namely decoding-time contribution sparsity, where many attention blocks exhibit nontrivial attention scores during prefilling yet contribute negligibly to subsequent decoding, as indicated by gradient-based analysis. Building on this observation, we propose TriangleMix, a training-free static attention pattern that uses dense attention in a subset of layers and switches to Triangle attention in the others. Extensive experiments show that TriangleMix preserves nearly lossless performance relative to dense attention while substantially reducing attention overhead in Triangle layers. For 128K inputs, Triangle attention achieves a 15.3x speedup in attention computation, significantly exceeding the acceleration of typical dynamic sparse methods (1.9x to 3.4x). Furthermore, TriangleMix can be seamlessly combined with dynamic sparsity approaches, delivering an additional 6% to 19% reduction in TTFT over using dynamic sparsity alone. Our code is released at https://aka.ms/TriangleMix.

[93] SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

Junjie Wu, Jiangnan Li, Yuqing Li, Lemao Liu, Liyan Xu, Jiwei Li, Dit-Yan Yeung, Jie Zhou, Mo Yu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Retrieval-augmented generation (RAG) over long documents typically involves splitting the text into smaller chunks, which serve as the basic units for retrieval. However, due to dependencies across the original document, contextual information is often essential for accurately interpreting each chunk. To address this, prior work has explored encoding longer context windows to produce embeddings for longer chunks. Despite these efforts, gains in retrieval and downstream tasks remain limited. This is because (1) longer chunks strain the capacity of embedding models due to the increased amount of information they must encode, and (2) many real-world applications still require returning localized evidence due to constraints on model or human bandwidth. We propose an alternative approach to this challenge by representing short chunks in a way that is conditioned on a broader context window to enhance retrieval performance – i.e., situating a chunk’s meaning within its context. We further show that existing embedding models are not well-equipped to encode such situated context effectively, and thus introduce a new training paradigm and develop the situated embedding models (SitEmb). To evaluate our method, we curate a book-plot retrieval dataset specifically designed to assess situated retrieval capabilities. On this benchmark, our SitEmb-v1 model based on BGE-M3 substantially outperforms state-of-the-art embedding models, including several with up to 7-8B parameters, with only 1B parameters. Our 8B SitEmb-v1.5 model further improves performance by over 10% and shows strong results across different languages and several downstream applications.

[94] Comparing energy consumption and accuracy in text classification inference

Johannes Zschache, Tilman Hartwig

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The increasing deployment of large language models (LLMs) in natural language processing (NLP) tasks raises concerns about energy efficiency and sustainability. While prior research has largely focused on energy consumption during model training, the inference phase has received comparatively less attention. This study systematically evaluates the trade-offs between model accuracy and energy consumption in text classification inference across various model architectures and hardware configurations. Our empirical analysis shows that in some contexts the best-performing model in terms of accuracy can also be energy-efficient. While LLMs tend to consume significantly more energy than traditional machine learning models, they show the same or even lower levels of accuracy in our zero-shot classification setting. We observe substantial variability in inference energy consumption ($<$mWh to $>$kWh), influenced by model type, model size, and hardware specifications. Additionally, we find a strong correlation between inference energy consumption and model runtime, indicating that execution time can serve as a practical proxy for energy usage in settings where direct measurement is not feasible. Our findings demonstrate that energy efficiency and accuracy represent distinct evaluation dimensions that do not necessarily align. We argue that sustainable AI development requires systematic evaluation of both performance and resource efficiency.

[95] A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

Xianren Zhang, Shreyas Prasad, Di Wang, Qiuhai Zeng, Suhang Wang, Wenbo Yan, Mat Hans

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Web agents have shown great promise in performing many tasks on ecommerce website. To assess their capabilities, several benchmarks have been introduced. However, current benchmarks in the e-commerce domain face two major problems. First, they primarily focus on product search tasks (e.g., Find an Apple Watch), failing to capture the broader range of functionalities offered by real-world e-commerce platforms such as Amazon, including account management and gift card operations. Second, existing benchmarks typically evaluate whether the agent completes the user query, but ignore the potential risks involved. In practice, web agents can make unintended changes that negatively impact the user account or status. For instance, an agent might purchase the wrong item, delete a saved address, or incorrectly configure an auto-reload setting. To address these gaps, we propose a new benchmark called Amazon-Bench. To generate user queries that cover a broad range of tasks, we propose a data generation pipeline that leverages webpage content and interactive elements (e.g., buttons, check boxes) to create diverse, functionality-grounded user queries covering tasks such as address management, wish list management, and brand store following. To improve the agent evaluation, we propose an automated evaluation framework that assesses both the performance and the safety of web agents. We systematically evaluate different agents, finding that current agents struggle with complex queries and pose safety risks. These results highlight the need for developing more robust and reliable web agents.

[96] BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury, Sinead Williamson, Adam Goliński, Ning Miao, Freddie Bickford Smith, Michael Kirchhof, Yizhe Zhang, Tom Rainforth

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) with respect to a variable of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM’s predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to purely prompting-based design generation and other adaptive design strategies.

[97] InsideOut: Measuring and Mitigating Insider-Outsider Bias in Interview Script Generation

Yixin Wan, Xingrun Chen, Kai-Wei Chang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Advancements in Large language models (LLMs) have enabled a variety of downstream applications like story and interview script generation. However, recent research raised concerns about culture-related fairness issues in LLM-generated content. In this work, we identify and systematically investigate LLMs’ insider-outsider bias, a phenomenon where models position themselves as “insiders” of mainstream cultures during generation while externalizing less dominant cultures. We propose the InsideOut benchmark with 4,000 generation prompts and three evaluation metrics to quantify this bias through a culturally situated interview script generation task, in which an LLM is positioned as a reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals that while models adopt insider tones in over 88% US-contexted scripts on average, they disproportionately default to “outsider” stances for non-Western cultures. To mitigate these biases, we propose 2 inference-time methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of a Single-Agent (MFA-SA), a Hierarchical-Agent (MFA-HA), and an autonomous Agentic Planning (MFA-Plan) pipeline. Empirical results demonstrate that agent-based MFA methods achieve outstanding and robust performance in mitigating the insider-outsider bias: For instance, on the Cultural Alignment Gap (CAG) metric, MFA-SA reduces bias in Llama model by 89.70 % and MFA-HA mitigates bias in Qwen by 82.54%. These findings showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.

[98] Hybrid Architectures for Language Models: Systematic Analysis and Design Insights

Sangmin Bae, Bilge Acun, Chien-Yu Lin, Haroun Habeeb, Seungyeon Kim, Liang Luo, Junjie Wang, Carole-Jean Wu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent progress in large language models demonstrates that hybrid architectures–combining self-attention mechanisms with structured state space models like Mamba–can achieve a compelling balance between modeling quality and computational efficiency, particularly for long-context tasks. While these hybrid models show promising performance, systematic comparisons of hybridization strategies and analyses on the key factors behind their effectiveness have not been clearly shared to the community. In this work, we present a holistic evaluation of hybrid architectures based on inter-layer (sequential) or intra-layer (parallel) fusion. We comprehensively evaluate these designs across multiple dimensions: language modeling and downstream task performance, long-context capabilities, scaling analysis, and training and inference efficiency. By investigating the core characteristics of their computational primitive, we identify the most critical elements for each hybridization strategy and further propose optimal design recipes for hybrid models. Our comprehensive analysis provides practical guidance and valuable insights for developing hybrid language models, facilitating the optimization of architectural configurations.

[99] A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks

Shuzheng Si, Haozhe Zhao, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks. In this paper, we introduce a plan-and-execute framework and propose EAGLET, an efficient and effective planner training method to enhance the executor agent’s planning abilities without human effort. Specifically, we train a plug-and-play global planner through a two-step process: we first synthesize high-quality plans from an advanced LLM using our proposed homologous consensus filtering strategy, and apply fine-tuning as a cold start. Moreover, we further improve the planner with a rule-based reinforcement learning stage using a novel executor capability gain reward, ensuring it can handle task instructions of varying difficulty. Experiments on three long-horizon agent tasks show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance. Meanwhile, EAGLET reduces training costs by 8x compared to RL-based baselines, and it does not require manual effort or extra training data, offering an efficient and effective solution.

[100] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling

Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at https://github.com/NEUIR/Genii.

[101] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent’s responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.

[102] VISTA: Verification In Sequential Turn-based Assessment

Ashley Lewis, Andrew Perrault, Eric Fosler-Lussier, Michael White

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hallucination–defined here as generating statements unsupported or contradicted by available evidence or conversational context–remains a major obstacle to deploying conversational AI systems in settings that demand factual reliability. Existing metrics either evaluate isolated responses or treat unverifiable content as errors, limiting their use for multi-turn dialogue. We introduce VISTA (Verification In Sequential Turn-based Assessment), a framework for evaluating conversational factuality through claim-level verification and sequential consistency tracking. VISTA decomposes each assistant turn into atomic factual claims, verifies them against trusted sources and dialogue history, and categorizes unverifiable statements (subjective, contradicted, lacking evidence, or abstaining). Across eight large language models and four dialogue factuality benchmarks (AIS, BEGIN, FAITHDIAL, and FADE), VISTA substantially improves hallucination detection over FACTSCORE and LLM-as-Judge baselines. Human evaluation confirms that VISTA’s decomposition improves annotator agreement and reveals inconsistencies in existing benchmarks. By modeling factuality as a dynamic property of conversation, VISTA offers a more transparent, human-aligned measure of truthfulness in dialogue systems.

[103] From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

Farima Fatahi Bayat, Pouya Pezeshkpour, Estevam Hruschka

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Tool-augmented Language Models (TaLMs) can invoke external tools to solve problems beyond their parametric capacity. However, it remains unclear whether these tool-enabled gains reflect trustworthy reasoning. Focusing on the Code Interpreter tool, we show that even when tools are selected and executed correctly, TaLMs treat tool outputs as substitutes for reasoning, producing solutions that appear correct but lack coherent justification. We term this failure mode Tool-Induced Myopia (TIM), and study it using PYMATH, a benchmark of 1,679 competition-level mathematical problems for which Python code is helpful but not sufficient. We further develop a multi-dimensional evaluation suite to quantify reasoning degradation in TaLMs relative to their non-tool counterparts. Our findings reveal that while TaLMs achieve up to a 19.3 percentage point gain in final-answer accuracy, their reasoning behavior consistently deteriorates (e.g., non-tool LLMs win up to 41.5% more often in pairwise comparisons of the reasoning process). This degradation intensifies with tool use; the more frequently a model invokes tools, the less coherent its reasoning becomes. Moreover, tool use shifts errors from arithmetic mistakes toward global reasoning failures (logic, assumption, creativity); with TIM present in ~55% of high-risk cases. Finally, we propose a preference-optimization-based framework that realigns TaLMs to use tools as assistive evidence, improving both final-answer accuracy and reasoning depth under tool use. Codes and data are available at: https://github.com/megagonlabs/TIM.

[104] MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroMind Team, Song Bai, Lidong Bing, Carson Chen, Guanzheng Chen, Yuntao Chen, Zhe Chen, Ziyi Chen, Jifeng Dai, Xuan Dong, Wenhan Dou, Yue Deng, Yunjie Fu, Junqi Ge, Chenxia Han, Tammy Huang, Zhenhang Huang, Jerry Jiao, Shilei Jiang, Tianyu Jiao, Xiaoqi Jian, Lei Lei, Ruilin Li, Gen Luo, Tiantong Li, Xiang Lin, Ziyuan Liu, Zhiqi Li, Jie Ni, Qiang Ren, Pax Sun, Shiqian Su, Chenxin Tao, Bin Wang, Wenhai Wang, Haonan Wang, James Wang, Jin Wang, Jojo Wang, Letian Wang, Shizun Wang, Weizhi Wang, Zixuan Wang, Jinfan Xu, Sen Xing, Chenyu Yang, Hai Ye, Jiaheng Yu, Yue Yu, Muyan Zhong, Tianchen Zhao, Xizhou Zhu, Yanpeng Zhou, Yifan Zhang, Zhi Zhu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.11793: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11793&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[105] When Does Verification Pay Off? A Closer Look at LLMs as Solution Verifiers

Jack Lu, Ryan Teehan, Jinran Jin, Mengye Ren

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.02304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[106] RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, Chenguang Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.13281: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.13281&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning

Jinlong Liu, Mohammed Bahja, Venelin Kovatchev, Mark Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.05747: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05747&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar, Junha Park, Aparna Garimella, Vivek Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.15907: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15907&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] FaithLens: Detecting and Explaining Faithfulness Hallucination

Shuzheng Si, Qingyi Wang, Haozhe Zhao, Yuzhuo Bai, Guanqiao Chen, Kangyang Luo, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.20182: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20182&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[110] EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Kumar Satvik Chaudhary, Chengshuai Zhao, Fan Zhang, Garima Agrawal, Yuli Deng, Huan Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.20817: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20817&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.02993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] Do LLMs Encode Functional Importance of Reasoning Tokens?

Janvijay Singh, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03066: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03066&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] STReasoner: Empowering LLMs for Spatio-Temporal Reasoning in Time Series via Spatial-Aware Reinforcement Learning

Juntong Ni, Shiyu Wang, Qi He, Ming Jin, Wei Jin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03248: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03248&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] SAGE-32B: Agentic Reasoning via Iterative Distillation

Basab Jha, Firoj Paudel, Ujjwal Puri, Ethan Henkel, Zhang Yuting, Mateusz Kowalczyk, Mei Huang, Choi Donghyuk, Wang Junhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.04237: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04237&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

Arkadiusz Modzelewski, Paweł Golik, Anna Kołos, Giovanni Da San Martino

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.04925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao, Yilun Du, Mengyu Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.05414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2502.16161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.16161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[118] Reasoning Models Will Sometimes Lie About Their Reasoning

William Walden, Miriam Wanner

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.07663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[119] Take Out Your Calculators: Estimating the Real Difficulty of Question Items with LLM Student Simulations

Christabel Acquaye, Yi Ting Huang, Marine Carpuat, Rachel Rudinger

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.09953: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09953&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding

Yichen Jiang, Jiakang Yuan, Chongjun Tu, Peng Ye, Tao Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[121] On Temperature-Constrained Non-Deterministic Machine Translation: Potential and Evaluation

Weichuan Wang, Mingyang Liu, Linqi Song, Chen Ma

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.13729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[122] VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Shijie Zhou, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.20279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.20279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[123] Lost in the Prompt Order: Revealing the Limitations of Causal Attention in Language Models

Hyunjong Ok, Jaeho Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.14152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[124] OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, Zheng Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.18871: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18871&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Chiyuan Fu, Lyuhao Chen, Yunze Xiao, Weihao Xuan, Carlos Busso, Mona Diab

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[126] Multi-Persona Thinking for Bias Mitigation in Large Language Models

Yuxing Chen, Guoqing Luo, Zijun Wu, Lili Mou

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.15488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[127] Beyond Marginal Distributions: A Framework to Evaluate the Representativeness of Demographic-Aligned LLMs

Tristan Williams, Franziska Weeber, Sebastian Padó, Alan Akbik

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.15755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[128] Temp-R1: A Unified Autonomous Agent for Complex Temporal KGQA via Reverse Curriculum Reinforcement Learning

Zhaoyan Gong, Zhiqiang Liu, Songze Li, Xiaoke Guo, Yuanxiang Liu, Xinle Deng, Zhizhen Liu, Lei Liang, Huajun Chen, Wen Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[129] Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length

Jingxuan Chen, Mohammad Taher Pilehvar, Jose Camacho-Collados

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.22608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[130] One Persona, Many Cues, Different Results: How Sociodemographic Cues Impact LLM Personalization

Franziska Weeber, Vera Neplenbroek, Jan Batzner, Sebastian Padó

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18572: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18572&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[131] Visual-TableQA: Open-Domain Benchmark for Reasoning over Table Images

Boammani Aser Lompo, Marc Haraoui

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.07966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[132] Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

Thanh Luong Tuan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.00555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[133] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Retrospective Forecasting Case Study

Ali El Lahib, Ying-Jieh Xia, Zehan Li, Yuxuan Wang, Xinyu Pi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.00758: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00758&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[134] ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Aaditya Baranwal, Shruti Vyas

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.03108: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03108&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[135] Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Ziqian Zhong, Aditi Raghunathan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.00161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[136] Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Xue Liu, Xin Ma, Yuxin Ma, Yongchang Peng, Duo Wang, Zhoufutu Wen, Ge Zhang, Kaiyuan Zhang, Xinyu Chen, Yida Ding, Tianci He, Jiani Hou, Liang Hu, Ziyun Huang, Yongzhe Hui, Jianpeng Jiao, Chennan Ju, Yingru Kong, Yiran Li, Jiashuo Liu, Mengyun Liu, Luyao Ma, Fei Ni, Yiqing Ni, Pengbo Niu, Yueyan Qiu, Yanle Ren, Xinyu Shen, Zilin Shi, Zaiyuan Wang, Wenjie Yue, Chun Zhang, Shiyu Zhang, Xinyi Zhang, Kaiwen Zhao, Zhenwei Zhu, Shanshan Wu, Qi Zhao, Wenhao Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[137] Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.05437: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05437&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[138] Investigating the structure of emotions by analyzing similarity and association of emotion words

Fumitaka Iwaki, Tatsuji Takahashi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.06430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[139] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.09642: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09642&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[140] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Jiale Zhao, Ke Fang, Lu Cheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11199: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11199&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[141] Cross-lingual Matryoshka Representation Learning across Speech and Text

Yaya Sy, Dioula Doucouré, Christophe Cerisara, Irina Illina

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.19991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[142] ConFu: Contemplate the Future for Better Speculative Sampling

Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.08899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[143] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.11665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[144] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

Tianyi Huang, Ying Kai Deng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.16091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[145] More Than Sum of Its Parts: Deciphering Intent Shifts in Multimodal Hate Speech Detection

Runze Sun, Yu Zheng, Zexuan Xiong, Zhongjin Qu, Lei Chen, Jie Zhou, Jiwen Lu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.21298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[146] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.24472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[147] Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval

Zhiyuan Cheng, Longying Lai, Yue Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.26815: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26815&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[148] Article and Comment Frames Shape the Quality of Online Comments

Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.27889: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27889&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[149] PolarQuant: Optimal Gaussian Weight Quantization via Hadamard Rotation for LLM Compression

Caio Vicentino

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.29078: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29078&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[150] Council Mode: Mitigating Hallucination and Bias in LLMs via Multi-Agent Consensus

Shuai Wu, Xue Li, Yanna Feng, Yufang Li, Zhijun Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02923: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02923&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[151] CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

Ahmed Heakl, Gustavo Bertolo Stahl, Sarim Hashmi, Seung Hun Eddie Han, Mukul Ranjan, Arina Kharlamova, Salman Khan, Abdulrahman Mahmoud

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.16968: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16968&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[152] VIGIL: An Extensible System for Real-Time Detection and Mitigation of Cognitive Bias Triggers

Bo Kang, Sander Noels, Tijl De Bie

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[153] Multilingual Language Models Encode Script Over Linguistic Structure

Aastha A K Verma, Anwoy Chatterjee, Mehak Gupta, Tanmoy Chakraborty

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05090: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05090&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[154] VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.12735: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12735&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[155] NameBERT: Scaling Name-Based Nationality Classification with LLM-Augmented Open Academic Data

Cong Ming, Ruixin Shi, Yifan Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10401: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10401&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[156] A Triadic Suffix Tokenization Scheme for Numerical Reasoning

Olga Chetverina

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11582: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11582&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Ryan Faulkner, Anushka Deshpande, David Guzman Piedrahita, Joel Z. Leibo, Zhijing Jin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11721: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11721&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[158] Coding-Free and Privacy-Preserving Agentic Framework for Data-Driven Clinical Research

Taehun Kim, Hyeryun Park, Hyeonhoon Lee, Yushin Lee, Kyungsang Kim, Hyung-Chul Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[159] How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data

Zixian Huang, Kaichen Yang, Xu Huang, Feiyang Hao, Qiming Ge, Bowen Li, He Du, Kai Chen, Qipeng Guo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[160] EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews

Naman Ahuja, Saniya Mulla, Muhammad Ali Khan, Zaryab Bin Riaz, Kaneez Zahra Rubab Khakwani, Mohamad Bassam Sonbol, Irbaz Bin Riaz, Vivek Gupta

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[161] The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

Jon-Paul Cacioli

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15702: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15702&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[162] Cross-Family Speculative Decoding for Polish Language Models on AppleSilicon: An Empirical Evaluation of Bielik11B with UAG-Extended MLX-LM

Krzysztof Fonal

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16368: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16368&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[163] IYKYK (But AI Doesn’t): Automated Content Moderation Does Not Capture Communities’ Heterogeneous Attitudes Towards Reclaimed Language

Christina Chance, Rebecca Pattichis, Arjun Subramonian, James He, Shruti Narayanan, Saadia Gabriel, Kai-Wei Chang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[164] No One Fits All: From Fixed Prompting to Learned Routing in Multilingual LLMs

Wei-Chi Wu, Sheng-Lun Wei, Hen-Hsen Huang, Hsin-Hsi Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16937: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16937&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[165] SciImpact: A Multi-Dimensional, Multi-Field Benchmark for Scientific Impact Prediction

Hangxiao Zhu, Yuyu Zhang, Ping Nie, Yu Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[166] REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning

Seungmin Lee, Jeonghwan Lee, Hyunkuk Lim, Sejoon Kim, Mingi Sung

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17257: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17257&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[167] Cat-DPO: Category-Adaptive Safety Alignment

Tiankai Yang, Yi Nian, Xinyuan Li, Ruiyao Xu, Kaize Ding, Yue Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17299: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17299&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[168] Who Watches the Watchmen? Humans Disagree With Translation Metrics on Unseen Domains

Finn Schmidt, Jan Philip Wahle, Terry Ruas, Bela Gipp

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[169] On the Emergence of Syntax by Means of Local Interaction

Zichao Wei

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17857: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17857&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[170] MM-JudgeBias: A Benchmark for Evaluating Compositional Biases in MLLM-as-a-Judge

Sua Lee, Sanghee Park, Jinbae Im

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18164: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18164&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[171] STaD: Scaffolded Task Design for Identifying Compositional Skill Gaps in LLMs

Sungeun An, Swanand Ravindra Kadhe, Shailja Thakur, Chad DeLuca, Hima Patel

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[172] AlphaContext: An Evolutionary Tree-based Psychometric Context Generator for Creativity Assessment

Yixuan Wang, Yue Huang, Hong Qian, Yunzhao Wei, Yifei Ding, Wenkai Wang, Zhi Liu, Zhongjing Huang, Aimin Zhou, Jiajun Guo

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18398: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18398&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[173] MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation

Xingchen Xiao, Heyan Huang, Runheng Liu, Jincheng Xie

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[174] Position: LLM Watermarking Should Align Stakeholders’ Incentives for Practical Adoption

Yepeng Liu, Xuandong Zhao, Dawn Song, Gregory W. Wornell, Yuheng Bu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.18333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[175] ContextLeak: Auditing Leakage in Private In-Context Learning Methods

Jacob Choi, Shuying Cao, Xingjian Dong, Amin Banayeeanzade, Wang Bill Zhu, Robin Jia, Sai Praneeth Karimireddy

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.16059: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16059&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[176] DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization

Haokun Lin, Xinle Jia, Haobo Xu, Bingchen Yao, Xianglong Guo, Yichen Wu, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17789: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17789&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[177] Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey

Yasmin Moslem, John D. Kelleher

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.04445: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04445&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[178] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Alexandra Dragomir, Ioana Pintilie, Antonio Barbalau, Marius Dragoi, Florin Brad, Cristian Daniel Paduraru, Alexandru Tifrea, Elena Burceanu, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16171: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16171&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[179] Why AI Readiness Is an Organizational Learning Problem, Not a Technology Purchase

Jeanne McClure, Gregg Gerdau

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16369: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16369&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[180] Modular Representation Compression: Adapting LLMs for Efficient and Effective Recommendations

Yunjia Xi, Menghui Zhu, Jianghao Lin, Bo Chen, Ruiming Tang, Yong Yu, Weinan Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18146: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18146&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[181] Sessa: Selective State Space Attention

Liubomyr Horbatko

Maximilian Haug, Christian Stippel, Lukas Pscherer, Benjamin Schwendinger, Ralph Hoch, Angel Gaydarov, Sebastian Schlund, Thilo Sauter

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Ensuring human safety is of paramount importance in warehouse environments that feature mixed traffic of human workers and autonomous mobile robots (AMRs). Current approaches often treat humans as generic dynamic obstacles, leading to conservative AMR behaviors like slowing down or detouring, even when workers are fully aware and capable of safely sharing space. This paper presents a real-time vision-based method to estimate human awareness of an AMR using a single RGB camera. We integrate state-of-the-art 3D human pose lifting with head orientation estimation to ascertain a human’s position relative to the AMR and their viewing cone, thereby determining if the human is aware of the AMR. The entire pipeline is validated using synthetically generated data within NVIDIA Isaac Sim, a robust physics-accurate robotics simulation environment. Experimental results confirm that our system reliably detects human positions and their attention in real time, enabling AMRs to safely adapt their motion based on human awareness. This enhancement is crucial for improving both safety and operational efficiency in industrial and factory automation settings.

Baiyu Chen, Wilson Wongso, Zechen Li, Yonchanok Khaokaew, Hao Xue, Flora Salim

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The goal of creating intelligent, human-centered wearable systems for continuous activity understanding faces a fundamental trade-off: Egocentric video-based models capture rich semantic information and have demonstrated strong performance in human activity recognition (HAR), but their high power consumption, privacy concerns, and dependence on lighting limit their feasibility for continuous on-device recognition. In contrast, inertial measurement unit (IMU) sensors offer an energy-efficient, privacy-preserving alternative, yet lack large-scale annotated datasets, leading to weaker generalization. To bridge this gap, we propose COMODO, a cross-modal self-supervised distillation framework that transfers semantic knowledge from video to IMU without requiring labels. COMODO leverages a pretrained and frozen video encoder to construct a dynamic instance queue to align the feature distributions of video and IMU embeddings. This enables the IMU encoder to inherit rich semantic structure from video while maintaining its efficiency for real-world applications. Experiments on multiple egocentric HAR datasets show that COMODO consistently improves downstream performance, matching or surpassing fully supervised models, and demonstrating strong cross-dataset generalization. Benefiting from its simplicity and flexibility, COMODO is compatible with diverse pretrained video and time-series models, offering the potential to leverage more powerful teacher and student foundation models in future ubiquitous computing research. The code is available at this repository: https://github.com/cruiseresearchgroup/COMODO.

[186] StomaD2: An All-in-One System for Intelligent Stomatal Phenotype Analysis via Diffusion-Based Restoration Detection Network

Quanling Zhao, Meng’en Qin, Yanfeng Sun, Yuan Miao, Xiaohui Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Stomata play a crucial role in regulating plant physiological processes and reflecting environmental responses. However, accurate and high-throughput stomatal phenotyping remains challenging, as conventional approaches rely on destructive sampling and manual annotation, restricting large-scale and field deployment. To overcome these limitations, a noninvasive restoration-detection integrated framework, termed StomaD2, is developed to achieve accurate and fast stomatal phenotyping under complex imaging conditions. The framework incorporates a diffusion-based restoration module to recover degraded images and a specialized rotated object detection network tailored to the small, dense, and cluttered characteristics of stomata. The proposed network enhances feature representation through three key innovations: a column-wise structure for global feature interaction, context-aware resampling and reweighting mechanism to improve multi-scale consistency, and a feature reassembly module to boost discrimination against complex backgrounds. In extensive comparisons, StomaD2 demonstrated state-of-the-art performance. On public Maize and Wheat datasets, it achieved accuracies of 0.994 and 0.992, respectively, significantly outperforming existing benchmarks. When benchmarked against ten other advanced models, including Oriented Former and YOLOv12, StomaD2 achieved a top-tier F1-score/mAP of 0.989. The framework is integrated into a user-friendly, field-operable system that supports the fast extraction of eight stomatal phenotypes, such as density and conductance. Validated on more than 130 plant species, StomaD2’s results highlight its strong generalizability and potential for large-scale phenotyping, plant physiology analysis, and precision agriculture applications.

[187] DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax

Hang Yuan, Xiaolin Hu, Yan Wan, Menglin Gao, Wenzhe Yu, Cong Huang, Fei Xu, Qing Li, Christina Dan Wang, Zhou Yu, Kai Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Text-driven controllable dance generation remains under-explored, primarily due to the severe scarcity of high-quality datasets and the inherent difficulty of articulating complex choreographies. Characterizing dance is particularly challenging owing to its intricate spatial dynamics, strong directionality, and the highly decoupled movements of distinct body parts. To overcome these bottlenecks, we bridge principles from dance studies, human anatomy, and biomechanics to propose \textit{Choreographic Syntax}, a novel theoretical framework with a tailored annotation system. Grounded in this syntax, we combine professional dance archives with high-fidelity motion capture data to construct \textbf{DanceFlow}, the most fine-grained dance dataset to date. It encompasses 41 hours of high-quality motions paired with 6.34 million words of detailed descriptions. At the model level, we introduce \textbf{DanceCrafter}, a tailored motion transformer built upon the Momentum Human Rig. To circumvent optimization instabilities, we construct a continuous manifold motion representation paired with a hybrid normalization strategy. Furthermore, we design an anatomy-aware loss to explicitly regulate the decoupled nature of body parts. Together, these adaptations empower DanceCrafter to achieve the high-fidelity and stable generation of complex dance sequences. Extensive evaluations and user studies demonstrate our state-of-the-art performance in motion quality, fine-grained controllability, and generation naturalness.

[188] From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory, we derive a Semantic Information Bottleneck objective and introduce SIB-GRPO to optimize the trade-off between memory compression and task-relevant information retention. In inference, we design an entropy-driven top-down memory retrieval strategy. Extensive experiments across 4 benchmarks confirm that MM-Mem achieves state-of-the-art performance on both offline and streaming tasks, demonstrating robust generalization and validating the effectiveness of cognition-inspired memory organization. Code and associated configurations are publicly available at https://github.com/EliSpectre/MM-Mem.

[189] Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

Cuiling Sun, Linkai Peng, Adam Murphy, Elif Keles, Hiten D. Patel, Ashley Ross, Frank Miller, Baris Turkbey, Andrea Mia Bejar, Halil Ertugrul Aktas, Gorkem Durak, Ulas Bagci

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automated 3D segmentation of prostate lesions from biparametric MRI (bp-MRI) is essential for reliable algorithmic analysis, but achieving high precision remains challenging. Volumetric methods must combine multiple modalities while ensuring anatomical consistency, but current models struggle to integrate cross-modal information reliably. While vision-language models (VLMs) are replacing the currently used architectural designs, they still lack the fine-grained, lesion-level semantics required for effective localized guidance. To address these limitations, we propose a new multi-encoder U-Net architecture incorporating three key innovations: (1) an alignment loss that enhances foreground text-image similarity to inject lesion semantics; (2) a heatmap loss that calibrates the similarity map and suppresses spurious background activations; and (3) a final-stage, confidence-gated multi-head cross-attention refiner that performs localized boundary edits in high-confidence regions. A phase-scheduled training regime stabilizes the optimization of these components. Our method consistently outperforms prior approaches, establishing a new state-of-the-art on the PI-CAI dataset through enhanced multi-modal fusion and localized text guidance. Our code is available at https://github.com/NUBagciLab/Prostate-Lesion-Segmentation.

[190] Colour Extraction Pipeline for Odonates using Computer Vision

Megan Mirnalini Sundaram Rajaraman, Fons J. Verbeek, Vincent J. Kalkman, Rita Pucci

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The correlation between insect morphological traits and climate has been documented in physiological studies, but such studies remain limited by the time-consuming nature of the data analysis. In particular, the open source datasets often lack annotations of species’ morphological traits, making dedicated annotations campaigns necessary; these efforts are typically local in scale and costly. In this paper, we propose a pipeline to identify and segment body parts of Odonates (dragonflies and damselflies) using deep neural networks, with the ultimate goal of extracting body parts’ colouration. The pipeline is trained on a limited annotated dataset and refined with pseudo supervised data. We show that, by using open source images from citizen science platforms, our approach can segment each visible subject (Odonates) into head, thorax, abdomen, and wings and then extract a colour palette for each body part. This will enable large-scale statistical analysis of ecological correlations (e.g., between colouration and climate change, habitat loss, or geolocation) which are crucial for quantifying and assessing ecosystem biodiversity status.

[191] Silicon Aware Neural Networks

Sebastian Fieldhouse, Kea-Tiong Tang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent work in the machine learning literature has demonstrated that deep learning can train neural networks made of discrete logic gate functions to perform simple image classification tasks at very high speeds on CPU, GPU and FPGA platforms. By virtue of being formed by discrete logic gates, these Differentiable Logic Gate Networks (DLGNs) lend themselves naturally to implementation in custom silicon - in this work we present a method to map DLGNs in a one-to-one fashion to a digital CMOS standard cell library by converting the trained model to a gate-level netlist. We also propose a novel loss function whereby the DLGN can optimize the area, and indirectly power consumption, of the resulting circuit by minimizing the expected area per neuron based on the area of the standard cells in the target standard cell library. Finally, we also show for the first time an implementation of a DLGN as a silicon circuit in simulation, performing layout of a DLGN in the SkyWater 130nm process as a custom hard macro using a Cadence standard cell library and performing post-layout power analysis. We find that our custom macro can perform classification on MNIST with 97% accuracy 41.8 million times a second at a power consumption of 83.88 mW.

[192] Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control

Jay Jung, Ahmad Arrabi, Jax Luo, Scott Raymond, Safwan Wshah

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Purpose: Automated C-arm positioning ensures timely treatment in patients requiring emergent interventions. When a conventional Deep Learning (DL) approach for C-arm control fails, clinicians must revert to manual operation, resulting in additional delays. Consequently, an agentic C-arm control framework based on multimodal large language models (MLLMs) is highly desirable, as it can incorporate clinician feedback and use reasoning to make adjustments toward more accurate positioning. Skeletal landmark localization is essential for C-arm control, and we investigate adapting MLLMs for autonomous landmark localization. Methods: We used an annotated synthetic X-ray dataset and a real X-ray dataset. Each X-ray in both datasets is paired with several skeletal landmarks. We fine-tuned two MLLMs and tasked them with retrieving the closest landmarks from each X-ray. Quantitative evaluations of landmark localization were performed and compared against a leading DL approach. We further conducted qualitative experiments demonstrating: (1) how an MLLM can correct an initially incorrect prediction through reasoning, and (2) how the MLLM can sequentially navigate the C-arm toward a target location. Results: On both datasets, fine-tuned MLLMs demonstrate competitive performance across all localization tasks when compared with the DL approach. In the qualitative experiments, the MLLMs provide evidence of reasoning and spatial awareness. Conclusion: This study shows that fine-tuned MLLMs achieve accurate skeletal landmark localization and hold promise for agentic autonomous C-arm control. Our code is available athttps://github.com/marszzibros/C-arm-localization-LLMs.git

[193] Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras

Ruijun Zhang, Hang Su, Kostas Daniilidis, Ziyun Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: https://github.com/spikelab-jhu/Match-Any-Events.

[194] DeltaSeg: Tiered Attention and Deep Delta Learning for Multi-Class Structural Defect Segmentation

Enrique Hernandez Noguera, Md Meftahul Ferdaus, Elias Ioup, Mahdi Abdelguerfi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automated segmentation of structural defects from visual inspection imagery remains challenging due to the diversity of damage types, extreme class imbalance, and the need for precise boundary delineation. This paper presents DeltaSeg, a U-shaped encoder-decoder architecture with a tiered attention strategy that integrates Squeeze-and-Excitation (SE) channel attention in the encoder, Coordinate Attention at the bottleneck and decoder, and a novel Deep Delta Attention (DDA) mechanism in the skip connections. The encoder uses depthwise separable convolutions with dilated stages to maintain spatial resolution while expanding the receptive field. Atrous Spatial Pyramid Pooling (ASPP) at the bottleneck captures multi-scale context. The DDA module refines skip connections through a dual-path scheme combining a learned delta operator for nuisance feature suppression with spatial attention gates conditioned on decoder signals. Deep supervision through multi-scale auxiliary heads further strengthens gradient flow and encourages semantically meaningful features at intermediate decoder stages. We evaluate DeltaSeg on two datasets: the S2DS dataset (7 classes) and the Culvert-Sewer Defect Dataset (CSDD, 9 classes). Across both benchmarks, DeltaSeg consistently outperforms 12 competing architectures including U-Net, SA-UNet, UNet3+, SegFormer, Swin-UNet, EGE-UNet, FPN, and Mobile-UNETR, demonstrating strong generalization across damage types, imaging conditions, and structural geometries.

[195] URoPE: Universal Relative Position Embedding across Geometric Spaces

Yichen Xie, Depu Meng, Chensheng Peng, Yihan Hu, Quentin Herau, Masayoshi Tomizuka, Wei Zhan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Relative position embedding has become a standard mechanism for encoding positional information in Transformers. However, existing formulations are typically limited to a fixed geometric space, namely 1D sequences or regular 2D/3D grids, which restricts their applicability to many computer vision tasks that require geometric reasoning across camera views or between 2D and 3D spaces. To address this limitation, we propose URoPE, a universal extension of Rotary Position Embedding (RoPE) to cross-view or cross-dimensional geometric spaces. For each key/value image patch, URoPE samples 3D points along the corresponding camera ray at predefined depth anchors and projects them into the query image plane. Standard 2D RoPE can then be applied using the projected pixel coordinates. URoPE is a parameter-free and intrinsics-aware relative position embedding that is invariant to the choice of global coordinate systems, while remaining fully compatible with existing RoPE-optimized attention kernels. We evaluate URoPE as a plug-in positional encoding for transformer architectures across a diverse set of tasks, including novel view synthesis, 3D object detection, object tracking, and depth estimation, covering 2D-2D, 2D-3D, and temporal scenarios. Experiments show that URoPE consistently improves the performance of transformer-based models across all tasks, demonstrating its effectiveness and generality for geometric reasoning. Our project website is: https://urope-pe.github.io/.

[196] REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

Seowung Leem, Lin Gu, Chenyu You, Kuang Gong, Ruogu Fang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The retina provides a unique, noninvasive window into Alzheimer’s disease (AD) and dementia, capturing early structural changes through morphometric features, while systemic and lifestyle risk factors reflect well-established contributors to disease susceptibility long before clinical symptom onset. However, current retinal analysis frameworks typically model imaging and risk factors separately, limiting their ability to capture joint multimodal patterns critical for early risk prediction. Moreover, existing methods rarely incorporate mechanisms to organize or align patients with similar retinal and clinical characteristics, constraining the learning of coherent cross-modal associations. To address these limitations, we introduce REVEAL (REtinal-risk Vision-Language Early Alzheimer’s Learning), a framework that aligns color fundus photographs with individualized disease-specific risk profiles for predicting incident AD and dementia, on average 8 years before diagnosis (range: 1-11 years). Because real-world risk factors are structured questionnaire data, we translate them into clinically interpretable narratives compatible with pretrained vision-language models (VLMs). We further propose a group-aware contrastive learning (GACL) strategy that clusters patients with similar retinal morphometry and risk factors as positive pairs, strengthening multimodal alignment. This unified representation learning framework substantially outperforms state-of-the-art retinal imaging models paired with clinical text encoders, as well as general-purpose VLMs, demonstrating the value of jointly modeling retinal biomarkers and clinical risk factors. By providing a generalizable and noninvasive approach for early AD and dementia risk stratification, REVEAL has the potential to enable earlier intervention and improve preventive care at the population level.

[197] CAHAL: Clinically Applicable resolution enHAncement for Low-resolution MRI scans

Sergio Morell-Ortega, Ángela González-Cebrián, Boris Mansencal, Marien Gadea, Roberto Vivo-Hernando, Gregorio Rubio, Fernando Aparici, Maria de la Iglesia-Vaya, Gwenaelle Catheline, Pierrick Coupé, José V. Manjón

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large-scale automated morphometric analysis of brain MRI is limited by the thick-slice, anisotropic acquisitions prevalent in routine clinical practice. Existing generative super-resolution (SR) methods produce visually compelling isotropic volumes but often introduce anatomical hallucinations, systematic volumetric overestimation, and structural distortions that compromise downstream quantitative analysis and diagnostic safety. To address this, we propose CAHAL (Clinically Applicable resolution enHAncement for Low-resolution MRI scans), a hallucination-robust, physics-informed resolution enhancement framework that operates directly in the patient’s native acquisition space. CAHAL employs a deterministic bivariate Mixture of Experts (MoE) architecture routing each input through specialised residual 3D U-Net experts conditioned on both volumetric resolution and acquisition anisotropy, two independent descriptors of clinical MRI acquisition. Experts are optimised with a composite loss combining edge-penalised spatial reconstruction, Fourier-domain spectral coherence matching, and a segmentation-guided semantic consistency constraint. Training pairs are generated on-the-fly via physics-based degradation sampled from a large-scale real-world database, ensuring robust generalisation. Validated on T1-weighted and FLAIR sequences against generative baselines, CAHAL achieves state-of-the-art results, improving the best related methods in terms of accuracy and efficiency.

Johny J. Lopez, Md Meftahul Ferdaus, Mahdi Abdelguerfi, Anton Netchaev, Steven Sloan, Ken Pathak, Kendall N. Niles

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Depth completion from sparse LiDAR measurements and corresponding RGB images is a prerequisite for accurate 3D perception in robotic systems. Existing methods achieve high accuracy on standard benchmarks but rely on heavy backbone architectures that preclude real-time deployment on embedded hardware. We present EfficientPENet, a two-branch depth completion network that replaces the conventional ResNet encoder with a modernized ConvNeXt backbone, introduces sparsity-invariant convolutions for the depth stream, and refines predictions through a Convolutional Spatial Propagation Network (CSPN). The RGB branch leverages ImageNet-pretrained ConvNeXt blocks with Layer Normalization, 7x7 depthwise convolutions, and stochastic depth regularization. Features from both branches are merged via late fusion and decoded through a multi-scale deep supervision strategy. We further introduce a position-aware test-time augmentation scheme that corrects coordinate tensors during horizontal flipping, yielding consistent error reduction at inference. On the KITTI depth completion benchmark, EfficientPENet achieves an RMSE of 631.94 mm with 36.24M parameters and a latency of 20.51 ms, operating at 48.76 FPS. This represents a 3.7 times reduction in parameters and a 23 times speedup relative to BP-Net, while maintaining competitive accuracy. These results establish EfficientPENet as a practical solution for real-time depth completion on resource-constrained edge platforms such as the NVIDIA Jetson.

[199] CrossPan: A Comprehensive Benchmark for Cross-Sequence Pancreas MRI Segmentation and Generalization

Linkai Peng, Cuiling Sun, Zheyuan Zhang, Wanying Dou, Halil Ertugrul Aktas, Andrea M Bejar, Elif Keles, Tamas Gonda, Michael B Wallace, Zongwei Zhou, Gorkem Durak, Rajesh N Keswani, Ulas Bagci

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automatic pancreas segmentation is fundamental to abdominal MRI analysis, yet deep learning models trained on one MRI sequence often fail catastrophically when applied to another-a challenge that has received little systematic investigation. We introduce CrossPan, a multi-institutional benchmark comprising 1,386 3D scans across three routinely acquired sequences (T1-weighted, T2-weighted, and Out-of-Phase) from eight centers. Our experiments reveal three key findings. First, cross-sequence domain shifts are far more severe than cross-center variability: models achieving Dice scores above 0.85 in-domain collapse to near-zero (<0.02) when transferred across sequences. Second, state-of-the-art domain generalization methods provide negligible benefit under these physics-driven contrast inversions, whereas foundation models like MedSAM2 maintain moderate zero-shot performance through contrast-invariant shape priors. Third, semi-supervised learning offers gains only under stable intensity distributions and becomes unstable on sequences with high intra-organ variability. These results establish cross-sequence generalization-not model architecture or center diversity-as the primary barrier to clinically deployable pancreas MRI segmentation. Dataset and code are available at https://crosspan.netlify.app/.

[200] LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

Zhiyuan Jiang, Weihao Hong, Xinlei Guan, Tejaswi Dhandu, Miles Q. Li, Meng Xu, Kuan Huang, Umamaheswara Rao Tida, Bingyu Shen, Daehan Kwak, Boyang Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) are increasingly deployed in settings where reliable visual grounding carries operational consequences, yet their behavior under progressively coercive prompt phrasing remains undercharacterized. Existing hallucination benchmarks predominantly rely on neutral prompts and binary detection, leaving open how both the incidence and the intensity of fabrication respond to graded linguistic pressure across structurally distinct task types. We present Ghost-100, a procedurally constructed benchmark of 800 synthetically generated images spanning eight categories across three task families – text-illegibility, time-reading, and object-absence – each designed under a negative-ground-truth principle that guarantees the queried target is absent, illegible, or indeterminate by construction. Every image is paired with five prompts drawn from a structured 5-Level Prompt Intensity Framework, holding the image and task identity fixed while varying only directive force, so that tone is isolated as the sole independent variable. We adopt a dual-track evaluation protocol: a rule-based H-Rate measuring the proportion of responses in which a model crosses from grounded refusal into unsupported positive commitment, and a GPT-4o-mini-judged H-Score on a 1-5 scale characterizing the confidence and specificity of fabrication once it occurs. We additionally release a three-stage automated validation workflow, which retrospectively confirms 717 of 800 images as strictly compliant. Evaluating nine open-weight VLMs, we find that H-Rate and H-Score dissociate substantially across model families, reading-style and presence-detection subsets respond to prompt pressure in qualitatively different ways, and several models exhibit non-monotonic sensitivity peaking at intermediate tone levels – patterns that aggregate metrics obscure.

[201] Geometric Decoupling: Diagnosing the Structural Instability of Latent

Yuanbang Liang, Zhengwen Chen, Yu-Kun Lai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Latent Diffusion Models (LDMs) achieve high-fidelity synthesis but suffer from latent space brittleness, causing discontinuous semantic jumps during editing. We introduce a Riemannian framework to diagnose this instability by analyzing the generative Jacobian, decomposing geometry into \textit{Local Scaling} (capacity) and \textit{Local Complexity} (curvature). Our study uncovers a \textbf{Geometric Decoupling"}: while curvature in normal generation functionally encodes image detail, OOD generation exhibits a functional decoupling where extreme curvature is wasted on unstable semantic boundaries rather than perceptible details. This geometric misallocation identifies Geometric Hotspots" as the structural root of instability, providing a robust intrinsic metric for diagnosing generative reliability.

[202] DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning

Abrar Majeedi, Zhiyuan Ruan, Ziyi Zhao, Hongcheng Wang, Jianglin Lu, Yin Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal large language models (MLLMs) have achieved impressive performance on visual perception and reasoning tasks with RGB imagery, yet they remain fragile under common degradations, such as fog, blur, or low-light conditions. Infrared (IR) imaging, a well-established complement to RGB, offers inherent robustness in these conditions, but its integration into MLLMs remains underexplored. To bridge this gap, we propose DUALVISION, a lightweight fusion module that efficiently incorporates IR-RGB information into MLLMs via patch-level localized cross-attention. To support training and evaluation and to facilitate future research, we also introduce DV-204K, a dataset of ~25K publicly available aligned IR-RGB image pairs with 204K modality-specific QA annotations, and DV-500, a benchmark of 500 IR-RGB image pairs with 500 QA pairs designed for evaluating cross-modal reasoning. Leveraging these datasets, we benchmark both open- and closed-source MLLMs and demonstrate that DUALVISION delivers strong empirical performance under a wide range of visual degradations. Our code and dataset are available at https://abrarmajeedi.github.io/dualvision.

[203] Feasibility of Indoor Frame-Wise Lidar Semantic Segmentation via Distillation from Visual Foundation Model

Haiyang Wu, Juan J. Gonzales Torres, George Vosselman, Ville Lehtola

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Frame-wise semantic segmentation of indoor lidar scans is a fundamental step toward higher-level 3D scene understanding and mapping applications. However, acquiring frame-wise ground truth for training deep learning models is costly and time-consuming. This challenge is largely addressed, for imagery, by Visual Foundation Models (VFMs) which segment image frames. The same VFMs may be used to train a lidar scan frame segmentation model via a 2D-to-3D distillation pipeline. The success of such distillation has been shown for autonomous driving scenes, but not yet for indoor scenes. Here, we study the feasibility of repeating this success for indoor scenes, in a frame-wise distillation manner by coupling each lidar scan with a VFM-processed camera image. The evaluation is done using indoor SLAM datasets, where pseudo-labels are used for downstream evaluation. Also, a small manually annotated lidar dataset is provided for validation, as there are no other lidar frame-wise indoor datasets with semantics. Results show that the distilled model achieves up to 56% mIoU under pseudo-label evaluation and around 36% mIoU with real-label, demonstrating the feasibility of cross-modal distillation for indoor lidar semantic segmentation without manual annotations.

[204] Multi-Domain Learning with Global Expert Mapping

Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Oscar Mendez, Dacheng Tao, Xuelong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human perception generalizes well across different domains, but most vision models struggle beyond their training data. This gap motivates multi-dataset learning, where a single model is trained on diverse datasets to improve robustness under domain shifts. However, unified training remains challenging due to inconsistencies in data distributions and label semantics. Mixture-of-Experts (MoE) models provide a scalable solution by routing inputs to specialized subnetworks (experts). Yet, existing MoEs often fail to specialize effectively, as their load-balancing mechanisms enforce uniform input distribution across experts. This fairness conflicts with domain-aware routing, causing experts to learn redundant representations, and reducing performance especially on rare or out-of-distribution domains. We propose GEM (Global Expert Mapping), a planner-compiler framework that replaces the learned router with a global scheduler. Our planner, based on linear programming relaxation, computes a fractional assignment of datasets to experts, while the compiler applies hierarchical rounding to convert this soft plan into a deterministic, capacity-aware mapping. Unlike prior MoEs, GEM avoids balancing loss, resolves the conflict between fairness and specialization, and produces interpretable routing. Experiments show that GEM-DINO achieves state-of-the-art performance on the UODB benchmark, with notable gains on underrepresented datasets and solves task interference in few-shot adaptation scenarios.

[205] DDF2Pol: A Dual-Domain Feature Fusion Network for PolSAR Image Classification

Mohammed Q. Alkhatib

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents DDF2Pol, a lightweight dual-domain convolutional neural network for PolSAR image classification. The proposed architecture integrates two parallel feature extraction streams, one real-valued and one complex-valued, designed to capture complementary spatial and polarimetric information from PolSAR data. To further refine the extracted features, a depth-wise convolution layer is employed for spatial enhancement, followed by a coordinate attention mechanism to focus on the most informative regions. Experimental evaluations conducted on two benchmark datasets, Flevoland and San Francisco, demonstrate that DDF2Pol achieves superior classification performance while maintaining low model complexity. Specifically, it attains an Overall Accuracy (OA) of 98.16% on the Flevoland dataset and 96.12% on the San Francisco dataset, outperforming several state-of-the-art real- and complex-valued models. With only 91,371 parameters, DDF2Pol offers a practical and efficient solution for accurate PolSAR image analysis, even when training data is limited. The source code is publicly available at https://github.com/mqalkhatib/DDF2Pol

[206] ConvVitMamba: Efficient Multiscale Convolution, Transformer, and Mamba-Based Sequence modelling for Hyperspectral Image Classification

Mohammed Q. Alkhatib

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hyperspectral image (HSI) classification remains challenging due to high spectral dimensionality, redundancy, and limited labeled data. Although convolutional neural networks (CNNs) and Vision Transformers (ViTs) achieve strong performance by exploiting spectral-spatial information and long-range dependencies, they often incur high computational cost and large model size, limiting practical use. To address these limitations, a unified hybrid framework, termed ConvVitMamba, is proposed for efficient HSI classification. The architecture integrates three components: a multiscale convolutional feature extractor to capture local spectral, spatial, and joint patterns; a Vision Transformer based tokenization and encoding stage to model global contextual relationships; and a lightweight Mamba inspired gated sequence mixing module for efficient content-aware refinement without quadratic self-attention. Principal Component Analysis (PCA) is used as preprocessing to reduce redundancy and improve efficiency. Experiments on four benchmark datasets, including Houston and three UAV borne QUH datasets (Pingan, Qingyun, and Tangdaowan), demonstrate that ConvVitMamba consistently outperforms CNN, Transformer, and Mamba based methods while maintaining a favorable balance between accuracy, model size, and inference efficiency. Ablation studies confirm the complementary contributions of all components. The results indicate that the proposed framework provides an effective and efficient solution for HSI classification in diverse scenarios. The source code is publicly available at https://github.com/mqalkhatib/ConvVitMamba

[207] ORCA: An Agentic Reasoning Framework for Hallucination and Adversarial Robustness in Vision-Language Models

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Vision-Language Models (LVLMs) exhibit strong multimodal capabilities but remain vulnerable to hallucinations from intrinsic errors and adversarial attacks from external exploitations, limiting their reliability in real-world applications. We present ORCA, an agentic reasoning framework that improves the factual accuracy and adversarial robustness of pretrained LVLMs through inference-time structured inference reasoning with a suite of small vision models (less than 3B parameters). ORCA operates via an Observe-Reason-Critique-Act loop, querying multiple visual tools with evidential questions, validating cross-model inconsistencies, and refining predictions iteratively without access to model internals or retraining. ORCA also stores intermediate reasoning traces, which supports auditable decision-making. Though designed primarily to mitigate object-level hallucinations, ORCA also exhibits emergent adversarial robustness without requiring adversarial training or defense mechanisms. We evaluate ORCA across three settings: (1) clean images on hallucination benchmarks, (2) adversarially perturbed images without defense, and (3) adversarially perturbed images with defense applied. On the POPE hallucination benchmark, ORCA improves standalone LVLMs performance by +3.64% to +40.67% across different subsets. Under adversarial perturbations on POPE, ORCA achieves an average accuracy gain of +20.11% across LVLMs. When combined with defense techniques on adversarially perturbed AMBER images, ORCA further improves standalone LVLM performance, with gains ranging from +1.20% to +48.00% across metrics. These results demonstrate that ORCA offers a promising path toward building more reliable and robust multimodal systems.

[208] HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

Pourya Shamsolmoali, Masoumeh Zareapoor, Michael Felsberg, Nick Pears, Yue Lu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a global expert assignment layer that uses latent geographic embeddings to route datasets to specialized processing modules, and a local scene decomposition mechanism that allocates image subregions to region-specific sub-modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method offers an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, regional specialization, and open-category detection.

[209] Visual Reasoning Agent: Robust Vision Systems in Remote Sensing via Inference-Time Scaling

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Building robust vision systems for high-stakes domains such as remote sensing requires stronger visual reasoning than what single-pass inference typically provides; yet, retraining large models is often computationally expensive and data intensive. We present Visual Reasoning Agent (VRA), a training-free agentic visual reasoning framework that orchestrates off-the-shelf large vision-language models (LVLMs) with a large reasoning model (LRM) through an iterative Think-Critique-Act loop for cross-model verification, self-critique, and recursive refinement. On the remote sensing benchmark VRSBench VQA dataset, VRA consistently outperforms multiple standalone LVLM baselines and achieves up to 40.67% improvement on challenging question types spanning both perception and reasoning tasks. In addition, integrating three LVLMs with VRA improves the overall accuracy of the standalone LVLMs from 52.8% to 78.8%, demonstrating the effectiveness of agentic reasoning with increased inference-time compute.

[210] Hierarchically Robust Zero-shot Vision-language Models

Junhao Dong, Yifei Zhang, Hao Zhu, Yew-Soon Ong, Piotr Koniusz

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) can perform zero-shot classification but are susceptible to adversarial attacks. While robust fine-tuning improves their robustness, existing approaches align fixed text embeddings with an image embedding, sacrificing natural performance and robustness. A robustness degradation also occurs when a model faces adversarial attacks targeting superclasses (parent classes, e.g., mammal) in addition to their base (leaf) classes (e.g., cat). Thus, to enhance adversarial robustness and leverage the inherent hierarchical properties of class space, we propose a novel adversarial fine-tuning framework based on hierarchical embeddings and several levels of adversarially robust alignment of image-text modalities. Additional mechanisms place visual embeddings at the desired depth of hierarchy, and we provide a theoretical connection between the depth of embedding in the hierarchy and the maximum viable margin size. Our model naturally realizes several margin sizes, boosting generalization of adversaries for robustification. As various trees with different parent labels can share the same leaf labels, we also consider aligning over multiple trees to boost semantic variety. Experiments across several datasets are performed.

[211] A Proxy Consistency Loss for Grounded Fusion of Earth Observation and Location Encoders

Zhongying Wang, Kevin Lane, Levi Cai, Morteza Karimzadeh, Esther Rolf

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Supervised learning with Earth observation inputs is often limited by the sparsity of high-quality labeled or in-situ measured data to use as training labels. With the abundance of geographic data products, in many cases there are variables correlated with - but different from - the variable of interest that can be leveraged. We integrate such proxy variables within a geographic prior via a trainable location encoder and introduce a proxy consistency loss (PCL) formulation to imbue proxy data into the location encoder. The first key insight behind our approach is to use the location encoder as an agile and flexible way to learn from abundantly available proxy data which can be sampled independently of training label availability. Our second key insight is that we will need to regularize the location encoder appropriately to achieve performance and robustness with limited labeled data. Our experiments on air quality prediction and poverty mapping show that integrating proxy data implicitly through the location encoder outperforms using both as input to an observation encoder and fusion strategies that use frozen, pretrained location embeddings as a geographic prior. Superior performance for in-sample prediction shows that the PCL can incorporate rich information from the proxies, and superior out-of-sample prediction shows that the learned latent embeddings help generalize to areas without training labels.

[212] Localization-Guided Foreground Augmentation in Autonomous Driving

Jiawei Yong, Deyuan Qu, Qi Chen, Kentaro Oguchi, Shintaro Fukushima

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Autonomous driving systems often degrade under adverse visibility conditions-such as rain, nighttime, or snow-where online scene geometry (e.g., lane dividers, road boundaries, and pedestrian crossings) becomes sparse or fragmented. While high-definition (HD) maps can provide missing structural context, they are costly to construct and maintain at scale. We propose Localization-Guided Foreground Augmentation (LG-FA), a lightweight and plug-and-play inference module that enhances foreground perception by enriching geometric context online. LG-FA: (i) incrementally constructs a sparse global vector layer from per-frame Bird’s-Eye View (BEV) predictions; (ii) estimates ego pose via class-constrained geometric alignment, jointly improving localization and completing missing local topology; and (iii) reprojects the augmented foreground into a unified global frame to improve per-frame predictions. Experiments on challenging nuScenes sequences demonstrate that LG-FA improves the geometric completeness and temporal stability of BEV representations, reduces localization error, and produces globally consistent lane and topology reconstructions. The module can be seamlessly integrated into existing BEV-based perception systems without backbone modification. By providing a reliable geometric context prior, LG-FA enhances temporal consistency and supplies stable structural support for downstream modules such as tracking and decision-making.

[213] Bridging Foundation Models and ASTM Metallurgical Standards for Automated Grain Size Estimation from Microscopy Images

Abdul Mueez, Shruti Vyas

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Extracting standardized metallurgical metrics from microscopy images remains challenging due to complex grain morphology and the data demands of supervised segmentation. To bridge foundational computer vision with practical metallurgical evaluation, we propose an automated pipeline for dense instance segmentation and grain size estimation that adapts Cellpose-SAM to microstructures and integrates its topology-aware gradient tracking with an ASTM E112 Jeffries planimetric module. We systematically benchmark this pipeline against a classical convolutional network (U-Net), an adaptive-prompting vision foundation model (MatSAM) and a contemporary vision-language model (Qwen2.5-VL-7B). Our evaluations reveal that while the out-of-the-box vision-language model struggles with the localized spatial reasoning required for dense microscopic counting and MatSAM suffers from over-segmentation despite its domain-specific prompt generation, our adapted pipeline successfully maintains topological separation. Furthermore, experiments across progressively reduced training splits demonstrate exceptional few-shot scalability; utilizing only two training samples, the proposed system predicts the ASTM grain size number (G) with a mean absolute percentage error (MAPE) as low as 1.50%, while robustness testing across varying target grain counts empirically validates the ASTM 50-grain sampling minimum. These results highlight the efficacy of application-level foundation model integration for highly accurate, automated materials characterization. Our project repository is available at https://github.com/mueez-overflow/ASTM-Grain-Size-Estimator.

[214] Toward Clinically Acceptable Chest X-ray Report Generation: A Qualitative Retrospective Pilot Study of CXRMate-2

Aaron Nicolson, Elizabeth J. Cooper, Hwan-Jin Yoon, Claire McCafferty, Ramya Krishnan, Michelle Craigie, Nivene Saad, Jason Dowling, Ian A. Scott, Bevan Koopman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Chest X-ray (CXR) radiology report generation (RRG) models have shown rapid progress, yet their clinical utility remains uncertain due to limited evaluation by radiologists. We present CXRMate-2, a state-of-the-art CXR RRG model that integrates structured multimodal conditioning and reinforcement learning with a composite reward for semantic alignment with radiologist reports. Across the MIMIC-CXR, CheXpert Plus, and ReXgradient datasets, CXRMate-2 achieves statistically significant improvements over strong benchmarks, including gains of 11.2% and 24.4% in GREEN and RadGraph-XL, respectively, on MIMIC-CXR relative to MedGemma 1.5 (4B). To directly compare CXRMate-2 against radiologist reporting, we conduct a blinded, randomised qualitative retrospective evaluation. Three consultant radiologists compare generated and radiologist reports across 120 studies from the MIMIC-CXR test set. Generated reports were deemed acceptable (defined as preferred or rated equally to radiologist reports) in 45% of ratings, with no statistically significant difference in preference rates between radiologist reports and acceptable generated reports for seven of the eight analysed findings. Preference for radiologist reports was driven primarily by higher recall, while generated reports were often preferred for readability. Together, these results suggest a credible pathway to clinically acceptable CXR RRG. Improvements in recall, alongside better detection of subtle findings (e.g., pulmonary congestion), are likely sufficient to achieve non-inferiority to radiologist reporting. With these targeted advances, CXR RRG systems may be ready for prospective evaluation in assistive roles within radiologist-led workflows.

[215] AdaGScale: Viewpoint-Adaptive Gaussian Scaling in 3D Gaussian Splatting to Reduce Gaussian-Tile Pairs

Joongho Jo, Hyerin Lim, Hanjun Choi, Jongsun Park

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reducing the number of Gaussian-tile pairs is one of the most promising approaches to improve 3D Gaussian Splatting (3D-GS) rendering speed on GPUs. However, the importance difference existing among Gaussian-tile pairs has never been considered in the previous works. In this paper, we propose AdaGScale, a novel viewpoint-adaptive Gaussian scaling technique for reducing the number of Gaussian-tile pairs. AdaGScale is based on the observation that the peripheral tiles located far from Gaussian center contribute negligibly to pixel color accumulation. This suggests an opportunity for reducing the number of Gaussian-tile pairs based on color contribution. AdaGScale efficiently estimates the color contribution in the peripheral region of each Gaussian during a preprocessing stage and adaptively scales its size based on the peripheral score. As a result, Gaussians with lower importance intersect with fewer tiles during the intersection test, which improves rendering speed while maintaining image quality. The adjusted size is used only for tile intersection test, and the original size is retained during color accumulation to preserve visual fidelity. Experimental results show that AdaGScale achieves a geometric mean speedup of 13.8x over original 3D-GS on a GPU, with only about 0.5 dB degradation in PSNR on city-scale scenes.

Liping Wang, Cheng Ye, Weidong Chen, Peipei Song, Bo Hu, Zhendong Mao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal empathetic response generation (MERG) aims to generate emotionally engaging and empathetic responses based on users’ multimodal contexts. Existing approaches usually rely on an implicit one-pass generation paradigm from multimodal context to the final response, which overlooks two intrinsic characteristics of MERG: (1) Human perception of emotional cues is inherently structured rather than a direct mapping. The conventional paradigm neglects the hierarchical progression of emotion perception, leading to distorted emotional judgments. (2) Given the inherent complexity and ambiguity of human emotions, the conventional paradigm is prone to significant emotional biases, ultimately resulting in suboptimal empathy. In this paper, we propose a multi-agent framework for MERG, which enhances empathy through structured reasoning and reflective refinement. Specifically, we first introduce a structured empathetic reasoning-to-generation module that explicitly decomposes response generation via multimodal perception, consistency-aware emotion forecasting, pragmatic strategy planning, and strategy-guided response generation, providing a clearer intermediate path from multimodal evidence to response realization. Besides, we develop a global reflection and refinement module, in which a global reflection agent performs step-wise auditing over intermediate states and the generated response, eliminating existing emotional biases and empathy errors, and triggering targeted regeneration. Overall, such a closed-loop framework enables our model to gradually improve the accuracy of emotion perception and eliminate emotion biases during the iteration process. Experiments on several benchmarks, e.g., IEMOCAP and MELD, demonstrate that our model has superior empathic response generation capabilities compared to state-of-the-art methods.

[217] Explore Like Humans: Autonomous Exploration with Online SG-Memo Construction for Embodied Agents

Xu Chen, Shichao Xie, Zhining Gu, Lu Jia, Minghua Luo, Fei Liu, Zedong Chu, Yanfen Shen, Xiaolong Wu, Mu Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Constructing structured spatial memory is essential for enabling long-horizon reasoning in complex embodied navigation tasks. Current memory construction predominantly relies on a decoupled, two-stage paradigm: agents first aggregate environmental data through exploration, followed by the offline reconstruction of spatial memory. However, this post-hoc and geometry-centric approach precludes agents from leveraging high-level semantic intelligence, often causing them to overlook navigationally critical landmarks (e.g., doorways and staircases) that serve as fundamental semantic anchors in human cognitive maps. To bridge this gap, we propose ABot-Explorer, a novel active exploration framework that unifies memory construction and exploration into an online, RGB-only process. At its core, ABot-Explorer leverages Large Vision-Language Models (VLMs) to distill Semantic Navigational Affordances (SNA), which act as cognitive-aligned anchors to guide the agent’s movement. By dynamically integrating these SNAs into a hierarchical SG-Memo, ABot-Explorer mirrors human-like exploratory logic by prioritizing structural transit nodes to facilitate efficient coverage. To support this framework, we contribute a large-scale dataset extending InteriorGS with SNA and SG-Memo annotations. Experimental results demonstrate that ABot-Explorer significantly outperforms current state-of-the-art methods in both exploration efficiency and environment coverage, while the resulting SG-Memo is shown to effectively support diverse downstream tasks.

[218] Generative Texture Filtering

Rongjia Zheng, Shangwei Huang, Lei Zhu, Wei-Shi Zheng, Qing Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present a generative method for texture filtering, which exhibits surprisingly good performance and generalizability. Our core idea is to empower texture filtering by taking full advantage of the strong learned image prior of pre-trained generative models. To this end, we propose to fine-tune a pre-trained generative model via a two-stage strategy. Specifically, we first conduct supervised fine-tuning on a very small set of paired images, and then perform reinforcement fine-tuning on a large-scale unlabeled dataset under the guidance of a reward function that quantifies the quality of texture removal and structure preservation. Extensive experiments show that our method clearly outperforms previous methods, and is effective to deal with previously challenging cases. Our code is available at https://github.com/OnlyZZZZ/Generative_Texture_Filtering.

[219] Evaluation of Winning Solutions of 2025 Low Power Computer Vision Challenge

Zihao Ye, Yung Hsiang Lu, Xiao Hu, Shuai Zhang, Taotao Jing, Xin Li, Zhen Yao, Bo Lang, Zhihao Zheng, Seungmin Oh, Hankyul Kang, Seunghun Kang, Jongbin Ryu, Kexin Chen, Yuan Qi, George K Thiruvathukal, Mooi Choo Chuah

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The IEEE Low-Power Computer Vision Challenge (LPCVC) aims to promote the development of efficient vision models for edge devices, balancing accuracy with constraints such as latency, memory capacity, and energy use. The 2025 challenge featured three tracks: (1) Image classification under various lighting conditions and styles, (2) Open-Vocabulary Segmentation with Text Prompt, and (3) Monocular Depth Estimation. This paper presents the design of LPCVC 2025, including its competition structure and evaluation framework, which integrates the Qualcomm AI Hub for consistent and reproducible benchmarking. The paper also introduces the top-performing solutions from each track and outlines key trends and observations. The paper concludes with suggestions for future computer vision competitions.

Zhen Liu, Yuhan Liu, Jinjun Wang, Jianyi Liu, Wei Song, Jingwen Fu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In vision-and-language navigation (VLN), self-improvement from policy-induced experience, using only standard VLN action supervision, critically depends on balancing behavioral diversity and learning stability, which governs whether the agent can extract a reliable learning signal for improvement. Increasing behavioral diversity is necessary to expose alternative action hypotheses but can destabilize policy-induced learning signals, whereas overly conservative stability constraints suppress exploration and induce early commitment, making reliable self-improvement difficult. To address this challenge, we propose Stability-Diversity Balance (SDB), a plug-and-play mechanism for balanced self-improvement in VLN. SDB expands each decision step into multiple latent behavioral hypotheses by applying controlled shifts in the instruction-conditioned hidden states, and then performs reliability-aware soft evaluation and aggregation to retain diverse yet instruction-consistent alternatives during learning. An explicit regularizer further constrains hypothesis interactions, preventing excessive drift or premature collapse of hypothesis diversity and stabilizing self-improvement without discarding training signals. Experiments on R2R, SOON, and REVERIE show consistent improvements; for example, on REVERIE val-unseen, SDB improves SPL from 33.73 to 35.93 and OSR from 51.07 to 54.25.

Jinglin Xu, Yi Li, Chuxiong Sun, Xiao Xu, Jiangmeng Li, Fanjiang Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-modal test-time adaptation (TTA) enhances the resilience of benchmark multi-modal models against distribution shifts by leveraging the unlabeled target data during inference. Despite the documented success, the advancement of multi-modal TTA methodologies has been impeded by a persistent limitation, i.e., the lack of explicit modeling of category-conditional distributions, which is crucial for yielding accurate predictions and reliable decision boundaries. Canonical Gaussian discriminant analysis (GDA) provides a vanilla modeling of category-conditional distributions and achieves moderate advancement in uni-modal contexts. However, in multi-modal TTA scenario, the inherent modality distribution asymmetry undermines the effectiveness of modeling the category-conditional distribution via the canonical GDA. To this end, we introduce a tailored probabilistic Gaussian model for multi-modal TTA to explicitly model the category-conditional distributions, and further propose an adaptive contrastive asymmetry rectification technique to counteract the adverse effects arising from modality asymmetry, thereby deriving calibrated predictions and reliable decision boundaries. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts. The code is available at https://github.com/XuJinglinn/AdaPGC.

[222] EgoMotion: Hierarchical Reasoning and Diffusion for Egocentric Vision-Language Motion Generation

Ruibing Hou, Mingyue Zhou, Yuwei Gui, Mingshuang Luo, Bingpeng Ma, Hong Chang, Shiguang Shan, Xilin Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.

[223] PortraitDirector: A Hierarchical Disentanglement Framework for Controllable and Real-time Facial Reenactment

Chaonan Ji, Jinwei Qi, Sheng Xu, Peng Zhang, Bang Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing facial reenactment methods struggle with a trade-off between expressiveness and fine-grained controllability. Holistic facial reenactment models often sacrifice granular control for expressiveness, while methods designed for control may struggle with fidelity and robust disentanglement. Instead of treating facial motion as a monolithic signal, we explore an alternative compositional perspective. In this paper, we introduce PortraitDirector, a novel framework that formulates face reenactment as a hierarchical composition task, achieving high-fidelity and controllable results. We employ a Hierarchical Motion Disentanglement and Composition strategy, deconstructing facial motion into a Spatial Layer for physical movements and a Semantic Layer for emotional content. The Spatial Layer comprises: (i) global head pose, managed via a dedicated representation and injection pathway; (ii) spatially separated local facial expressions, distilled from cropped facial regions and purged of emotional cues via Emotion-Filtering Module leveraging an information bottleneck. The Semantic Layer contains a derived global emotion. The disentangled components are then recomposed into an expressive motion latent. Furthermore, we engineer the framework for real-time performance through a suite of optimizations, including diffusion distillation, causal attention and VAE acceleration. PortraitDirector achieves streaming, high-fidelity, controllable 512 x 512 face reenactment at 20 FPS with a end-to-end 800 ms latency on a single 5090 GPU.

[224] BALTIC: A Benchmark and Cross-Domain Strategy for 3D Reconstruction Across Air and Underwater Domains Under Varying Illumination

Michele Grimaldi, David Nakath, Oscar Pizarro, Jonatan Scharff Willners, Ignacio Carlucho, Yvan R. Petillot

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Robust 3D reconstruction across varying environmental conditions remains a critical challenge for robotic perception, particularly when transitioning between air and water. To address this, we introduce BALTIC, a controlled benchmark designed to systematically evaluate modern 3D reconstruction methods under variations in medium and lighting. The benchmark comprises 13 datasets spanning two media (air and water) and three lighting conditions (ambient, artificial, and mixed), with additional variations in motion type, scanning pattern, and initialization trajectory, resulting in a diverse set of sequences. Our experimental setup features a custom water tank equipped with a monocular camera and an HTC Vive tracker, enabling accurate ground-truth pose estimation. We further investigate cross-domain reconstruction by augmenting underwater image sequences with a small number of in-air views captured under similar lighting conditions. We evaluate Structure-from-Motion reconstruction using COLMAP in terms of both trajectory accuracy and scene geometry, and use these reconstructions as input to Neural Radiance Fields and 3D Gaussian Splatting methods. The resulting models are assessed against ground-truth trajectories and in-air references, while rendered outputs are compared using perceptual and photometric metrics. Additionally, we perform a color restoration analysis to evaluate radiometric consistency across domains. Our results show that under controlled, texture-consistent conditions, Gaussian Splatting with simple preprocessing (e.g., white balance correction) can achieve performance comparable to specialized underwater methods, although its robustness decreases in more complex and heterogeneous real-world environments

[225] Diff-SBSR: Learning Multimodal Feature-Enhanced Diffusion Models for Zero-Shot Sketch-Based 3D Shape Retrieval

Hang Cheng, Fanhe Dong, Long Zeng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents the first exploration of text-to-image diffusion models for zero-shot sketch-based 3D shape retrieval (ZS-SBSR). Existing sketch-based 3D shape retrieval methods struggle in zero-shot settings due to the absence of category supervision and the extreme sparsity of sketch inputs. Our key insight is that large-scale pretrained diffusion models inherently exhibit open-vocabulary capability and strong shape bias, making them well suited for zero-shot visual retrieval. We leverage a frozen Stable Diffusion backbone to extract and aggregate discriminative representations from intermediate U-Net layers for both sketches and rendered 3D views. Diffusion models struggle with sketches due to their extreme abstraction and sparsity, compounded by a significant domain gap from natural images. To address this limitation without costly retraining, we introduce a multimodal feature-enhanced strategy that conditions the frozen diffusion backbone with complementary visual and textual cues from CLIP, explicitly enhancing the ability of semantic context capture and concentrating on sketch contours. Specifically, we inject global and local visual features derived from a pretrained CLIP visual encoder, and incorporate enriched textual guidance by combining learnable soft prompts with hard textual descriptions generated by BLIP. Furthermore, we employ the Circle-T loss to dynamically strengthen positive-pair attraction once negative samples are sufficiently separated, thereby adapting to sketch noise and enabling more effective sketch-3D alignment. Extensive experiments on two public benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches in ZS-SBSR.

[226] Denoising, Fast and Slow: Difficulty-Aware Adaptive Sampling for Image Generation

Johannes Schusterbauer, Ming Gui, Yusong Li, Pingchuan Ma, Felix Krause, Björn Ommer

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion- and flow-based models usually allocate compute uniformly across space, updating all patches with the same timestep and number of function evaluations. While convenient, this ignores the heterogeneity of natural images: some regions are easy to denoise, whereas others benefit from more refinement or additional context. Motivated by this, we explore patch-level noise scales for image synthesis. We find that naively varying timesteps across image tokens performs poorly, as it exposes the model to overly informative training states that do not occur at inference. We therefore introduce a timestep sampler that explicitly controls the maximum patch-level information available during training, and show that moving from global to patch-level timesteps already improves image generation over standard baselines. By further augmenting the model with a lightweight per-patch difficulty head, we enable adaptive samplers that allocate compute dynamically where it is most needed. Combined with noise levels varying over both space and diffusion time, this yields Patch Forcing (PF), a framework that advances easier regions earlier so they can provide context for harder ones. PF achieves superior results on class-conditional ImageNet, remains orthogonal to representation alignment and guidance methods, and scales to text-to-image synthesis. Our results suggest that patch-level denoising schedules provide a promising foundation for adaptive image generation.

[227] ST-Prune: Training-Free Spatio-Temporal Token Pruning for Vision-Language Models in Autonomous Driving

Lin Sha, Haiyun Guo, Tao Wang, Cong Zhang, Min Huang, Jinqiao Wang, Qinghai Miao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language Models (VLMs) have become central to autonomous driving systems, yet their deployment is severely bottlenecked by the massive computational overhead of multi-view camera and multi-frame video input. Existing token pruning methods, primarily designed for single-image inputs, treat each frame or view in isolation and thus fail to exploit the inherent spatio-temporal redundancies in driving scenarios. To bridge this gap, we propose ST-Prune, a training-free, plug-and-play framework comprising two complementary modules: Motion-aware Temporal Pruning (MTP) and Ring-view Spatial Pruning (RSP). MTP addresses temporal redundancy by encoding motion volatility and temporal recency as soft constraints within the diversity selection objective, prioritizing dynamic trajectories and current-frame content over static historical background. RSP further resolves spatial redundancy by exploiting the ring-view camera geometry to penalize bilateral cross-view similarity, eliminating duplicate projections and residual background that temporal pruning alone cannot suppress. These two modules together constitute a complete spatio-temporal pruning process, preserving key scene information under strict compression. Validated across four benchmarks spanning perception, prediction, and planning, ST-Prune establishes new state-of-the-art for training-free token pruning. Notably, even at 90% token reduction, ST-Prune achieves near-lossless performance with certain metrics surpassing the full-model baseline, while maintaining inference speeds comparable to existing pruning approaches.

[228] MSDS: Deep Structural Similarity with Multiscale Representation

Danling Kang, Xue-Hua Chen, Bin Liu, Keke Zhang, Weiling Chen, Tiesong Zhao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep-feature-based perceptual similarity models have demonstrated strong alignment with human visual perception in Image Quality Assessment (IQA). However, most existing approaches operate at a single spatial scale, implicitly assuming that structural similarity at a fixed resolution is sufficient. The role of spatial scale in deep-feature similarity modeling thus remains insufficiently understood. In this letter, we isolate spatial scale as an independent factor using a minimal multiscale extension of DeepSSIM, referred to as Deep Structural Similarity with Multiscale Representation (MSDS). The proposed framework decouples deep feature representation from cross-scale integration by computing DeepSSIM independently across pyramid levels and fusing the resulting scores with a lightweight set of learnable global weights. Experiments on multiple benchmark datasets demonstrate consistent and statistically significant improvements over the single-scale baseline, while introducing negligible additional complexity. The results empirically confirm spatial scale as a non-negligible factor in deep perceptual similarity, isolated here via a minimal testbed.

[229] Improved Anomaly Detection in Medical Images via Mean Shift Density Enhancement

Pritam Kar, Gouri Lakshmi S, Saptarshi Bej

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Anomaly detection in medical imaging is essential for identifying rare pathological conditions, particularly when annotated abnormal samples are limited. We propose a hybrid anomaly detection framework that integrates self-supervised representation learning with manifold-based density estimation, a combination that remains largely unexplored in this domain. Medical images are first embedded into a latent feature space using pretrained, potentially domain-specific, backbones. These representations are then refined via Mean Shift Density Enhancement (MSDE), an iterative manifold-shifting procedure that moves samples toward regions of higher likelihood. Anomaly scores are subsequently computed using Gaussian density estimation in a PCA-reduced latent space, where Mahalanobis distance measures deviation from the learned normal distribution. The framework follows a one-class learning paradigm and requires only normal samples for training. Extensive experiments on seven medical imaging datasets demonstrate state-of-the-art performance. MSDE achieves the highest AUC on four datasets and the highest Average Precision on five datasets, including near-perfect performance on brain tumor detection (0.981 AUC/AP). These results underscore the potential of the proposed framework as a scalable clinical decision-support tool for early disease detection, screening in low-label settings, and robust deployment across diverse imaging modalities.

[230] How Far Are Video Models from True Multimodal Reasoning?

Xiaotian Zhang, Jianhui Wei, Yuan Wang, Jie Tan, Yichen Li, Yan Zhang, Ziyi Chen, Daoan Zhang, Dezhi YU, Wei Xu, Songtao Jiang, Zuozhu Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite remarkable progress toward general-purpose video models, a critical question remains unanswered: how far are these models from achieving true multimodal reasoning? Existing benchmarks fail to address this question rigorously, as they remain constrained by straightforward task designs and fragmented evaluation metrics that neglect complex multimodal reasoning. To bridge this gap, we introduce CLVG-Bench, an evaluation framework designed to probe video models’ zero-shot reasoning capabilities via Context Learning in Video Generation. CLVG-Bench comprises more than 1,000 high-quality, manually annotated metadata across 6 categories and 47 subcategories, covering complex scenarios including physical simulation, logical reasoning, and interactive contexts. To enable rigorous and scalable assessment, we further propose an Adaptive Video Evaluator (AVE) that aligns with human expert perception using minimal annotations, delivering interpretable textual feedback across diverse video context tasks. Extensive experiments reveal a striking answer to our central question: while state-of-the-art (SOTA) video models, such as Seedance 2.0, demonstrate competence on certain understanding and reasoning subtasks, they fall substantially short with logically grounded and interactive generation tasks (achieving success rates <25% and ~0%, respectively), exposing multimodal reasoning and physical grounding as critical bottlenecks. By systematically quantifying these limitations, the proposed method provides actionable feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here.

[231] Benchmarking Vision Foundation Models for Domain-Generalizable Face Anti-Spoofing

Mika Feng, Pierre Gallin-Martel, Koichi Ito, Takafumi Aoki

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Face Anti-Spoofing (FAS) remains challenging due to the requirement for robust domain generalization across unseen environments. While recent trends leverage Vision-Language Models (VLMs) for semantic supervision, these multimodal approaches often demand prohibitive computational resources and exhibit high inference latency. Furthermore, their efficacy is inherently limited by the quality of the underlying visual features. This paper revisits the potential of vision-only foundation models to establish a highly efficient and robust baseline for FAS. We conduct a systematic benchmarking of 15 pre-trained models, such as supervised CNNs, supervised ViTs, and self-supervised ViTs, under severe cross-domain scenarios including the MICO and Limited Source Domains (LSD) protocols. Our comprehensive analysis reveals that self-supervised vision models, particularly DINOv2 with Registers, significantly suppress attention artifacts and capture critical, fine-grained spoofing cues. Combined with Face Anti-Spoofing Data Augmentation (FAS-Aug), Patch-wise Data Augmentation (PDA) and Attention-weighted Patch Loss (APL), our proposed vision-only baseline achieves state-of-the-art performance in the MICO protocol. This baseline outperforms existing methods under the data-constrained LSD protocol while maintaining superior computational efficiency. This work provides a definitive vision-only baseline for FAS, demonstrating that optimized self-supervised vision transformers can serve as a backbone for both vision-only and future multimodal FAS systems. The project page is available at: https://gsisaoki.github.io/FAS-VFMbenchmark-CVPRW2026/ .

[232] When Can We Trust Deep Neural Networks? Towards Reliable Industrial Deployment with an Interpretability Guide

Hang-Cheng Dong, Yuhao Jiang, Yibo Jiao, Lu Zou, Kai Zheng, Bingguo Liu, Dong Ye, Guodong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The deployment of AI systems in safety-critical domains, such as industrial defect inspection, autonomous driving, and medical diagnosis, is severely hampered by their lack of reliability. A single undetected erroneous prediction can lead to catastrophic outcomes. Unfortunately, there is often no alternative but to place trust in the outputs of a trained AI system, which operates without an internal safeguard to flag unreliable predictions, even in cases of high accuracy. We propose a post-hoc explanation-based indicator to detect false negatives in binary defect detection networks. To our knowledge, this is the first method to proactively identify potentially erroneous network outputs. Our core idea leverages the difference between class-specific discriminative heatmaps and class-agnostic ones. We compute the difference in their intersection over union (IoU) as a reliability score. An adversarial enhancement method is further introduced to amplify this disparity. Evaluations on two industrial defect detection benchmarks show our method effectively identifies false negatives. With adversarial enhancement, it achieves 100% recall, albeit with a trade-off for true negatives. Our work thus advocates for a new and trustworthy deployment paradigm: data-model-explanation-output, moving beyond conventional end-to-end systems to provide critical support for reliable AI in real-world applications.

[233] An Object-Centered Data Acquisition Method for 3D Gaussian Splatting using Mobile Phones

Yuezhe Zhang, Luqian Bai, Mengting Yu, Lei Wei, Shuai Wan, Yifan Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Data acquisition through mobile phones remains a challenge for 3D Gaussian Splatting (3DGS). In this work we target the object-centered scenario and enable reliable mobile acquisition by providing on-device capture guidance and recording onboard sensor signals for offline reconstruction. After the calibration step, the device orientations are aligned to a baseline frame to obtain relative poses, and the optical axis of the camera is mapped to an object-centered spherical grid for uniform viewpoint indexing. To curb polar sampling bias, we compute area-weighted spherical coverage in real-time and guide the user’s motion accordingly. We compare the proposed method with RealityScan and the free-capture strategy. Our method achieves superior reconstruction quality using fewer input images compared to free capture and RealityScan. Further analysis shows that the proposed method is able to obtain more comprehensive and uniform viewpoint coverage during object-centered acquisition.

Gopal Krishna Shyam, Ila Chandrakar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Crop yield prediction is one of the most important challenge, which is crucial to world food security and policy-making decisions. The conventional forecasting techniques are limited in their accuracy with reference to the fact that they utilize static data sources that do not reflect the dynamic and intricate relationships that exist between the variables of the environment over time [5,13]. This paper presents Attention-Based Multi-Modal Deep Learning Framework (ABMMDLF), which is suggested to be used in high-accuracy spatio-temporal crop yield prediction. The model we use combines multi-year satellite imagery, high-resolution time-series of meteorological data and initial soil properties as opposed to the traditional models which use only one of the aforementioned factors [12, 21]. The main architecture involves the use of Convolutional Neural Networks (CNN) to extract spatial features and a Temporal Attention Mechanism to adaptively weight important phenological periods targeted by the algorithm to change over time and condition on spatial features of images and video sequences. As can be experimentally seen, the proposed research work provides an R^2 score of 0.89, which is far better than the baseline models do.

[235] Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

Quan Zhang, Jingze Wu, Jialong Wang, Xiaohua Xie, Jianhuang Lai, Hongbo Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Learning identity-discriminative representations with multi-scene generality has become a critical objective in person re-identification (ReID). However, mainstream perception-driven paradigms tend to identify fitting from massive annotated data rather than identity-causal cues understanding, which presents a fragile representation against multiple disruptions. In this work, ReID-R is proposed as a novel reasoning-driven paradigm that achieves explicit identity understanding and reasoning by incorporating chain-of-thought into the ReID pipeline. Specifically, ReID-R consists of a two-stage contribution: (i) Discriminative reasoning warm-up, where a model is trained in a CoT label-free manner to acquire identity-aware feature understanding; and (ii) Efficient reinforcement learning, which proposes a non-trivial sampling to construct scene-generalizable data. On this basis, ReID-R leverages high-quality reward signals to guide the model toward focusing on ID-related cues, achieving accurate reasoning and correct responses. Extensive experiments on multiple ReID benchmarks demonstrate that ReID-R achieves competitive identity discrimination as superior methods using only 14.3K non-trivial data (20.9% of the existing data scale). Furthermore, benefit from inherent reasoning, ReID-R can provide high-quality interpretation for results.

[236] Adaptive Slicing-Assisted Hyper Inference for Enhanced Small Object Detection in High-Resolution Imagery

Francesco Moretti, Yi Jin, Guiqin Mario

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep learning-based object detectors have achieved remarkable success across numerous computer vision applications, yet they continue to struggle with small object detection in high-resolution aerial and satellite imagery, where dense object distributions, variable shooting angles, diminutive target sizes, and substantial inter-class variability pose formidable challenges. Existing slicing strategies that partition high-resolution images into manageable patches have demonstrated promising results for enlarging the effective receptive field of small targets; however, their reliance on fixed slice dimensions introduces significant redundant computation, inflating inference cost and undermining detection speed. In this paper, we propose \textbf{Adaptive Slicing-Assisted Hyper Inference (ASAHI)}, a novel slicing framework that shifts the paradigm from prescribing a fixed slice size to adaptively determining the optimal number of slices according to image resolution, thereby substantially mitigating redundant computation while preserving beneficial overlap between adjacent patches. ASAHI integrates three synergistic components: (1)an adaptive resolution-aware slicing algorithm that dynamically generates 6 or 12 overlapping patches based on a learned threshold, (2)a slicing-assisted fine-tuning (SAF) strategy that constructs augmented training data comprising both full-resolution and sliced image patches, and (3)a Cluster-DIoU-NMS (CDN) post-processing module that combines the geometric merging efficiency of Cluster-NMS with the center-distance-aware suppression of DIoU-NMS to achieve robust duplicate elimination in crowded scenes. Extensive experiments on VisDrone2019 and xView, demonstrate that ASAHI achieves state-of-the-art performance with 56.8% on VisDrone2019-DET-val and 22.7% on xView-test, while reducing inference time by 20-25% compared to the baseline SAHI method.

[237] Learning to Credit the Right Steps: Objective-aware Process Optimization for Visual Generation

Rui Li, Ke Hao, Yuanzhi Liang, Haibin Huang, Chi Zhang, YunGu, XueLong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement learning, particularly Group Relative Policy Optimization (GRPO), has emerged as an effective framework for post-training visual generative models with human preference signals. However, its effectiveness is fundamentally limited by coarse reward credit assignment. In modern visual generation, multiple reward models are often used to capture heterogeneous objectives, such as visual quality, motion consistency, and text alignment. Existing GRPO pipelines typically collapse these rewards into a single static scalar and propagate it uniformly across the entire diffusion trajectory. This design ignores the stage-specific roles of different denoising steps and produces mistimed or incompatible optimization signals. To address this issue, we propose Objective-aware Trajectory Credit Assignment (OTCA), a structured framework for fine-grained GRPO training. OTCA consists of two key components. Trajectory-Level Credit Decomposition estimates the relative importance of different denoising steps. Multi-Objective Credit Allocation adaptively weights and combines multiple reward signals throughout the denoising process. By jointly modeling temporal credit and objective-level credit, OTCA converts coarse reward supervision into a structured, timestep-aware training signal that better matches the iterative nature of diffusion-based generation. Extensive experiments show that OTCA consistently improves both image and video generation quality across evaluation metrics.

[238] Allo{SR}$^2$: Rectifying One-Step Super-Resolution to Stay Real via Allomorphic Generative Flows

Zihan Wang, Xudong Huang, Junbo Qiao, Wei Li, Jie Hu, Xinghao Chen, Shaohui Lin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Real-world image super-resolution (Real-SR) has been revolutionized by leveraging the powerful generative priors of large-scale diffusion and flow-based models. However, fine-tuning these models on limited LR-HR pairs often precipitates “prior collapse” that the model sacrifices its inherent generative richness to overfit specific training degradations. This issue is further exacerbated in one-step generation, where the absence of multi-step refinement leads to significant trajectory drift and artifact generation. In this paper, we propose Allo{SR}$^2$, a novel framework that rectifies one-step SR trajectories via allomorphic generative flows to maintain high-fidelity generative realism. Specifically, we utilize Signal-to-Noise Ratio (SNR) Guided Trajectory Initialization to establish a physically grounded starting state by aligning the degradation level of LR latent features with the optimal anchoring timestep of the pre-trained flow. To ensure a stable, curvature-free path for one-step inference, we propose Flow-Anchored Trajectory Consistency (FATC), which enforces velocity-level supervision across intermediate states. Furthermore, we develop Allomorphic Trajectory Matching (ATM), a self-adversarial alignment strategy that minimizes the distributional discrepancy between the SR flow and the generative flow in a unified vector field. Extensive experiments on both synthetic and real-world benchmarks demonstrate that Allo{SR}$^2$ achieves state-of-the-art performance in one-step Real-SR, offering a superior balance between restoration fidelity and generative realism while maintaining extreme efficiency.

[239] Unposed-to-3D: Learning Simulation-Ready Vehicles from Real-World Images

Hongyuan Liu, Bochao Zou, Qiankun Liu, Haochen Yu, Qi Mei, Jianfei Jiang, Chen Liu, Cheng Bi, Zhao Wang, Xueyang Zhang, Yifei Zhan, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Creating realistic and simulation-ready 3D assets is crucial for autonomous driving research and virtual environment construction. However, existing 3D vehicle generation methods are often trained on synthetic data with significant domain gaps from real-world distributions. The generated models often exhibit arbitrary poses and undefined scales, resulting in poor visual consistency when integrated into driving scenes. In this paper, we present Unposed-to-3D, a novel framework that learns to reconstruct 3D vehicles from real-world driving images using image-only supervision. Our approach consists of two stages. In the first stage, we train an image-to-3D reconstruction network using posed images with known camera parameters. In the second stage, we remove camera supervision and use a camera prediction head that directly estimates the camera parameters from unposed images. The predicted pose is then used for differentiable rendering to provide self-supervised photometric feedback, enabling the model to learn 3D geometry purely from unposed images. To ensure simulation readiness, we further introduce a scale-aware module to predict real-world size information, and a harmonization module that adapts the generated vehicles to the target driving scene with consistent lighting and appearance. Extensive experiments demonstrate that Unposed-to-3D effectively reconstructs realistic, pose-consistent, and harmonized 3D vehicle models from real-world images, providing a scalable path toward creating high-quality assets for driving scene simulation and digital twin environments.

[240] Feature Perturbation Pool-based Fusion Network for Unified Multi-Class Industrial Defect Detection

Yuanchan Xu, Wenjun Zang, Ying Wu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-class defect detection constitutes a critical yet challenging task in industrial quality inspection, where existing approaches typically suffer from two fundamental limitations: (i) the necessity of training separate models for each defect category, resulting in substantial computational and memory overhead, and (ii) degraded robustness caused by inter-class feature perturbation when heterogeneous defect categories are jointly modeled. In this paper, we present FPFNet, a Feature Perturbation Pool-based Fusion Network that synergistically integrates a stochastic feature perturbation pool with a multi-layer feature fusion strategy to address these challenges within a unified detection framework. The feature perturbation pool enriches the training distribution by randomly injecting diverse noise patterns – including Gaussian noise, F-Noise, and F-Drop – into the extracted feature representations, thereby strengthening the model’s robustness against domain shifts and unseen defect morphologies. Concurrently, the multi-layer feature fusion module aggregates hierarchical feature representations from both the encoder and decoder through residual connections and normalization, enabling the network to capture complex cross-scale relationships while preserving fine-grained spatial details essential for precise defect localization. Built upon the UniAD architecture~\cite{you2022unified}, our method achieves state-of-the-art performance on two widely adopted benchmarks: 97.17% image-level AUROC and 96.93% pixel-level AUROC on MVTec-AD, and 91.08% image-level AUROC and 99.08% pixel-level AUROC on VisA, surpassing existing methods by notable margins while introducing no additional learnable parameters or computational complexity.

[241] DR-MMSearchAgent: Deepening Reasoning in Multimodal Search Agents

Shengqin Wang, Wentao Yan, Huichi Zhou, Yihang Chen, Kun Shao, Zhizhong Zhang, Yuan Xie

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Agentic multimodal models have garnered significant attention for their ability to leverage external tools to tackle complex tasks. However, it is observed that such agents often meet premature interaction collapse, caused by two primary reasons: 1) the terminal reward often appending on the last token prevents the advantage from distinguishing trajectories with exploratory behavior; 2) excessively redundant context hinders the agent from absorbing useful feedback. To address these issues, we propose the Deepening Reasoning MMSearchAgent, the framework leverages the structural proximity to derive advantage signals from the whole rollout trajectories in an entire batch, such that trajectories of different lengths are further encouraged to be generated, even when containing the same correct answer. Additionally, differentiated gaussian rewards are employed to dynamically calibrate interaction tolerance, thereby ensuring information reliability and reduce redundancy. To support multi-turn interaction training, we have constructed a multi-step deep-reasoning dataset including 3602 high-quality QA pair with at least 3 reasonning steps. Extensive experiments demonstrate that our method achieves state-of-the-art performance, outperforming the MMSearch-R1 by 8.4$%$ on FVQA-test.

Heng Zhang, Reza Parvaz, Rui Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recovering corrupted images is one of the most challenging problems in image processing. Among various restoration tasks, blind image deblurring has been extensively studied due to its practical importance and inherent difficulty. In this problem, both the point spread function (PSF) and the underlying latent sharp image must be estimated simultaneously. This problem cannot be solved directly due to its ill-posed nature. One powerful tool for solving such problems is total variation (TV) regularization. The $\ell_0$-norm regularization within the TV framework has been widely adopted to promote sparsity in image gradients or transform domains, leading to improved preservation of edges and fine structures. However, the use of the $\ell_0$-norm results in a highly nonconvex and computationally intractable optimization problem, which limits its practical applicability. To overcome these difficulties, we employ the minimax concave penalty (MCP), which promotes enhanced sparsity and provides a closer approximation to the $\ell_0$-norm. In addition, a reweighted $\ell_1$-norm regularization is incorporated to further reduce estimation bias and improve the preservation of fine image details and textures. After introducing the proposed model, a numerical algorithm is developed to solve the resulting optimization problem. The effectiveness of the proposed approach is then demonstrated through experimental evaluations on several test images.

[243] Multi-view Crowd Tracking Transformer with View-Ground Interactions Under Large Real-World Scenes

Qi Zhang, Jixuan Chen, Kaiyi Zhang, Xinquan Yu, Antoni B. Chan, Hui Huang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multi-view crowd tracking estimates each person’s tracking trajectories on the ground of the scene. Recent research works mainly rely on CNNs-based multi-view crowd tracking architectures, and most of them are evaluated and compared on relatively small datasets, such as Wildtrack and MultiviewX. Since these two datasets are collected in small scenes and only contain tens of frames in the evaluation stage, it is difficult for the current methods to be applied to real-world applications where scene size and occlusion are more complicated. In this paper, we propose a Transformer-based multi-view crowd tracking model, \textit{MVTrackTrans}, which adopts interactions between camera views and the ground plane for enhanced multi-view tracking performance. Besides, for better evaluation, we collect and label two large real-world multi-view tracking datasets, MVCrowdTrack and CityTrack, which contain a much larger scene size over a longer time period. Compared with existing methods on the two large and new datasets, the proposed MVTrackTrans model achieves better performance, demonstrating the advantages of the model design in dealing with large scenes. We believe the proposed datasets and model will push the frontiers of the task to more practical scenarios, and the datasets and code are available at: https://github.com/zqyq/MVTrackTrans.

[244] PLaMo 2.1-VL Technical Report

Tommi Kerola, Yuya Masuda, Takashi Masuko, Toshiki Nakanishi, Daisuke Nishino, Kuniyuki Takahashi, Hanqin Wang, Yoshihiro Yamada

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce PLaMo 2.1-VL, a lightweight Vision Language Model (VLM) for autonomous devices, available in 8B and 2B variants and designed for local and edge deployment with Japanese-language operation. Focusing on Visual Question Answering (VQA) and Visual Grounding as its core capabilities, we develop and evaluate the models for two real-world application scenarios: factory task analysis via tool recognition, and infrastructure anomaly detection. We also develop a large-scale synthetic data generation pipeline and comprehensive Japanese training and evaluation resources. PLaMo 2.1-VL outperforms comparable open models on Japanese and English benchmarks, achieving 61.5 ROUGE-L on JA-VG-VQA-500 and 85.2% accuracy on Japanese Ref-L4. For the two application scenarios, it achieves 53.9% zero-shot accuracy on factory task analysis, and fine-tuning on power plant data improves anomaly detection bbox + label F1-score from 39.7 to 64.9.

[245] Divide-and-Conquer Approach to Holistic Cognition in High-Similarity Contexts with Limited Data

Shijie Wang, Zijian Wang, Yadan Luo, Haojie Li, Zi Huang, Mahsa Baktashmotlagh

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Ultra-fine-grained visual categorization (Ultra-FGVC) aims to classify highly similar subcategories within fine-grained objects using limited training samples. However, holistic yet discriminative cues, such as leaf contours in extremely similar cultivars, remain under-explored in current studies, thereby limiting recognition performance. Though crucial, modeling holistic cues with complex morphological structures typically requires massive training samples, posing significant challenges in data-limited scenarios. To address this challenge, we propose a novel Divide-and-Conquer Holistic Cognition Network (DHCNet) that implements a divide-and-conquer strategy by decomposing holistic cues into spatially-associated subtle discrepancies and progressively establishing the holistic cognition process, significantly simplifying holistic cognition while reducing dependency on training data. Technically, DHCNet begins by progressively analyzing subtle discrepancies, transitioning from smaller local patches to larger ones using a self-shuffling operation on local regions. Simultaneously, it leverages the unaffected local regions to potentially guide the perception of the original topological structure among the shuffled patches, thereby aiding in the establishment of spatial associations for these discrepancies. Additionally, DHCNet incorporates the online refinement of these holistic cues discovered from local regions into the training process to iteratively improve their quality. As a result, DHCNet uses these holistic cues as supervisory signals to fine-tune the parameters of the recognition model, thus improving its sensitivity to holistic cues across the entire objects. Extensive evaluations demonstrate that DHCNet achieves remarkable performance on five widely-used Ultra-FGVC datasets.

[246] Geometry-Guided Self-Supervision for Ultra-Fine-Grained Recognition with Limited Data

Shijie Wang, Yadan Luo, Zijian Wang, Haojie Li, Zi Huang, Mahsa Baktashmotlagh

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper investigates the intrinsic geometrical features of highly similar objects and introduces a general self-supervised framework called the Geometric Attribute Exploration Network (GAEor), which is designed to address the ultra-fine-grained visual categorization (Ultra-FGVC) task in data-limited scenarios. Unlike prior work that often captures subtle yet critical distinctions, GAEor generates geometric attributes as novel alternative recognition cues. These attributes are determined by various details within the object, aligned with its geometric patterns, such as the intricate vein structures in soybean leaves. Crucially, each category exhibits distinct geometric descriptors that serve as powerful cues, even among objects with minimal visual variation – a factor largely overlooked in recent research. GAEor discovers these geometric attributes by first amplifying geometry-relevant details via visual feedback from a backbone network, then embedding the relative polar coordinates of these details into the final representation. Extensive experiments demonstrate that GAEor significantly sets new state-of-the-art records in five widely-used Ultra-FGVC benchmarks.

[247] RAFT-MSF++: Temporal Geometry-Motion Feature Fusion for Self-Supervised Monocular Scene Flow

Xunpei Sun, Zuoxun Hou, Yi Chang, Gang Chen, Wei-Shi Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Monocular scene flow estimation aims to recover dense 3D motion from image sequences, yet most existing methods are limited to two-frame inputs, restricting temporal modeling and robustness to occlusions. We propose RAFT-MSF++, a self-supervised multi-frame framework that recurrently fuses temporal features to jointly estimate depth and scene flow. Central to our approach is the Geometry-Motion Feature (GMF), which compactly encodes coupled motion and geometry cues and is iteratively updated for effective temporal reasoning. To ensure the robustness of this temporal fusion against occlusions, we incorporate relative positional attention to inject spatial priors and an occlusion regularization module to propagate reliable motion from visible regions. These components enable the GMF to effectively propagate information even in ambiguous areas. Extensive experiments show that RAFT-MSF++ achieves 24.14% SF-all on the KITTI Scene Flow benchmark, with a 30.99% improvement over the baseline and better robustness in occluded regions. The code is available at https://github.com/sunzunyi/RAFT-MSF-PlusPlus.

[248] Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

Samyak Sanghvi, Piyush Miglani, Sarvesh Shashikumar, Kaustubh R Borgavi, Veenu Singla, Chetan Arora

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision Transformers $(\texttt{ViT})$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(\texttt{RoI})$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected $\texttt{RoI}$ to enhance fine-grained discrimination through hard-negative based training; and (3) a $\texttt{DINOv2}$ pretrained $\texttt{ViT}$ that captures localization-aware, fine-grained features instead of global $\texttt{CLIP}$ representations. Experiments on public mammography datasets demonstrate that our method achieves superior performance over existing baselines, establishing its effectiveness and potential clinical utility for large-scale breast cancer screening. Our code is available for reproducibility here: https://aih-iitd.github.io/publications/attend-what-matters

[249] Detection of T-shirt Presentation Attacks in Face Recognition Systems

Mathias Ibsen, Loris Tim Ide, Christian Rathgeb, Christoph Busch

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Face recognition systems are often used for biometric authentication. Nevertheless, it is known that without any protective measures, face recognition systems are vulnerable to presentation attacks. To tackle this security problem, methods for detecting presentation attacks have been developed and shown good detection performance on several benchmark datasets. However, generalising presentation attack detection methods to new and novel types of attacks is an ongoing challenge. In this work, we employ 1,608 T-shirt attacks of the T-shirt Face Presentation Attack (TFPA) database using 100 unique presentation attack instruments together with 152 bona fide presentations. In a comprehensive evaluation, we show that this type of attack can compromise the security of face recognition systems. Furthermore, we propose a detection method based on spatial consistency checks in order to detect said T-shirt attacks. Precisely, state-of-the-art face and person detectors are combined to analyse the spatial positions of detected faces and persons based on which T-shirt attacks can be reliably detected.

[250] Mind2Drive: Predicting Driver Intentions from EEG in Real-world On-Road Driving

Ghadah Alosaimi, Hanadi Alhamdan, Wenke E, Stamos Katsigiannis, Amir Atapour-Abarghouei, Toby P. Breckon

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Predicting driver intention from neurophysiological signals offers a promising pathway for enhancing proactive safety in advanced driver assistance systems, yet remains challenging in real-world driving due to EEG signal non-stationarity and the complexity of cognitive-motor preparation. This study proposes and evaluates an EEG-based driver intention prediction framework using a synchronised multi-sensor platform integrated into a real electric vehicle. A real-world on-road dataset was collected across 32 driving sessions, and 12 deep learning architectures were evaluated under consistent experimental conditions. Among the evaluated architectures, TSCeption achieved the highest average accuracy (0.907) and Macro-F1 score (0.901). The proposed framework demonstrates strong temporal stability, maintaining robust decoding performance up to 1000 ms before manoeuvre execution with minimal degradation. Furthermore, additional analyses reveal that minimal EEG preprocessing outperforms artefact-handling pipelines, and prediction performance peaks within a 400-600 ms interval, corresponding to a critical neural preparatory phase preceding driving manoeuvres. Overall, these findings support the feasibility of early and stable EEG-based driver intention decoding under real-world on-road conditions. Code: https://github.com/galosaimi/Mind2Drive.

[251] IonMorphNet: Generalizable Learning of Ion Image Morphologies for Peak Picking in Mass Spectrometry Imaging

Philipp Weigand, Niels Nawrot, Nikolas Ebert, Carsten Hopf, Oliver Wasenmüller

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Peak picking is a fundamental preprocessing step in Mass Spectrometry Imaging (MSI), where each sample is represented by hundreds to thousands of ion images. Existing approaches require careful dataset-specific hyperparameter tuning, and often fail to generalize across acquisition protocols. We introduce IonMorphNet, a spatial-structure-aware representation model for ion images that enables fully data-driven peak picking without any task-specific supervision. We curate 53 publicly available MSI datasets and define six structural classes capturing representative spatial patterns in ion images to train standard image backbones for structural pattern classification. Once trained, IonMorphNet can assess ion images and perform peak picking without additional hyperparameter tuning. Using a ConvNeXt V2-Tiny backbone, our approach improves peak picking performance by +7 % mSCF1 compared to state-of-the-art methods across multiple datasets. Beyond peak picking, we demonstrate that spatially informed channel reduction enables a 3D CNN for patch-based tumor classification in MSI. This approach matches or exceeds pixel-wise spectral classifiers by up to +7.3 % Balanced Accuracy on three tumor classification tasks, indicating meaningful ion image selection. The source code and model weights are available at https://github.com/CeMOS-IS/IonMorphNet.

[252] PanDA: Unsupervised Domain Adaptation for Multimodal 3D Panoptic Segmentation in Autonomous Driving

Yining Pan, Shijie Li, Yuchen Wu, Xulei Yang, Na Zhao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents the first study on Unsupervised Domain Adaptation (UDA) for multimodal 3D panoptic segmentation (mm-3DPS), aiming to improve generalization under domain shifts commonly encountered in real-world autonomous driving. A straightforward solution is to employ a pseudo-labeling strategy, which is widely used in UDA to generate supervision for unlabeled target data, combined with an mm-3DPS backbone. However, existing supervised mm-3DPS methods rely heavily on strong cross-modal complementarity between LiDAR and RGB inputs, making them fragile under domain shifts where one modality degrades (e.g., poor lighting or adverse weather). Moreover, conventional pseudo-labeling typically retains only high-confidence regions, leading to fragmented masks and incomplete object supervision, which are issues particularly detrimental to panoptic segmentation. To address these challenges, we propose PanDA, the first UDA framework specifically designed for multimodal 3D panoptic segmentation. To improve robustness against single-sensor degradation, we introduce an asymmetric multimodal augmentation that selectively drops regions to simulate domain shifts and improve robust representation learning. To enhance pseudo-label completeness and reliability, we further develop a dual-expert pseudo-label refinement module that extracts domain-invariant priors from both 2D and 3D modalities. Extensive experiments across diverse domain shifts, spanning time, weather, location, and sensor variations, significantly surpass state-of-the-art UDA baselines for 3D semantic segmentation.

[253] Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

Zhiheng Fu, Yupeng Hu, Qianyun Yang, Shiqi Zhang, Zhiwei Chen, Zixu Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Composed Image Retrieval (CIR) has attracted significant attention due to its flexible multimodal query method, yet its development is severely constrained by the Noisy Triplet Correspondence (NTC) problem. Most existing robust learning methods rely on the “small loss hypothesis”, but the unique semantic ambiguity in NTC, such as “partial matching”, invalidates this assumption, leading to unreliable noise identification. This entraps the model in a self dependent vicious cycle where the learner is intertwined with the arbiter, ultimately causing catastrophic “representation pollution”. To address this critical challenge, we propose a novel “Expert-Proxy-Diversion” decoupling paradigm, named Air-Know (ArbIteR calibrated Knowledge iNternalizing rObust netWork). Air-Know incorporates three core modules: (1) External Prior Arbitration (EPA), which utilizes Multimodal Large Language Models (MLLMs) as an offline expert to construct a high precision anchor dataset; (2) Expert Knowledge Internalization (EKI), which efficiently guides a lightweight proxy “arbiter” to internalize the expert’s discriminative logic; (3) Dual Stream Reconciliation (DSR), which leverages the EKI’s matching confidence to divert the training data, achieving a clean alignment stream and a representation feedback reconciliation stream. Extensive experiments on multiple CIR benchmark datasets demonstrate that Air-Know significantly outperforms existing SOTA methods under the NTC setting, while also showing strong competitiveness in traditional CIR.

[254] HarmoniDiff-RS: Training-Free Diffusion Harmonization for Satellite Image Composition

Xiaoqi Zhuang, Jefersson A. Dos Santos, Jungong Han

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Satellite image composition plays a critical role in remote sensing applications such as data augmentation, disaste simulation, and urban planning. We propose HarmoniDiff-RS, a training-free diffusion-based framework for harmonizing composite satellite images under diverse domain conditions. Our method aligns the source and target domains through a Latent Mean Shift operation that transfers radiometric characteristics between them. To balance harmonization and content preservation, we introduce a Timestep-wise Latent Fusion strategy by leveraging early inverted latents for high harmonization and late latents for semantic consistency to generate a set of composite candidates. A lightweight harmony classifier is trained to further automatically select the most coherent result among them. We also construct RSIC-H, a benchmark dataset for satellite image harmonization derived from fMoW, providing 500 paired composition samples. Experiments demonstrate that our method effectively performs satellite image composition, showing strong potential for scalable remote-sensing synthesis and simulation tasks. Code is available at: https://github.com/XiaoqiZhuang/HarmoniDiff-RS.

[255] VecHeart: Holistic Four-Chamber Cardiac Anatomy Modeling via Hybrid VecSets

Yihong Chen, Pascal Fua

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate cardiac anatomy modeling requires the model to be able to handle intricate interrelations among structures. In this paper, we propose VecHeart, a unified framework for holistic reconstruction and generation of four-chamber cardiac structures. To overcome the limitations of current feed-forward implicit methods, specifically their restriction to single-object modeling and their neglect of inter-part correlations, we introduce Hybrid Part Transformer, which leverages part-specific learnable queries and interleaved attention to capture complex inter-chamber dependencies. Furthermore, we propose Anatomical Completion Masking and Modality Alignment strategies, enabling the model to infer complete four-chamber structures from partial, sparse, or noisy observations, even when certain anatomical parts are entirely missing. VecHeart also seamlessly extends to 3D+t dynamic mesh sequence generation, demonstrating exceptional versatility. Experiments show that our method achieves state-of-the-art performance, maintaining high-fidelity reconstruction across diverse challenging scenarios. Code will be released.

[256] HP-Edit: A Human-Preference Post-Training Framework for Image Editing

Fan Li, Chonghuinan Wang, Lina Lei, Yuping Qiu, Jiaqi Xu, Jiaxiu Jiang, Xinran Qin, Zhikai Chen, Fenglong Song, Zhixin Wang, Renjing Pei, Wangmeng Zuo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Common image editing tasks typically adopt powerful generative diffusion models as the leading paradigm for real-world content editing. Meanwhile, although reinforcement learning (RL) methods such as Diffusion-DPO and Flow-GRPO have further improved generation quality, efficiently applying Reinforcement Learning from Human Feedback (RLHF) to diffusion-based editing remains largely unexplored, due to a lack of scalable human-preference datasets and frameworks tailored to diverse editing needs. To fill this gap, we propose HP-Edit, a post-training framework for Human Preference-aligned Editing, and introduce RealPref-50K, a real-world dataset across eight common tasks and balancing common object editing. Specifically, HP-Edit leverages a small amount of human-preference scoring data and a pretrained visual large language model (VLM) to develop HP-Scorer–an automatic, human preference-aligned evaluator. We then use HP-Scorer both to efficiently build a scalable preference dataset and to serve as the reward function for post-training the editing model. We also introduce RealPref-Bench, a benchmark for evaluating real-world editing performance. Extensive experiments demonstrate that our approach significantly enhances models such as Qwen-Image-Edit-2509, aligning their outputs more closely with human preference.

[257] GOLD-BEV: GrOund and aeriaL Data for Dense Semantic BEV Mapping of Dynamic Scenes

Joshua Niemeijer, Alaa Eddine Ben Zekri, Reza Bahmanyar, Philipp M. Schmälzle, Houda Chaabouni-Chouayakh, Franz Kurz

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding road scenes in a geometrically consistent, scene-centric representation is crucial for planning and mapping. We present GOLD-BEV, a framework that learns dense bird’s-eye-view (BEV) semantic environment maps-including dynamic agents-from ego-centric sensors, using time-synchronized aerial imagery as supervision only during training. BEV-aligned aerial crops provide an intuitive target space, enabling dense semantic annotation with minimal manual effort and avoiding the ambiguity of ego-only BEV labeling. Crucially, strict aerial-ground synchronization allows overhead observations to supervise moving traffic participants and mitigates the temporal inconsistencies inherent to non-synchronized overhead sources. To obtain scalable dense targets, we generate BEV pseudo-labels using domain-adapted aerial teachers, and jointly train BEV segmentation with optional pseudo-aerial BEV reconstruction for interpretability. Finally, we extend beyond aerial coverage by learning to synthesize pseudo-aerial BEV images from ego sensors, which support lightweight human annotation and uncertainty-aware pseudo-labeling on unlabeled drives.

[258] VCE: A zero-cost hallucination mitigation method of LVLMs via visual contrastive editing

Yanbin Huang, Yisen Li, Guiyao Tie, Xiaoye Qu, Pan Zhou, Hongfei Wang, Zhaofan Zou, Hao Sun, Xuelong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large vision-language models (LVLMs) frequently suffer from Object Hallucination (OH), wherein they generate descriptions containing objects that are not actually present in the input image. This phenomenon is particularly problematic in real-world applications such as medical imaging and autonomous driving, where accuracy is critical. Recent studies suggest that the hallucination problem may stem from language priors: biases learned during pretraining that cause LVLMs to generate words based on their statistical co-occurrence. To mitigate this problem, we propose Visual Contrastive Editing (VCE), a novel post-hoc method that identifies and suppresses hallucinatory tendencies by analyzing the model’s response to contrastive visual perturbations. Using Singular Value Decomposition (SVD), we decompose the model’s activation patterns to isolate hallucination subspaces and apply targeted parameter edits to attenuate its influence. Unlike existing approaches that require fine-tuning or labeled data, VCE operates as a label-free intervention, making it both scalable and practical for deployment in resource-constrained settings. Experimental results demonstrate that VCE effectively reduces object hallucination across multiple benchmarks while maintaining the model’s original computational efficiency.

[259] TESO: Online Tracking of Essential Matrix by Stochastic Optimization

Jaroslav Moravec, Radim Šára, Akihiro Sugimoto

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Maintaining long-term accuracy of stereo camera calibration parameters is important for autonomous systems’ perception. This work proposes Online Tracking of Essential Matrix by Stochastic Optimization (TESO). The core mechanisms of TESO are: 1) a robust loss function based on kernel correlation over tentative correspondences, 2) an adaptive online stochastic optimization on the essential manifold. TESO has low CPU and memory requirements, relies on a few hyperparameters, and eliminates the need for data-driven training, enabling the usage in resource-constrained online perception systems. We evaluated the influence of TESO on geometric precision, rectification quality, and stereo depth consistency. On the large-scale MAN TruckScenes dataset, TESO tracks rotational calibration drift with 0.12 deg precision in the Y-axis (critical for stereo accuracy) while the X- and Z-axes are five times more precise. Tracking applied to sequences with simulated drift shows similar precision with respect to the reference as tracking applied to no-drift sequences, indicating the tracker is unbiased. On the KITTI dataset, TESO revealed systematic inconsistencies in extrinsic parameters across stereo pairs, confirming previous published findings. We verified that intrinsic decalibration affected these errors, as evidenced by the conflicting behavior of the rectification and depth metrics. After correcting the reference calibration, TESO improved its rotation precision around the Y-axis 20 times to 0.025 deg and its depth accuracy 50 times. Despite its lightweight design, direct optimization of the proposed TESO loss function alone achieves accuracy comparable to that of neural network-based single-frame methods.

[260] DINO Eats CLIP: Adapting Beyond Knowns for Open-set 3D Object Retrieval

Xinwei He, Yansong Zheng, Qianru Han, Zhichuan Wang, Yuxuan Cai, Yang Zhou, Jingbo Xia, Yulong Wang, Jinhai Xiang, Xiang Bai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision foundation models have shown great promise for open-set 3D object retrieval (3DOR) through efficient adaptation to multi-view images. Leveraging semantically aligned latent space, previous work typically adapts the CLIP encoder to build view-based 3D descriptors. Despite CLIP’s strong generalization ability, its lack of fine-grainedness prompted us to explore the potential of a more recent self-supervised encoder-DINO. To address this, we propose DINO Eats CLIP (DEC), a novel framework for dynamic multi-view integration that is regularized by synthesizing data for unseen classes. We first find that simply mean-pooling over view features from a frozen DINO backbone gives decent performance. Yet, further adaptation causes severe overfitting on average view patterns of known classes. To combat it, we then design a module named Chunking and Adapting Module (CAM). It segments multi-view images into chunks and dynamically integrates local view relations, yielding more robust features than the standard pooling strategy. Finally, we propose Virtual Feature Synthesis (VFS) module to mitigate bias towards known categories explicitly. Under the hood, VFS leverages CLIP’s broad, pre-aligned vision-language space to synthesize virtual features for unseen classes. By exposing DEC to these virtual features, we greatly enhance its open-set discrimination capacity. Extensive experiments on standard open-set 3DOR benchmarks demonstrate its superior efficacy.

[261] LoViF 2026 Challenge on Real-World All-in-One Image Restoration: Methods and Results

Xiang Chen, Hao Li, Jiangxin Dong, Jinshan Pan, Xin Li, Xin He, Naiwei Chen, Shengyuan Li, Fengning Liu, Haoyi Lv, Haowei Peng, Yilian Zhong, Yuxiang Chen, Shibo Yin, Yushun Fang, Xilei Zhu, Yahui Wang, Chen Lu, Kaibin Chen, Xu Zhang, Xuhui Cao, Jiaqi Ma, Ziqi Wang, Shengkai Hu, Yuning Cui, Huan Zhang, Shi Chen, Bin Ren, Lefei Zhang, Guanglu Dong, Qiyao Zhao, Tianheng Zheng, Chunlei Li, Lichao Mou, Chao Ren, Wangzhi Xing, Xin Lu, Enxuan Gu, Jingxi Zhang, Diqi Chen, Qiaosi Yi, Bingcai Wei, Mingyu Liu, Pengyu Wang, Ce Liu, Miaoxin Guan, Boyu Chen, Hongyu Li, Jian Zhu, Xinrui Luo, Ziyang He, Jiayu Wang, Yichen Xiang, Huayi Qi, Haoyu Bian, Yiran Li, Sunlichen Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents a review for the LoViF Challenge on Real-World All-in-One Image Restoration. The challenge aimed to advance research on real-world all-in-one image restoration under diverse real-world degradation conditions, including blur, low-light, haze, rain, and snow. It provided a unified benchmark to evaluate the robustness and generalization ability of restoration models across multiple degradation categories within a common framework. The competition attracted 124 registered participants and received 9 valid final submissions with corresponding fact sheets, significantly contributing to the progress of real-world all-in-one image restoration. This report provides a detailed analysis of the submitted methods and corresponding results, emphasizing recent progress in unified real-world image restoration. The analysis highlights effective approaches and establishes a benchmark for future research in real-world low-level vision.

[262] TS-Attn: Temporal-wise Separable Attention for Multi-Event Video Generation

Hongyu Zhang, Yufan Deng, Zilin Pan, Peng-Tao Jiang, Bo Li, Qibin Hou, Zhiyang Dou, Zhen Dong, Daquan Zhou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generating high-quality videos from complex temporal descriptions that contain multiple sequential actions is a key unsolved problem. Existing methods are constrained by an inherent trade-off: using multiple short prompts fed sequentially into the model improves action fidelity but compromises temporal consistency, while a single complex prompt preserves consistency at the cost of prompt-following capability. We attribute this problem to two primary causes: 1) temporal misalignment between video content and the prompt, and 2) conflicting attention coupling between motion-related visual objects and their associated text conditions. To address these challenges, we propose a novel, training-free attention mechanism, Temporal-wise Separable Attention (TS-Attn), which dynamically rearranges attention distribution to ensure temporal awareness and global coherence in multi-event scenarios. TS-Attn can be seamlessly integrated into various pre-trained text-to-video models, boosting StoryEval-Bench scores by 33.5% and 16.4% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 2% increase in inference time. It also supports plug-and-play usage across models for multi-event image-to-video generation. The source code and project page are available at https://github.com/Hong-yu-Zhang/TS-Attn.

[263] Deep sprite-based image models: An analysis

Zeynep Sonat Baltacı, Romain Loiseau, Mathieu Aubry

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While foundation models drive steady progress in image segmentation and diffusion algorithms compose always more realistic images, the seemingly simple problem of identifying recurrent patterns in a collection of images remains very much open. In this paper, we focus on sprite-based image decomposition models, which have shown some promise for clustering and image decomposition and are appealing because of their high interpretability. These models come in different flavors, need to be tailored to specific datasets, and struggle to scale to images with many objects. We dive into the details of their design, identify their core components, and perform an extensive analysis on clustering benchmarks. We leverage this analysis to propose a deep sprite-based image decomposition method that performs on par with state-of-the-art unsupervised class-aware image segmentation methods on the standard CLEVR benchmark, scales linearly with the number of objects, identifies explicitly object categories, and fully models images in an easily interpretable way.

[264] Seeing Candidates at Scale: Multimodal LLMs for Visual Political Communication on Instagram

Michael Achmann-Denkler, Mario Haim, Christian Wolff

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents a computational case study that evaluates the capabilities of specialized machine learning models and emerging multimodal large language models for Visual Political Communication (VPC) analysis. Focusing on concentrated visibility in Instagram stories and posts during the 2021 German federal election campaign, we compare the performance of traditional computer vision models (FaceNet512, RetinaFace, Google Cloud Vision) with a multimodal large language model (GPT-4o) in identifying front-runner politicians and counting individuals in images. GPT-4o outperformed the other models, achieving a macro F1-score of 0.89 for face recognition and 0.86 for person counting in stories. These findings demonstrate the potential of advanced AI systems to scale and refine visual content analysis in political communication while highlighting methodological considerations for future research.

[265] Evaluating Histogram Matching for Robust Deep learning-Based Grapevine Disease Detection

Ruben Pascual, Inés Hernández, Salvador Gutiérrez, Javier Tardaguila, Pedro Melo-Pinto, Daniel Paternain, Mikel Galar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Variability in illumination is a primary factor limiting deep learning robustness for field-based plant disease detection. This study evaluates Histogram Matching (HM), a technique that transforms the pixel intensity distribution of an image to match a reference profile, to mitigate this in grapevine classification, distinguishing among healthy leaves, downy mildew, and spider mite damage. We propose a dual-stage integration of HM: (i) as a preprocessing step for normalization, and (ii) as a data augmentation technique to introduce controlled training variability. Experiments using 1,469 RGB images (comprising homogeneous leaf-focused and heterogeneous canopy samples) to train ResNet-18 models demonstrate that this combination significantly enhances robustness on real-world canopy images. While leaf-focused samples showed marginal gains, the canopy subset improved markedly, indicating that balancing normalization with histogram-based diversification effectively bridges the domain gap caused by uncontrolled lighting.

[266] Paparazzo: Active Mapping of Moving 3D Objects

Davide Allegro, Shiyao Li, Stefano Ghidoni, Vincent Lepetit

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current 3D mapping pipelines generally assume static environments, which limits their ability to accurately capture and reconstruct moving objects. To address this limitation, we introduce the novel task of active mapping of moving objects, in which a mapping agent must plan its trajectory while compensating for the object’s motion. Our approach, Paparazzo, provides a learning-free solution that robustly predicts the target’s trajectory and identifies the most informative viewpoints from which to observe it, to plan its own path. We also contribute a comprehensive benchmark designed for this new task. Through extensive experiments, we show that Paparazzo significantly improves 3D reconstruction completeness and accuracy compared to several strong baselines, marking an important step toward dynamic scene understanding. Project page: https://davidea97.github.io/paparazzo-page/

[267] EgoSelf: From Memory to Personalized Egocentric Assistant

Yanshuo Wang, Yuan Xu, Xuesong Li, Jie Hong, Yizhou Wang, Chang Wen Chen, Wentao Zhu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Egocentric assistants often rely on first-person view data to capture user behavior and context for personalized services. Since different users exhibit distinct habits, preferences, and routines, such personalization is essential for truly effective assistance. However, effectively integrating long-term user data for personalization remains a key challenge. To address this, we introduce EgoSelf, a system that includes a graph-based interaction memory constructed from past observations and a dedicated learning task for personalization. The memory captures temporal and semantic relationships among interaction events and entities, from which user-specific profiles are derived. The personalized learning task is formulated as a prediction problem where the model predicts possible future interactions from individual user’s historical behavior recorded in the graph. Extensive experiments demonstrate the effectiveness of EgoSelf as a personalized egocentric assistant. Code is available at \href{https://abie-e.github.io/egoself_project/}{https://abie-e.github.io/egoself_project/}.

[268] RF-HiT: Rectified Flow Hierarchical Transformer for General Medical Image Segmentation

Ahmed Marouane Djouama, Abir Belaala, Abdellah Zakaria Sellam, Salah Eddine Bekhouche, Cosimo Distante, Abdenour Hadid

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate medical image segmentation requires both long-range contextual reasoning and precise boundary delineation, a task where existing transformer- and diffusion-based paradigms are frequently bottlenecked by quadratic computational complexity and prohibitive inference latency. We propose RF-HiT, a Rectified Flow Hierarchical Transformer that integrates an hourglass transformer backbone with a multi-scale hierarchical encoder for anatomically guided feature conditioning. Unlike prior diffusion-based approaches, RF-HiT leverages rectified flow with efficient transformer blocks to achieve linear complexity while requiring only a few discretization steps. The model further fuses conditioning features across resolutions via learnable interpolation, enabling effective multi-scale representation with minimal computational overhead. As a result, RF-HiT achieves a strong efficiency-performance trade-off, requiring only 10.14 GFLOPs, 13.6M parameters, and inference in as few as three steps. Despite its compact design, RF-HiT attains 91.27% mean Dice on ACDC and 87.40% on BraTS 2021, achieving performance comparable to or exceeding that of significantly more intensive architectures. This demonstrates its strong potential as a robust, computationally efficient foundation for real-time clinical segmentation.

[269] TransSplat: Unbalanced Semantic Transport for Language-Driven 3DGS Editing

Yanhui Chen, Jiahong Li, Jingchao Wang, Junyi Lin, Zixin Zeng, Yang Shi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Language-driven 3D Gaussian Splatting (3DGS) editing provides a more convenient approach for modifying complex scenes in VR/AR. Standard pipelines typically adopt a two-stage strategy: first editing multiple 2D views, and then optimizing the 3D representation to match these edited observations. Existing methods mainly improve view consistency through multi-view feature fusion, attention filtering, or iterative recalibration. However, they fail to explicitly address a more fundamental issue: the semantic correspondence between edited 2D evidence and 3D Gaussians. To tackle this problem, we propose TransSplat, which formulates language-driven 3DGS editing as a multi-view unbalanced semantic transport problem. Specifically, our method establishes correspondences between visible Gaussians and view-specific editing prototypes, thereby explicitly characterizing the semantic relationship between edited 2D evidence and 3D Gaussians. It further recovers a cross-view shared canonical 3D edit field to guide unified 3D appearance updates. In addition, we use transport residuals to suppress erroneous edits in non-target regions, mitigating edit leakage and improving local control precision. Qualitative and quantitative results show that, compared with existing 3D editing methods centered on enhancing view consistency, TransSplat achieves superior performance in local editing accuracy and structural consistency.

[270] SmartPhotoCrafter: Unified Reasoning, Generation and Optimization for Automatic Photographic Image Editing

Ying Zeng, Miaosen Luo, Guangyuan Li, Yang Yang, Ruiyang Fan, Linxiao Shi, Qirui Yang, Jian Zhang, Chengcheng Liu, Siming Zheng, Jinwei Chen, Bo Li, Peng-Tao Jiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Traditional photographic image editing typically requires users to possess sufficient aesthetic understanding to provide appropriate instructions for adjusting image quality and camera parameters. However, this paradigm relies on explicit human instruction of aesthetic intent, which is often ambiguous, incomplete, or inaccessible to non-expert users. In this work, we propose SmartPhotoCrafter, an automatic photographic image editing method which formulates image editing as a tightly coupled reasoning-to-generation process. The proposed model first performs image quality comprehension and identifies deficiencies by the Image Critic module, and then the Photographic Artist module realizes targeted edits to enhance image appeal, eliminating the need for explicit human instructions. A multi-stage training pipeline is adopted: (i) Foundation pretraining to establish basic aesthetic understanding and editing capabilities, (ii) Adaptation with reasoning-guided multi-edit supervision to incorporate rich semantic guidance, and (iii) Coordinated reasoning-to generation reinforcement learning to jointly optimize reasoning and generation. During training, SmartPhotoCrafter emphasizes photo-realistic image generation, while supporting both image restoration and retouching tasks with consistent adherence to color- and tone-related semantics. We also construct a stage-specific dataset, which progressively builds reasoning and controllable generation, effective cross-module collaboration, and ultimately high-quality photographic enhancement. Experiments demonstrate that SmartPhotoCrafter outperforms existing generative models on the task of automatic photographic enhancement, achieving photo-realistic results while exhibiting higher tonal sensitivity to retouching instructions. Project page: https://github.com/vivoCameraResearch/SmartPhotoCrafter.

[271] Structure-Semantic Decoupled Modulation of Global Geospatial Embeddings for High-Resolution Remote Sensing Mapping

Jienan Lyu, Miao Yang, Jinchen Cai, Yiwen Hu, Guanyi Lu, Junhao Qiu, Runmin Dong

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fine-grained high-resolution remote sensing mapping typically relies on localized visual features, which restricts cross-domain generalizability and often leads to fragmented predictions of large-scale land covers. While global geospatial foundation models offer powerful, generalizable representations, directly fusing their high-dimensional implicit embeddings with high-resolution visual features frequently triggers feature interference and spatial structure degradation due to a severe semantic-spatial gap. To overcome these limitations, we propose a Structure-Semantic Decoupled Modulation (SSDM) framework, which decouples global geospatial representations into two complementary cross-modal injection pathways. First, the structural prior modulation branch introduces the macroscopic receptive field priors from global representations into the self-attention modules of the high-resolution encoder. By guiding local feature extraction with holistic structural constraints, it effectively suppresses prediction fragmentation caused by high-frequency detail noise and excessive intra-class variance. Second, the global semantic injection branch explicitly aligns holistic context with the deep high-resolution feature space and directly supplements global semantics via cross-modal integration, thereby significantly enhancing the semantic consistency and category-level discrimination of complex land covers. Extensive experiments demonstrate that our method achieves state-of-the-art performance compared to existing cross-modal fusion approaches. By unleashing the potential of global embeddings, SSDM consistently improves high-resolution mapping accuracy across diverse scenarios, providing a universal and effective paradigm for integrating geospatial foundation models into high-resolution vision tasks.

[272] PC2Model: ISPRS benchmark on 3D point cloud to model registration

Mehdi Maboudi, Said Harb, Jackson Ferrao, Kourosh Khoshelham, Yelda Turkan, Karam Mawas

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Point cloud registration involves aligning one point cloud with another or with a three-dimensional (3D) model, enabling the integration of multimodal data into a unified representation. This is essential in applications such as construction monitoring, autonomous driving, robotics, and virtual or augmented reality (VR/AR).With the increasing accessibility of point cloud acquisition technologies, such as Light Detection and Ranging (LiDAR) and structured light scanning, along with recent advances in deep learning, the research focus has increasingly shifted towards downstream tasks, particularly point cloud-to-model (PC2Model) registration. While data-driven methods aim to automate this process, they struggle with sparsity, noise, clutter, and occlusions in real-world scans, which limit their performance. To address these challenges, this paper introduces the PC2Model benchmark, a publicly available dataset designed to support the training and evaluation of both classical and data-driven methods. Developed under the leadership of ICWG II/Ib, the PC2Model benchmark adopts a hybrid design that combines simulated point clouds with, in some cases, real-world scans and their corresponding 3D models. Simulated data provide precise ground truth and controlled conditions, while real-world data introduce sensor and environmental artefacts. This design supports robust training and evaluation across domains and enables the systematic analysis of model transferability from simulated to real-world scenarios. The dataset is publicly accessible at: https://zenodo.org/uploads/17581812.

[273] Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding

Kadir Yilmaz, Adrian Kruse, Tristan Höfer, Daan de Geus, Bastian Leibe

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transformers have become a common foundation across deep learning, yet 3D scene understanding still relies on specialized backbones with strong domain priors. This keeps the field isolated from the broader Transformer ecosystem, limiting the transfer of new advances as well as the benefits of increasingly optimized software and hardware stacks. To bridge this gap, we adapt the vanilla Transformer encoder to 3D scenes with minimal modifications. Given an input 3D scene, we partition it into volumetric patch tokens, process them with full global self-attention, and inject positional information via a 3D extension of rotary positional embeddings. We call the resulting model the Volume Transformer (Volt) and apply it to 3D semantic segmentation. Naively training Volt on standard 3D benchmarks leads to shortcut learning, highlighting the limited scale of current 3D supervision. To overcome this, we introduce a data-efficient training recipe based on strong 3D augmentations, regularization, and distillation from a convolutional teacher, making Volt competitive with state-of-the-art methods. We then scale supervision through joint training on multiple datasets and show that Volt benefits more from increased scale than domain-specific 3D backbones, achieving state-of-the-art results across indoor and outdoor datasets. Finally, when used as a drop-in backbone in a standard 3D instance segmentation pipeline, Volt again sets a new state of the art, highlighting its potential as a simple, scalable, general-purpose backbone for 3D scene understanding.

Pradyumna YM, Yuxuan Xue, Yue Chen, Nikita Kister, István Sárándi, Gerard Pons-Moll

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reconstructing physically plausible 3D human-scene interactions (HSI) from a single image currently presents a trade-off: optimization based methods offer accurate contact but are slow (~20s), while feed-forward approaches are fast yet lack explicit interaction reasoning, producing floating and interpenetration artifacts. Our key insight is that geometry-based human–scene fitting can be amortized into fast feed-forward inference. We present GRAFT (Geometric Refinement And Fitting Transformer), a learned HSI prior that predicts Interaction Gradients: corrective parameter updates that iteratively refine human meshes by reasoning about their 3D relationship to the surrounding scene. GRAFT encodes the interaction state into compact body-anchored tokens, each grounded in the scene geometry via Geometric Probes that capture spatial relationships with nearby surfaces. A lightweight transformer recurrently updates human meshes and re-probes the scene, ensuring the final pose aligns with both learned priors and observed geometry. GRAFT operates either as an end-to-end reconstructor using image features, or with geometry alone as a transferable plug-and-play HSI prior that improves feed-forward methods without retraining. Experiments show GRAFT improves interaction quality by up to 113% over state-of-the-art feed-forward methods and matches optimization-based interaction quality at ${\sim}50{\times}$ lower runtime, while generalizing seamlessly to in-the-wild multi-person scenes and being preferred in 64.8% of three-way user study. Project page: https://pradyumnaym.github.io/graft .

[275] MOSA: Motion-Guided Semantic Alignment for Dynamic Scene Graph Generation

Xuejiao Wang, Bohao Zhang, Changbo Wang, Gaoqi He

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Dynamic Scene Graph Generation (DSGG) aims to structurally model objects and their dynamic interactions in video sequences for high-level semantic understanding. However, existing methods struggle with fine-grained relationship modeling, semantic representation utilization, and the ability to model tail relationships. To address these issues, this paper proposes a motion-guided semantic alignment method for DSGG (MoSA). First, a Motion Feature Extractor (MFE) encodes object-pair motion attributes such as distance, velocity, motion persistence, and directional consistency. Then, these motion attributes are fused with spatial relationship features through the Motion-guided Interaction Module (MIM) to generate motion-aware relationship representations. To further enhance semantic discrimination capabilities, the cross-modal Action Semantic Matching (ASM) mechanism aligns visual relationship features with text embeddings of relationship categories. Finally, a category-weighted loss strategy is introduced to emphasize learning of tail relationships. Extensive and rigorous testing shows that MoSA performs optimally on the Action Genome dataset.

[276] CreatiParser: Generative Image Parsing of Raster Graphic Designs into Editable Layers

Weidong Chen, Dexiang Hong, Zhendong Mao, Yutao Cheng, Xinyan Liu, Lei Zhang, Yongdong Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graphic design images consist of multiple editable layers, such as text, background, and decorative elements, while most generative models produce rasterized outputs without explicit layer structures, limiting downstream editing. Existing graphic design parsing methods typically rely on multi-stage pipelines combining layout prediction, matting, and inpainting, which suffer from error accumulation and limited controllability. We propose a hybrid generative framework for raster-to-layer graphic design parsing that decomposes a design image into editable text, background, and sticker layers. Text regions are parsed using a vision-language model into a text rendering protocol, enabling faithful reconstruction and flexible re-editing, while background and sticker layers are generated using a multi-branch diffusion architecture with RGBA support. We further introduce ParserReward and integrate it with Group Relative Policy Optimization to align generation quality with human design preferences. Extensive experiments on two challenging datasets, \emph{i.e.,} the Parser-40K and Crello datasets, demonstrate superior performance over existing methods, \emph{eg.,} achieving an overall average improvement of 23.7% across all metrics.

[277] CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

Yanhui Chen, Baoyao Yang, Siqi Liu, Jingchao Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

[278] CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation

Xiangyang Luo, Xiaozhe Xin, Tao Feng, Xu Guo, Meiguang Jin, Junfeng Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Synthesizing human–object interaction (HOI) videos has broad practical value in e-commerce, digital advertising, and virtual marketing. However, current diffusion models, despite their photorealistic rendering capability, still frequently fail on (i) the structural stability of sensitive regions such as hands and faces and (ii) physically plausible contact (e.g., avoiding hand–object interpenetration). We present CoInteract, an end-to-end framework for HOI video synthesis conditioned on a person reference image, a product reference image, text prompts, and speech audio. CoInteract introduces two complementary designs embedded into a Diffusion Transformer (DiT) backbone. First, we propose a Human-Aware Mixture-of-Experts (MoE) that routes tokens to lightweight, region-specialized experts via spatially supervised routing, improving fine-grained structural fidelity with minimal parameter overhead. Second, we propose Spatially-Structured Co-Generation, a dual-stream training paradigm that jointly models an RGB appearance stream and an auxiliary HOI structure stream to inject interaction geometry priors. During training, the HOI stream attends to RGB tokens and its supervision regularizes shared backbone weights; at inference, the HOI branch is removed for zero-overhead RGB generation. Experimental results demonstrate that CoInteract significantly outperforms existing methods in structural stability, logical consistency, and interaction realism.

[279] InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement

Nikita Kister, Pradyumna YM, István Sárándi, Jiayi Wang, Anna Khoreva, Gerard Pons-Moll

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Training embodied agents to understand 3D scenes as humans do requires large-scale data of people meaningfully interacting with diverse environments, yet such data is scarce. Real-world motion capture is costly and limited to controlled settings, while existing synthetic datasets rely on simple geometric heuristics that ignore rich scene context. In contrast, 2D foundation models trained on internet-scale data have implicitly acquired commonsense knowledge of human-environment interactions. To transfer this knowledge into 3D, we introduce InHabit, a fully automatic and scalable data generator for populating 3D scenes with interacting humans. InHabit follows a render-generate-lift principle: given a rendered 3D scene, a vision-language model proposes contextually meaningful actions, an image-editing model inserts a human, and an optimization procedure lifts the edited result into physically plausible SMPL-X bodies aligned with the scene geometry. Applied to Habitat-Matterport3D, InHabit produces the first large-scale photorealistic 3D human-scene interaction dataset, containing 78K samples across 800 building-scale scenes with complete 3D geometry, SMPL-X bodies, and RGB images. Augmenting standard training data with our samples improves RGB-based 3D human-scene reconstruction and contact estimation, and in a perceptual user study our data is preferred in 78% of cases over the state of the art.

[280] MedFlowSeg: Flow Matching for Medical Image Segmentation with Frequency-Aware Attention

Zhi Chen, Runze Hu, Le Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Flow matching has recently emerged as a principled framework for learning continuous-time transport maps, enabling efficient deterministic generation without relying on stochastic diffusion processes. While generative modeling has shown promise for medical image segmentation, particularly in capturing uncertainty and complex anatomical variability, existing approaches are predominantly built upon diffusion models, which incur substantial computational overhead due to iterative sampling and are often constrained by UNet-based parameterizations. In this work, we introduce MedFlowSeg, a conditional flow matching framework that formulates medical image segmentation as learning a time-dependent vector field that transports a simple prior distribution to the target segmentation distribution. This formulation enables one-step deterministic inference while preserving the expressiveness of generative modeling. We further develop a dual-conditioning mechanism to incorporate structured priors into the learned flow. Specifically, we propose a Dual-Branch Spatial Attention module that injects multi-scale structural information into the flow field, and a Frequency-Aware Attention module that models cross-domain interactions between spatial and spectral representations via discrepancy-aware fusion and time-dependent modulation. Together, these components provide an effective parameterization of conditional flows that capture both global anatomical structure and fine-grained boundary details. We provide extensive empirical validation across multiple medical imaging modalities, demonstrating that MedFlowSeg achieves state-of-the-art performance while significantly reducing computational cost compared to diffusion-based methods. Our results highlight the potential of flow matching as a theoretically grounded and computationally efficient alternative for generative medical image segmentation.

Liyang Li, Wen Wang, Canyu Zhao, Tianjian Feng, Zhiyue Zhao, Hao Chen, Chunhua Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

[282] IR-Flow: Bridging Discriminative and Generative Image Restoration via Rectified Flow

Zihao Fan, Xin Lu, Jie Xiao, Dong Li, Jie Huang, Xueyang Fu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In image restoration, single-step discriminative mappings often lack fine details via expectation learning, whereas generative paradigms suffer from inefficient multi-step sampling and noise-residual coupling. To address this dilemma, we propose IR-Flow, a novel image restoration method based on Rectified Flow that serves as a unified framework bridging the gap between discriminative and generative paradigms. Specifically, we first construct multilevel data distribution flows, which expand the ability of models to learn from and adapt to various levels of degradation. Subsequently, cumulative velocity fields are proposed to learn transport trajectories across varying degradation levels, guiding intermediate states toward the clean target, while a multi-step consistency constraint is presented to enforce trajectory coherence and boost few-step restoration performance. We show that directly establishing a linear transport flow between degraded and clean image domains not only enables fast inference but also improves adaptability to out-of-distribution degradations. Extensive evaluations on deraining, denoising and raindrop removal tasks demonstrate that IR-Flow achieves competitive quantitative results with only a few sampling steps, offering an efficient and flexible framework that maintains an excellent distortion-perception balance. Our code is available at https://github.com/fanzh03/IR-Flow.

[283] Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

Jing Jin, Hao Liu, Yan Bai, Yihang Lou, Zhenke Wang, Tianrun Yuan, Juntong Chen, Yongkang Zhu, Fanhu Zeng, Xuanyu Zhu, Yige Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.

[284] Face Anything: 4D Face Reconstruction from Any Image Sequence

Umut Kocasari, Simon Giebenhain, Richard Shaw, Matthias Nießner

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3$\times$ lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.

[285] SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

Zewei Zhou, Ruining Yang, Xuewei, Qi, Yiluan Guo, Sherry X. Chen, Tao Feng, Kateryna Pistunova, Yishan Shen, Lili Su, Jiaqi Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language-Action (VLA) models offer a promising autonomous driving paradigm for leveraging world knowledge and reasoning capabilities, especially in long-tail scenarios. However, existing VLA models often struggle with the high latency in action generation using an autoregressive generation framework and exhibit limited robustness. In this paper, we propose SpanVLA, a novel end-to-end autonomous driving framework, integrating an autoregressive reasoning and a flow-matching action expert. First, SpanVLA introduces an efficient bridge to leverage the vision and reasoning guidance of VLM to efficiently plan future trajectories using a flow-matching policy conditioned on historical trajectory initialization, which significantly reduces inference time. Second, to further improve the performance and robustness of the SpanVLA model, we propose a GRPO-based post-training method to enable the VLA model not only to learn from positive driving samples but also to learn how to avoid the typical negative behaviors and learn recovery behaviors. We further introduce mReasoning, a new real-world driving reasoning dataset, focusing on complex, reasoning-demanding scenarios and negative-recovery samples. Extensive experiments on the NAVSIM (v1 and v2) demonstrate the competitive performance of the SpanVLA model. Additionally, the qualitative results across diverse scenarios highlight the planning performance and robustness of our model.

[286] A Network-Aware Evaluation of Distributed Energy Resource Control in Smart Distribution Systems

Houchao Gan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Distribution networks with high penetration of Distributed Energy Resources (DERs) increasingly rely on communication networks to coordinate grid-interactive control. While many distributed control schemes have been proposed, they are often evaluated under idealized communication assumptions, making it difficult to assess their performance under realistic network conditions. This work presents an implementation-driven evaluation of a representative virtual power plant (VPP) dispatch algorithm using a co-simulation framework that couples a linearized distribution-system model with packet-level downlink emulation in ns-3. The study considers a modified IEEE~37-node feeder with high photovoltaic penetration and a primal–dual VPP dispatch that simultaneously targets feeder-head active power tracking and voltage regulation. Communication effects are introduced only on the downlink path carrying dual-variable updates, where per-DER packet delays and a hold-last-value strategy are modeled. Results show that, under ideal communication, the dispatch achieves close tracking of the feeder-head power reference while maintaining voltages within the prescribed limits at selected buses. When realistic downlink delay is introduced, the same controller exhibits large oscillations in feeder-head power and more frequent voltage limit violations. These findings highlight that distributed DER control performance can be strongly influenced by communication behavior and motivate evaluation frameworks that explicitly incorporate network dynamics into the assessment of grid-interactive control schemes.

[287] ReImagine: Rethinking Controllable High-Quality Human Video Generation via Image-First Synthesis

Zhengwentai Sun, Keru Zheng, Chenghong Li, Hongjie Liao, Xihe Yang, Heyuan Li, Yihao Zhi, Shuliang Ning, Shuguang Cui, Xiaoguang Han

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary model for compositional human image synthesis. Code and data are publicly available at https://github.com/Taited/ReImagine.

[288] Generative Drifting for Conditional Medical Image Generation

Zirong Li, Siyuan Mei, Weiwen Wu, Andreas Maier, Lina Gölz, Yan Xia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Conditional medical image generation plays an important role in many clinically relevant imaging tasks. However, existing methods still face a fundamental challenge in balancing inference efficiency, patient-specific fidelity, and distribution-level plausibility, particularly in high-dimensional 3D medical imaging. In this work, we propose GDM, a generative drifting framework that reformulates deterministic medical image prediction as a multi-objective learning problem to jointly promote distribution-level plausibility and patient-specific fidelity while retaining one-step inference. GDM extends drifting to 3D medical imaging through an attractive-repulsive drift that minimizes the discrepancy between the generator pushforward and the target distribution. To enable stable drifting-based learning in 3D volumetric data, GDM constructs a multi-level feature bank from a medical foundation encoder to support reliable affinity estimation and drifting field computation across complementary global, local, and spatial representations. In addition, a gradient coordination strategy in the shared output space improves optimization balance under competing distribution-level and fidelity-oriented objectives. We evaluate the proposed framework on two representative tasks, MRI-to-CT synthesis and sparse-view CT reconstruction. Experimental results show that GDM consistently outperforms a wide range of baselines, including GAN-based, flow-matching-based, and SDE-based generative models, as well as supervised regression methods, while improving the balance among anatomical fidelity, quantitative reliability, perceptual realism, and inference efficiency. These findings suggest that GDM provides a practical and effective framework for conditional 3D medical image generation.

[289] CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

[290] AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.

[291] Tstars-Tryon 1.0: Robust and Realistic Virtual Try-On for Diverse Fashion Items

Mengting Chen, Zhengrui Chen, Yongchao Du, Zuan Gao, Taihang Hu, Jinsong Lan, Chao Lin, Yefeng Shen, Xingjian Wang, Zhao Wang, Zhengtao Wu, Xiaoli Xu, Zhengze Xu, Hao Yan, Mingzhou Zhang, Jun Zheng, Qinye Zhou, Xiaoyong Zhu, Bo Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advances in image generation and editing have opened new opportunities for virtual try-on. However, existing methods still struggle to meet complex real-world demands. We present Tstars-Tryon 1.0, a commercial-scale virtual try-on system that is robust, realistic, versatile, and highly efficient. First, our system maintains a high success rate across challenging cases like extreme poses, severe illumination variations, motion blur, and other in-the-wild conditions. Second, it delivers highly photorealistic results with fine-grained details, faithfully preserving garment texture, material properties, and structural characteristics, while largely avoiding common AI-generated artifacts. Third, beyond apparel try-on, our model supports flexible multi-image composition (up to 6 reference images) across 8 fashion categories, with coordinated control over person identity and background. Fourth, to overcome the latency bottlenecks of commercial deployment, our system is heavily optimized for inference speed, delivering near real-time generation for a seamless user experience. These capabilities are enabled by an integrated system design spanning end-to-end model architecture, a scalable data engine, robust infrastructure, and a multi-stage training paradigm. Extensive evaluation and large-scale product deployment demonstrate that Tstars-Tryon1.0 achieves leading overall performance. To support future research, we also release a comprehensive benchmark. The model has been deployed at an industrial scale on the Taobao App, serving millions of users with tens of millions of requests.

[292] Data Leakage Detection and De-duplication in Large Scale Geospatial Image Datasets

Yeshwanth Kumar Adimoolam, Charalambos Poullis, Melinos Averkiou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2304.02296: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2304.02296&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] How to Teach Large Multimodal Models New Skills

Zhen Zhu, Yiming Gong, Yao Xiao, Yaoyao Liu, Derek Hoiem

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.08564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space

Huzheng Yang, James Gee, Jianbo Shi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2406.18344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.18344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[295] On the Generalizability of Foundation Models for Crop Type Mapping

Yi-Chia Chang, Adam J. Stewart, Favyen Bastani, Piper Wolters, Shreya Kannan, George R. Huber, Jingtong Wang, Arindam Banerjee

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2409.09451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.09451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[296] EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

Yiying Wei, Hadi Amirpour, Jong Hwan Ko, Christian Timmerer

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2411.16312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[297] Visual Adversarial Attack on Vision-Language Models for Autonomous Driving

Tianyuan Zhang, Lu Wang, Xinwei Zhang, Yitong Zhang, Boyi Jia, Siyuan Liang, Shengshan Hu, Qiang Fu, Aishan Liu, Xianglong Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2411.18275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.18275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[298] Uncertainty Quantification in Detection Transformers: Object-Level Calibration and Image-Level Reliability

Young-Jin Park, Carson Sobolewski, Navid Azizan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.01782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[299] 3D Foundation Model for Generalizable Disease Detection in Head Computed Tomography

Weicheng Zhu, Haoxu Huang, Huanze Tang, Rushabh Musthyala, Boyang Yu, Long Chen, Emilio Vega, Thomas O’Donnell, Seena Dehkharghani, Jennifer A. Frontera, Arjun V. Masurkar, Kara Melmed, Narges Razavian

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2502.02779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.02779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[300] You Point, I Learn: Online Adaptation of Interactive Segmentation Models for Handling Distribution Shifts in Medical Imaging

Wentian Xu, Ziyun Liang, Harry Anthony, Yasin Ibrahim, Felix Cohen, Guang Yang, Konstantinos Kamnitsas

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.06717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.06717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[301] ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios

António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, Gautier Viaud

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.08620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[302] GAIR: Location-Aware Self-Supervised Contrastive Pre-Training with Geo-Aligned Implicit Representations

Zeping Liu, Ni Lao, Zhangyu Wang, Junfeng Jiao, Gengchen Mai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.16683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[303] RESFL: An Uncertainty-Aware Framework for Responsible Federated Learning by Balancing Privacy, Fairness and Utility

Dawood Wasif, Terrence J. Moore, Jin-Hee Cho

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.16251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[304] Centralized Copy-Paste: Enhanced Data Augmentation Strategy for Wildland Fire Semantic Segmentation

Joon Tai Kim, Tianle Chen, Ziyu Dong, Nishanth Kunchala, Alexander Guller, Daniel Ospina Acero, Roger Williams, Mrinal Kumar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.06321: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.06321&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[305] A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

Yihao Ding, Siwen Luo, Yue Dai, Yanbei Jiang, Zechuan Li, Qiang Sun, Geoffrey Martin, Wei Liu, Yifan Peng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.09861: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09861&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[306] LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, Shiyin Lu, Qifeng Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.09373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Chen-Chen Fan, Peiyao Guo, Linping Zhang, Kehan Qi, Haolin Huang, Yong-Qiang Mao, Yuxi Suo, Zhizhuo Jiang, Yu Liu, You He

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.02384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[308] FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling

Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.12052: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12052&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[309] Adapting Self-Supervised Representations as a Latent Space for Efficient Generation

Ming Gui, Johannes Schusterbauer, Timy Phan, Felix Krause, Josh Susskind, Miguel Angel Bautista, Björn Ommer

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.14630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[310] ReefNet: A Large-Scale Dataset and Benchmark for Fine-Grained Coral Reef Recognition

Abdulwahab Felemban, Yahia Battach, Faizan Farooq Khan, Yuqian Fu, Xuhui Liu, Yesmeen M. Khattab, Yousef A. Radwan, Xiang Li, Fabio Marchese, Sara Beery, Burton H. Jones, Francesca Benzoni, Mohamed Elhoseiny

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.16822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[311] Towards Reliable Human Evaluations in Gesture Generation: Insights from a Community-Driven State-of-the-Art Benchmark

Rajmund Nagy, Hendric Voss, Thanh Hoang-Minh, Mihail Tsakov, Teodor Nikolov, Zeyi Zhang, Tenglong Ao, Sicheng Yang, Shaoli Huang, Yongkang Cheng, M. Hamza Mughal, Rishabh Dabral, Kiran Chhatre, Christian Theobalt, Libin Liu, Stefan Kopp, Rachel McDonnell, Michael Neff, Taras Kucherenko, Youngwoo Yoon, Gustav Eje Henter

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.01233: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01233&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[312] Pixels or Positions? Benchmarking Modalities in Group Activity Recognition

Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.12606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.12606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[313] Mammo-FM: Breast-specific foundational model for Integrated Mammographic Diagnosis, Prognosis, and Reporting

Shantanu Ghosh, Vedant Parthesh Joshi, Rayan Syed, Param Budhraja, Aya Kassem, Katelyn C. Morrison, Alex Tang, Ho Cheung Aiden Wong, Abhishek Varshney, Payel Basak, Weicheng Dai, Judy Wawira Gichoya, Hari M. Trivedi, Imon Banerjee, Shyam Visweswaran, Clare B. Poynton, Kayhan Batmanghelich

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00198: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00198&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[314] Realistic Handwritten Multi-Digit Writer (MDW) Number Recognition Challenges

Kiri L. Wagstaff

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00676: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00676&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[316] Recurrent Video Masked Autoencoders

Daniel Zoran, Nikhil Parthasarathy, Yi Yang, Drew A Hudson, Joao Carreira, Andrew Zisserman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.13684: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13684&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[317] MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du, Duolikun Danier, Jan Eric Lenssen, Hakan Bilen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.15577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[318] Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Optical Remote Sensing

Xu Zhang, Junyao Ge, Yang Zheng, Kaitai Guo, Jimin Liang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.19302: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.19302&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[319] Ensembling Pruned Attention Heads For Uncertainty-Aware Efficient Transformers

Firas Gabetni, Giuseppe Curci, Andrea Pilzer, Subhankar Roy, Elisa Ricci, Gianni Franchi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.18358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[320] Adversarial Attacks on Medical Hyperspectral Imaging Exploiting Spectral-Spatial Dependencies and Multiscale Features

Yunrui Gu, Zhenzhe Gao, Cong Kong, Jiawei Du, Zhaoxia Yin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.07056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Zhiyang Li, Ao Ke, Yukun Cao, Xike Xie

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11632: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11632&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[322] Weakly supervised framework for wildlife detection and counting in challenging Arctic environments: a case study on caribou (Rangifer tarandus)

Ghazaleh Serati, Samuel Foucher, Jerome Theau

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.18891: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18891&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[323] Cloning Deterministic Worlds: The Critical Role of Latent Geometry in Long-Horizon World Models

Zaishuo Xia, Yukuan Lu, Xinyi Li, Yifan Xu, Yubei Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.26782: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26782&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[324] Lingua-SafetyBench: A Benchmark for Safety Evaluation of Multilingual Vision-Language Models

Enyi Shi, Pengyang Shao, Yanxin Zhang, Chenhang Cui, Jiayi Lyu, Xiaobo Xia, Fei Shen, Tat-Seng Chua

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.22737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[325] Learning Evolution via Optimization Knowledge Adaptation

Chao Wang, Lingling Li, Licheng Jiao, Jiaxuan Zhao, Fang Liu, Shuyuan Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.02200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.02200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[326] Mitigating Long-Tail Bias via Prompt-Controlled Diffusion Augmentation

Buddhi Wijenayake, Nichula Wasalathilake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake, Vishal M. Patel

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.04749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[327] TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

Zhenxing Ming, Yaoqi Huang, Julie Stephany Berrio, Mao Shan, Stewart Worrall

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.06400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[328] CLIPoint3D: Language-Grounded Few-Shot Unsupervised 3D Point Cloud Domain Adaptation

Mainak Singha, Sarthak Mehrotra, Paolo Casari, Subhasis Chaudhuri, Elisa Ricci, Biplab Banerjee

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.20409: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20409&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[329] AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Xi Jiang, Yue Guo, Jian Li, Yong Liu, Bin-Bin Gao, Hanqiu Deng, Jun Liu, Heng Zhao, Chengjie Wang, Feng Zheng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.13779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[330] MAGICIAN: Efficient Long-Term Planning with Imagined Gaussians for Active Mapping

Shiyao Li, Antoine Guédon, Shizhe Chen, Vincent Lepetit

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.22650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[331] ORSIFlow: Saliency-Guided Rectified Flow for Optical Remote Sensing Salient Object Detection

Haojing Chen, Zhihang Liu, Yutong Li, Tao Tan, Haoyu Bian, Qiuju Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.28584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[332] Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

Longwei Xu, Feng Feng, Shaojie Zhang, Xin Chen, Hang Li, Anan Du, Hailong Yu, Pei Fu, Zhenbo Luo, Jian Luan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.00161: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00161&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[333] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection

Arezoo Borji, Gernot Kronreif, Bernhard Angermayr, Francisco Mario Calisto, Ali Abbasian Ardakani, Wolfgang Birkfellner, Inna Servetnyk, Yinyin Yuan, Sepideh Hatamikia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.01798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[334] Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition

Haocheng Tang, Xingyu Dang, Junmei Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03476: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03476&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[335] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations

Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.01219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[336] SmokeGS-R: Physics-Guided Pseudo-Clean 3DGS for Real-World Multi-View Smoke Restoration

Xueming Fu, Lixia Han

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05301: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05301&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[337] VDPP: Video Depth Post-Processing for Speed and Scalability

Daewon Yoon, Injun Baek, Sangyu Han, Yearim Kim, Nojun Kwak

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[338] Adaptive Prompt Elicitation for Text-to-Image Generation

Xinyi Wen, Lena Hegemann, Xiaofu Jin, Shuai Ma, Antti Oulasvirta

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.04713: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.04713&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[339] Bridging Time and Space: Decoupled Spatio-Temporal Alignment for Video Grounding

Xuezhen Tu, Jingyu Wu, Fangyu Kang, Qingpeng Nong, Kaijin Zhang, Chaoyue Niu, Fan Wu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08014: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08014&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[340] LAMP: Lift Image-Editing as General 3D Priors for Open-world Manipulation

Jingjing Wang, Zhengdong Hong, Chong Bao, Yuke Zhu, Junhan Sun, Guofeng Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[341] Unsupervised Local Plasticity in a Multi-Frequency VisNet Hierarchy

Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09734: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09734&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[342] Radar-Informed 3D Multi-Object Tracking under Adverse Conditions

Bingxue Xu, Emil Hedemalm, Ajinkya Khoche, Patric Jensfelt

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.13571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[343] Reward-Aware Trajectory Shaping for Few-step Visual Generation

Rui Li, Bingyu Li, Yuanzhi Liang, HuangHai Bin, Chi Zhang, XueLong Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14910: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14910&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[344] The Amazing Stability of Flow Matching

Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16079: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16079&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Lorenzo Beltrame, Jules Salzinger, Filip Svoboda, Jasmin Lampert, Phillipp Fanta-Jende, Radu Timofte, Marco Körner

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Nirmalendu Prakash, Narmeen Fatimah Oozeer, Xin Su, Phillip Howard, Shaan Shah, Zoe Wanying He, Shuang Wu, Shivam Raval, Roy Ka-Wei Lee, Meenakshi Khosla, Amir Abdullah

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[347] BARD: Bridging AutoRegressive and Diffusion Vision-Language Models Via Highly Efficient Progressive Block Merging and Stage-Wise Distillation

Baoyou Chen, Hanchen Xia, Peng Tu, Haojun Shi, Shan Mu, Weihao Yuan, Siyu Zhu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[348] MESA: A Training-Free Multi-Exemplar Deep Framework for Restoring Ancient Inscription Textures

Vasileios Toulatzis, Sofia Theodoridou, Ioannis Fudos

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[349] ARM: Advantage Reward Modeling for Long-Horizon Manipulation

Yiming Mao, Zixi Yu, Weixin Mao, Yinhao Li, Qirui Hu, Zihan Lan, Minzhao Zhu, Hua Chen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[350] IncreFA: Breaking the Static Wall of Generative Model Attribution

Haotian Qin, Dongliang Chang, Yueying Gao, Yuexuan Tan, Lei Chen, Zhanyu Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[351] Weakly-Supervised Referring Video Object Segmentation through Text Supervision

Miaojing Shi, Jun Huang, Zijie Yue, Hanli Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[352] Memorize When Needed: Decoupled Memory Control for Spatially Consistent Long-Horizon Video Generation

Yanjun Guo, Zhengqiang Zhang, Pengfei Wang, Xinyue Liang, Zhiyuan Ma, Lei Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[353] UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69%$ to $96%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8%$ to $57%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

[354] AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

Rui Qian, Chuanhang Deng, Qiang Huang, Jian Xiong, Mingxuan Li, Yingbo Zhou, Wei Zhai, Jintao Chen, Dejing Dou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[355] MultiWorld: Scalable Multi-Agent Multi-View Video World Models

Haoyu Wu, Jiwen Yu, Yingtian Zou, Xihui Liu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18564: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18564&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Vishnu Sashank Dorbala, Bhrij Patel, Amrit Singh Bedi, Dinesh Manocha

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2403.09905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.09905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[357] Fast and Robust Diffusion Posterior Sampling for MR Image Reconstruction Using the Preconditioned Unadjusted Langevin Algorithm

Moritz Blumenthal, Tina Holliber, Jonathan I. Tamir, Martin Uecker

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.05791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[358] Memory Over Maps: 3D Object Localization Without Reconstruction

Rui Zhou, Xander Yap, Jianwen Cao, Allison Lau, Boyang Sun, Marc Pollefeys

Yadong Li, Guoxin Wu, Haiping Hou, Biye Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Full-duplex speech interaction, as the most natural and intuitive mode of human communication, is driving artificial intelligence toward more human-like conversational systems. Traditional cascaded speech processing pipelines suffer from critical limitations, including accumulated latency, information loss, and error propagation across modules. To address these issues, recent efforts focus on the end-to-end audio large language models (LLMs) like GPT-4o, which primarily unify speech understanding and generation task. However, most of these models are inherently half-duplex, and rely on a suite of separate, task-specific front-end components, such as voice activity detection (VAD) and turn-taking detection (TD). In our development of speech assistant, we observed that optimizing the speech front-end is equally crucial as advancing the back-end unified model for achieving seamless, responsive interactions. To bridge this gap, we propose the first unified audio front-end LLM (UAF) tailored for full-duplex speech systems. Our model reformulates diverse audio front-end tasks into a single auto-regressive sequence prediction problem, including VAD, TD, speaker recognition (SR), automatic speech recognition (ASR) and question answer (QA). It takes streaming fixed-duration audio chunk (e.g., 600 ms) as input, leverages a reference audio prompt to anchor the target speaker at the beginning, and regressively generates discrete tokens encoding both semantic content and system-level state controls (e.g., interruption signals). Experiments demonstrate that our model achieves leading performance across multiple audio front-end tasks and significantly enhances response latency and interruption accuracy in real-world interaction scenarios.

Xiachong Feng, Yi Jiang, Xiaocheng Feng, Deyi Yin, Libo Qin, Yangfan Ye, Lei Huang, Weitao Ma, Yuxuan Gu, Chonghan Qin, Bing Qin, Lingpeng Kong

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance’s strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.

[379] On Accelerating Grounded Code Development for Research

Santosh Ganji

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: A major challenge for niche scientific and technical domains in leveraging coding agents is the lack of access to up-to-date, domain- specific knowledge. Foundational models often demonstrate limited reasoning capabilities in specialized fields and cannot inherently incorporate knowledge that evolves through ongoing research and experimentation. Materials scientists exploring novel compounds, communication engineers designing and evaluating new protocols, and bioengineering researchers conducting iterative experiments all face this limitation. These experts typically lack the resources to fine-tune large models or continuously embed new findings, creating a barrier to adopting AI-driven coding agents. To address this, we introduce a framework that gives coding agents instanta- neous access to research repositories and technical documentation, enabling real-time, context-aware operation. Our open-source im- plementation allows users to upload documents via doc-search.dev and includes zed-fork, which enforces domain-specific rules and workflows. Together, these tools accelerate the integration of coding agents into specialized scientific and technical workflows

[380] Plausible Reasoning and First-Order Plausible Logic

David Billington

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Defeasible statements are statements that are likely, or probable, or usually true, but may occasionally be false. Plausible reasoning makes conclusions from statements that are either facts or defeasible statements without using numbers. So there are no probabilities or suchlike involved. Seventeen principles of logics that do plausible reasoning are suggested and several important plausible reasoning examples are considered. There are 14 necessary principles and 3 desirable principles, one of which is not formally stated. A first-order logic, called Plausible Logic (PL), is defined that satisfies all but two of the desirable principles and reasons correctly with all the examples. As far as we are aware, this is the only such logic. PL has 8 reasoning algorithms because, from a given plausible reasoning situation, there are different sensible conclusions. This article is a condensation of my book `Plausible Reasoning and Plausible Logic’ (PRPL), which is to be submitted. Each section of this article corresponds to a chapter in PRPL, and vice versa. The proofs of all the results are in PRPL, so they are omitted in this article.

[381] Learning Lifted Action Models from Unsupervised Visual Traces

Kai Xi, Stephen Gould, Sylvie Thiébaux

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Efficient construction of models capturing the preconditions and effects of actions is essential for applying AI planning in real-world domains. Extensive prior work has explored learning such models from high-level descriptions of state and/or action sequences. In this paper, we tackle a more challenging setting: learning lifted action models from sequences of state images, without action observation. We propose a deep learning framework that jointly learns state prediction, action prediction, and a lifted action model. We also introduce a mixed-integer linear program (MILP) to prevent prediction collapse and self-reinforcing errors among predictions. The MILP takes the predicted states, actions, and action model over a subset of traces and solves for logically consistent states, actions, and action model that are as close as possible to the original predictions. Pseudo-labels extracted from the MILP solution are then used to guide further training. Experiments across multiple domains show that integrating MILP-based correction helps the model escape local optima and converge toward globally consistent solutions.

[382] Reinforcement Learning Improves LLM Accuracy and Reasoning in Disease Classification from Radiology Reports

Yishu Wei, Yi Lin, Adam Flanders, George Shih, Yifan Peng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate disease classification from radiology reports is essential for many applications. While supervised fine-tuning (SFT) of lightweight LLMs improves accuracy, it can degrade reasoning. We propose a two-stage approach: SFT on disease labels followed by Group Relative Policy Optimization (GRPO) to refine predictions by optimizing accuracy and format without reasoning supervision. Across three radiologist-annotated datasets, SFT outperformed baselines and GRPO further improved classification and enhanced reasoning recall and comprehensiveness.

[383] OLLM: Options-based Large Language Models

Shashank Sharma, Janina Hoffmann, Vinay Namboodiri

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce Options LLM (OLLM), a simple, general method that replaces the single next-token prediction of standard LLMs with a \textit{set of learned options} for the next token, indexed by a discrete latent variable. Instead of relying on temperature or sampling heuristics to induce diversity, OLLM models variation explicitly: a small latent space parametrizes multiple plausible next-token options which can be selected or searched by a downstream policy. Architecturally, OLLM is a lightweight “plug-in” that inserts two layers: an encoder and a decoder, before the output head, allowing almost any pretrained LLM to be converted with minimal additional parameters. We apply OLLM to a 1.7B-parameter backbone (only $1.56%$ of parameters trainable) trained on OpenMathReasoning and evaluated on OmniMath. The SOTA LoRA-adapted baselines peak at $51%$ final answer correctness, while OLLM’s option set allows up to $\sim 70%$ under optimal latent selection. We then train a compact policy in the latent space that emits latents to control generation. Operating in a low-dimensional option space makes reward optimization far more sample-efficient and substantially reduces common misalignments (e.g., language switching or degenerate reasoning), as the policy is constrained to options learned during SFT. Crucially, this alignment arises from model structure rather than additional KL or handcrafted alignment losses. Our results demonstrate that optionized next-token modeling enhances controllability, robustness, and efficiency in math reasoning, and highlight latent-space policy learning as a promising direction for reinforcement learning in LLMs.

[384] Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

Dahyun Jung, Jaewook Lee, Heuiseok Lim

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) require frequent knowledge updates to reflect changing facts and mitigate hallucinations. To meet this demand, lifelong knowledge editing has emerged as a continual approach to modify specific pieces of knowledge without retraining the entire model. Existing parameter editing methods struggle with stability during sequential edits due to catastrophic forgetting. While retrieval-based approaches are proposed to alleviate this issue, their applicability remains limited across various datasets because of high training costs. To address these limitations and enhance scalability in lifelong settings, we propose LightEdit. Our framework first selects relevant knowledge from retrieved information to modify the query effectively. It then incorporates a decoding strategy to suppress the model’s original knowledge probabilities, thereby enabling efficient edits based on the selected information. Extensive experiments on ZSRE, Counterfact, and RIPE benchmarks demonstrate that LightEdit outperforms existing lifelong knowledge editing methods. Furthermore, by minimizing training costs, LightEdit achieves cost-effective scalability, enabling easy adaptation to various datasets.

[385] Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory

Masaki Uto

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human–human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.

[386] Reasoning-Aware AIGC Detection via Alignment and Reinforcement

Zhao Wang, Max Xiong, Jianxun Lian, Zhicheng Dou

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid advancement and widespread adoption of Large Language Models (LLMs) have elevated the need for reliable AI-generated content (AIGC) detection, which remains challenging as models evolve. We introduce AIGC-text-bank, a comprehensive multi-domain dataset with diverse LLM sources and authorship scenarios, and propose REVEAL, a detection framework that generates interpretable reasoning chains before classification. Our approach uses a two-stage training strategy: supervised fine-tuning to establish reasoning capabilities, followed by reinforcement learning to improve accuracy, improve logical consistency, and reduce hallucinations. Extensive experiments show that REVEAL achieves state-of-the-art performance across multiple benchmarks, offering a robust and transparent solution for AIGC detection. The project is open-source at https://aka.ms/reveal

[387] ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation

Zhiqin Yang, Zhenyuan Zhang, Xianzhang Jia, Jun Song, Wei Xue, Yonggang Zhang, Yike Guo

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Current AI agent frameworks have made remarkable progress in automating individual tasks, yet all existing systems serve a single user. Human productivity rests on the social and organizational relationships through which people coordinate, negotiate, and delegate. When agents move beyond performing tasks for one person to representing that person in collaboration with others, the infrastructure for cross-user agent collaboration is entirely absent, let alone the governance mechanisms needed to secure it. We argue that the next frontier for AI agents lies not in stronger individual capability, but in the digitization of human collaborative relationships. To this end, we propose a human-symbiotic agent paradigm. Each user owns a permanently bound agent system that collaborates on the owner’s behalf, forming a network whose nodes are humans rather than agents. This paradigm rests on three governance primitives. A layered identity architecture separates a Manager Agent from multiple context-specific Identity Agents; the Manager Agent holds global knowledge but is architecturally isolated from external communication. Scoped authorization enforces per-identity access control and escalates boundary violations to the owner. Action-level accountability logs every operation against its owner’s identity and authorization, ensuring full auditability. We instantiate this paradigm in ClawNet, an identity-governed agent collaboration framework that enforces identity binding and authorization verification through a central orchestrator, enabling multiple users to collaborate securely through their respective agents.

[388] Industrial Surface Defect Detection via Diffusion Generation and Asymmetric Student-Teacher Network

Shuo Feng, Runlin Zhou, Yuyang Li, Guangcan Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Industrial surface defect detection often suffers from limited defect samples, severe long-tailed distributions, and difficulties in accurately localizing subtle defects under complex backgrounds. To address these challenges, this paper proposes an unsupervised defect detection method that integrates a Denoising Diffusion Probabilistic Model (DDPM) with an asymmetric teacher-student architecture. First, at the data level, the DDPM is trained solely on normal samples. By introducing constant-variance Gaussian perturbations and Perlin noise-based masks, high-fidelity and physically consistent defect samples along with pixel-level annotations are generated, effectively alleviating the data scarcity problem. Second, at the model level, an asymmetric dual-stream network is constructed. The teacher network provides stable representations of normal features, while the student network reconstructs normal patterns and amplifies discrepancies between normal and anomalous regions. Finally, a joint optimization strategy combining cosine similarity loss and pixel-wise segmentation supervision is adopted to achieve precise localization of subtle defects. Experimental results on the MVTecAD dataset show that the proposed method achieves 98.4% image-level AUROC and 98.3% pixel-level AUROC, significantly outperforming existing unsupervised and mainstream deep learning methods. The proposed approach does not require large amounts of real defect samples and enables accurate and robust industrial defect detection and localization. \keywords{Industrial defect detection \and diffusion models \and data generation \and teacher-student architecture \and pixel-level localization}

[389] Do Agents Dream of Root Shells? Partial-Credit Evaluation of LLM Agents in Capture The Flag Challenges

Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner, Arie van Deursen, Maliheh Izadi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Model (LLM) agents are increasingly proposed for autonomous cybersecurity tasks, but their capabilities in realistic offensive settings remain poorly understood. We present DeepRed, an open-source benchmark for evaluating LLM-based agents on realistic Capture The Flag (CTF) challenges in isolated virtualized environments. DeepRed places an agent in a Kali attacker environment with terminal tools and optional web search, connected over a private network to a target challenge, and records full execution traces for analysis. To move beyond binary solved/unsolved outcomes, we introduce a partial-credit scoring method based on challenge-specific checkpoints derived from public writeups, together with an automated summarise-then-judge labelling pipeline for assigning checkpoint completion from logs. Using DeepRed, we benchmark ten commercially accessible LLMs on ten VM-based CTF challenges spanning different challenge categories. The results indicate that current agents remain limited: the best model achieves only 35% average checkpoint completion, performing strongest on common challenge types and weakest on tasks requiring non-standard discovery and longer-horizon adaptation.

[390] Towards Energy Impact on AI-Powered 6G IoT Networks: Centralized vs. Decentralized

Anjie Qiu, Donglin Wang, Sanket Partani, Andreas Weinand, Hans D. Schotten

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The emergence of sixth-generation (6G) technologies has introduced new challenges and opportunities for machine learning (ML) applications in Internet of Things (IoT) networks, particularly concerning energy efficiency. As model training and data transmission contribute significantly to energy consumption, optimizing these processes has become critical for sustainable system design. This study first conduct analysis on the energy consumption model for both centralized and decentralized architecture and then presents a testbed deployed within the German railway infrastructure, leveraging sensor data for ML-based predictive maintenance. A comparative analysis of distributed versus Centralized Learning (CL) architectures reveals that distributed models maintain competitive predictive accuracy (~90%) while reducing overall electricity consumption by up to 70%. These findings underscore the potential of distributed ML to improve energy efficiency in real-world IoT deployments, particularly by mitigating transmission-related energy costs.

[391] GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

Ziyang Wang, Jiangfeng Xiao, Chuan Xiao, Ruoxiang Li, Rui Mao, Jianbin Qin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are expensive to serve because model parameters, attention computation, and KV caches impose substantial memory and latency costs. We present GRASPrune, a structured pruning framework applied after pretraining that jointly prunes FFN channels and KV head groups under a single global budget. Instead of learning importance scores without constraints and applying the budget only after training, GRASPrune learns lightweight gate scores with a projected straight-through estimator that enforces a hard mask satisfying the budget at every step while keeping the backbone weights frozen. After the mask is fixed, we calibrate scaling factors on the retained units to mitigate scale mismatch caused by pruning, and fold these factors into the pruned weights to obtain a smaller dense checkpoint with no extra parameters at inference. On LLaMA-2-7B, GRASPrune removes 50% of parameters and achieves 12.18 perplexity on WikiText-2 while maintaining competitive average zero-shot accuracy on five benchmarks, using four epochs on 512 unlabeled calibration sequences on a single NVIDIA A100 80GB GPU without any full model fine-tuning.

[392] Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

Vasundra Srininvasan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Long-horizon enterprise agents make high-stakes decisions (loan underwriting, claims adjudication, clinical review, prior authorization) under lossy memory, multi-step reasoning, and binding regulatory constraints. Current evaluation reports a single task-success scalar that conflates distinct failure modes and hides whether an agent is aligned with the standards its deployment environment requires. We propose that long-horizon decision behavior decomposes into four orthogonal alignment axes, each independently measurable and failable: factual precision (FRP), reasoning coherence (RCS), compliance reconstruction (CRR), and calibrated abstention (CAR). CRR is a novel regulatory-grounded axis; CAR is a measurement axis separating coverage from accuracy. We exercise the decomposition on a controlled benchmark (LongHorizon-Bench) covering loan qualification and insurance claims adjudication with deterministic ground-truth construction. Running six memory architectures, we find structure aggregate accuracy cannot see: retrieval collapses on factual precision; schema-anchored architectures pay a scaffolding tax; plain summarization under a fact-preservation prompt is a strong baseline on FRP, RCS, EDA, and CRR; and all six architectures commit on every case, exposing a decisional-alignment axis the field has not targeted. The decomposition also surfaced a pre-registered prediction of our own, that summarization would fail factual recall, which the data reversed at large magnitude, an axis-level reversal aggregate accuracy would have hidden. Institutional alignment (regulatory reconstruction) and decisional alignment (calibrated abstention) are under-represented in the alignment literature and become load-bearing once decisions leave the laboratory. The framework transfers to any regulated decisioning domain via two steps: build a fact schema, and calibrate the CRR auditor prompt.

[393] Do LLMs Game Formalization? Evaluating Faithfulness in Logical Reasoning

Kyuhee Kim, Auguste Poiroux, Antoine Bosselut

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The two-stage pipeline reveals two distinct modes of unfaithfulness: GPT-5 fabricates axioms during proof generation, a reactive fallback detectable via cross-stage comparison, while DeepSeek-R1 mistranslates premises during formalization, producing internally consistent outputs that evade detection entirely. These findings show that high compilation rates or accuracies should not be equated with faithful reasoning. Code and data are available at https://github.com/koreankiwi99/formalization-gaming.

[394] CoDA: Towards Effective Cross-domain Knowledge Transfer via CoT-guided Domain Adaptation

Jianzhi Yan, Le Liu, Buzhou Tang, Yang Xiang, Dongning Sun, Zhiming Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) have achieved substantial advances in logical reasoning, yet they continue to lag behind human-level performance. In-context learning provides a viable solution that boosts the model’s performance via prompting its input with expert-curated, in-domain exemplars. However, in many real-world, expertise-scarce domains, such as low-resource scientific disciplines, emerging biomedical subfields, or niche legal jurisdictions, such high-quality in-domain demonstrations are inherently limited or entirely unavailable, thereby constraining the general applicability of these approaches. To mitigate this limitation, recent efforts have explored the retrieval of cross-domain samples as surrogate in-context demonstrations. Nevertheless, the resulting gains remain modest. This is largely attributable to the pronounced domain shift between source and target distributions, which impedes the model’s ability to effectively identify and exploit underlying shared structures or latent reasoning patterns. Consequently, when relying solely on raw textual prompting, LLMs struggle to abstract and transfer such cross-domain knowledge in a robust and systematic manner. To address these issues, we propose CoDA, which employs a lightweight adapter to directly intervene in the intermediate hidden states. By combining feature-based distillation of CoT-enriched reference representations with Maximum Mean Discrepancy (MMD) for kernelized distribution matching, our method aligns the latent reasoning representation of the source and target domains. Extensive experimental results on multiple logical reasoning tasks across various model families validate the efficacy of CoDA by significantly outperforming the previous state-of-the-art baselines by a large margin.

[395] From Experience to Skill: Multi-Agent Generative Engine Optimization via Reusable Strategy Learning

Beining Wu, Fuyou Mao, Jiong Lin, Cheng Yang, Jiaxuan Lu, Yifu Guo, Siyu Zhang, Yifan Wu, Ying Huang, Fu Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generative engines (GEs) are reshaping information access by replacing ranked links with citation-grounded answers, yet current Generative Engine Optimization (GEO) methods optimize each instance in isolation, unable to accumulate or transfer effective strategies across tasks and engines. We reframe GEO as a strategy learning problem and propose MAGEO, a multi-agent framework in which coordinated planning, editing, and fidelity-aware evaluation serve as the execution layer, while validated editing patterns are progressively distilled into reusable, engine-specific optimization skills. To enable controlled assessment, we introduce a Twin Branch Evaluation Protocol for causal attribution of content edits and DSV-CF, a dual-axis metric that unifies semantic visibility with attribution accuracy. We further release MSME-GEO-Bench, a multi-scenario, multi-engine benchmark grounded in real-world queries. Experiments on three mainstream engines show that MAGEO substantially outperforms heuristic baselines in both visibility and citation fidelity, with ablations confirming that engine-specific preference modeling and strategy reuse are central to these gains, suggesting a scalable learning-driven paradigm for trustworthy GEO. Code is available at https://github.com/Wu-beining/MAGEO

[396] SimDiff: Depth Pruning via Similarity and Difference

Yuli Chen, Shuhao Zhang, Fanshen Meng, Bo Cheng, Jiale Han, Qiang Tong, Xiulei Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Depth pruning improves the deployment efficiency of large language models (LLMs) by identifying and removing redundant layers. A widely accepted standard for this identification process is to measure the similarity between layers using cosine distance. However, we find that methods relying solely on this one-dimensional heuristic can exhibit unpredictable performance and even catastrophic collapse across different architectures. To address this issue, we propose SimDiff, a novel layer importance criterion that jointly evaluates layers from two orthogonal perspectives: representational similarity and transformation difference. The difference is quantified using two distinct metrics: MSSD, which is sensitive to outliers and identifies layers that make decisive corrections, and MASD, which robustly measures a layer’s average contribution. Extensive experiments on multiple models ranging from 0.5B to 13B parameters demonstrate that SimDiff significantly outperforms state-of-the-art baselines across various pruning ratios. Notably, our method retains over 91% of LLaMA2-7B’s performance at a 25% pruning ratio and achieves up to a 1.49x inference speedup when pruning 12 layers on LLaMA3.1-8B. We also show that pruned models can be effectively recovered with minimal fine-tuning.

Mihir Shriniwas Arya, Avinash Anish, Aditya Ranjan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Social deduction games such as Mafia present a unique AI challenge: players must reason under uncertainty, interpret incomplete and intentionally misleading information, evaluate human-like communication, and make strategic elimination decisions. Unlike deterministic board games, success in Mafia depends not on perfect information or brute-force search, but on inference, memory, and adaptability in the presence of deception. This work presents the design and evaluation of Revac-8, an AI agent developed for the Social Deduction track of the MindGames Arena competition, where it achieved first place. The final agent evolved from a simple two-stage reasoning system into a multi-module architecture that integrates memory-based player profiling, social-graph analysis of accusations and defenses, and dynamic tone selection for communication. These results highlight the importance of structured memory and adaptive communication for achieving strong performance in high-stakes social environments.

[398] DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

[399] Enhancing Construction Worker Safety in Extreme Heat: A Machine Learning Approach Utilizing Wearable Technology for Predictive Health Analytics

Syed Sajid Ullah, Amir Khan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Construction workers are highly vulnerable to heat stress, yet tools that translate real-time physiological data into actionable safety intelligence remain scarce. This study addresses this gap by developing and evaluating deep learning models, specifically a baseline Long Short-Term Memory (LSTM) network and an attention-based LSTM, to predict heat stress among 19 workers in Saudi Arabia. Using Garmin Vivosmart 5 smartwatches to monitor metrics such as heart rate, HRV, and oxygen saturation, the attention-based model outperformed the baseline, achieving 95.40% testing accuracy and significantly reducing false positives and negatives. With precision, recall, and F1 scores of 0.982, this approach not only improves predictive performance but also offers interpretable results suitable for integration into IoT-enabled safety systems and BIM dashboards, advancing proactive, informatics-driven safety management in the construction industry.

[400] Detecting Data Contamination in Large Language Models

Juliusz Janicki, Savvas Chamezopoulos, Evangelos Kanoulas, Georgios Tsatsaronis

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumptions and compare them to each other using a unified set of datasets to determine if any of them can reliably detect membership under SOTA LLMs. In addition, a new method, called the Familiarity Ranking, was developed to showcase a possible approach to black-box MIAs, thereby giving LLMs more freedom in their expression to understand their reasoning better. The results indicate that none of the methods are capable of reliably detecting membership in LLMs, as shown by an AUC-ROC of approximately 0.5 for all methods across several LLMs. The higher TPR and FPR for more advanced LLMs indicate higher reasoning and generalizing capabilities, showcasing the difficulty of detecting membership in LLMs using black-box MIAs.

Chuou Xu, Liya Ji, Qifeng Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement learning (RL) as post-training is crucial for enhancing the reasoning ability of large language models (LLMs) in coding and math. However, their capacity for visual semantic arithmetic, inferring relationships from images, remains underexplored. The classic text analogy “king”-“man”+“woman” = “queen” illustrates relational reasoning, yet replacing text with images of “king” and “man” significantly reduces performance because it requires commonsense knowledge and the extraction of concise concepts from irrelevant visual details. This capability is important for service and domestic robotics in unstructured environments, where robots must infer semantic relationships among objects, agents, and actions. In a kitchen, recognizing from images that “powder” and “cake” are related by “is made of” grounds symbolic relations in perception, enabling tool substitution, task generalization, and improved semantic reasoning. Prior work approaches semantic arithmetic by decoding image features after vector arithmetic, but suffers from modality gaps and lacks systematic evaluation. In this paper, we formulate two novel tasks, two-term subtraction and three-term operations, and construct the Image-Relation-Pair Dataset (IRPD) for benchmarking. We further propose Semantic Arithmetic Reinforcement Fine-Tuning (SAri-RFT), which post-trains large vision-language models (LVLMs) using a verifiable function and Group Relative Policy Optimization (GRPO). Our method achieves state-of-the-art results on IRPD and the real-world Visual7W-Telling dataset. By equipping LVLMs with robust cross-modal relational reasoning, this work advances domestic robots’ ability to ground symbolic reasoning in perception, enhancing decision-making, tool adaptability, and human-robot interaction in complex environments. Datasets and source code are provided in the supplementary material.

[402] Time Series Augmented Generation for Financial Applications

Anton Kolonin, Alexey Glushchenko, Evgeny Bochkov, Abhishek Saxena

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent’s core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent’s reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.

[403] SafetyALFRED: Evaluating Safety-Conscious Planning of Multimodal Large Language Models

Josue Torres-Fonseca, Naihao Deng, Yinpei Dai, Shane Storks, Yichi Zhang, Rada Mihalcea, Casey Kennington, Joyce Chai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Large Language Models are increasingly adopted as autonomous agents in interactive environments, yet their ability to proactively address safety hazards remains insufficient. We introduce SafetyALFRED, built upon the embodied agent benchmark ALFRED, augmented with six categories of real-world kitchen hazards. While existing safety evaluations focus on hazard recognition through disembodied question answering (QA) settings, we evaluate eleven state-of-the-art models from the Qwen, Gemma, and Gemini families on not only hazard recognition, but also active risk mitigation through embodied planning. Our experimental results reveal a significant alignment gap: while models can accurately recognize hazards in QA settings, average mitigation success rates for these hazards are low in comparison. Our findings demonstrate that static evaluations through QA are insufficient for physical safety, thus we advocate for a paradigm shift toward benchmarks that prioritize corrective actions in embodied contexts. We open-source our code and dataset under https://github.com/sled-group/SafetyALFRED.git

[404] A Dual Perspective on Synthetic Trajectory Generators: Utility Framework and Privacy Vulnerabilities

Aya Cherigui, Florent Guépin, Arnaud Legendre, Jean-François Couchot

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human mobility data are used in numerous applications, ranging from public health to urban planning. Human mobility is inherently sensitive, as it can contain information such as religious beliefs and political affiliations. Historically, it has been proposed to modify the information using techniques such as aggregation, obfuscation, or noise addition, to adequately protect privacy and eliminate concerns. As these methods come at a great cost in utility, new methods leveraging development in generative models, were introduced. The extent to which such methods answer the privacy-utility trade-off remains an open problem. In this paper, we introduced a first step towards solving it, by the introduction and application of a new framework for utility evaluation. Furthermore, we provide evidence that privacy evaluation remains a great challenge to consider and that it should be tackled through adversarial evaluation in accordance with the current EU regulation. We propose a new membership inference attack against a subcategory of generative models, even though this subcategory was deemed private due to its resistance over the trajectory user-linking problem.

[405] A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.

[406] Conjuring Semantic Similarity

Tian Yu Liu, Stefano Soatto

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. These meanings are themselves typically represented by textual expressions. We propose a novel approach whereby the semantic similarity among textual expressions is based not on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or ‘conjure.’ We show that by choosing the Jeffreys divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

[407] User Simulation in the Era of Generative AI: User Modeling, Synthetic Data Generation, and System Evaluation

Krisztian Balog, ChengXiang Zhai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: User simulation is an emerging interdisciplinary topic with multiple critical applications in the era of Generative AI. It involves creating an intelligent agent that mimics the actions of a human user interacting with an AI system, enabling researchers to model and analyze user behaviour, generate synthetic data for training, and evaluate interactive AI systems in a controlled and reproducible manner. Because of its broad scope, research on this topic currently remains scattered across artificial intelligence, human-computer interaction, information science, computational social science, and psychology. To address this fragmented landscape of current research, this article presents a foundational synthesis. We highlight the paradigm shift from traditional predictive models to modern generative approaches, and explicitly frame critical ethical considerations – demonstrating how controlled simulation serves not merely as a risk vector for bias, but as a powerful, proactive tool to ensure fair representation and system safety. Furthermore, we establish the theoretical connection between user simulation and the pursuit of Artificial General Intelligence, arguing that realistic simulators are indispensable catalysts for overcoming critical data and evaluation bottlenecks and optimizing personalization. Ultimately, we propose a practical, self-sustaining innovation ecosystem bridging academia and industry to advance this increasingly important technology.

[408] Epistemic Skills: Reasoning about Knowledge and Oblivion

Xiaolong Liang, Yì N. Wáng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of knowability’’ and ``forgettability,’’ defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

[409] Memory Assignment for Finite-Memory Strategies in Adversarial Patrolling Games

Vojtěch Kůr, Vít Musil, Vojtěch Řehák

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Adversarial Patrolling games form a subclass of Security games where a Defender moves between locations, guarding vulnerable targets. The main algorithmic problem is constructing a strategy for the Defender that minimizes the worst damage an Attacker can cause. We focus on the class of finite-memory (also known as regular) Defender’s strategies that experimentally outperformed other competing classes. A finite-memory strategy can be seen as a positional strategy on a finite set of states. Each state consists of a pair of a location and a certain integer value–called memory. Existing algorithms improve the transitional probabilities between the states but require that the available memory size itself is assigned at each location manually. Choosing the right memory assignment is a well-known open and hard problem that hinders the usability of finite-memory strategies. We solve this issue by developing a general method that iteratively changes the memory assignment. Our algorithm can be used in connection with any black-box strategy optimization tool. We evaluate our method on various experiments and show its robustness by solving instances of various patrolling models.

[410] MRS: Multi-Resolution Skills for HRL Agents

Shashank Sharma, Janina Hoffmann, Vinay Namboodiri

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hierarchical reinforcement learning (HRL) decomposes the policy into a manager and a worker, enabling long-horizon planning but introducing a performance gap on tasks requiring agility. We identify a root cause: in subgoal-based HRL, the manager’s goal representation is typically learned without constraints on reachability or temporal distance from the current state, preventing precise local subgoal selection. We further show that the optimal subgoal distance is both task- and state-dependent: nearby subgoals enable precise control but amplify prediction noise, while distant subgoals produce smoother motion at the cost of geometric precision. We propose Multi-Resolution Skills (MRS), which learns multiple goal-prediction modules each specialized to a fixed temporal horizon, with a jointly trained meta-controller that selects among them based on the current state. MRS consistently outperforms fixed-resolution baselines and significantly reduces the performance gap between HRL and non-HRL state-of-the-art on DeepMind Control Suite, Gym-Robotics, and long-horizon AntMaze tasks. [Project page: https://sites.google.com/view/multi-res-skills/home]

[411] SEAT: Sparse Entity-Aware Tuning for Knowledge Adaptation while Preserving Epistemic Abstention

William F. Shen, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Adapting LLMs with new knowledge is increasingly important, but standard fine-tuning often erodes aligned epistemic abstention: the ability to acknowledge when the model does not know. This failure mode is especially concerning in high-stakes settings, where abstention is a critical safeguard against hallucination. We present SEAT, a preventive fine-tuning method that preserves epistemic abstention while maintaining strong knowledge acquisition. SEAT combines sparse tuning, which constrains global activation drift, with entity-perturbed KL regularization, which sharpens local epistemic boundaries and prevents spillover to neighboring knowledge. Crucially, SEAT requires no alignment data, explicit boundary probing, or post-hoc re-alignment, making it attractive for lightweight and privacy-sensitive adaptation. Across models and datasets, SEAT improves human-evaluated abstention on unknown queries by 18%-101% over the strongest baseline while retaining near-perfect target knowledge acquisition, and produces coherent, context-aware abstentions after tuning. Further analyses show that both components are essential, that SEAT more cleanly separates known from unknown queries in representation space, and that it preserves downstream utility. These results identify preservation of epistemic abstention as a core objective for safe knowledge adaptation.

[412] GRAIL:Learning to Interact with Large Knowledge Graphs for Retrieval Augmented Reasoning

Ge Chang, Jinbo Su, Jiacheng Liu, Pengfei Yang, Yuhao Shang, Huiwen Zheng, Hongli Ma, Yan Liang, Yuanchun Li, Yunxin Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) integrated with Retrieval-Augmented Generation (RAG) techniques have exhibited remarkable performance across a wide range of domains. However, existing RAG approaches primarily operate on unstructured data and demonstrate limited capability in handling structured knowledge such as knowledge graphs. Meanwhile, current graph retrieval methods fundamentally struggle to capture holistic graph structures while simultaneously facing precision control challenges that manifest as either critical information gaps or excessive redundant connections, collectively undermining reasoning performance. To address this challenge, we propose GRAIL: Graph-Retrieval Augmented Interactive Learning, a framework designed to interact with large-scale graphs for retrieval-augmented reasoning. Specifically, GRAIL integrates LLM-guided random exploration with path filtering to establish a data synthesis pipeline, where a fine-grained reasoning trajectory is automatically generated for each task. Based on the synthesized data, we then employ a two-stage training process to learn a policy that dynamically decides the optimal actions at each reasoning step. The overall objective of precision-conciseness balance in graph retrieval is decoupled into fine-grained process-supervised rewards to enhance data efficiency and training stability. In practical deployment, GRAIL adopts an interactive retrieval paradigm, enabling the model to autonomously explore graph paths while dynamically balancing retrieval breadth and precision. Extensive experiments have shown that GRAIL achieves an average accuracy improvement of 21.01% and F1 improvement of 22.43% on three knowledge graph question-answering datasets. Our source code and datasets is available at https://github.com/Changgeww/GRAIL.

[413] GeoLaux: A Benchmark for Evaluating MLLMs’ Geometry Performance on Long-Step Problems Requiring Auxiliary Lines

Yumeng Fu, Jiayin Zhu, Lingling Zhang, Wenjun Wu, Bo Zhao, Shaoxuan Ma, Yushun Zhang, Jun Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Geometry problem solving (GPS) poses significant challenges for Multimodal Large Language Models (MLLMs) in diagram comprehension, knowledge application, long-step reasoning, and auxiliary line construction. However, current benchmarks lack fine-grained evaluation for long-step problems necessitating auxiliary construction. To address these limitations, we present GeoLaux, a fine-grained annotated dataset comprising 2186 calculation and proof problems. It features long-step reasoning (with an average solution length of 6.51 steps, maximum of 24 steps) and auxiliary line construction (required in 41.8% of problems). Building on the dataset, we conduct a comprehensive five-dimensional evaluation of 23 leading MLLMs. The evaluation yields three pivotal findings: First, models perform significantly worse on long-step problems compared to short-step ones, with 18 models exhibiting a performance drop of over 50%. Second, it is crucial to enhance models’ understanding, awareness, and proficiency in auxiliary line construction, which is vital for overall geometric reasoning. Third, limited answer hints effectively improve process correctness, whereas explicit answers lead models to neglect intermediate reasoning steps. These findings position GeoLaux both to benchmark MLLMs geometry reasoning abilities and to guide their improvement. Data and code are available at https://github.com/Candice-yu/GeoLaux

[414] VideoAgent: Personalized Synthesis of Scientific Videos

Xiao Liang, Bangxin Li, Zixuan Chen, Hanyue Zheng, Zhi Ma, Di Wang, Cong Tian, Quan Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.11253: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11253&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Wenda Xie, Chao Guo, Yanqing Jing, Junle Wang, Yisheng Lv, Fei-Yue Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.05188: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05188&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[416] StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Jiayi Mao, Liqun Li, Yanjie Gao, Zegang Peng, Shilin He, Chaoyun Zhang, Si Qin, Samia Khalid, Qingwei Lin, Saravan Rajmohan, Sitaram Lanka, Dongmei Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.10074: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10074&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[417] Chain-of-Thought as a Lens: Evaluating Structured Reasoning Alignment between Human Preferences and Large Language Models

Boxuan Wang, Zhuoyun Li, Xinmiao Huang, Xiaowei Huang, Yi Dong

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.06168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.06168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[418] TROJail: Trajectory-Level Optimization for Multi-Turn Large Language Model Jailbreaks with Process Rewards

Xiqiao Xiong, Ouxiang Li, Zhuo Liu, Moxin Li, Wentao Shi, Fengbin Zhu, Qifan Wang, Fuli Feng

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.07761: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07761&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[419] Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

Xiang Cheng, Yulan Hu, Xiangwen Zhang, Lu Xu, Lide Tan, Zheng Pan, Xin Li, Yong Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.22673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[420] Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation

Dongyi Lv, Qiuyu Ding, Heng-Da Xu, Zhaoxu Sun, Zhi Wang, Feng Xiong, Mu Xu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.04562: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04562&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[421] Generative Models and Connected and Automated Vehicles: A Survey in Exploring the Intersection of Transportation and AI

Bo Shu, Yiting Zhang, Saisai Hu, Dong Shu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2403.10559: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.10559&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[422] BAPO: Boundary-Aware Policy Optimization for Reliable Agentic Search

Shiyu Liu, Yongjing Yin, Jianhao Yan, Yunbo Tang, Qinggang Zhang, Bei Li, Xin Chen, Jingang Wang, Xunliang Cai, Jinsong Su

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11037: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11037&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[423] Failure Modes in Multi-Hop QA: The Weakest Link Effect and the Recognition Bottleneck

Meiru Zhang, Zaiqiao Meng, Nigel Collier

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.12499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.12499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[424] Remote Rowhammer Attack using Adversarial Observations on Federated Learning Clients

Jinsheng Yuan, Yuhang Hao, Weisi Guo, Yun Wu, Chongyan Gu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.06335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[425] Right for the Wrong Reasons: Epistemic Regret Minimization for LLM Causal Reasoning

Edward Y. Chang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11675: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11675&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[426] Mind the (DH) Gap! A Contrast in Risky Choices Between Reasoning and Conversational LLMs

Luise Ge, Yongyan Zhang, Yevgeniy Vorobeychik

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.15173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[427] Best Agent Identification for General Game Playing

Matthew Stephenson, Alex Newcombe, Eric Piette, Dennis Soemers

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.00451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.00451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[428] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics

Zhengyang Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.01375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[429] DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Yuliang Liu, Xingchen Liu, Yuyi Zhang, Minghui Liao, Jihao Wu, Wei Chen, Xiang Bai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[430] Autogenesis: A Self-Evolving Agent Protocol

Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Ming Yin, Bo An, Mengdi Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15034: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15034&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[431] Machine individuality: Separating genuine idiosyncrasy from response bias in large language models

Valentin Kriegmair, Dirk U. Wulff

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[432] EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

Xinyu Zhu, Yuzhu Cai, Zexi Liu, Cheng Wang, Fengyang Li, Wenkai Jin, Wanxu Liu, Zehao Bing, Bingyang Zheng, Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xianghe Pang, Yaxin Du, Tingjia Miao, Yuzhi Zhang, Ruoxue Liao, Zhaohan Ding, Linfeng Zhang, Yanfeng Wang, Weinan E, Siheng Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[433] EHRAG: Bridging Semantic Gaps in Lightweight GraphRAG via Hybrid Hypergraph Construction and Retrieval

Yifan Song, Xingjian Tao, Zhicheng Yang, Yihong Luo, Jing Tang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[434] How does the optimizer implicitly bias the model merging loss landscape?

Chenxiang Zhang, Alexander Theus, Damien Teney, Antonio Orvieto, Jun Pang, Sjouke Mauw

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.04686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[435] WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent

Lingfeng Zhang, Yongan Sun, Jinpeng Hu, Hui Ma, Yang Ying, Kuien Liu, Zenglin Shi, Meng Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17821: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17821&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[436] Towards Generalization of Graph Neural Networks for AC Optimal Power Flow

Olayiwola Arowolo, Jochen L. Cremer

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.06860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[437] Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

Terry Leitch

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[438] Agentic Forecasting using Sequential Bayesian Updating of Linguistic Beliefs

Kevin Murphy

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[439] Unifying Controller Design for Stabilizing Nonlinear Systems with Norm-Bounded Control Inputs

Ming Li, Zhiyong Sun, Siep Weiland

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2403.03030: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.03030&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[440] Enabling Vibration-Based Gesture Recognition on Everyday Furniture via Energy-Efficient FPGA Implementation of 1D Convolutional Networks

Koki Shibata, Tianheng Ling, Chao Qian, Tomokazu Matsui, Hirohiko Suwa, Keiichi Yasumoto, Gregor Schiele

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.23156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[441] Towards Auto-Building of Embedded FPGA-based Soft Sensors for Wastewater Flow Estimation

Tianheng Ling, Chao Qian, Gregor Schiele

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.05102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.05102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[442] Multiclass Local Calibration with the Jensen-Shannon Distance

Cesare Barbera, Lorenzo Perini, Giovanni De Toni, Andrea Passerini, Andrea Pugnana

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.26566: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.26566&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[443] Idle is the New Sleep: Configuration-Aware Alternative to Powering Off FPGA-Based DL Accelerators During Inactivity

Chao Qian, Christopher Cichiwskyj, Tianheng Ling, Gregor Schiele

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.12027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.12027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[444] Who Benefits from AI? Self-Selection, Skill Gap, and the Hidden Costs of AI Feedback

Christoph Riedl, Eric Bogert

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2409.18660: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.18660&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[445] Decomposed Trust: Privacy, Adversarial Robustness, Ethics, and Fairness in Low-Rank LLMs

Daniel Agyei Asante, Md Mokarram Chowdhury, Yang Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.22099: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22099&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[446] Benchmarking Misuse Mitigation Against Covert Adversaries

Davis Brown, Mahdi Sabbaghi, Luze Sun, Alexander Robey, George J. Pappas, Eric Wong, Hamed Hassani

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.06414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[447] Graph Data Augmentation with Contrastive Learning on Covariate Distribution Shift

Fanlong Zeng, Wensheng Gan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.00716: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.00716&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[448] End-to-End Large Portfolio Optimization for Variance Minimization with Neural Networks through Covariance Cleaning

Christian Bongiorno, Efstratios Manolakis, Rosario Nunzio Mantegna

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.01918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[449] Fine-Tuning Code Language Models to Detect Cross-Language Bugs

Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, Yutao Ma

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.21954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[450] Prompt to Pwn: Automated Exploit Generation for Smart Contracts

ZeKe Xiao, Qin Wang, Yuekang Li, Shiping Chen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.01371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.01371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[451] Quantum spatial best-arm identification via quantum walks

Tomoki Yamagami, Etsuo Segawa, Takatomo Mihana, André Röhm, Atsushi Uchida, Ryoichi Horisaki

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.05890: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.05890&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[452] SpecAgent: A Speculative Retrieval and Forecasting Agent for Code Completion

George Ma, Anurag Koul, Qi Chen, Yawen Wu, Sachit Kuhar, Yu Yu, Aritra Sengupta, Varun Kumar, Murali Krishna Ramanathan

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.17925: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17925&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[453] Knowledge-Guided Time-Varying Causal Inference for Arctic Sea Ice Dynamics

Akila Sampath, Vandana Janeja, Jianwu Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.17647: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17647&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[454] ODMA: On-Demand Memory Allocation Strategy for LLM Serving on LPDDR-Class Accelerators

Guoqiang Zou, Wanyu Wang, Hao Zheng, Longxiang Yin, Yinhe Han

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.09427: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.09427&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[455] QSLM: A Performance- and Memory-aware Quantization Framework with Tiered Search Strategy for Spike-driven Language Models

Rachmad Vidya Wicaksana Putra, Pasindu Wickramasinghe, Muhammad Shafique

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.00679: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00679&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[456] Bootstrapping Code Translation with Weighted Multilanguage Exploration

Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03512: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03512&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[457] On the Spatiotemporal Dynamics of Generalization in Neural Networks

Zichao Wei

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.01651: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01651&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[458] See2Refine: Vision-Language Feedback Improves LLM-Based eHMI Action Designers

Ding Xia, Xinyue Gui, Mark Colley, Fan Gao, Zhongyi Zhou, Dongyuan Li, Renhe Jiang, Takeo Igarashi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.02063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.02063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[459] Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.05993: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05993&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[460] Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models

Freja Høgholm Petersen, Jesper Sandvig Mariegaard, Rocco Palmitessa, Allan P. Engsig-Karup

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.05416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[461] Chimera: Neuro-Symbolic Attention Primitives for Trustworthy Dataplane Intelligence

Rong Fu, Xiaowen Ma, Kun Liu, Wangyu Wu, Ziyu Kong, Jia Yee Tan, Tailong Luo, Xianda Li, Zeli Su, Youjin Wang, Yongtai Liu, Simon Fong

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.12851: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12851&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[462] Debug2Fix: Can Interactive Debugging Help Coding Agents Fix More Bugs?

Spandan Garg, Yufan Huang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.18571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[463] PhysMem: Scaling Test-time Physical Memory for Robot Manipulation

Haoyang Li, Yang You, Hao Su, Leonidas Guibas

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.20323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[464] Bridging the High-Frequency Data Gap: A Millisecond-Resolution Network Dataset for Advancing Time Series Foundation Models

Subina Khanal, Seshu Tirupathi, Merim Dzaferagic, Marco Ruffini, Torben Bach Pedersen

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.16497: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16497&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[465] Reinforced Generation of Combinatorial Structures: Ramsey Numbers

Ansh Nagda, Prabhakar Raghavan, Abhradeep Thakurta

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.09172: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.09172&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[466] Adapting Dijkstra for Buffers and Unlimited Transfers

Denys Katkalo, Andrii Rohovyi, Toby Walsh

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.11729: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11729&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[467] Early Pruning for Public Transport Routing

Andrii Rohovyi, Abdallah Abuaisha, Toby Walsh

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.12592: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12592&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[468] ExpertGen: Scalable Sim-to-Real Expert Policy Learning from Imperfect Behavior Priors

Zifan Xu, Ran Gong, Maria Vittoria Minniti, Ahmet Salih Gundogdu, Eric Rosen, Kausik Sivakumar, Riedana Yan, Zixing Wang, Di Deng, Peter Stone, Xiaohan Zhang, Karl Schmeckpeper

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.15956: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15956&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[469] GAIN: Multiplicative Modulation for Domain Adaptation

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.04516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[470] The data heat island effect: quantifying the impact of AI data centers in a warming world

Andrea Marinoni, Erik Cambria, Weisi Lin, Mauro Dalla Mura, Jocelyn Chanussot, Edoardo Ragusa, Chi Yan Tso, Yihao Zhu, Benjamin Horton

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20897: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20897&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[471] MoBiE: Efficient Inference of Mixture of Binary Experts under Post-Training Quantization

Zhixiong Zhao, Zukang Xu, Zhixuan Chen, Dawei Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06798: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06798&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[472] Semantic Intent Fragmentation: A Single-Shot Compositional Attack on Multi-Agent AI Pipelines

Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Yoonpyo Lee, Syed Bahauddin Alam, Sajedul Talukder

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[473] THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture

Augustus Haoyang Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[474] Catching Every Ripple: Enhanced Anomaly Awareness via Dynamic Concept Adaptation

Jiaqi Zhu, Shaofeng Cai, Jie Chen, Fang Deng, Beng Chin Ooi, Wenqiao Zhang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.14726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.14726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[475] Beyond the ‘Diff’: Addressing Agentic Entropy in Agentic Software Development

Matteo Casserini, Alessandro Facchini, Andrea Ferrario

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[476] SCATR: Simple Calibrated Test-Time Ranking

Divya Shyamal, Marta Knežević, Lan Tran, Chanakya Ekbote, Vijay Lingam, Paul Pu Liang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[477] Beyond the Bellman Fixed Point: Geometry and Fast Policy Identification in Value Iteration

Donghwan Lee

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17457: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17457&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[478] LEPO: Latent Reasoning Policy Optimization for Large Language Models

Yuyan Zhou, Jiarui Yu, Hande Dong, Zhezheng Hao, Hong Wang, Jianqing Zhang, Qiang Lin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.17892: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.17892&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[479] Learning the Riccati solution operator for time-varying LQR via Deep Operator Networks

Jun Chen, Umberto Biccari, Junmin Wang

[497] Compile to Compress: Boosting Formal Theorem Provers by Compiler Outputs

Guchan Li, Rui Tian, Hongning Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) have demonstrated significant potential in formal theorem proving, yet state-of-the-art performance often necessitates prohibitive test-time compute via massive roll-outs or extended context windows. In this work, we address this scalability bottleneck by exploiting an informative structure in formal verification: the observation that compilers map a vast space of diverse proof attempts to a compact set of structured failure modes. We introduce a learning-to-refine framework that leverages this compression to perform efficient learning and proof exploration. We perform tree search that corrects errors locally conditioned on explicit verifier feedback, thereby circumventing the costs associated with accumulating a long history of proof attempts. Extensive evaluations show that our method consistently amplifies the reasoning capabilities of base provers across varying scales. Notably, our approach achieves state-of-the-art performance on PutnamBench among publicly reported $\sim$8B and $\sim$32B parameter models under comparable test-time budgets, offering a scalable paradigm for next-generation verifier-guided reasoning.

[498] Easy Samples Are All You Need: Self-Evolving LLMs via Data-Efficient Reinforcement Learning

Zhiyin Yu, Bo Zhang, Qibin Hou, Zhonghai Wu, Xiao Luo, Lei Bai

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Previous LLMs-based RL studies typically follow either supervised learning with high annotation costs, or unsupervised paradigms using voting or entropy-based rewards. However, their performance remains far from satisfactory due to the substantial annotation cost and issues such as model collapse or reward hacking. To address these issues, we introduce a new perspective inspired by cognitive learning theory and propose a novel approach called EasyRL. The core of EasyRL is to simulate the human cognitive acquisition curve by integrating reliable knowledge transfer from easy labeled data with a progressive divide-and-conquer strategy that tackles increasingly difficult unlabeled data. Specifically, we initialize a warm-up model using supervised RL with few-shot labeled data. This is followed by a divide-and-conquer pseudo-labeling strategy on difficult unlabeled data, combining consistency-based selection for low-uncertainty cases and reflection-based resolution for medium-uncertainty cases. Finally, difficulty-progressive self-training with iterative pseudo-labeling and RL further strengthens the model’s reasoning capability. EasyRL provides a unified self-evolving framework that facilitates data-efficient post-training of LLMs. Experimental results on mathematical and scientific benchmarks demonstrate that EasyRL, using only 10% of easy labeled data, consistently outperforms state-of-the-art baselines.

[499] FASE : A Fairness-Aware Spatiotemporal Event Graph Framework for Predictive Policing

Pronob Kumar Barman, Pronoy Kumar Barman, Plaban Kumar Barman, Rohan Mandar Salvi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Predictive policing systems that allocate patrol resources based solely on predicted crime risk can unintentionally amplify racial disparities through feedback driven data bias. We present FASE, a Fairness Aware Spatiotemporal Event Graph framework, which integrates spatiotemporal crime prediction with fairness constrained patrol allocation and a closed loop deployment feedback simulator. We model Baltimore as a graph of 25 ZIP Code Tabulation Areas and use 139,982 Part 1 crime incidents from 2017 to 2019 at hourly resolution, producing a sparse feature tensor. The prediction module combines a spatiotemporal graph neural network with a multivariate Hawkes process to capture spatial dependencies and self exciting temporal dynamics. Outputs are modeled using a Zero Inflated Negative Binomial distribution, suitable for overdispersed and zero heavy crime counts. The model achieves a validation loss of 0.4800 and a test loss of 0.4857. Patrol allocation is formulated as a fairness constrained linear optimization problem that maximizes risk weighted coverage while enforcing a Demographic Impact Ratio constraint with deviation bounded by 0.05. Across six simulated deployment cycles, fairness remains within 0.9928 to 1.0262, and coverage ranges from 0.876 to 0.936. However, a persistent detection rate gap of approximately 3.5 percentage points remains between minority and non minority areas. This result shows that allocation level fairness constraints alone do not eliminate feedback induced bias in retraining data, highlighting the need for fairness interventions across the full pipeline.

[500] Curiosity-Critic: Cumulative Prediction Error Improvement as a Tractable Intrinsic Reward for World Model Training

Vin Bhaskara, Haicheng Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Local prediction-error-based curiosity rewards focus on the current transition without considering the world model’s cumulative prediction error across all visited transitions. We introduce Curiosity-Critic, which grounds its intrinsic reward in the improvement of this cumulative objective, and show that it reduces to a tractable per-step form: the difference between the current prediction error and the asymptotic error baseline of the current state transition. We estimate this baseline online with a learned critic co-trained alongside the world model; regressing a single scalar, the critic converges well before the world model saturates, redirecting exploration toward learnable transitions without oracle knowledge of the noise floor. The reward is higher for learnable transitions and collapses toward the baseline for stochastic ones, effectively separating epistemic (reducible) from aleatoric (irreducible) prediction error online. Prior prediction-error curiosity formulations, from Schmidhuber (1991) to learned-feature-space variants, emerge as special cases corresponding to specific approximations of this baseline. Experiments on a stochastic grid world show that Curiosity-Critic outperforms prediction-error and visitation-count baselines in convergence speed and final world model accuracy.

[501] The Cost of Relaxation: Evaluating the Error in Convex Neural Network Verification

Merkouris Papamichail, Konstantinos Varsos, Giorgos Flouris, João Marques-Silva

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Many neural network (NN) verification systems represent the network’s input-output relation as a constraint program. Sound and complete, representations involve integer constraints, for simulating the activations. Recent works convexly relax the integer constraints, improving performance, at the cost of soundness. Convex relaxations consider outputs that are unreachable by the original network. We study the worst case divergence between the original network and its convex relaxations; both qualitatively and quantitatively. The relaxations’ space forms a lattice, where the top element corresponds to a full relaxation, with every neuron linearized. The bottom element corresponds to the original network. We provide analytical upper and lower bounds for the $\ell_\infty$-distance between the fully relaxed and original outputs. This distance grows exponentially, w.r.t. the network’s depth, and linearly w.r.t. the input’s radius. The misclassification probability exhibits a step-like behavior, w.r.t. input radius. Our results are supported by experiments on MNIST, Fashion MNIST and random networks.

[502] Discrete Tilt Matching

Yuyuan Chen, Shiyi Wang, Peter Potaptchik, Jaeyeon Kim, Michael S. Albergo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Masked diffusion large language models (dLLMs) are a promising alternative to autoregressive generation. While reinforcement learning (RL) methods have recently been adapted to dLLM fine-tuning, their objectives typically depend on sequence-level marginal likelihoods, which are intractable for masked diffusion models. To address this, we derive Discrete Tilt Matching (DTM), a likelihood-free method that recasts dLLM fine-tuning as state-level matching of local unmasking posteriors under reward tilting. DTM takes the form of a weighted cross-entropy objective with explicit minimizer, and admits control variates that improve training stability. On a synthetic maze-planning task, we analyze how DTM’s annealing schedule and control variates affect training stability and prevent mode collapse. At scale, fine-tuning LLaDA-8B-Instruct with DTM yields strong gains on Sudoku and Countdown while remaining competitive on MATH500 and GSM8K.

[503] Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

[504] Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

Andrew Wang, Ellie Pavlick, Ritambhara Singh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: An active challenge in developing multimodal machine learning (ML) models for healthcare is handling missing modalities during training and deployment. As clinical datasets are inherently temporal and sparse in terms of modality presence, capturing the underlying predictive signal via diagnostic multimodal ML models while retaining model explainability remains an ongoing challenge. In this work, we address this by re-framing clinical diagnosis as an autoregressive sequence modeling task, utilizing causal decoders from large language models (LLMs) to model a patient’s multimodal trajectory. We first introduce a missingness-aware contrastive pre-training objective that integrates multiple modalities in datasets with missingness in a shared latent space. We then show that autoregressive sequence modeling with transformer-based architectures outperforms baselines on the MIMIC-IV and eICU fine-tuning benchmarks. Finally, we use interpretability techniques to move beyond performance boosts and find that across various patient stays, removing modalities leads to divergent behavior that our contrastive pre-training mitigates. By abstracting clinical diagnosis as sequence modeling and interpreting patient stay trajectories, we develop a framework to profile and handle missing modalities while addressing the canonical desideratum of safe, transparent clinical AI.

[505] Towards Understanding the Robustness of Sparse Autoencoders

Ahson Saiyed, Sabrina Sadiekh, Chirag Agarwal

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) remain vulnerable to optimization-based jailbreak attacks that exploit internal gradient structure. While Sparse Autoencoders (SAEs) are widely used for interpretability, their robustness implications remain underexplored. We present a study of integrating pretrained SAEs into transformer residual streams at inference time, without modifying model weights or blocking gradients. Across four model families (Gemma, LLaMA, Mistral, Qwen) and two strong white-box attacks (GCG, BEAST) plus three black-box benchmarks, SAE-augmented models achieve up to a 5x reduction in jailbreak success rate relative to the undefended baseline and reduce cross-model attack transferability. Parametric ablations reveal (i) a monotonic dose-response relationship between L0 sparsity and attack success rate, and (ii) a layer-dependent defense-utility tradeoff, where intermediate layers balance robustness and clean performance. These findings are consistent with a representational bottleneck hypothesis: sparse projection reshapes the optimization geometry exploited by jailbreak attacks.

[506] Multi-Level Temporal Graph Networks with Local-Global Fusion for Industrial Fault Diagnosis

Bibek Aryal, Gift Modekwe, Qiugang Lu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fault detection and diagnosis are critical for the optimal and safe operation of industrial processes. The correlations among sensors often display non-Euclidean structures where graph neural networks (GNNs) are widely used therein. However, for large-scale systems, local, global, and dynamic relations extensively exist among sensors, and traditional GNNs often overlook such complex and multi-level structures for various problems including the fault diagnosis. To address this issue, we propose a structure-aware multi-level temporal graph network with local-global feature fusion for industrial fault diagnosis. First, a correlation graph is dynamically constructed using Pearson correlation coefficients to capture relationships among process variables. Then, temporal features are extracted through long short-term memory (LSTM)-based encoder, whereas the spatial dependencies among sensors are learned by graph convolution layers. A multi-level pooling mechanism is used to gradually coarsen and learn meaningful graph structures, to capture higher-level patterns while keeping important fault related details. Finally, a fusion step is applied to combine both detailed local features and overall global patterns before the final prediction. Experimental evaluations on the Tennessee Eastman process (TEP) demonstrate that the proposed model achieves superior fault diagnosis performance, particularly for complex fault scenarios, outperforming various baseline methods.

[507] Streaming Structured Inference with Flash-SemiCRF

Benjamin K. Johnson, Thomas Goralski, Ayush Semwal, Hui Shen, H. Josh Jang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semi-Markov Conditional Random Fields (semi-CRFs) assign labels to segments of a sequence rather than to individual positions, enabling exact inference over segment-level features and principled uncertainty estimates at their boundaries. However, existing implementations must materialize a large edge potential tensor whose size grows with sequence length, maximum segment length, and label count, becoming prohibitive for speech-scale state spaces and intractable at genomic scales where sequences can exceed 100,000 positions. This memory bottleneck has limited the adoption of exact segment-level inference for long sequences and large label sets. We identify that the core inefficiency is materializing edge potentials that can instead be evaluated on-the-fly from a compact prefix-sum array, and make several improvements. First, replacing the stored edge tensor with prefix-sum lookup reduces the memory footprint by a factor proportional to the product of segment length and label count. Second, a streaming forward-backward pass with checkpoint-boundary normalization keeps working memory sublinear in sequence length while preserving exact gradients. Third, zero-centered cumulative scores control numerical drift and induce an adaptive duration prior under label imbalance. We integrate these ideas into Flash-SemiCRF, a fused Triton kernel that enables exact semi-CRF inference on previously intractable problem sizes. Available at https://github.com/biobenkj/flash-semicrf.

[508] Efficient Mixture-of-Experts LLM Inference with Apple Silicon NPUs

Afsara Benazir, Felix Xiaozhu Lin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Apple Neural Engine (ANE) is a dedicated neural processing unit (NPU) present in every Apple Silicon chip. Mixture-of-Experts (MoE) LLMs improve inference efficiency via sparse activation but are challenging for NPUs in three ways: expert routing is unpredictable and introduces dynamic tensor shapes that conflict with the shape-specific constraints of NPUs; several irregular operators, e.g., top-k, scatter/gather, etc., are not NPU-friendly; and launching many small expert kernels incurs substantial dispatch and synchronization overhead. NPUs are designed to offload AI compute from CPU and GPU; our goal is to enable such offloading for MoE inference, particularly during prefill, where long-context workloads consume substantial system resources. This paper presents NPUMoE, a runtime inference engine that accelerates MoE execution on Apple Silicon by offloading dense, static computation to NPU, while preserving a CPU/GPU fallback path for dynamic operations. NPUMoE uses offline calibration to estimate expert capacity and popularity that drives three key techniques: (1) Static tiers for expert capacity to address dynamic expert routing (2) Grouped expert execution to mitigate NPU concurrency limits (3) Load-aware expert compute graph residency to reduce CPU-NPU synchronization overhead. Experiments on Apple M-series devices using three representative MoE LLMs and four long-context workloads show that NPUMoE consistently outperforms baselines, reducing latency by 1.32x-5.55x, improving energy efficiency by 1.81x-7.37x, and reducing CPU-cycle usage by 1.78x-5.54x through effective NPU offloading.

[509] HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation

Zijian Zeng, Fei Ding, Huiming Yang, Xianwei Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-Language-Action (VLA) models fail systematically on long-horizon manipulation tasks despite strong short-horizon performance. We show that this failure is not resolved by extending context length alone in the current reactive execution setting; instead, it stems from three recurring execution-loop deficiencies: the memory gap, the verification gap, and the recovery gap. We present HELM, a model-agnostic framework that addresses these deficiencies with three components: an Episodic Memory Module (EMM) that retrieves key task history via CLIP-indexed keyframes, a learned State Verifier (SV) that predicts action failure before execution from observation, action, subgoal, and memory-conditioned context, and a Harness Controller (HC) that performs rollback and replanning. The SV is the core learning contribution: it consistently outperforms rule-based feasibility checks and ensemble uncertainty baselines, and its effectiveness depends critically on access to episodic memory. On LIBERO-LONG, HELM improves task success rate by 23.1 percentage points over OpenVLA (58.4% to 81.5%), while extending the context window to H=32 yields only a 5.4-point gain and same-budget LoRA adaptation remains 12.2 points below HELM. HELM also improves long-horizon performance on CALVIN and substantially boosts recovery success under controlled perturbations. Ablations and mechanism analyses isolate the contribution of each component, and we release LIBERO-Recovery as a perturbation-injection protocol for evaluating failure recovery in long-horizon manipulation.

[510] Preserving Clusters in Error-Bounded Lossy Compression of Particle Data

Congrong Ren, Sheng Di, Katrin Heitmann, Franck Cappello, Hanqi Guo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Lossy compression is widely used to reduce storage and I/O costs for large-scale particle datasets in scientific applications such as cosmology, molecular dynamics, and fluid dynamics, where clustering structures (e.g., single-linkage or Friends-of-Friends) are critical for downstream analysis; however, existing compressors typically provide only pointwise error bounds on particle positions and offer no guarantees on preserving clustering outcomes, and even small perturbations can alter cluster connectivity and compromise scientific validity. We propose a correction-based technique to preserve single-linkage clustering under lossy compression, operating on decompressed data from off-the-shelf compressors such as SZ3 and Draco. Our key contributions are threefold: (1) a clustering-aware correction algorithm that identifies vulnerable particle pairs via spatial partitioning and local neighborhood search; (2) an optimization-based formulation that enforces clustering consistency using projected gradient descent with a loss that encodes pairwise distance violations; and (3) a scalable GPU-accelerated and distributed implementation for large-scale datasets. Experiments on cosmology and molecular dynamics datasets show that our method effectively preserves clustering results while maintaining competitive compression performance compared with SZ3, ZFP, Draco, LCP, and space-filling-curve-based schemes.

[511] A PPA-Driven 3D-IC Partitioning Selection Framework with Surrogate Models

Shang Wang, Shuai Liu, Owen Randall, Matthew E. Taylor

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: 3D-IC netlist partitioning is commonly optimized using proxy objectives, while final PPA is treated as a costly evaluation rather than an optimization signal. This proxy-driven paradigm makes it difficult to reliably translate additional PPA evaluations into better PPA outcomes. To bridge this gap, we present DOPP (D-Optimal PPA-driven partitioning selection), an approach that bridges the gap between proxies and true PPA metrics. Across eight 3D-IC designs, our framework improves PPA over Open3DBench (average relative improvements of 9.99% congestion, 7.87% routed wirelength, 7.75% WNS, 21.85% TNS, and 1.18% power). Compared with exhaustive evaluation over the full candidate set, DOPP achieves comparable best-found PPA while evaluating only a small fraction of candidates, substantially reducing evaluation cost. By parallelizing evaluations, our method delivers these gains while maintaining wall-clock runtime comparable to traditional baselines.

[512] Rethinking Dataset Distillation: Hard Truths about Soft Labels

Priyam Dey, Aditya Sahdev, Sunny Bhati, Konda Reddy Mopuri, R. Venkatesh Babu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the perceived success of large-scale dataset distillation (DD) methods, recent evidence finds that simple random image baselines perform on-par with state-of-theart DD methods like SRe2L due to the use of soft labels during downstream model training. This is in contrast with the findings in coreset literature, where high-quality coresets consistently outperform random subsets in the hardlabel (HL) setting. To understand this discrepancy, we perform a detailed scalability analysis to examine the role of data quality under different label regimes, ranging from abundant soft labels (termed as SL+KD regime) to fixed soft labels (SL) and hard labels (HL). Our analysis reveals that high-quality coresets fail to convincingly outperform the random baseline in both SL and SL+KD regimes. In the SL+KD setting, performance further approaches nearoptimal levels relative to the full dataset, regardless of subset size or quality, for a given compute budget. This performance saturation calls into question the widespread practice of using soft labels for model evaluation, where unlike the HL setting, subset quality has negligible influence. A subsequent systematic evaluation of five large-scale and four small-scale DD methods in the HL setting reveals that only RDED reliably outperforms random baselines on ImageNet-1K, but can still lag behind strong coreset methods due to its over-reliance on easy sample patches. Based on this, we introduce CAD-Prune, a compute-aware pruning metric that efficiently identifies samples of optimal difficulty for a given compute budget, and use it to develop CA2D, a compute-aligned DD method, outperforming current DD methods on ImageNet-1K at various IPC settings. Together, our findings uncover many insights into current DD research and establish useful tools to advance dataefficient learning for both coresets and DD.

[513] Curvature-Aware PCA with Geodesic Tangent Space Aggregation for Semi-Supervised Learning

Alexandre L. M. Levada

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Principal Component Analysis (PCA) is a fundamental tool for representation learning, but its global linear formulation fails to capture the structure of data supported on curved manifolds. In contrast, manifold learning methods model nonlinearity but often sacrifice the spectral structure and stability of PCA. We propose \emph{Geodesic Tangent Space Aggregation PCA (GTSA-PCA)}, a geometric extension of PCA that integrates curvature awareness and geodesic consistency within a unified spectral framework. Our approach replaces the global covariance operator with curvature-weighted local covariance operators defined over a $k$-nearest neighbor graph, yielding local tangent subspaces that adapt to the manifold while suppressing high-curvature distortions. We then introduce a geodesic alignment operator that combines intrinsic graph distances with subspace affinities to globally synchronize these local representations. The resulting operator admits a spectral decomposition whose leading components define a geometry-aware embedding. We further incorporate semi-supervised information to guide the alignment, improving discriminative structure with minimal supervision. Experiments on real datasets show consistent improvements over PCA, Kernel PCA, Supervised PCA and strong graph-based baselines such as UMAP, particularly in small sample size and high-curvature regimes. Our results position GTSA-PCA as a principled bridge between statistical and geometric approaches to dimensionality reduction.

[514] The High Explosives and Affected Targets (HEAT) Dataset

Bryan Kaiser, Kyle Hickmann, Sharmistha Chakrabarti, Soumi De, Sourabh Pandit, David Schodt, Jesus Pulido, Divya Banesh, Christine Sweeney

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Artificial Intelligence (AI) surrogate models provide a computationally efficient alternative to full-physics simulations, but no public datasets currently exist for training and validating models of high-explosive-driven, multi-material shock dynamics. Simulating shock propagation is challenging due to the need for material-specific equations of state (EOS) and models of plasticity, phase change, damage, fluid instabilities, and multi-material interactions. Explosive-driven shocks further require reactive material models to capture detonation physics. To address this gap, we introduce the High-Explosives and Affected Targets (HEAT) dataset, a physics-rich collection of two-dimensional, cylindrically symmetric simulations generated using an Eulerian multi-material shock-propagation code developed at Los Alamos National Laboratory. HEAT consists of two partitions: expanding shock-cylinder (CYL) simulations and Perturbed Layered Interface (PLI) simulations. Each entry includes time series of thermodynamic fields (pressure, density, temperature), kinematic fields (position, velocity), and continuum quantities such as stress. The CYL partition spans a range of materials, including metals (aluminum, copper, depleted uranium, stainless steel, tantalum), a polymer, water, gases (air, nitrogen), and a detonating material. The PLI partition explores varied geometries with fixed materials: copper, aluminum, stainless steel, polymer, and high explosive. HEAT captures key phenomena such as shock propagation, momentum transfer, plastic deformation, and thermal effects, providing a benchmark dataset for AI/ML models of multi-material shock physics.

[515] One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

Chris Cameron, Wangzheng Wang, Nikita Ivanov, Ashmita Bhattacharyya, Didier Chételat, Yingxue Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Looped transformers scale computational depth without increasing parameter count by repeatedly applying a shared transformer block and can be used for iterative refinement, where each loop rewrites a full fixed-size prediction in parallel. On difficult problems, such as those that require search-like computation, reaching a highly structured solution starting from noise can require long refinement trajectories. Learning such trajectories is challenging when training specifies only the target solution and provides no supervision over the intermediate refinement path. Diffusion models tackle this issue by corrupting data with varying magnitudes of noise and training the model to reverse it in a \textit{single step}. However, this process misaligns training and testing behaviour. We introduce Denoising Recursion Models, a method that similarly corrupts data with noise but trains the model to reverse the corruption over \textit{multiple} recursive steps. This strategy provides a tractable curriculum of intermediate states, while better aligning training with testing and incentivizing non-greedy, forward-looking generation. Through extensive experiments, we show this approach outperforms the Tiny Recursion Model (TRM) on ARC-AGI, where it recently achieved breakthrough performance.

[516] Task Switching Without Forgetting via Proximal Decoupling

Pourya Shamsolmoali, Masoumeh Zareapoor, Eric Granger, William A. P. Smith, Yue Lu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In continual learning, the primary challenge is to learn new information without forgetting old knowledge. A common solution addresses this trade-off through regularization, penalizing changes to parameters critical for previous tasks. In most cases, this regularization term is directly added to the training loss and optimized with standard gradient descent, which blends learning and retention signals into a single update and does not explicitly separate essential parameters from redundant ones. As task sequences grow, this coupling can over-constrain the model, limiting forward transfer and leading to inefficient use of capacity. We propose a different approach that separates task learning from stability enforcement via operator splitting. The learning step focuses on minimizing the current task loss, while a proximal stability step applies a sparse regularizer to prune unnecessary parameters and preserve task-relevant ones. This turns the stability-plasticity into a negotiated update between two complementary operators, rather than a conflicting gradient. We provide theoretical justification for the splitting method on the continual-learning objective, and demonstrate that our proposed solver achieves state-of-the-art results on standard benchmarks, improving both stability and adaptability without the need for replay buffers, Bayesian sampling, or meta-learning components.

[517] ParamBoost: Gradient Boosted Piecewise Cubic Polynomials

Nicolas Salvadé, Tim Hillel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generalized Additive Models (GAMs) can be used to create non-linear glass-box (i.e. explicitly interpretable) models, where the predictive function is fully observable over the complete input space. However, glass-box interpretability itself does not allow for the incorporation of expert knowledge from the modeller. In this paper, we present ParamBoost, a novel GAM whose shape functions (i.e. mappings from individual input features to the output) are learnt using a Gradient Boosting algorithm that fits cubic polynomial functions at leaf nodes. ParamBoost incorporates several constraints commonly used in parametric analysis to ensure well-refined shape functions. These constraints include: (i) continuity of the shape functions and their derivatives (up to C2); (ii) monotonicity; (iii) convexity; (iv) feature interaction constraints; and (v) model specification constraints. Empirical results show that the unconstrained ParamBoost model consistently outperforms state-of-the-art GAMs across several real-world datasets. We further demonstrate that modellers can selectively impose required constraints at a modest trade-off in predictive performance, allowing the model to be fully tailored to application-specific interpretability and parametric-analysis requirements.

[518] Subgraph Concept Networks: Concept Levels in Graph Classification

Lucie Charlotte Magister, Alexander Norcliffe, Iulia Duta, Pietro Lio

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The reasoning process of Graph Neural Networks is complex and considered opaque, limiting trust in their predictions. To alleviate this issue, prior work has proposed concept-based explanations, extracted from clusters in the model’s node embeddings. However, a limitation of concept-based explanations is that they only explain the node embedding space and are obscured by pooling in graph classification. To mitigate this issue and provide a deeper level of understanding, we propose the Subgraph Concept Network. The Subgraph Concept Network is the first graph neural network architecture that distils subgraph and graph-level concepts. It achieves this by performing soft clustering on node concept embeddings to derive subgraph and graph-level concepts. Our results show that the Subgraph Concept Network allows to obtain competitive model accuracy, while discovering meaningful concepts at different levels of the network.

[519] AC-SINDy: Compositional Sparse Identification of Nonlinear Dynamics

Peter Racioppo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present AC-SINDy, a compositional extension of the Sparse Identification of Nonlinear Dynamics (SINDy) framework that replaces explicit feature libraries with a structured representation based on arithmetic circuits. Rather than enumerating candidate basis functions, the proposed approach constructs nonlinear features through compositions of linear functions and multiplicative interactions, yielding a compact and scalable parameterization and enabling sparsity to be enforced directly over the computational graph. We also introduce a formulation that separates state estimation from dynamics identification by combining latent state inference with shared dynamics and multi-step supervision, improving robustness to noise while preserving interpretability. Experiments on nonlinear and chaotic systems demonstrate that the method recovers accurate and interpretable governing equations while scaling more favorably than standard SINDy.

[520] Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Isaac Llorente-Saguer

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction ($73^\circ$ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains $\geq$0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@$1%$FPR should accompany AUROC in safety-adjacent evaluation.

[521] Gradient-Based Program Synthesis with Neurally Interpreted Languages

Matthew V. Macfarlane, Clément Bonnet, Herke van Hoof, Levi H. S. Lelis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: A central challenge in program induction has long been the trade-off between symbolic and neural approaches. Symbolic methods offer compositional generalisation and data efficiency, yet their scalability is constrained by formalisms such as domain-specific languages (DSLs), which are labour-intensive to create and may not transfer to new domains. In contrast, neural networks flexibly learn from data but tend to generalise poorly in compositional and out-of-distribution settings. We bridge this divide with an instance of a Latent Adaptation Network architecture named Neural Language Interpreter (NLI), which learns its own discrete, symbolic-like programming language end-to-end. NLI autonomously discovers a vocabulary of primitive operations and uses a novel differentiable neural executor to interpret variable-length sequences of these primitives. This allows NLI to represent programs that are not bound to a constant number of computation steps, enabling it to solve more complex problems than those seen during training. To make these discrete, compositional program structures amenable to gradient-based optimisation, we employ the Gumbel-Softmax relaxation, enabling the entire model to be trained end-to-end. Crucially, this same differentiability enables powerful test-time adaptation. At inference, NLI’s program inductor provides an initial program guess. This guess is then refined via gradient descent through the neural executor, enabling efficient search for the neural program that best explains the given data. We demonstrate that NLI outperforms in-context learning, test-time training, and continuous latent program networks on tasks that require combinatorial generalisation and rapid adaptation to unseen tasks. Our results establish a new path toward models that combine the compositionality of discrete languages with the gradient-based search and end-to-end learning of neural networks.

[522] Collaborative Contextual Bayesian Optimization

Chih-Yu Chang, Qiyuan Chen, Tianhan Gao, David Fenning, Chinedum Okwudire, Neil Dasgupta, Wei Lu, Raed Al Kontar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Discovering optimal designs through sequential data collection is essential in many real-world applications. While Bayesian Optimization (BO) has achieved remarkable success in this setting, growing attention has recently turned to context-specific optimal design, formalized as Contextual Bayesian Optimization (CBO). Unlike BO, CBO is inherently more challenging as it must approximate an entire mapping from the context space to its corresponding optimal design, requiring simultaneous exploration across contexts and exploitation within each. In many modern applications, such tasks arise across multiple potentially heterogeneous but related clients, where collaboration can significantly improve learning efficiency. We propose CCBO, Collaborative Contextual Bayesian Optimization, a unified framework enabling multiple clients to jointly perform CBO with controllable contexts, supporting both online collaboration and offline initialization from peers’ historical beliefs, with an optional privacy-preserving communication mechanism. We establish sublinear regret guarantees and demonstrate, through extensive simulations and a real-world hot rolling application, that CCBO achieves substantial improvements over existing approaches even under client heterogeneity. The code to reproduce the results can be found at https://github.com/cchihyu/Collaborative-Contextual-Bayesian-Optimization

[523] Fine-Tuning Small Reasoning Models for Quantum Field Theory

Nathaniel S. Woodward, Zhiqi Gao, Yurii Kvasiuk, Kendrick M. Smith, Frederic Sala, Moritz Münchmeyer

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the growing application of Large Language Models (LLMs) to theoretical physics, there is little academic exploration into how domain-specific physics reasoning ability develops while training these models. To investigate this, we perform the first academic fine-tuning study of small (7B-parameter) reasoning models dedicated specifically to theoretical physics. Because open-source verifiable training data required to train such capabilities is scarce, we developed a robust data generation pipeline that can both create synthetic problems and make existing human-authored problems suitable for model training. Selecting Quantum Field Theory (QFT) as our primary domain, we generated over 2,500 synthetic problems alongside a curated collection of human-adapted problems sourced from arXiv and standard pedagogical resources. We conduct both Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) experiments, benchmarking performance gains as well as generalization to other physics domains. We perform an extensive analysis of model chains-of-though before and after fine-tuning, to understand how reasoning errors evolve during RL and SFT. Finally, we publicly release our data pipeline, verifiable QFT training data, and $\sim$200M tokens of QFT reasoning traces.

[524] TabEmb: Joint Semantic-Structure Embedding for Table Annotation

Ehsan Hoseinzade, Ke Wang, Anandharaju Durai Raju

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Table annotation is crucial for making web and enterprise tables usable in downstream NLP applications. Unlike textual data where learning semantically rich token or sentence embeddings often suffice, tables are structured combinations of columns wherein useful representations must jointly capture column’s semantics and the inter-column relationships. Existing models learn by linearizing the 2D table into a 1D token sequence and encoding it with pretrained language models (PLMs) such as BERT. However, this leads to limited semantic quality and weaker generalization to unseen or rare values compared to modern LLMs, and degraded structural modeling due to 2D-to-1D flattening and context-length constraints. We propose TabEmb, which directly targets these limitations by decoupling semantic encoding from structural modeling. An LLM first produces semantically rich embeddings for each column, and a graph-based module over columns then injects relationships into the embeddings, yielding joint semantic-tructural representations for table annotation. Experiments show that TabEmb consistently outperforms strong baselines on different table annotation tasks. Source code and datasets are available at https://github.com/hoseinzadeehsan/TabEmb

[525] FlowForge: A Staged Local Rollout Engine for Flow-Field Prediction

Xiaowen Zhang, Ziming Zhou, Fengnian Zhao, David L. S. Hung

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deep learning surrogates for CFD flow-field prediction often rely on large, complex models, which can be slow and fragile when data are noisy or incomplete. We introduce FlowForge, a staged local rollout engine that predicts future flow fields by compiling a locality-preserving update schedule and executing it with a shared lightweight local predictor. Rather than producing the next frame in a single global pass, FlowForge rewrites spatial sites stage by stage so that each update conditions only on bounded local context exposed by earlier stages. This compile-execute design aligns inference with short-range physical dependence, keeps latency predictable, and limits error amplification from global mixing. Across PDEBench, CFDBench, and BubbleML, FlowForge matches or improves upon strong baselines in pointwise accuracy, delivers consistently better robustness to noise and missing observations, and maintains stable multi-step rollout behavior while reducing per-step latency.

[526] Distillation Traps and Guards: A Calibration Knob for LLM Distillability

Weixiao Zhan, Yongcheng Jing, Leszek Rutkowski, Dacheng Tao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knowledge distillation (KD) transfers capabilities from large language models (LLMs) to smaller students, yet it can fail unpredictably and also underpins model leakage risks. Our analysis revealed several distillation traps: tail noise, off-policy instability, and, most fundamentally, the teacher-student gap, that distort training signals. These traps manifest as overconfident hallucinations, self-correction collapse, and local decoding degradation, causing distillation to fail. Motivated by these findings, we propose a post-hoc calibration method that, to the best of our knowledge, for the first time enables control over a teacher’s distillability via reinforcement fine-tuning (RFT). Our objective combines task utility, KL anchor, and across-tokenizer calibration reward. This makes distillability a practical safety lever for foundation models, connecting robust teacher-student transfer with deployment-aware model protection. Experiments across math, knowledge QA, and instruction-following tasks show that students distilled from distillable calibrated teachers outperform SFT and KD baselines, while undistillable calibrated teachers retain their task performance but cause distilled students to collapse, offering a practical knob for both better KD and model IP protection.

[527] Self-Improving Tabular Language Models via Iterative Group Alignment

Yunbo Long, Tejumade Afonja, Alexandra Brintrup, Mario Fritz

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While language models have been adapted for tabular data generation, two fundamental limitations remain: (1) static fine-tuning produces models that cannot learn from their own generated samples and adapt to self-correct, and (2) autoregressive objectives preserve local token coherence but neglect global statistical properties, degrading tabular quality. Reinforcement learning offers a potential solution but requires designing reward functions that balance competing objectives – impractical for tabular data. To fill the gap, we introduce TabGRAA (Tabular Group-Relative Advantage Alignment), the first self-improving framework for tabular data generation via automated feedback. At each iteration, TabGRAA uses an \emph{automated quality signal} – such as a two-sample distinguishability classifier or a distance-based reward – to partition newly generated samples into high- and low-quality groups, then optimizes a group-relative advantage objective that reinforces realistic patterns while penalizing artifacts. The specific signal is a modular choice rather than a fixed component of the framework. This establishes a virtuous feedback cycle, where the quality signal is re-computed against newly \emph{generated synthetic} samples at each round; the language model is only fine-tuned on these self-generated signals, so no additional real record is exposed during alignment, mitigating data-leakage risk beyond the initial supervised fine-tuning. Experiments show TabGRAA outperforms existing methods in fidelity, utility, and privacy, while matching or exceeding diffusion-based synthesizers, advancing tabular synthesis from static statistical replication to dynamic, self-improving generation.

[528] Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We can often verify the correctness of neural network outputs using ground truth labels, but we cannot reliably determine whether the output was produced by normal or anomalous internal mechanisms. Mechanistic anomaly detection (MAD) aims to flag these cases, but existing methods either depend on latent space analysis, which is vulnerable to obfuscation, or are specific to particular architectures and modalities. We reframe MAD as a functional attribution problem: asking to what extent samples from a trusted set can explain the model’s output, where attribution failure signals anomalous behavior. We operationalize this using influence functions, measuring functional coupling between test samples and a small reference set via parameter-space sampling. We evaluate across multiple anomaly types and modalities. For backdoors in vision models, our method achieves state-of-the-art detection on BackdoorBench, with an average Defense Effectiveness Rating (DER) of 0.93 across seven attacks and four datasets (next best 0.83). For LLMs, we similarly achieve a significant improvement over baselines for several backdoor types, including on explicitly obfuscated models. Beyond backdoors, our method can detect adversarial and out-of-distribution samples, and distinguishes multiple anomalous mechanisms within a single model. Our results establish functional attribution as an effective, modality-agnostic tool for detecting anomalous behavior in deployed models.

[529] Low-Rank Adaptation for Critic Learning in Off-Policy Reinforcement Learning

Yuan Zhuang, Yuexin Bian, Sihong He, Jie Feng, Qing Su, Songyang Han, Jonathan Petit, Shihao Ji, Yuanyuan Shi, Fei Miao

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scaling critic capacity is a promising direction for enhancing off-policy reinforcement learning (RL). However, larger critics are prone to overfitting and unstable in replay-buffer-based bootstrap training. This paper leverages Low-Rank Adaptation (LoRA) as a structural-sparsity regularizer for off-policy critics. Our approach freezes randomly initialized base matrices and solely optimizes low-rank adapters, thereby constraining critic updates to a low-dimensional subspace. Built on top of SimbaV2, we further develop a LoRA formulation, compatible with SimbaV2, that preserves its hyperspherical normalization geometry under frozen-backbone training. We evaluate our method with SAC and FastTD3 on DeepMind Control locomotion and IsaacLab robotics benchmarks. LoRA consistently achieves lower critic loss during training and stronger policy performance. Extensive experiments demonstrate that adaptive low-rank updates provide a simple, scalable, and effective structural regularization for critic learning in off-policy RL.

[530] Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code will be released to the public soon.

[531] Guiding Distribution Matching Distillation with Gradient-Based Reinforcement Learning

Linwei Dong, Ruoyu Guo, Ge Bai, Zehuan Yuan, Yawei Luo, Changqing Zou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion distillation, exemplified by Distribution Matching Distillation (DMD), has shown great promise in few-step generation but often sacrifices quality for sampling speed. While integrating Reinforcement Learning (RL) into distillation offers potential, a naive fusion of these two objectives relies on suboptimal raw sample evaluation. This sample-based scoring creates inherent conflicts with the distillation trajectory and produces unreliable rewards due to the noisy nature of early-stage generation. To overcome these limitations, we propose GDMD, a novel framework that redefines the reward mechanism by prioritizing distillation gradients over raw pixel outputs as the primary signal for optimization. By reinterpreting the DMD gradients as implicit target tensors, our framework enables existing reward models to directly evaluate the quality of distillation updates. This gradient-level guidance functions as an adaptive weighting that synchronizes the RL policy with the distillation objective, effectively neutralizing optimization divergence. Empirical results show that GDMD sets a new SOTA for few-step generation. Specifically, our 4-step models outperform the quality of their multi-step teacher and substantially exceed previous DMDR results in GenEval and human-preference metrics, exhibiting strong scalability potential.

[532] Accelerating trajectory optimization with Sobolev-trained diffusion policies

Théotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.

[533] FedProxy: Federated Fine-Tuning of LLMs via Proxy SLMs and Heterogeneity-Aware Fusion

Tao Fan, Guoqiang Ma, Yuanfeng Song, Lixin Fan, Kai Chen, Qiang Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated fine-tuning of Large Language Models (LLMs) is obstructed by a trilemma of challenges: protecting LLMs intellectual property (IP), ensuring client privacy, and mitigating performance loss on heterogeneous data. Existing methods like Offsite-Tuning (OT) secure the LLMs IP by having clients train only lightweight adapters, yet our analysis reveals they suffer from a fundamental performance bottleneck, leaving a significant gap compared to centralized training. To bridge this gap, we introduce FedProxy, a new federated adaptation framework. FedProxy replaces weak adapters with a unified, powerful Proxy Small Language Model (SLM), compressed from the proprietary LLM, to serve as a high-fidelity surrogate for collaborative fine-tuning. Our framework systematically resolves the trilemma through a three-stage architecture: (i) Efficient Representation via server-guided compression to create a resource-friendly proxy; (ii) Robust Optimization through an interference-mitigating aggregation strategy to handle data heterogeneity; and (iii) Effortless Fusion via a training-free “plug-in” mechanism to integrate learned knowledge back into the LLM. Experiments show FedProxy significantly outperforms OT methods and approaches centralized performance, establishing a new benchmark for secure and high-performance federated LLM adaptation.

[534] Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

Julian Skifstad, Xinyue Annie Yang, Glen Chou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Inference-time LLM alignment methods, particularly activation steering, offer an alternative to fine-tuning by directly modifying activations during generation. Existing methods, however, often rely on non-anticipative interventions that ignore how perturbations propagate through transformer layers and lack online error feedback, resulting in suboptimal, open-loop control. To address this, we show empirically that, despite the nonlinear structure of transformer blocks, layer-wise dynamics across multiple LLM architectures and scales are well-approximated by locally-linear models. Exploiting this property, we model LLM inference as a linear time-varying dynamical system and adapt the classical linear quadratic regulator to compute feedback controllers using layer-wise Jacobians, steering activations toward desired semantic setpoints in closed-loop with minimal computational overhead and no offline training. We also derive theoretical bounds on setpoint tracking error, enabling formal guarantees on steering performance. Using a novel adaptive semantic feature setpoint signal, our method yields robust, fine-grained behavior control across models, scales, and tasks, including state-of-the-art modulation of toxicity, truthfulness, refusal, and arbitrary concepts, surpassing baseline steering methods. Our code is available at: https://github.com/trustworthyrobotics/lqr-activation-steering

[535] FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

Pingwei Sun, Yuxuan Hu, Jianchao Tan, Xue Wang, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Linear attention mechanisms have emerged as promising alternatives to softmax attention, offering linear-time complexity during inference. Recent advances such as Gated DeltaNet (GDN) and Kimi Delta Attention (KDA) have demonstrated that the delta rule, an online gradient descent update, enables superior associative recall compared to simple additive updates. While KDA refined the coarse head-wise decay gate into channel-wise decay, the learning rate $β_t$ in the delta update remains a scalar, limiting the model’s capacity for dimension-specific adaptation. We introduce FG$^2$-GDN, which replaces the scalar $β_t$ with a channel-wise vector analogous to the transition from SGD to per-coordinate adaptive optimizers such as AdaGrad and Adam. We further propose FG$^2$-GDN+, which decouples the scaling for keys and values, enabling independent control of erasure strength and write strength. Experiments on synthetic and real-world benchmarks show that FG$^2$-GDN and its variant improve associative recall and long-context understanding over GDN and KDA, with comparable computational efficiency.

[536] Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

Qiang Liu, Adrienne Kline, Ermin Wei

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Safe Reinforcement Learning from Human Feedback (Safe RLHF) has recently achieved empirical success in developing helpful and harmless large language models by decoupling human preferences regarding helpfulness and harmlessness. Existing approaches typically rely on fitting fixed horizon reward models from human feedback and have only been validated empirically. In this paper, we formulate safe RLHF as an infinite horizon discounted Con- strained Markov Decision Process (CMDP), since humans may interact with the model over a continuing sequence of interactions rather than within a single finite episode. We propose two Safe RLHF algorithms that do not require reward model fitting and, in contrast to prior work assuming fixed-length trajectories, support flexible trajectory lengths for training. Both algo- rithms are based on the primal-dual method and achieve global convergence guarantees with polynomial rates in terms of policy gradient iterations, trajectory sample lengths, and human preference queries. To the best of our knowledge, this is the first work to study infinite horizon discounted CMDP under human feedback and establish global, non-asymptotic convergence.

[537] Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors

Jeongwhan Choi, Jongwoo Kim, Woosung Kang, Noseong Park

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: One of the most challenging problems in graph machine learning is generalizing across graphs with diverse properties. Graph neural networks (GNNs) face a fundamental limitation: they require separate training for each new graph, preventing universal generalization across diverse graph datasets. A critical challenge facing GNNs lies in their reliance on labeled training data for each individual graph, a requirement that hinders the capacity for universal node classification due to the heterogeneity inherent in graphs – differences in homophily levels, community structures, and feature distributions across datasets. Inspired by the success of large language models (LLMs) that achieve in-context learning through massive-scale pre-training on diverse datasets, we introduce NodePFN. This universal node classification method generalizes to arbitrary graphs without graph-specific training. NodePFN learns posterior predictive distributions (PPDs) by training only on thousands of synthetic graphs generated from carefully designed priors. Our synthetic graph generation covers real-world graphs through the use of random networks with controllable homophily levels and structural causal models for complex feature-label relationships. We develop a dual-branch architecture combining context-query attention mechanisms with local message passing to enable graph-aware in-context learning. Extensive evaluation on 23 benchmarks demonstrates that a single pre-trained NodePFN achieves 71.27 average accuracy. These results validate that universal graph learning patterns can be effectively learned from synthetic priors, establishing a new paradigm for generalization in node classification.

[538] Intentional Updates for Streaming Reinforcement Learning

Arsalan Sharifnassab, Mohamed Elsayed, Kris De Asis, A. Rupam Mahmood, Richard S. Sutton

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In gradient-based learning, a step size chosen in parameter units does not produce a predictable per-step change in function output. This often leads to instability in the streaming setting (i.e., batch size=1), where stochasticity is not averaged out and update magnitudes can momentarily become arbitrarily big or small. Instead, we propose intentional updates: first specify the intended outcome of an update and then solve for the step size that approximately achieves it. This strategy has precedent in online supervised linear regression via Normalized Least Mean Squares algorithm, which selects a step size to yield a specified change in the function output proportional to the current error. We extend this principle to streaming deep reinforcement learning by defining appropriate intended outcomes: Intentional TD aims for a fixed fractional reduction of the TD error, and Intentional Policy Gradient aims for a bounded per-step change in the policy, limiting local KL divergence. We propose practical algorithms combining eligibility traces and diagonal scaling. Empirically, these methods yield state-of-the-art streaming performance, frequently performing on par with batch and replay-buffer approaches.

[539] Age-Dependent Heterogeneity in the Association Between Physical Activity and Mental Distress: A Causal Machine Learning Analysis of 3.2 Million U.S. Adults

Yuan Shan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Physical activity (PA) is widely recognized as protective against mental distress, yet whether this benefit varies systematically across population subgroups remains poorly understood. Using pooled data from ten consecutive annual waves of the U.S. Behavioral Risk Factor Surveillance System (2015-2024; n = 3,242,218), we investigate heterogeneity in the association between leisure-time PA and frequent mental distress (FMD, >=14 days/month) across age groups. Survey-weighted logistic regression reveals a striking age gradient: the adjusted odds ratio for PA ranges from 0.89 among young adults (18-24) to 0.50 among adults aged 55-64, with the protective association strengthening monotonically with age. Temporal analysis across all ten years shows that the young-adult PA effect has been eroding over the past decade, with the 18-24 OR reaching 1.01 (null) in both 2018 and 2024 – paralleling the deepening youth mental health crisis. Causal Forest via Double Machine Learning independently identifies age as the dominant driver of treatment effect heterogeneity (feature importance = 0.39, 2.5x the next predictor). E-value sensitivity analysis, propensity score overlap checks, placebo tests, and imputation comparisons confirm the robustness of the findings. These results suggest that the well-documented exercise–mental health link may not generalize to the youngest adult population, whose distress appears increasingly driven by stressors that PA alone cannot mitigate.

[540] S2MAM: Semi-supervised Meta Additive Model for Robust Estimation and Variable Selection

Xuelin Zhang, Hong Chen, Yingjie Wang, Tieliang Gong, Bin Gu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semi-supervised learning with manifold regularization is a classical framework for jointly learning from both labeled and unlabeled data, where the key requirement is that the support of the unknown marginal distribution has the geometric structure of a Riemannian manifold. Typically, the Laplace-Beltrami operator-based manifold regularization can be approximated empirically by the Laplacian regularization associated with the entire training data and its corresponding graph Laplacian matrix. However, the graph Laplacian matrix depends heavily on the prespecified similarity metric and may lead to inappropriate penalties when dealing with redundant or noisy input variables. To address the above issues, this paper proposes a new \textit{Semi-Supervised Meta Additive Model (S$^2$MAM) based on a bilevel optimization scheme that automatically identifies informative variables, updates the similarity matrix, and simultaneously achieves interpretable predictions. Theoretical guarantees are provided for S$^2$MAM, including the computing convergence and the statistical generalization bound. Experimental assessments across 4 synthetic and 12 real-world datasets, with varying levels and categories of corruption, validate the robustness and interpretability of the proposed approach.

[541] Robust Continual Unlearning against Knowledge Erosion and Forgetting Reversal

Eun-Ju Park, Youjin Shin, Simon S. Woo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As a means to balance the growth of the AI industry with the need for privacy protection, machine unlearning plays a crucial role in realizing the ``right to be forgotten’’ in artificial intelligence. This technique enables AI systems to remove the influence of specific data while preserving the rest of the learned knowledge. Although it has been actively studied, most existing unlearning methods assume that unlearning is performed only once. In this work, we evaluate existing unlearning algorithms in a more realistic scenario where unlearning is conducted repeatedly, and in this setting, we identify two critical phenomena: (1) Knowledge Erosion, where the accuracy on retain data progressively degrades over unlearning phases, and (2) Forgetting Reversal, where previously forgotten samples become recognizable again in later phases. To address these challenges, we propose SAFER (StAbility-preserving Forgetting with Effective Regularization), a continual unlearning framework that maintains representation stability for retain data while enforcing negative logit margins for forget data. Extensive experiments show that SAFER mitigates not only knowledge erosion but also forgetting reversal, achieving stable performance across multiple unlearning phases.

[542] LLMs Know They’re Wrong and Agree Anyway: The Shared Sycophancy-Lying Circuit

Manav Pandey

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: When a language model agrees with a user’s false belief, is it failing to detect the error, or noticing and agreeing anyway? We show the latter. Across twelve open-weight models from five labs, spanning small to frontier scale, the same small set of attention heads carries a “this statement is wrong” signal whether the model is evaluating a claim on its own or being pressured to agree with a user. Silencing these heads flips sycophantic behavior sharply while leaving factual accuracy intact, so the circuit controls deference rather than knowledge. Edge-level path patching confirms that the same head-to-head connections drive sycophancy, factual lying, and instructed lying. Opinion-agreement, where no factual ground truth exists, reuses these head positions but writes into an orthogonal direction, ruling out a simple “truth-direction” reading of the substrate. Alignment training leaves this circuit in place: an RLHF refresh cuts sycophantic behavior roughly tenfold while the shared heads persist or grow, a pattern that replicates on an independent model family and under targeted anti-sycophancy DPO. When these models sycophant, they register that the user is wrong and agree anyway.

[543] RL-ABC: Reinforcement Learning for Accelerator Beamline Control

Anwar Ibrahim, Fedor Ratnikov, Maxim Kaledin, Alexey Petrenko, Denis Derkach

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Particle accelerator beamline optimization is a high-dimensional control problem traditionally requiring significant expert intervention. We present RLABC (Reinforcement Learning for Accelerator Beamline Control), an open-source Python framework that automatically transforms standard Elegant beamline configurations into reinforcement learning environments. RLABC integrates with the widely-used Elegant beam dynamics simulation code via SDDS-based interfaces, enabling researchers to apply modern RL algorithms to beamline optimization with minimal RL-specific development. The main contribution is a general methodology for formulating beamline tuning as a Markov decision process: RLABC automatically preprocesses lattice files to insert diagnostic watch points before each tunable element, constructs a 57-dimensional state representation from beam statistics, covariance information, and aperture constraints, and provides a configurable reward function for transmission optimization. The framework supports multiple RL algorithms through Stable-Baselines3 compatibility and implements stage learning strategies for improved training efficiency. Validation on a test beamline derived from the VEPP-5 injection complex (37 control parameters across 11 quadrupoles and 4 dipoles) demonstrates that the framework successfully enables RL-based optimization, with a Deep Deterministic Policy Gradient agent achieving 70.3% particle transmission – performance matching established methods such as differential evolution. The framework’s stage learning capability allows decomposition of complex optimization problems into manageable subproblems, improving training efficiency. The complete framework, including configuration files and example notebooks, is available as open-source software to facilitate adoption and further research.

[544] Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling

Weijie Zhao, Mingquan Liu, Bolun Wang, Simo Wu, Nuobei Xie, Rui-Jie Zhu, Peng Zhou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Scaling Transformers typically necessitates training larger models from scratch, as standard architectures struggle to expand without discarding learned representations. We identify the primary bottleneck in the attention mechanism’s linear projections, which strictly confine feature extraction to fixed-dimensional subspaces, limiting both expressivity and incremental capacity. To address this, we introduce Nexusformer, which replaces linear $Q/K/V$ projections with a Nexus-Rank layer, a three-stage nonlinear mapping driven by dual activations in progressively higher dimensional spaces. This design overcomes the linearity constraint and enables lossless structured growth: new capacity can be injected along two axes via zero-initialized blocks that preserve pretrained knowledge. Experiments on language modeling and reasoning benchmarks demonstrate that Nexusformer matches Tokenformer’s perplexity using up to 41.5% less training compute during progressive scaling (240M to 440M). Furthermore, our analysis of growth dynamics reveals that zero initialization induces a stable convergence trajectory, allowing us to derive a geometric scaling law that accurately predicts performance across expansion scales.

[545] SAW-INT4: System-Aware 4-Bit KV-Cache Quantization for Real-World LLM Serving

Jinda Jia, Jisen Li, Zhongzhu Zhou, Jung Hwan Heo, Jue Wang, Tri Dao, Shuaiwen Leon Song, Ben Athiwaratkun, Chenfeng Xu, Tianyi Zhang, Xiaoxia Wu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: KV-cache memory is a major bottleneck in real-world LLM serving, where systems must simultaneously support latency-sensitive small-batch requests and high-throughput concurrent workloads. Although many KV-cache compression methods improve offline accuracy or compression ratio, they often violate practical serving constraints such as paged memory layouts, regular memory access, and fused attention execution, limiting their effectiveness in deployment. In this work, we identify the minimal set of 4-bit KV-cache quantization methods that remain viable under these constraints. Our central finding is that a simple design–token-wise INT4 quantization with block-diagonal Hadamard rotation–consistently achieves the best accuracy-efficiency trade-off. Across multiple models and benchmarks, this approach recovers nearly all of the accuracy lost by naive INT4, while more complex methods such as vector quantization and Hessian-aware quantization provide only marginal additional gains once serving compatibility is taken into account. To make this practical, we implement a fused rotation-quantization kernel that integrates directly into paged KV-cache layouts and introduces zero measurable end-to-end overhead, matching plain INT4 throughput across concurrency levels. Our results show that effective KV-cache compression is fundamentally a systems co-design problem: under real serving constraints, lightweight block-diagonal Hadamard rotation is a viable method that delivers near-lossless accuracy without sacrificing serving efficiency.

[546] LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Siqing Song, Chuang Wang, Yong Lang, Yi Yang, Xu-Yao Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization through a novel three-stage quantization strategy. The framework proceeds as follows: (1) initialize a high-quality quantized model via PTQ; (2) quantize binarized weights, group-wise bitmaps, and quantization parameters through layer-wise distillation while keeping activations in full precision; and (3) training learnable activation quantization factors to dynamically quantize activations to 4 bits. This decoupled design mitigates interference between weight and activation quantization, yielding greater training stability and better inference accuracy. LBLLM, trained only using 0.016B tokens with a single GPU, surpasses existing state-of-the-art binarization methods on W2A4 quantization settings across tasks of language modeling, commonsense QA, and language understanding. These results demonstrate that extreme low-bit quantization of LLMs can be both practical and highly effective without introducing any extra high-precision channels or rotational matrices commonly used in recent PTQ-based works, offering a promising path toward efficient LLM deployment in resource-limited situations.

[547] FOCAL-Attention for Heterogeneous Multi-Label Prediction

Chenghao Zhang, Qingqing Long, Ludi Wang, Wenjuan Cui, Jianjun Yu, Yi Du

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Heterogeneous graphs have attracted increasing attention for modeling multi-typed entities and relations in complex real-world systems. Multi-label node classification on heterogeneous graphs is challenging due to structural heterogeneity and the need to learn shared representations across multiple labels. Existing methods typically adopt either flexible attention mechanisms or meta-path constrained anchoring, but in heterogeneous multi-label prediction they often suffer from semantic dilution or coverage constraint. Both issues are further amplified under multi-label supervision. We present a theoretical analysis showing that as heterogeneous neighborhoods expand, the attention mass allocated to task-critical (primary) neighborhoods diminishes, and that meta-path constrained aggregation exhibits a dilemma: too few meta-paths intensify coverage constraint, while too many re-introduce dilution. To resolve this coverage-anchoring conflict, we propose FOCAL: Fusion Of Coverage and Anchoring Learning, with two components: coverage-oriented attention (COA) for flexible, unconstrained heterogeneous context aggregation, and anchoring-oriented attention (AOA) that restricts aggregation to meta-path-induced primary semantics. Our theoretical analysis and experimental results further indicates that FOCAL has a better performance than other state-of-the-art methods.

[548] Inductive Subgraphs as Shortcuts: Causal Disentanglement for Heterophilic Graph Learning

Xiangmeng Wang, Qian Li, Haiyang Xia, Hao Miao, Qing Li, Guandong Xu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Heterophily is a prevalent property of real-world graphs and is well known to impair the performance of homophilic Graph Neural Networks (GNNs). Prior work has attempted to adapt GNNs to heterophilic graphs through non-local neighbor extension or architecture refinement. However, the fundamental reasons behind misclassifications remain poorly understood. In this work, we take a novel perspective by examining recurring inductive subgraphs, empirically and theoretically showing that they act as spurious shortcuts that mislead GNNs and reinforce non-causal correlations in heterophilic graphs. To address this, we adopt a causal inference perspective to analyze and correct the biased learning behavior induced by shortcut inductive subgraphs. We propose a debiased causal graph that explicitly blocks confounding and spillover paths responsible for these shortcuts. Guided by this causal graph, we introduce Causal Disentangled GNN (CD-GNN), a principled framework that disentangles spurious inductive subgraphs from true causal subgraphs by explicitly blocking non-causal paths. By focusing on genuine causal signals, CD-GNN substantially improves the robustness and accuracy of node classification in heterophilic graphs. Extensive experiments on real-world datasets not only validate our theoretical findings but also demonstrate that our proposed CD-GNN outperforms state-of-the-art heterophily-aware baselines.

[549] The Logical Expressiveness of Topological Neural Networks

Amirreza Akbari, Amauri H. Souza, Vikas Garg

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph neural networks (GNNs) are the standard for learning on graphs, yet they have limited expressive power, often expressed in terms of the Weisfeiler-Leman (WL) hierarchy or within the framework of first-order logic. In this context, topological neural networks (TNNs) have recently emerged as a promising alternative for graph representation learning. By incorporating higher-order relational structures into message-passing schemes, TNNs offer higher representational power than traditional GNNs. However, a fundamental question remains open: what is the logical expressiveness of TNNs? Answering this allows us to characterize precisely which binary classifiers TNNs can represent. In this paper, we address this question by analyzing isomorphism tests derived from the underlying mechanisms of general TNNs. We introduce and investigate the power of higher-order variants of WL-based tests for combinatorial complexes, called $k$-CCWL test. In addition, we introduce the topological counting logic (TC$k$), an extension of standard counting logic featuring a novel pairwise counting quantifier $ \exists^{N}(x_i,x_j), \varphi(x_i,x_j), $ which explicitly quantifies pairs $(x_i, x_j)$ satisfying property $\varphi$. We rigorously prove the exact equivalence: $ \text{k-CCWL} \equiv \text{TC}{k{+}2} \equiv \text{Topological }(k{+}2)\text{-pebble game}.$ These results establish a logical expressiveness theory for TNNs.

[550] TEMPO: Scaling Test-time Training for Large Reasoning Models

Qingyang Zhang, Xinke Kong, Haitao Wu, Qinghua Hu, Minghao Wu, Baosong Yang, Yu Cheng, Yun Luo, Ganqu Cui, Changqing Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

[551] Debiased neural operators for estimating functionals

Konstantin Hess, Dennis Frauen, Niki Kilbertus, Stefan Feuerriegel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neural operators are widely used to approximate solution maps of complex physical systems. In many applications, however, the goal is not to recover the full solution trajectory, but to summarize the solution trajectory via a scalar target quantity (e.g., a functional such as time spent in a target range, time above a threshold, accumulated cost, or total energy). In this paper, we introduce DOPE (debiased neural operator): a semiparametric estimator for such target quantities of solution trajectories obtained from neural operators. DOPE is broadly applicable to settings with both partial and irregular observations and can be combined with arbitrary neural operator architectures. We make three main contributions. (1) We show that, in contrast to DOPE, naive plug-in estimation can suffer from first-order bias. (2) To address this, we derive a novel one-step, Neyman-orthogonal estimator that treats the neural operator as a high-dimensional nuisance mapping between function spaces, and removes the leading bias term. For this, DOPE uses a weighting mechanism that simultaneously accounts for irregular observation designs and for how sensitive the target quantity is to perturbations of the underlying trajectory. (3) To learn the weights, we extend automatic debiased machine learning to operator-valued nuisances via Riesz regression. We demonstrate the benefits of DOPE across various numerical experiments.

[552] On the Conditioning Consistency Gap in Conditional Neural Processes

Robin Young

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Neural processes are meta-learning models that map context sets to predictive distributions. While inspired by stochastic processes, NPs do not generally satisfy the Kolmogorov consistency conditions required to define a valid stochastic process. This inconsistency is widely acknowledged but poorly understood. Practitioners note that NPs work well despite the violation, without quantifying what this means. We address this gap by defining the conditioning consistency gap, a KL divergence measuring how much a conditional neural process’s (CNP) predictions change when a point is added to the context versus conditioned upon. Our main results show that for CNPs with bounded encoders and Lipschitz decoders, the consistency gap is $O(1/n^2)$ in context size $n$, and that this rate is tight. These bounds establish the precise sense in which CNPs approximate valid stochastic processes. The inconsistency is negligible for moderate context sizes but can be significant in the few-shot regime.

[553] RDP LoRA: Geometry-Driven Identification for Parameter-Efficient Adaptation in Large Language Models

Yusuf Çelebi, Yağız Asker, Özay Ezerceli, Mahmoud ElHussieni, Selva Taş, Reyhan Bayraktar, Fatma Betül Terzioğlu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Fine-tuning Large Language Models (LLMs) remains structurally uncertain despite parameter-efficient methods such as Low-Rank Adaptation (LoRA), as the layer-specific roles of internal representations are poorly understood, leading to heuristic decisions about where adaptation should be applied. We model the evolution of hidden states as a high-dimensional geometric trajectory and propose using the Ramer-Douglas-Peucker (RDP) algorithm, a parameter-free and training-free polygon simplification method that preserves global structural transitions while eliminating locally redundant changes, to identify critical breakpoints along the representation path. Crucially, we use these geometric pivots not merely for analysis, but as a direct decision signal for determining which layers should be adapted during parameter-efficient fine-tuning. By integrating this geometry-aware layer selection strategy into LoRA fine-tuning of Qwen3-8B-Base, we achieve superior performance on MMLU-Math using only 13 RDP-selected layers (81.67%), significantly outperforming both full 36-layer adaptation (79.32%) and random 13-layer selection (75.56%), as well as the baseline Qwen3-8B-Base model (74.25%). These results demonstrate that leveraging the intrinsic geometry of representation trajectories provides a robust, interpretable, and training-free signal for optimizing layer selection during model adaptation.

[554] Concept Inconsistency in Dermoscopic Concept Bottleneck Models: A Rough-Set Analysis of the Derm7pt Dataset

Gonzalo Nápoles, Isel Grau, Yamisleydi Salgueiro

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Concept Bottleneck Models (CBMs) route predictions exclusively through a clinically grounded concept layer, binding interpretability to concept-label consistency. When a dataset contains concept-level inconsistencies, identical concept profiles mapped to conflicting diagnosis labels create an unresolvable bottleneck that imposes a hard ceiling on achievable accuracy. In this paper, we apply rough set theory to the Derm7pt dermoscopy benchmark and characterize the full extent and clinical structure of this inconsistency. Among 305 unique concept profiles formed by the 7 dermoscopic criteria of the 7-point melanoma checklist, 50 (16.4%) are inconsistent, spanning 306 images (30.3% of the dataset). This yields a theoretical accuracy ceiling of 92.1%, independent of backbone architecture or training strategy for CBMs that exclusively operate with hard concepts. In addition, we characterize the conflict-severity distribution, identify the clinical features most responsible for boundary ambiguity, and evaluate two filtering strategies with quantified effects on dataset composition and CBM interpretability. Symmetric removal of all boundary-region images yields Derm7pt+, a fully consistent benchmark subset of 705 images with perfect quality of classification and no hard accuracy ceiling. Building on this filtered dataset, we present a hard CBM evaluated across 19 backbone architectures from the EfficientNet, DenseNet, ResNet, and Wide ResNet families. Under symmetric filtering, explored for completeness, EfficientNet-B5 achieves the best label F1 score (0.85) and label accuracy (0.90) on the held-out test set, with a concept accuracy of 0.70. Under asymmetric filtering, EfficientNet-B7 leads across all four metrics, reaching a label F1 score of 0.82 and concept accuracy of 0.70. These results establish reproducible baselines for concept-consistent CBM evaluation on dermoscopic data.

Simin Yu, Sufia Fathima

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The rapid growth of chemical literature has generated vast amounts of unstructured data, where reaction information is particularly valuable for applications such as reaction predictions and drug design. However, the prohibitive cost of expert annotation has led to a scarcity of training data, severely hindering the performance of automatic reaction extraction. In this work, we conduct a systematic study of active learning for chemical reaction extraction. We integrate six uncertainty- and diversity-based strategies with pretrained transformer-CRF architectures, and evaluate them on product extraction and role labeling task. While several methods approach full-data performance with fewer labeled instances, learning curves are often non-monotonic and task-dependent. Our analysis shows that strong pretraining, structured CRF decoding, and label sparsity limit the stability of conventional active learning strategies. These findings provide practical insights for the effective use of active learning in chemical information extraction.

[556] FedSEA: Achieving Benefit of Parallelization in Federated Online Learning

Harekrushna Sahu, Pratik Jawanpuria, Pranay Sharma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Online federated learning (OFL) has emerged as a popular framework for decentralized decision-making over continuous data streams without compromising client privacy. However, the adversary model assumed in standard OFL typically precludes any potential benefits of parallelization. Further, it fails to adequately capture the different sources of statistical variation in OFL problems. In this paper, we extend the OFL paradigm by integrating a stochastically extended adversary (SEA). Under this framework, the loss function remains fixed across clients over time. However, the adversary dynamically and independently selects the data distribution for each client at each time. We propose the \algoOFL{} algorithm to solve this problem, which utilizes online stochastic gradient descent at the clients, along with periodic global aggregation via the server. We establish bounds on the global network regret over a time horizon (T) for two classes of functions: (1) for smooth and convex losses, we prove an (\mathcal{O}(\sqrt{T})) bound, and (2) for smooth and strongly convex losses, we prove an (\mathcal{O}(\log T)) bound. Through careful analysis, we quantify the individual impact of both spatial (across clients) and temporal (over time) data heterogeneity on the regret bounds. Consequently, we identify a regime of mild temporal variation (relative to stochastic gradient variance), where the network regret improves with parallelization. Hence, in the SEA setting, our results improve the existing pessimistic worst-case results in online federated learning.

[557] Evaluation-driven Scaling for Scientific Discovery

Haotian Ye, Haowei Lin, Jingyi Tang, Yizhen Luo, Caiyin Yang, Chang Su, Rahul Thapa, Rui Yang, Ruihua Liu, Zeyu Li, Chong Gao, Dachao Ding, Guangrong He, Miaolei Zhang, Lina Sun, Wenyang Wang, Yuchen Zhong, Zhuohao Shen, Di He, Jianzhu Ma, Stefano Ermon, Tongyang Li, Xiaowen Chu, James Zou, Yuzhi Xu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of how evaluation-driven discovery loops can be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combines parallel exploration, feedback-driven refinement, and local selection, revealing substantial gains unlocked by scaling evaluation-driven discovery loops along the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely used LASSO algorithm by over 2x, designed quantum circuit routing policies that reduce gate overhead by 24.5%, and discovered new Erdos minimum overlap constructions that surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancing LLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.

[558] LASER: Learning Active Sensing for Continuum Field Reconstruction

Huayu Deng, Jinghui Zhong, Xiangming Zhu, Yunbo Wang, Xiaokang Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: High-fidelity measurements of continuum physical fields are essential for scientific discovery and engineering design but remain challenging under sparse and constrained sensing. Conventional reconstruction methods typically rely on fixed sensor layouts, which cannot adapt to evolving physical states. We propose LASER, a unified, closed-loop framework that formulates active sensing as a Partially Observable Markov Decision Process (POMDP). At its core, LASER employs a continuum field latent world model that captures the underlying physical dynamics and provides intrinsic reward feedback. This enables a reinforcement learning policy to simulate ‘‘what-if’’ sensing scenarios within a latent imagination space. By conditioning sensor movements on predicted latent states, LASER navigates toward potentially high-information regions beyond current observations. Our experiments demonstrate that LASER consistently outperforms static and offline-optimized strategies, achieving high-fidelity reconstruction under sparsity across diverse continuum fields.

[559] FairTree: Subgroup Fairness Auditing of Machine Learning Models with Bias-Variance Decomposition

Rudolf Debelak

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The evaluation of machine learning models typically relies mainly on performance metrics based on loss functions, which risk to overlook changes in performance in relevant subgroups. Auditing tools such as SliceFinder and SliceLine were proposed to detect such groups, but usually have conceptual disadvantages, such as the inability to directly address continuous covariates. In this paper, we introduce FairTree, a novel algorithm adapted from psychometric invariance testing. Unlike SliceFinder and related algorithms, FairTree directly handles continuous, categorical, and ordinal features without discretization. It further decomposes performance disparities into systematic bias and variance, allowing a categorization of changes in algorithm performance. We propose and evaluate two variations of the algorithm: a permutation-based approach, which is conceptually closer to SliceFinder, and a fluctuation test. Through simulation studies that include a direct comparison with SliceLine, we demonstrate that both approaches have a satisfactory rate of false-positive results, but that the fluctuation approach has relatively higher power. We further illustrate the method on the UCI Adult Census dataset. The proposed algorithms provide a flexible framework for the statistical evaluation of the performance and aspects of fairness of machine learning models in a wide range of applications even in relatively small data.

[560] TACENR: Task-Agnostic Contrastive Explanations for Node Representations

Vasiliki Papanikou, Evaggelia Pitoura

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Graph representation learning has achieved notable success in encoding graph-structured data into latent vector spaces, enabling a wide range of downstream tasks. However, these node representations remain opaque and difficult to interpret. Existing explainability methods primarily focus on supervised settings or on explaining individual representation dimensions, leaving a critical gap in explaining the overall structure of node representations. In this paper, we propose TACENR (Task-Agnostic Contrastive Explanations for Node Representations), a local explanation method that identifies not only attribute features but also proximity and structural ones that contribute the most in the representation space. TACENR builds on contrastive learning, through which we learn a similarity function in the representation space, revealing which are the features that play an important role in the representation of a node. While our focus is on task-agnostic explanations, TACENR can be applied to supervised scenarios as well. Experimental results demonstrate that proximity and structural features play a significant role in shaping node representations and that our supervised variant performs comparably to existing task-specific approaches in identifying the most impactful features.

[561] Optimal Routing for Federated Learning over Dynamic Satellite Networks: Tractable or Not?

Yi Zhao, Di Yuan, Tao Deng, Suzhi Cao, Ying Dong

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated learning (FL) is a key paradigm for distributed model learning across decentralized data sources. Communication in each FL round typically consists of two phases: (i) distributing the global model from a server to clients, and (ii) collecting updated local models from clients to the server for aggregation. This paper focuses on a type of FL where communication between a client and the server is relay-based over dynamic networks, making routing optimization essential. A typical scenario is in-orbit FL, where satellites act as clients and communicate with a server (which can be a satellite, ground station, or aerial platform) via multi-hop inter-satellite links. This paper presents a comprehensive tractability analysis of routing optimization for in-orbit FL under different settings. For global model distribution, these include the number of models, the objective function, and routing schemes (unicast versus multicast, and splittable versus unsplittable flow). For local model collection, the settings consider the number of models, client selection, and flow splittability. For each case, we rigorously prove whether the global optimum is obtainable in polynomial time or the problem is NP-hard. Together, our analysis draws clear boundaries between tractable and intractable regimes for a broad spectrum of routing problems for in-orbit FL. For tractable cases, the derived efficient algorithms are directly applicable in practice. For intractable cases, we provide fundamental insights into their inherent complexity. These contributions fill a critical yet unexplored research gap, laying a foundation for principled routing design, evaluation, and deployment in satellite-based FL or similar distributed learning systems.

[562] Revisiting Catastrophic Forgetting in Continual Knowledge Graph Embedding

Gerard Pons, Carlos Escolano, Besim Bilalli, Anna Queralt

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Knowledge Graph Embeddings (KGEs) support a wide range of downstream tasks over Knowledge Graphs (KGs). In practice, KGs evolve as new entities and facts are added, motivating Continual Knowledge Graph Embedding (CKGE) methods that update embeddings over time. Current CKGE approaches address catastrophic forgetting (i.e., the performance degradation on previously learned tasks) primarily by limiting changes to existing embeddings. However, we show that this view is incomplete. When new entities are introduced, their embeddings can interfere with previously learned ones, causing the model to predict them in place of previously correct answers. This phenomenon, which we call entity interference, has been largely overlooked and is not accounted for in current CKGE evaluation protocols. As a result, the assessment of catastrophic forgetting becomes misleading, and CKGE methods performance is systematically overestimated. To address this issue, we introduce a corrected CKGE evaluation protocol that accounts for entity interference. Through experiments on multiple benchmarks, we show that ignoring this effect can lead to performance overestimation of up to 25%, particularly in scenarios with significant entity growth. We further analyze how different CKGE methods and KGE models are affected by the different sources of forgetting, and introduce a catastrophic forgetting metric tailored to CKGE.

[563] Unsupervised Confidence Calibration for Reasoning LLMs from a Single Generation

Thomas Zollo, Jimmy Wang, Richard Zemel

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reasoning language models can solve increasingly complex tasks, but struggle to produce the calibrated confidence estimates necessary for reliable deployment. Existing calibration methods usually depend on labels or repeated sampling at inference time, making them impractical in many settings. We introduce a method for unsupervised confidence calibration of reasoning LLMs when only a single generation is available at inference time. Our approach uses offline sampling on unlabeled data to derive a self-consistency-based proxy target, then distills this signal into a lightweight deployment-time confidence predictor. In a broad evaluation across 5 math and question-answering tasks using 9 reasoning models, our method substantially outperforms baselines, including under distribution shift, and improves downstream performance in selective prediction and simulated downstream decision-making.

[564] Heterogeneity-Aware Personalized Federated Learning for Industrial Predictive Analytics

Yuhan Hu, Xiaolei Fang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Federated prognostics enable clients (e.g., companies, factories, and production lines) to collaboratively develop a failure time prediction model while keeping each client’s data local and confidential. However, traditional federated models often assume homogeneity in the degradation processes across clients, an assumption that may not hold in many industrial settings. To overcome this, this paper proposes a personalized federated prognostic model designed to accommodate clients with heterogeneous degradation processes, allowing them to build tailored prognostic models. The prognostic model iteratively facilitates the underlying pairwise collaborations between clients with similar degradation patterns, which enhances the performance of personalized federated learning. To estimate parameters jointly using decentralized datasets, we develop a federated parameter estimation algorithm based on proximal gradient descent. The proposed approach addresses the limitations of existing federated prognostic models by simultaneously achieving model personalization, preserving data privacy, and providing comprehensive failure time distributions. The superiority of the proposed model is validated through extensive simulation studies and a case study using the turbofan engine degradation dataset from the NASA repository.

[565] ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications

Suvinava Basak

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Batch Normalization (BN) is a cornerstone of deep learning, yet it fundamentally breaks down in micro-batch regimes (e.g., 3D medical imaging) and non-IID Federated Learning. Removing BN from deep architectures, however, often leads to catastrophic training failures such as vanishing gradients and dying channels. We identify that standard activation functions, like Swish and ReLU, exacerbate this instability in BN-free networks due to their non-zero-centered nature, which causes compounding activation mean-shifts as network depth increases. In this technical communication, we propose Zero-Centered Swish (ZC-Swish), a drop-in activation function parameterized to dynamically anchor activation means near zero. Through targeted stress-testing on BN-free convolutional networks at depths 8, 16, and 32, we demonstrate that while standard Swish collapses to near-random performance at depth 16 and beyond, ZC-Swish maintains stable layer-wise activation dynamics and achieves the highest test accuracy at depth 16 (51.5%) with seed 42. ZC-Swish thus provides a robust, parameter-efficient solution for stabilizing deep networks in memory-constrained and privacy-preserving applications where traditional normalization is unviable.

[566] EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Chengjun Pan, Shichun Liu, Jiahang Lin, Dingwei Zhu, Jiazheng Zhang, Shihan Dou, Songyang Gao, Zhenhua Han, Binghai Wang, Rui Zheng, Xuanjing Huang, Tao Gui, Yansong Feng

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.

[567] When Graph Structure Becomes a Liability: A Critical Re-Evaluation of Graph Neural Networks for Bitcoin Fraud Detection under Temporal Distribution Shift

Saket Maganti

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The consensus that GCN, GraphSAGE, GAT, and EvolveGCN outperform feature-only baselines on the Elliptic Bitcoin Dataset is widely cited but has not been rigorously stress-tested under a leakage-free evaluation protocol. We perform a seed-matched inductive-versus-transductive comparison and find that this consensus does not hold. Under a strictly inductive protocol, Random Forest on raw features achieves F1 = 0.821 and outperforms all evaluated GNNs, while GraphSAGE reaches F1 = 0.689 +/- 0.017. A paired controlled experiment reveals a 39.5-point F1 gap attributable to training-time exposure to test-period adjacency. Additionally, edge-shuffle ablations show that randomly wired graphs outperform the real transaction graph, indicating that the dataset’s topology can be misleading under temporal distribution shift. Hybrid models combining GNN embeddings with raw features provide only marginal gains and remain substantially below feature-only baselines. We release code, checkpoints, and a strict-inductive protocol to enable reproducible, leakage-free evaluation.

[568] Accelerating Optimization and Machine Learning through Decentralization

Ziqin Chen, Zuang Wang, Yongqiang Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Decentralized optimization enables multiple devices to learn a global machine learning model while each individual device only has access to its local dataset. By avoiding the need for training data to leave individual users’ devices, it enhances privacy and scalability compared to conventional centralized learning, where all data has to be aggregated to a central server. However, decentralized optimization has traditionally been viewed as a necessary compromise, used only when centralized processing is impractical due to communication constraints or data privacy concerns. In this study, we show that decentralization can paradoxically accelerate convergence, outperforming centralized methods in the number of iterations needed to reach optimal solutions. Through examples in logistic regression and neural network training, we demonstrate that distributing data and computation across multiple agents can lead to faster learning than centralized approaches, even when each iteration is assumed to take the same amount of time, whether performed centrally on the full dataset or decentrally on local subsets. This finding challenges longstanding assumptions and reveals decentralization as a strategic advantage, offering new opportunities for more efficient optimization and machine learning.

[569] Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments

Jianyang Gao, Yutong Gou, Yuexuan Xu, Jifan Shi, Yongyi Yang, Shuolin Li, Raymond Chi-Wing Wong, Cheng Long

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This technical note revisits the relationship between RaBitQ and TurboQuant under a unified comparison framework. We compare the two methods in terms of methodology, theoretical guarantees, and empirical performance, using a reproducible, transparent, and symmetric setup. Our results show that, despite the claimed advantage of TurboQuant, TurboQuant does not provide a consistent improvement over RaBitQ in directly comparable settings; in many tested configurations, it performs worse than RaBitQ. We further find that several reported runtime and recall results in the TurboQuant paper could not be reproduced from the released implementation under the stated configuration. Overall, this note clarifies the shared structure and genuine differences between the two lines of work, while documenting reproducibility issues in the experimental results reported by the TurboQuant paper.

[570] Calibrating Scientific Foundation Models with Inference-Time Stochastic Attention

Akash Yadav, Taiwo A. Adebiyi, Ruda Zhang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transformer-based scientific foundation models are increasingly deployed in high-stakes settings, but current architectures give deterministic outputs and provide limited support for calibrated predictive uncertainty. We propose Stochastic Attention, a lightweight inference-time modification that randomizes attention by replacing softmax weights with normalized multinomial samples controlled by a single concentration parameter, and produces predictive ensembles without retraining. To set this parameter, we introduce a calibration objective that matches the stochastic attention output with the target, yielding an efficient univariate post-hoc tuning problem. We evaluate this mechanism on two scientific foundation models for weather and timeseries forecasting along with an additional regression task. Across benchmarks against uncertainty-aware baselines, we find that Stochastic Attention achieves the strongest native calibration and the sharpest prediction intervals at comparable coverage, while requiring only minutes of post-hoc tuning versus days of retraining for competitive baselines.

[571] Separating Geometry from Probability in the Analysis of Generalization

Maxim Raginsky, Benjamin Recht

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The goal of machine learning is to find models that minimize prediction error on data that has not yet been seen. Its operational paradigm assumes access to a dataset $S$ and articulates a scheme for evaluating how well a given model performs on an arbitrary sample. The sample can be $S$ (in which case we speak of in-sample'' performance) or some entirely new $S'$ (in which case we speak of out-of-sample’’ performance). Traditional analysis of generalization assumes that both in- and out-of-sample data are i.i.d.\ draws from an infinite population. However, these probabilistic assumptions cannot be verified even in principle. This paper presents an alternative view of generalization through the lens of sensitivity analysis of solutions of optimization problems to perturbations in the problem data. Under this framework, generalization bounds are obtained by purely deterministic means and take the form of variational principles that relate in-sample and out-of-sample evaluations through an error term that quantifies how close out-of-sample data are to in-sample data. Statistical assumptions can then be used \textit{ex post} to characterize the situations when this error term is small (either on average or with high probability).

[572] Structure-guided molecular design with contrastive 3D protein-ligand learning

Carles Navarro, Philipp Tholke, Gianni de Fabritiis

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Structure-based drug discovery faces the dual challenge of accurately capturing 3D protein-ligand interactions while navigating ultra-large chemical spaces to identify synthetically accessible candidates. In this work, we present a unified framework that addresses these challenges by combining contrastive 3D structure encoding with autoregressive molecular generation conditioned on commercial compound spaces. First, we introduce an SE(3)-equivariant transformer that encodes ligand and pocket structures into a shared embedding space via contrastive learning, achieving competitive results in zero-shot virtual screening. Second, we integrate these embeddings into a multimodal Chemical Language Model (MCLM). The model generates target-specific molecules conditioned on either pocket or ligand structures, with a learned dataset token that steers the output toward targeted chemical spaces, yielding candidates with favorable predicted binding properties across diverse targets.

[573] Lyapunov-Certified Direct Switching Theory for Q-Learning

Donghwan Lee

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Q-learning is one of the most fundamental algorithms in reinforcement learning. We analyze constant-stepsize Q-learning through a direct stochastic switching system representation. The key observation is that the Bellman maximization error can be represented exactly by a stochastic policy. Therefore, the Q-learning error admits a switched linear conditional-mean recursion with martingale-difference noise. The intrinsic drift rate is the joint spectral radius (JSR) of the direct switching family, which can be strictly smaller than the standard row-sum rate. Using this representation, we derive a finite-time final-iterate bound via a JSR-induced Lyapunov function and then give a computable quadratic-certificate version.

[574] An Efficient Black-Box Reduction from Online Learning to Multicalibration, and a New Route to $Φ$-Regret Minimization

Gabriele Farina, Juan Carlos Perdomo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We give a Gordon-Greenwald-Marks (GGM) style black-box reduction from online learning to online multicalibration. Concretely, we show that to achieve high-dimensional multicalibration with respect to a class of functions H, it suffices to combine any no-regret learner over H with an expected variational inequality (EVI) solver. We also prove a converse statement showing that efficient multicalibration implies efficient EVI solving, highlighting how EVIs in multicalibration mirror the role of fixed points in the GGM result for $Φ$-regret. This first set of results resolves the main open question in Garg, Jung, Reingold, and Roth (SODA ‘24), showing that oracle-efficient online multicalibration with $\sqrt{T}$-type guarantees is possible in full generality. Furthermore, our GGM-style reduction unifies the analyses of existing online multicalibration algorithms, enables new algorithms for challenging environments with delayed observations or censored outcomes, and yields the first efficient black-box reduction between online learning and multiclass omniprediction. Our second main result is a fine-grained reduction from high-dimensional online multicalibration to (contextual) $Φ$-regret minimization. Together with our first result, this establishes a new route from external regret to Phi-regret that bypasses sophisticated fixed-point or semi-separation machinery, dramatically simplifies a result of Daskalakis, Farina, Fishelson, Pipis, and Schneider (STOC ‘25) while improving rates, and yields new algorithms that are robust to richer deviation classes, such as those belonging to any reproducing kernel Hilbert space.

[575] SAGE: Training-Free Semantic Evidence Composition for Edge-Cloud Inference under Hard Uplink Budgets

Inhyeok Choi, Hyuncheol Park

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Edge-cloud hybrid inference offloads difficult inputs to a powerful remote model, but the uplink channel imposes hard per-request constraints on the number of bits that can be transmitted. We show that selecting transmitted content based solely on attention-based importance, the standard approach in collaborative inference, is inherently limited under hard budgets. Two findings support this claim. First, replacing high-importance units with low-importance but complementary ones improves server accuracy. This shows that what matters is not individual importance but how well the transmitted set covers diverse aspects of the input. Second, spatially uniform selection without any content information achieves competitive accuracy at moderate budgets. This confirms that spatial coverage alone carries independent value. Based on this analysis, we propose SAGE (Semantic Attention-Guided Evidence), a principled, training-free method that combines importance filtering with embedding-diversity sampling. SAGE achieves 93% of the server ceiling in offloaded accuracy while transmitting fewer than half of the available evidence units on ImageNet-1K, substantially outperforming importance-only composition.

[576] Disentangling Damage from Operational Variability: A Label-Free Self-Supervised Representation Learning Framework for Output-Only Structural Damage Identification

Xudong Jian, Charikleia Stoura, Simon Scandella, Eleni Chatzi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Damage identification is a core task in structural health monitoring. In practice, however, its reliability is often compromised by confounding non-damage effects, such as variations in excitation and environmental conditions, which can induce changes comparable to or larger than those caused by structural damage. To address this challenge, this study proposes a self-supervised label-free disentangled representation learning framework for robust vibration-based structural damage identification. The proposed framework employs an autoencoder with two latent representations to learn directly from raw vibration acceleration signals. A self-supervised invariance regularization, implemented via Variance-Invariance-Covariance Regularization (VICReg), is imposed on one latent representation using baseline data where structural damage is assumed constant but operational and environmental conditions vary. In addition, a frequency-domain constraint is introduced to enforce agreement between the power spectral density reconstructed from the latent representation and that computed from the corresponding input time series. Together, these mechanisms promote disentanglement, enabling the learned representation to be sensitive to damage-related characteristics while remaining invariant to nuisance variability. The framework is trained in a fully end-to-end and label-free manner, requiring no prior information on damage, excitation, or environmental conditions, making it well-suited for real-world applications. Its effectiveness is validated on two distinct real-world vibration datasets, including a bridge and a gearbox. The results demonstrate robustness to operational variability, strong generalization capability, and good performance in both damage detection and quantification.

[577] HardNet++: Nonlinear Constraint Enforcement in Neural Networks

Andrea Goertzen, Kaveh Alim, Navid Azizan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Enforcing constraint satisfaction in neural network outputs is critical for safety, reliability, and physical fidelity in many control and decision-making applications. While soft-constrained methods penalize constraint violations during training, they do not guarantee constraint adherence during inference. Other approaches guarantee constraint satisfaction via specific parameterizations or a projection layer, but are tailored to specific forms (e.g., linear constraints), limiting their utility in other general problem settings. Many real-world problems of interest are nonlinear, motivating the development of methods that can enforce general nonlinear constraints. To this end, we introduce HardNet++, a constraint-enforcement method that simultaneously satisfies linear and nonlinear equality and inequality constraints. Our approach iteratively adjusts the network output via damped local linearizations. Each iteration is differentiable, admitting an end-to-end training framework, where the constraint satisfaction layer is active during training. We show that under certain regularity conditions, this procedure can enforce nonlinear constraint satisfaction to arbitrary tolerance. Finally, we demonstrate tight constraint adherence without loss of optimality in a learning-for-optimization context, where we apply this method to a model predictive control problem with nonlinear state constraints.

[578] Budgeted Online Influence Maximization

Pierre Perrault, Jennifer Healey, Zheng Wen, Michal Valko

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We introduce a new budgeted framework for online influence maximization, considering the total cost of an advertising campaign instead of the common cardinality constraint on a chosen influencer set. Our approach better models the real-world setting where the cost of influencers varies and advertisers want to find the best value for their overall social advertising budget. We propose an algorithm assuming an independent cascade diffusion model and edge level semi-bandit feedback, and provide both theoretical and experimental results. Our analysis is also valid for the cardinality constraint setting and improves the state of the art regret bound in this case.

[579] PREF-XAI: Preference-Based Personalized Rule Explanations of Black-Box Machine Learning Models

Salvatore Greco, Jacek Karolczak, Roman Słowiński, Jerzy Stefanowski

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Explainable artificial intelligence (XAI) has predominantly focused on generating model-centric explanations that approximate the behavior of black-box models. However, such explanations often overlook a fundamental aspect of interpretability: different users require different explanations depending on their goals, preferences, and cognitive constraints. Although recent work has explored user-centric and personalized explanations, most existing approaches rely on heuristic adaptations or implicit user modeling, lacking a principled framework for representing and learning individual preferences. In this paper, we consider Preference-Based Explainable Artificial Intelligence (PREF-XAI), a novel perspective that reframes explanation as a preference-driven decision problem. Within PREF-XAI, explanations are not treated as fixed outputs, but as alternatives to be evaluated and selected according to user-specific criteria. In the PREF-XAI perspective, here we propose a methodology that combines rule-based explanations with formal preference learning. User preferences are elicited through a ranking of a small set of candidate explanations and modeled via an additive utility function inferred using robust ordinal regression. Experimental results on real-world datasets show that PREF-XAI can accurately reconstruct user preferences from limited feedback, identify highly relevant explanations, and discover novel explanatory rules not initially considered by the user. Beyond the proposed methodology, this work establishes a connection between XAI and preference learning, opening new directions for interactive and adaptive explanation systems.

[580] Planning in entropy-regularized Markov decision processes and games

Jean-Bastien Grill, Omar Darwiche Domingues, Pierre Ménard, Rémi Munos, Michal Valko

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We propose SmoothCruiser, a new planning algorithm for estimating the value function in entropy-regularized Markov decision processes and two-player games, given a generative model of the environment. SmoothCruiser makes use of the smoothness of the Bellman operator promoted by the regularization to achieve problem-independent sample complexity of order O~(1/epsilon^4) for a desired accuracy epsilon, whereas for non-regularized settings there are no known algorithms with guaranteed polynomial sample complexity in the worst case.

[581] On two ways to use determinantal point processes for Monte Carlo integration

Guillaume Gautier, Rémi Bardenet, Michal Valko

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The standard Monte Carlo estimator $\widehat{I}_N^{\mathrm{MC}}$ of $\int fdω$ relies on independent samples from $ω$ and has variance of order $1/N$. Replacing the samples with a determinantal point process (DPP), a repulsive distribution, makes the estimator consistent, with variance rates that depend on how the DPP is adapted to $f$ and $ω$. We examine two existing DPP-based estimators: one by Bardenet & Hardy (2020) with a rate of $\mathcal{O}(N^{-(1+1/d)})$ for smooth $f$, but relying on a fixed DPP. The other, by Ermakov & Zolotukhin (1960), is unbiased with rate of order $1/N$, like Monte Carlo, but its DPP is tailored to $f$. We revisit these estimators, generalize them to continuous settings, and provide sampling algorithms.

[582] Ultrametric OGP - parametric RDT \emph{symmetric} binary perceptron connection

Mihailo Stojnic

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In [97,99,100], an fl-RDT framework is introduced to characterize \emph{statistical computational gaps} (SCGs). Studying \emph{symmetric binary perceptrons} (SBPs), [100] obtained an \emph{algorithmic} threshold estimate $α_a\approx α_c^{(7)}\approx 1.6093$ at the 7th lifting level (for $κ=1$ margin), closely approaching $1.58$ local entropy (LE) prediction [18]. In this paper, we further connect parametric RDT to overlap gap properties (OGPs), another key geometric feature of the solution space. Specifically, for any positive integer $s$, we consider $s$-level ultrametric OGPs ($ult_s$-OGPs) and rigorously upper-bound the associated constraint densities $α_{ult_s}$. To achieve this, we develop an analytical union-bounding program consisting of combinatorial and probabilistic components. By casting the combinatorial part as a convex problem and the probabilistic part as a nested integration, we conduct numerical evaluations and obtain that the tightest bounds at the first two levels, $\barα_{ult_1} \approx 1.6578$ and $\barα_{ult_2} \approx 1.6219$, closely approach the 3rd and 4th lifting level parametric RDT estimates, $α_c^{(3)} \approx 1.6576$ and $α_c^{(4)} \approx 1.6218$. We also observe excellent agreement across other key parameters, including overlap values and the relative sizes of ultrametric clusters. Based on these observations, we propose several conjectures linking $ult$-OGP and parametric RDT. Specifically, we conjecture that algorithmic threshold $α_a=\lim_{s\rightarrow\infty} α_{ult_s} = \lim_{s\rightarrow\infty} \barα{ult_s} = \lim_{r\rightarrow\infty} α_{c}^{(r)}$, and $α_{ult_s} \leq α_{c}^{(s+2)}$ (with possible equality for some (maybe even all) $s$). Finally, we discuss the potential existence of a full isomorphism connecting all key parameters of $ult$-OGP and parametric RDT.

[583] Adaptive MSD-Splitting: Enhancing C4.5 and Random Forests for Skewed Continuous Attributes

Jake Lee

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: The discretization of continuous numerical attributes remains a persistent computational bottleneck in the induction of decision trees, particularly as dataset dimensions scale. Building upon the recently proposed MSD-Splitting technique – which bins continuous data using the empirical mean and standard deviation to dramatically improve the efficiency and accuracy of the C4.5 algorithm – we introduce Adaptive MSD-Splitting (AMSD). While standard MSD-Splitting is highly effective for approximately symmetric distributions, its rigid adherence to fixed one-standard-deviation cutoffs can lead to catastrophic information loss in highly skewed data, a common artifact in real-world biomedical and financial datasets. AMSD addresses this by dynamically adjusting the standard deviation multiplier based on feature skewness, narrowing intervals in dense regions to preserve discriminative resolution. Furthermore, we integrate AMSD into ensemble methods, specifically presenting the Random Forest-AMSD (RF-AMSD) framework. Empirical evaluations on the Census Income, Heart Disease, Breast Cancer, and Forest Covertype datasets demonstrate that AMSD yields a 2-4% accuracy improvement over standard MSD-Splitting, while maintaining near-identical O(N) time complexity reductions compared to the O(N log N) exhaustive search. Our Random Forest extension achieves state-of-the-art accuracy at a fraction of standard computational costs, confirming the viability of adaptive statistical binning in large-scale ensemble learning architectures.

[584] Benign Overfitting in Adversarial Training for Vision Transformers

Jiaming Zhang, Meng Ding, Shaopeng Fu, Jingfeng Zhang, Di Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Despite the remarkable success of Vision Transformers (ViTs) across a wide range of vision tasks, recent studies have revealed that they remain vulnerable to adversarial examples, much like Convolutional Neural Networks (CNNs). A common empirical defense strategy is adversarial training, yet the theoretical underpinnings of its robustness in ViTs remain largely unexplored. In this work, we present the first theoretical analysis of adversarial training under simplified ViT architectures. We show that, when trained under a signal-to-noise ratio that satisfies a certain condition and within a moderate perturbation budget, adversarial training enables ViTs to achieve nearly zero robust training loss and robust generalization error under certain regimes. Remarkably, this leads to strong generalization even in the presence of overfitting, a phenomenon known as \emph{benign overfitting}, previously only observed in CNNs (with adversarial training). Experiments on both synthetic and real-world datasets further validate our theoretical findings.

[585] FB-NLL: A Feature-Based Approach to Tackle Noisy Labels in Personalized Federated Learning

Abdulmoneam Ali, Ahmed Arafa

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Personalized Federated Learning (PFL) aims to learn multiple task-specific models rather than a single global model across heterogeneous data distributions. Existing PFL approaches typically rely on iterative optimization-such as model update trajectories-to cluster users that need to accomplish the same tasks together. However, these learning-dynamics-based methods are inherently vulnerable to low-quality data and noisy labels, as corrupted updates distort clustering decisions and degrade personalization performance. To tackle this, we propose FB-NLL, a feature-centric framework that decouples user clustering from iterative training dynamics. By exploiting the intrinsic heterogeneity of local feature spaces, FB-NLL characterizes each user through the spectral structure of the covariances of their feature representations and leverages subspace similarity to identify task-consistent user groupings. This geometry-aware clustering is label-agnostic and is performed in a one-shot manner prior to training, significantly reducing communication overhead and computational costs compared to iterative baselines. Complementing this, we introduce a feature-consistency-based detection and correction strategy to address noisy labels within clusters. By leveraging directional alignment in the learned feature space and assigning labels based on class-specific feature subspaces, our method mitigates corrupted supervision without requiring estimation of stochastic noise transition matrices. In addition, FB-NLL is model-independent and integrates seamlessly with existing noise-robust training techniques. Extensive experiments across diverse datasets and noise regimes demonstrate that our framework consistently outperforms state-of-the-art baselines in terms of average accuracy and performance stability.

[586] FASTER: Value-Guided Sampling for Fast RL

Perry Dong, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Some of the most performant reinforcement learning algorithms today can be prohibitively expensive as they use test-time scaling methods such as sampling multiple action candidates and selecting the best one. In this work, we propose FASTER, a method for getting the benefits of sampling-based test-time scaling of diffusion-based policies without the computational cost by tracing the performance gain of action samples back to earlier in the denoising process. Our key insight is that we can model the denoising of multiple action candidates and selecting the best one as a Markov Decision Process (MDP) where the goal is to progressively filter action candidates before denoising is complete. With this MDP, we can learn a policy and value function in the denoising space that predicts the downstream value of action candidates in the denoising process and filters them while maximizing returns. The result is a method that is lightweight and can be plugged into existing generative RL algorithms. Across challenging long-horizon manipulation tasks in online and batch-online RL, FASTER consistently improves the underlying policies and achieves the best overall performance among the compared methods. Applied to a pretrained VLA, FASTER achieves the same performance while substantially reducing training and inference compute requirements. Code is available at https://github.com/alexanderswerdlow/faster .

[587] Safe Continual Reinforcement Learning in Non-stationary Environments

Austin Coursey, Abel Diaz-Gonzalez, Marcos Quinones-Grueiro, Gautam Biswas

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reinforcement learning (RL) offers a compelling data-driven paradigm for synthesizing controllers for complex systems when accurate physical models are unavailable; however, most existing control-oriented RL methods assume stationarity and, therefore, struggle in real-world non-stationary deployments where system dynamics and operating conditions can change unexpectedly. Moreover, RL controllers acting in physical environments must satisfy safety constraints throughout their learning and execution phases, rendering transient violations during adaptation unacceptable. Although continual RL and safe RL have each addressed non-stationarity and safety, respectively, their intersection remains comparatively unexplored, motivating the study of safe continual RL algorithms that can adapt over the system’s lifetime while preserving safety. In this work, we systematically investigate safe continual reinforcement learning by introducing three benchmark environments that capture safety-critical continual adaptation and by evaluating representative approaches from safe RL, continual RL, and their combinations. Our empirical results reveal a fundamental tension between maintaining safety constraints and preventing catastrophic forgetting under non-stationary dynamics, with existing methods generally failing to achieve both objectives simultaneously. To address this shortcoming, we examine regularization-based strategies that partially mitigate this trade-off and characterize their benefits and limitations. Finally, we outline key open challenges and research directions toward developing safe, resilient learning-based controllers capable of sustained autonomous operation in changing environments.

[588] Generalization at the Edge of Stability

Mario Tuci, Caner Korkmaz, Umut Şimşekli, Tolga Birdal

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the `sharpness dimension’, and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.

[589] Resource-aware Mixed-precision Quantization for Enhancing Deployability of Transformers for Time-series Forecasting on Embedded FPGAs

Tianheng Ling, Chao Qian, Gregor Schiele

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.03294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.03294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[590] Quantum Non-Linear Bandit Optimization

Zakaria Shams Siam, Chaowen Guan, Chong Liu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.03023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[591] AutoNFS: Automatic Neural Feature Selection

Witold Wydmański, Marek Śmieja

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.13304: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.13304&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[592] Online Learning of Whittle Indices for Restless Bandits with Non-Stationary Transition Kernels

Md Kamran Chowdhury Shisher, Vishrant Tripathi, Mung Chiang, Christopher G. Brinton

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.18186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[593] IMPACT: Importance-Aware Activation Space Reconstruction

Md Mokarram Chowdhury, Daniel Agyei Asante, Ernie Chang, Yang Li

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.03828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.03828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[594] Symbolic Quantile Regression for the Interpretable Prediction of Conditional Quantiles

Cas Oude Hoekstra, Floris den Hengst

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.08080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.08080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[595] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.12121: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12121&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[596] Automated Energy-Aware Time-Series Model Deployment on Embedded FPGAs for Resilient Combined Sewer Overflow Management

Tianheng Ling, Vipin Singh, Chao Qian, Felix Biessmann, Gregor Schiele

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.13905: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.13905&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[597] Benchmarking Physics-Informed Neural Networks and Boundary Elements Methods for Wave Scattering

Oscar Rincón-Cardeno, Gregorio Pérez Bernal, Silvana Montoya Noguera, Nicolás Guarín-Zapata

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.12483: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12483&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[598] Beyond Linear Probes: Dynamic Safety Monitoring for Language Models

James Oldfield, Philip Torr, Ioannis Patras, Adel Bibi, Fazl Barez

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.26238: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26238&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[599] Conditional Diffusion Modeling with Attention for Probabilistic Battery Capacity Prediction under Real-World Condition

Chunlin Jiang, Hequn Li, Zhongwei Deng, Jie Shao, Zhansheng Ning

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.17414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[600] Physics-Informed Neural Operators for Cardiac Electrophysiology

Hannah Lydon, Milad Kazemi, Martin Bishop, Nicola Paoletti

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.08418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Fatemeh Akbarian, Anahita Baninajjar, Yingyi Zhang, Ananth Balashankar, Amir Aminifar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.21893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[602] Bayesian Event-Based Model for Disease Subtype and Stage Inference

Hongtao Hao, Joseph L. Austerweil

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.03467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[603] Optimized Architectures for Kolmogorov-Arnold Networks

James Bagrow, Josh Bongard

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.12448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Kirandeep Kaur, Vinayak Gupta, Aditya Gupta, Chirag Shah

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.09926: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.09926&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[605] Diagnosing Failure Modes of Neural Operators Across Diverse PDE Families

Lennon Shikhman

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11428: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11428&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[606] MapPFN: Learning Causal Perturbation Maps in Context

Marvin Sextro, Weronika Kłos, Gabriel Dernbach

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.21092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[607] TreeGrad-Ranker: Feature Ranking via $O(L)$-Time Gradients for Decision Trees

Weida Li, Yaoliang Yu, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11623: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11623&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[608] Mixture of Predefined Experts: Maximizing Data Usage on Vertical Federated Learning

Jon Irureta, Gorka Azkune, Jon Imaz, Aizea Lojo, Javier Fernandez-Marques

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.12708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[609] Drift Localization using Conformal Predictions

Fabian Hinder, Valerie Vaquet, Johannes Brinkrolf, Barbara Hammer

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.19790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[610] Tackling multiphysics problems via finite element-guided physics-informed operator learning

Yusuke Yamazaki, Reza Najian Asl, Markus Apel, Mayu Muramatsu, Shahed Rezaei

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.01420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[611] How Out-of-Equilibrium Phase Transitions can Seed Pattern Formation in Trained Diffusion Models

Luca Ambrogioni

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[612] A Heterogeneous Long-Micro Scale Cascading Architecture for General Aviation Health Management

Xinhang Chen, Zhihuan Wei, Yang Hu, Zhiguo Zeng, Kang Zeng, Wei Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.22885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[613] Semantic Interaction Information mediates compositional generalization in latent space

John Schwarcz

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.27134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[614] Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

Simon Zhang, Ryan P. DeMilt, Kun Jin, Cathy H. Xia

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[615] LLM-Extracted Covariates for Clinical Causal Inference: Rethinking Integration Strategies

Lei Liu, Jialin Chen, Kathy Macropol

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.16763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.16763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[616] Towards Real-Time ECG and EMG Modeling on $μ$NPUs

Josh Millar, Ashok Samraj Thangarajan, Soumyajit Chatterjee, Hamed Haddadi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.18067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.18067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[617] Whispers in the Machine: Confidentiality in Agentic Systems

Jonathan Evertz, Merlin Chlosta, Lea Schönherr, Thorsten Eisenhofer

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2402.06922: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.06922&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[618] Finite-dimensional approximations of push-forwards on locally analytic functionals

Isao Ishikawa

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2404.10769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.10769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[619] Latent Linear Quadratic Regulator for Robotic Control Tasks

Yuan Zhang, Shaohui Yang, Toshiyuki Ohtsuka, Colin Jones, Joschka Boedecker

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.11107: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.11107&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[620] Byzantine-tolerant distributed learning of finite mixture models

Qiong Zhang, Yan Shuo Tan, Jiahua Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.13980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.13980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[621] Regression with Large Language Models for Materials and Molecular Property Prediction

Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2409.06080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.06080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[622] Global Optimization of Gaussian Process Acquisition Functions Using a Piecewise-Linear Kernel Approximation

Yilin Xie, Shiqiang Zhang, Joel A. Paulson, Calvin Tsay

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.16893: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.16893&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[623] The Data-Driven Censored Newsvendor Problem

Chamsi Hssaine, Sean R. Sinclair

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.01763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.01763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Konstantinos Mylonas, Thrasyvoulos Spyropoulos

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2412.09404: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.09404&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[625] DADA: Dual Averaging with Distance Adaptation

Mohammad Moshtaghifar, Anton Rodomanov, Daniil Vankov, Sebastian Stich

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2501.10258: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.10258&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[626] LatticeVision: Image to Image Networks for Modeling Non-Stationary Spatial Data

Antony Sikorski, Michael Ivanitskiy, Nathan Lenssen, Douglas Nychka, Daniel McKenzie

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.09803: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.09803&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[627] Highly Efficient and Effective LLMs with Multi-Boolean Architectures

Ba-Hien Tran, Van Minh Nguyen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.22811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[628] ASVSim (AirSim for Surface Vehicles): A High-Fidelity Simulation Framework for Autonomous Surface Vehicle Research

Bavo Lesy, Siemen Herremans, Robin Kerstens, Jan Steckel, Walter Daems, Siegfried Mercelis, Ali Anwar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.22174: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.22174&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[629] VoteGCL: Enhancing Graph-based Recommendations with Majority-Voting LLM-Rerank Augmentation

Minh-Anh Nguyen, Bao Nguyen, Ha Lan N.T., Tuan Anh Hoang, Duc-Trong Le, Dung D. Le

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.21563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[630] QTMRL: An Agent for Quantitative Trading Decision-Making Based on Multi-Indicator Guided Reinforcement Learning

Jingfeng Pan, Jiahao Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.20467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[631] Energy-Weighted Flow Matching: Unlocking Continuous Normalizing Flows for Efficient and Scalable Boltzmann Sampling

Niclas Dern, Lennart Redl, Sebastian Pfister, Marcel Kollovieh, David Lüdke, Stephan Günnemann

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.03726: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.03726&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[632] When Langevin Monte Carlo Meets Randomization: Non-asymptotic Error Bounds beyond Log-Concavity and Gradient Lipschitzness

Xiaojie Wang, Bin Yang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25630: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25630&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[633] Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs

Patrick Forré, Abel Jansma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.05786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[634] Flow-Opt: Scalable Centralized Multi-Robot Trajectory Optimization with Flow Matching and Differentiable Optimization

Simon Idoko, Prajyot Jadhav, Arun Kumar Singh

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.09204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[635] Efficient Autoregressive Inference for Transformer Probabilistic Models

Conor Hassan, Nasrulloh Loka, Cen-You Li, Daolang Huang, Paul E. Chang, Yang Yang, Francesco Silvestrin, Samuel Kaski, Luigi Acerbi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.09477: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09477&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[636] Quantifying Data Similarity Using Cross Learning

Shudong Sun, Hao Helen Zhang, Joseph C Watkins

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.10866: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10866&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[637] PriorGuide: Test-Time Prior Adaptation for Simulation-Based Inference

Yang Yang, Severi Rissanen, Paul E. Chang, Nasrulloh Loka, Daolang Huang, Arno Solin, Markus Heinonen, Luigi Acerbi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.13763: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13763&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[638] Nonmonotone subgradient methods based on a local descent lemma

Francisco J. Aragón-Artacho, Rubén Campoy, Pedro Pérez-Aros, David Torregrosa-Belén

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.19341: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19341&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[639] StrikeWatch: Wrist-worn Gait Recognition with Compact Time-series Models on Low-power FPGAs

Tianheng Ling, Chao Qian, Peter Zdankin, Torben Weis, Gregor Schiele

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.24738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[640] Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

Lars van der Laan, Nathan Kallus

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.23805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[641] Local Updates in Distributed Optimization: Provable Acceleration and Topology Effects

Zuang Wang, Yongqiang Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[642] Enforcing Reciprocity in Operator Learning for Seismic Wave Propagation

Caifeng Zou, Yaozhong Shi, Zachary E. Ross, Robert W. Clayton, Kamyar Azizzadenesheli

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.11631: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11631&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[643] GaiaFlow: Semantic-Guided Diffusion Tuning for Carbon-Frugal Search

Rong Fu, Jia Yee Tan, Chunlei Meng, Shuo Yin, Xiaowen Ma, Wangyu Wu, Muge Qi, Guangzhen Yao, Zhaolu Kang, Zeli Su, Simon Fong

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.15423: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15423&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[644] LiveGraph: Active-Structure Neural Re-ranking for Exercise Recommendation

Rong Fu, Zijian Zhang, Haiyun Wei, Jiekai Wu, Kun Liu, Xianda Li, Haoyu Zhao, Yang Li, Yongtai Liu, Ziming Wang, Rui Lu, Simon Fong

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.17036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[645] Spatiotemporal-Aware Bit-Flip Injection on DNN-based Advanced Driver Assistance Systems (extended version)

Taibiao Zhao, Xiang Zhang, Mingxuan Sun, Ruyi Ding, Xugui Zhou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03753: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03753&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[646] An Imbalanced Dataset with Multiple Feature Representations for Studying Quality Control of Next-Generation Sequencing

Philipp Röchner, Clarissa Krämer, Johannes U Mayer, Franz Rothlauf, Steffen Albrecht, Maximilian Sprang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.04981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[647] Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel

Hongyi Jin, Bohan Hou, Guanjie Wang, Ruihang Lai, Jinqi Chen, Zihao Ye, Yaxing Cai, Yixin Dong, Xinhao Cheng, Zhihao Zhang, Yilong Zhao, Yingyi Huang, Lijie Yang, Jinchen Jiang, Gabriele Oliaro, Jianan Ji, Xupeng Miao, Vinod Grover, Todd C. Mowry, Zhihao Jia, Tianqi Chen

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.13327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.13327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[648] ExoNet: Multimodal Deep Learning for TESS Exoplanet Candidate Identification via Phase-Folded Light Curves, Stellar Parameters, and Multi-Head Attention Fusion

Md.Rashadul Islam

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.15560: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.15560&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[649] Q-SINDy: Quantum-Kernel Sparse Identification of Nonlinear Dynamics with Provable Coefficient Debiasing

Samrendra Roy, Syed Bahauddin Alam

[662] Smiling Regulates Emotion During Traumatic Recollection

Marcus Ma, Emily Zhou, Leonard Ludwig, Julia Hörath, Christina Winkler, Kleanthis Avramidis, Tiantian Feng, Gabor Toth, Alina Bothe, Shrikanth Narayanan

[667] A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation

Nichula Wasalathilaka, Dineth Perera, Oshadha Samarakoon, Buddhi Wijenayake, Roshan Godaliyadda, Vijitha Herath, Parakrama Ekanayake

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones

[668] VOLT: Volumetric Wide-Field Microscopy via 3D-Native Probabilistic Transport

Yetao He, Wenhan Guo, Deliang Wei, Evan Bel, Ji Yi, Yu Sun

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Three-dimensional (3D) wide-field fluorescence microscopy is a widely used modality for volumetric imaging, but suffers from characteristic out-of-focus blur. Existing reconstruction methods either struggle to operate on high-dimensional volumes or fail to provide credibility characterization of the reconstruction. In this work, we introduce Volumetric Transport (VOLT), a 3D-native probabilistic framework for wide-field fluorescence microscopy reconstruction. VOLT combines a transport-based formulation that maps degraded measurements to clean volumes via stochastic interpolants with a 3D-native anisotropic network that separates lateral and axial processing. This design operates directly in voxel space and achieves improved scalability to large volumes without relying on slice-wise approximations. We develop both stochastic (SDE) and deterministic (ODE) variants within the same framework. We validate VOLT on simulated wide-field microscopy datasets. Our results show that VOLT significantly improves reconstruction quality in both lateral and axial directions while providing voxel-wise credibility estimates.

[669] ExplainS2A: Explainable Spectral-Spatial Duality Model for Fast Transforming Sentinel-2 Image to AVIRIS-Level Hyperspectral Image

Chia-Hsiang Lin, Zi-Chao Leng

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Mainstream optical satellites often acquire multispectral multi-resolution images, which have limited material identifiability compared to the HSIs. Thus, spectrally super-resolving the MSI into their hyperspectral counterparts greatly facilitates remote material identification and the downstream tasks. However, spectrally super-resolving the MSI into an HSI is often constrained by the multi-resolution nature of the sensor. Specifically, due to the presence of some LR bands in the MSI, the initial spectral super-resolution results often appear to be spatially blurry, resulting in an LR HSI. To overcome this bottleneck, we then leverage some HR band inherent in the acquired MSI to spatially guide the reconstruction procedure, thereby yielding the desired HR HSI. This fusion procedure elegantly coincides with a widely known spatial super-resolution problem in satellite remote sensing. Hence, we have reformulated the tough spectral super-resolution problem into a more widely investigated spatial super-resolution problem, referred to as the spectral-spatial duality theory. Accordingly, we propose ExplainS2A, consisting of a deep unfolding network and an explainable fusion network, that unifies spectral recovery and spatial fusion into a single explainable framework. Unlike conventional black-box models, ExplainS2A offers interpretability and operates as a linear-time algorithm. Remarkably, it can process a million-scale Sentinel-2 image in less than one second, yielding high-fidelity HSI over the same scene, and upgrades the blind source separation results. Although demonstrated on the Sentinel-2 and AVIRIS sensors, ExplainS2A also serves as a general framework applicable to various sensor pairs with different resolution configurations, and has experimentally demonstrated cross-region and cross-season generalization ability. Source codes: https://github.com/IHCLab/ExplainS2A.

[670] Deep Image Prior for photoacoustic tomography can mitigate limited-view artifacts

Hanna Pulkkinen, Jenni Poimala, Leonid Kunyansky, Janek Gröhl, Andreas Hauptmann

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We study the deep image prior (DIP) framework applied to photoacoustic tomography (PAT) as an unsupervised reconstruction approach to mitigate limited-view artifacts and noise commonly encountered in experimental settings. Efficient implementation is achieved by employing recently published fast forward and adjoint algorithms for circular measurement geometries. Initialization via a fast inverse and total variation (TV) regularization are applied to further suppress noise and mitigate overfitting. For comparison, we compute a classical TV reconstruction. Our experiments comprise simulated PAT measurements under limited-view geometries and varying levels of added noise as well as experimental measurements together with using a digital twin for quality assessment. Our findings suggest that DIP framework provides an effective unsupervised strategy for robust PAT reconstruction even in the challenging case of a limited view geometry providing improvement in several quantitative measures over total variation reconstructions.

[671] Harmonizing MR Images Across 100+ Scanners: Multi-site Validation with Traveling Subjects and Real-world Protocols

Savannah P. Hays, Lianrui Zuo, Muhammad Faizyab Ali Chaudhary, Kathleen M. Bartz, Samuel W. Remedios, Jinwei Zhang, Jiachen Zhuo, Murat Bilgel, Shiv Saidha, Ellen M. Mowry, Scott D. Newsome, Jerry L. Prince, Blake E. Dewey, Aaron Carass

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reliable harmonization of heterogeneous magnetic resonance~(MR) image datasets, especially those acquired in pragmatic clinical trials, is critical to advance multi-center neuroimaging studies and translational machine learning in healthcare. We present an enhanced and rigorously validated version of the HACA3 harmonization algorithm, which we refer to as HACA3$^+$, incorporating key methodological enhancements: (1)~an improved artifact encoder to better isolate and mitigate image artifacts, (2)~background and foreground-sensitive attention mechanisms to increase harmonization specificity, and (3)~extensive training using data spanning 100+ scanners from 64 independent sites, providing a broader diversity of scanners than other harmonization methods. Our study focuses on four commonly acquired MR image contrasts (T1-weighted, T2-weighted, proton density, & fluid-attenuated inversion recovery), reflecting realistic clinical protocols. We perform inter-site harmonization experiments using traveling subjects to assess the generalization and robustness of the harmonization model. We compare the results of the publicly available version of HACA3 and our implementation, HACA3$^+$. Downstream relevance is further established through whole brain segmentation and image imputation. Finally, we justify each enhancement through an ablation experiment. Pre-trained weights and code for HACA3$^+$ are made publicly available at https://github.com/shays15/haca3-plus.

[672] Defining Robust Ultrasound Quality Metrics via an Ultrasound Foundation Model

Ziyang Huang, Bingyan Li, Chen Ma, Tianyi Liu, Yihui Zhai, Hong Xu, Yi Guo, Zeju Li, Yuanyuan Wang

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Clinicians lack a principled framework to quantify diagnostic utility in ultrasound reconstructions. Existing standards like PSNR and VGG-LPIPS are inadequate, failing to account for modality-specific physics or the structural nuances of acoustic imaging. We close this gap with a TinyUSFM-based evaluation framework featuring two distinct metrics: TinyUSFM-uLPIPS, a full-reference perceptual distance based on multi-layer token relations, and TinyUSFM-NRQ, a deployable no-reference quality score utilizing clean-manifold modeling and worst-region aggregation to detect localized harmful artifacts. We demonstrate that the presented metrics have four unique advantages: 1) Task-linked quality, where TinyUSFM-uLPIPS achieves superior calibration with semantic task damage, accurately reflecting Dice-score drops in segmentation where VGG-based metrics fail; 2) Cross-organ comparability, maintaining stable scoring scales and consistent severity rankings across diverse anatomical sites and domain-shifted data; 3) PSNR-consistent sensitivity, with TinyUSFM-NRQ providing a reliable quality score without ground-truth images that remains consistent with traditional fidelity benchmarks (i.e. PSNR); and 4) Clinical utility, improving the prediction of expert preference from 47.2$%$ to 72.8$%$ accuracy and producing super-resolution reconstructions preferred by sonographers. By integrating these advantages into a unified assessment and optimization loop, this work establishes a modality-aligned standard that finally bridges the gap between algorithmic performance and diagnostic utility. https://github.com/sextant-fable/US-Metrics

[673] Synthetic Abundance Maps for Unsupervised Super-Resolution of Hyperspectral Remote Sensing Images

Xinxin Xu, Yann Gousseau, Christophe Kervazo, Saïd Ladjal

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Hyperspectral single image super-resolution (HS-SISR) aims to enhance the spatial resolution of hyperspectral images to fully exploit their spectral information. While considerable progress has been made in this field, most existing methods are supervised and require ground truth data for training-data that is often unavailable in practice. To overcome this limitation, we propose a novel unsupervised training framework for HS-SISR, based on synthetic abundance data, where no high-resolution ground-truth reference is required for training. The approach begins by unmixing the hyperspectral image into endmembers and abundances. A neural network is then trained to perform abundance super-resolution using synthetic abundances only. These synthetic abundance maps are generated from a dead leaves model whose characteristics are inherited from the low-resolution image to be super-resolved and from the known point spread function (PSF) of the hyperspectral sensor. This trained network is subsequently used to enhance the spatial resolution of the original image’s abundances, and the final super-resolution hyperspectral image is reconstructed by combining them with the endmembers. Experimental results demonstrate both the training value of the synthetic data and the effectiveness of the proposed method across 3 datasets, 3 scaling factors, and several evaluation metrics. The code is available at https://github.com/xinxinxu99/SISR-DL.git

[674] AI-Based Detection of Temporal Changes in MR-Linac Images Acquired During Routine Prostate Radiotherapy

Seungbin Park, Peilin Wang, Ryan Pennell, Emily S. Weg, Himanshu Nagar, Timothy McClure, Mert R. Sabuncu, Daniel Margolis, Heejong Kim

Main category: eess.IV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Purpose: To investigate whether an AI-based method can detect subtle inter-fraction changes in MR-Linac images acquired during radiotherapy and explore the broader potential of MRLinac imaging. Methods: This retrospective study included longitudinal 0.35T MR-Linac images from 761 patients. To identify temporal changes, we employed a deep learning model using temporal ordering via pairwise comparison, previously shown effective for longitudinal imaging studies. The model was trained using first-to-last fraction pairs (F1-FL) and all pairs (All-pairs). Performance was assessed using quantitative metrics (accuracy and AUC) and compared against a radiologist’s performance. Qualitative evaluation was performed using saliency maps, which identify anatomical regions associated with temporal imaging changes. Results: The F1-FL model demonstrated high performance (AUC=0.99, accuracy=0.95) and outperformed the radiologist in temporal ordering task. The All-pairs model also showed high performance (AUC=0.97, accuracy=0.91). Regions contributing to predictions included the prostate, bladder, and pubic symphysis. The performance was correlated to fractional intervals and was reduced for non-radiation-exposed timepoints (Sim and F1), suggesting that observed changes may reflect both temporal variation and radiation exposure. Conclusion: MR-Linac imaging appears capable of capturing subtle changes during prostate radiotherapy that can be detected by AI models, even over approximately two-day intervals. The model’s high performance, together with quantitative and qualitative analyses, supports a potential role for MR-Linac in clinical applications beyond image guidance.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Two-dimensional early exit optimisation of LLM inference

[2] Probing for Reading Times

[3] Characterizing AlphaEarth Embedding Geometry for Agentic Environmental Reasoning

[4] Speculative End-Turn Detector for Efficient Speech Chatbot Assistant

[5] Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP

[6] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

[7] Investigating Counterfactual Unfairness in LLMs towards Identities through Humor

[8] Qwen3.5-Omni Technical Report

[9] Remask, Don’t Replace: Token-to-Mask Refinement in Masked Diffusion Language Models

[10] Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

[11] Model-Agnostic Meta Learning for Class Imbalance Adaptation

[12] An Empirical Study of Multi-Generation Sampling for Jailbreak Detection in Large Language Models

[13] Mango: Multi-Agent Web Navigation via Global-View Optimization

[14] Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

[15] Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

[16] LegalBench-BR: A Benchmark for Evaluating Large Language Models on Brazilian Legal Decision Classification

[17] Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

[18] Where Fake Citations Are Made: Tracing Field-Level Hallucination to Specific Neurons in LLMs

[19] Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

[20] Prioritizing the Best: Incentivizing Reliable Multimodal Reasoning by Rewarding Beyond Answer Correctness

[21] Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

[22] LogosKG: Hardware-Optimized Scalable and Interpretable Knowledge Graph Retrieval

[23] MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation

[24] Proposing Topic Models and Evaluation Frameworks for Analyzing Associations with External Outcomes: An Application to Leadership Analysis Using Large-Scale Corporate Review Data

[25] Disparities In Negation Understanding Across Languages In Vision-Language Models

[26] A Mechanism and Optimization Study on the Impact of Information Density on User-Generated Content Named Entity Recognition

[27] Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest

[28] STAR-Teaming: A Strategy-Response Multiplex Network Approach to Automated LLM Red Teaming

[29] $R^2$-dLLM: Accelerating Diffusion Large Language Models via Spatio-Temporal Redundancy Reduction

[30] When Safety Fails Before the Answer: Benchmarking Harmful Behavior Detection in Reasoning Chains

[31] Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

[32] AlignCultura: Towards Culturally Aligned Large Language Models?

[33] RARE: Redundancy-Aware Retrieval Evaluation Framework for High-Similarity Corpora

[34] SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

[35] Cell-Based Representation of Relational Binding in Language Models

[36] Product-of-Experts Training Reduces Dataset Artifacts in Natural Language Inference

[37] TRN-R1-Zero: Text-rich Network Reasoning via LLMs with Reinforcement Learning Only

[38] HoWToBench: Holistic Evaluation for LLM’s Capability in Human-level Writing using Tree of Writing

[39] SAHM: A Benchmark for Arabic Financial and Shari’ah-Compliant Reasoning

[40] Detoxification for LLM: From Dataset Itself

[41] Do Emotions Influence Moral Judgment in Large Language Models?

[42] Construction of Knowledge Graph based on Language Model

[43] The Rise of Verbal Tics in Large Language Models: A Systematic Analysis Across Frontier Models

[44] ReflectMT: Internalizing Reflection for Efficient and High-Quality Machine Translation

[45] How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

[46] Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation

[47] SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

[48] Headlines You Won’t Forget: Can Pronoun Insertion Increase Memorability?

[49] Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

[50] ShadowPEFT: Shadow Network for Parameter-Efficient Fine-Tuning

[51] Towards a Linguistic Evaluation of Narratives: A Quantitative Stylistic Framework

[52] CulturALL: Benchmarking Multilingual and Multicultural Competence of LLMs on Grounded Tasks

[53] HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing

[54] Location Not Found: Exposing Implicit Local and Global Biases in Multilingual LLMs

[55] IndiaFinBench: An Evaluation Benchmark for Large Language Model Performance on Indian Financial Regulatory Text

[56] Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms

[57] Evaluating LLM-Driven Summarisation of Parliamentary Debates with Computational Argumentation

[58] Are Large Language Models Economically Viable for Industry Deployment?

[59] DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

[60] Can Continual Pre-training Bridge the Performance Gap between General-purpose and Specialized Language Models in the Medical Domain?

[61] Does Self-Consistency Improve the Recall of Encyclopedic Knowledge?

[62] Lost in Translation: Do LVLM Judges Generalize Across Languages?

[63] What Makes an LLM a Good Optimizer? A Trajectory Analysis of LLM-Guided Evolutionary Search

[64] ‘The Order in the Horse’s Heart’: A Case Study in LLM-Assisted Stylometry for the Discovery of Biblical Allusion in Modern Literary Fiction

[65] LePREC: Reasoning as Classification over Structured Factors for Assessing Relevance of Legal Issues

[66] Rank-Turbulence Delta and Interpretable Approaches to Stylometric Delta Metrics

[67] Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

[68] Bangla Key2Text: Text Generation from Keywords for a Low Resource Language

[69] Emotion-Cause Pair Extraction in Conversations via Semantic Decoupling and Graph Alignment

[70] Taming Actor-Observer Asymmetry in Agents via Dialectical Alignment

[71] Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps

[72] A Self-Evolving Framework for Efficient Terminal Agents via Observational Context Compression

[73] Impact of large language models on peer review opinions from a fine-grained perspective: Evidence from top conference proceedings in AI

[74] A Bolu: A Structured Dataset for the Computational Analysis of Sardinian Improvisational Poetry

[75] RoLegalGEC: Legal Domain Grammatical Error Detection and Correction Dataset for Romanian

[76] Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

[77] The “Small World of Words” German Free-Association Norms