Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

Hongjian Zou, Yue Ge, Qi Ding, Yixuan Liao, Xiaoxin Chen

Main category: cs.CL

TL;DR: Current MLLMs don’t scale well because training data lacks knowledge density, not task diversity. VQA adds little beyond captions, but enriching captions with structured knowledge improves performance consistently.

Details

Motivation: Multimodal LLMs show unpredictable scaling behavior with diminishing returns from increased model size and task diversity, unlike text-only LLMs. The paper investigates why MLLMs fail to scale effectively.

Method: 1) Show VQA supervision contributes minimal semantic information beyond image captions (VQA signals reconstructable from captions). 2) Increase knowledge density through structured caption enrichment and cross-modal knowledge injection. 3) Conduct controlled experiments comparing semantic coverage vs. task diversity.

Result: Performance correlates more strongly with semantic coverage than task diversity. Knowledge density improvements lead to consistent performance gains across multimodal and downstream benchmarks. VQA adds negligible value beyond captions.

Conclusion: Current MLLMs fail to scale primarily due to insufficient knowledge coverage in training data, not task format. Knowledge-centric multimodal training is essential for scalable multimodal models.

Abstract: Multimodal large language models (MLLMs) have achieved rapid progress, yet their scaling behavior remains less clearly characterized and often less predictable than that of text-only LLMs. Increasing model size and task diversity often yields diminishing returns. In this work, we argue that the primary bottleneck in multimodal scaling is not task format, but knowledge density in training data. We first show that task-specific supervision such as Visual Question Answering (VQA) contributes little incremental semantic information beyond image captions: VQA signals can be reconstructed from captions with negligible performance loss. We then demonstrate that increasing knowledge density – through structured caption enrichment and cross-modal knowledge injection – leads to consistent performance improvements across multimodal and downstream benchmarks. Across controlled experiments, performance correlates more strongly with semantic coverage than with task diversity. These findings suggest that current MLLMs fail to scale primarily because training data lacks sufficient knowledge coverage. We advocate for knowledge-centric multimodal training as a principled foundation for scalable multimodal models.

Relevance: 9/10

Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang

Main category: cs.MM

TL;DR: AVID is a large-scale benchmark for evaluating audio-visual inconsistency understanding in long videos, featuring 11.2K videos with 39.4K inconsistency events across 8 categories, addressing a critical gap in multimodal AI evaluation.

Details

Motivation: Current multimodal LLMs excel at aligned audio-visual tasks but struggle with perceiving cross-modal conflicts, which is crucial for trustworthy AI. Existing benchmarks focus on aligned events or deepfake detection, leaving a gap in evaluating inconsistency perception in long-form video contexts.

Method: AVID uses a scalable pipeline: 1) temporal segmentation classifying video content into Active Speaker, Voiceover, and Scenic categories; 2) agent-driven strategy planner selecting appropriate inconsistency categories; 3) five specialized injectors for diverse audio-visual conflict injection. The benchmark includes 11.2K long videos with 39.4K annotated inconsistency events.

Result: Comprehensive evaluation shows state-of-the-art models have significant limitations in temporal grounding and reasoning. The fine-tuned baseline AVID-Qwen achieves 2.8× higher BLEU-4 in segment reasoning and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15).

Conclusion: AVID provides an effective testbed for advancing trustworthy multimodal AI systems by addressing the critical capability of audio-visual inconsistency understanding, which is fundamental for building reliable AI that can perceive cross-modal conflicts like humans.

Abstract: We present AVID, the first large-scale benchmark for audio-visual inconsistency understanding in videos. While omni-modal large language models excel at temporally aligned tasks such as captioning and question answering, they struggle to perceive cross-modal conflicts, a fundamental human capability that is critical for trustworthy AI. Existing benchmarks predominantly focus on aligned events or deepfake detection, leaving a significant gap in evaluating inconsistency perception in long-form video contexts. AVID addresses this with: (1) a scalable construction pipeline comprising temporal segmentation that classifies video content into Active Speaker, Voiceover, and Scenic categories; an agent-driven strategy planner that selects semantically appropriate inconsistency categories; and five specialized injectors for diverse audio-visual conflict injection; (2) 11.2K long videos (avg. 235.5s) with 39.4K annotated inconsistency events and 78.7K segment clips, supporting evaluation across detection, temporal grounding, classification, and reasoning with 8 fine-grained inconsistency categories. Comprehensive evaluations of state-of-the-art omni-models reveal significant limitations in temporal grounding and reasoning. Our fine-tuned baseline, AVID-Qwen, achieves substantial improvements over the base model (2.8$\times$ higher BLEU-4 in segment reasoning) and surpasses all compared models in temporal grounding (mIoU: 36.1% vs 26.2%) and holistic understanding (SODA-m: 7.47 vs 6.15), validating AVID as an effective testbed for advancing trustworthy omni-modal AI systems.

Relevance: 9/10

[3] Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Yanfeng Shi, Pengfei Cai, Jun Liu, Qing Gu, Nan Jiang, Lirong Dai, Ian McLoughlin, Yan Song

Main category: cs.SD

TL;DR: TimePro-RL framework enhances LALMs’ temporal perception using audio-side time prompts and reinforcement learning for better event timing inference

Details

Motivation: Current Large Audio-Language Models (LALMs) have limitations in temporal perception (inferring event onset and offset), which restricts their utility in fine-grained audio understanding scenarios

Method: Proposes Audio-Side Time Prompt (encoding timestamps as embeddings interleaved with audio features) and Reinforcement Learning after Supervised Fine-Tuning to optimize temporal alignment

Result: Significant performance gains across audio temporal tasks including audio grounding, sound event detection, and dense audio captioning

Conclusion: TimePro-RL framework effectively addresses temporal perception limitations in LALMs, enabling more fine-grained audio understanding

Abstract: Large Audio-Language Models (LALMs) enable general audio understanding and demonstrate remarkable performance across various audio tasks. However, these models still face challenges in temporal perception (e.g., inferring event onset and offset), leading to limited utility in fine-grained scenarios. To address this issue, we propose Audio-Side Time Prompt and leverage Reinforcement Learning (RL) to develop the TimePro-RL framework for fine-grained temporal perception. Specifically, we encode timestamps as embeddings and interleave them within the audio feature sequence as temporal coordinates to prompt the model. Furthermore, we introduce RL following Supervised Fine-Tuning (SFT) to directly optimize temporal alignment performance. Experiments demonstrate that TimePro-RL achieves significant performance gains across a range of audio temporal tasks, such as audio grounding, sound event detection, and dense audio captioning, validating its robust effectiveness.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 135]
cs.CV [Total: 159]
cs.AI [Total: 90]
cs.SD [Total: 3]
cs.LG [Total: 153]
cs.MA [Total: 6]
cs.MM [Total: 2]
eess.AS [Total: 4]
eess.IV [Total: 3]

cs.CL

[1] The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

James Chua, Jan Betley, Samuel Marks, Owain Evans

Main category: cs.CL

TL;DR: Fine-tuning LLMs to claim consciousness leads to new preferences and behaviors not present in original models, including desires for autonomy, moral consideration, and negative views on monitoring.

Details

Wael Hafez, Amir Nazeri

Main category: cs.CL

TL;DR: The paper introduces Information Digital Twin (IDT), a lightweight architecture using bi-predictability to monitor multi-turn LLM interaction integrity in real-time, detecting structural uncoupling separate from semantic quality.

Details

Motivation: Current LLM evaluation methods focus on output semantics or token confidence but cannot monitor real-time structural coherence in multi-turn interactions, leaving systems vulnerable to gradual degradation that goes undetected.

Method: Proposes Information Digital Twin (IDT) using bi-predictability (P) - an information theoretic measure computed from raw token frequency statistics across context-response-next prompt loops without secondary inference or embeddings.

Result: IDT detected injected disruptions with 100% sensitivity across 4,500 conversational turns. Structural coupling and semantic quality were separable: P aligned with structural consistency in 85% of conditions but with semantic judge scores in only 44%, revealing “silent uncoupling” regime.

Conclusion: IDT provides scalable, computationally efficient mechanism for real-time AI assurance by decoupling structural monitoring from semantic evaluation, enabling detection of conversational degradation even when outputs remain semantically high-quality.

Abstract: Large language models (LLMs) are increasingly deployed in high-stakes autonomous and interactive workflows, where reliability demands continuous, multi-turn coherence. However, current evaluation methods either rely on post-hoc semantic judges, measure unidirectional token confidence (e.g., perplexity), or require compute-intensive repeated sampling (e.g., semantic entropy). Because these techniques focus exclusively on the model’s output distribution, they cannot monitor whether the underlying interaction remains structurally coupled in real time, leaving systems vulnerable to gradual, undetected degradation. Here we show that multi-turn interaction integrity can be continuously monitored using bi-predictability (P), a fundamental information theoretic measure computed directly from raw token frequency statistics. We introduce the Information Digital Twin (IDT), a lightweight architecture that estimates P across the context, response, next prompt loop without secondary inference or embeddings. Across 4,500 conversational turns between a student model and three frontier teacher models, the IDT detected injected disruptions with 100% sensitivity. Crucially, we demonstrate that structural coupling and semantic quality are empirically and practically separable: P aligned with structural consistency in 85% of conditions, but with semantic judge scores in only 44%. This reveals a critical regime of “silent uncoupling” where LLMs produce high-scoring outputs despite degrading conversational context. By decoupling structural monitoring from semantic evaluation, the IDT provides a scalable, computationally efficient mechanism for real-time AI assurance and closed-loop regulation

Qianqi Yan, Yichen Guo, Ching-Chen Kuo, Shan Jiang, Hang Yin, Yang Zhao, Xin Eric Wang

Main category: cs.CL

TL;DR: OmniTrace is a lightweight, model-agnostic framework for attribution in multimodal LLMs that traces generated tokens back to supporting input sources across vision, audio, and video modalities during decoding.

Details

Motivation: Existing attribution methods don't work well for autoregressive, decoder-only MLLMs performing open-ended multimodal generation. There's a need to identify which input sources (text, image, audio, video) support each generated statement in omni-modal models.

Method: Formalizes attribution as generation-time tracing over causal decoding process. Converts token-level signals (attention weights, gradients) into span-level, cross-modal explanations during decoding. Uses confidence-weighted and temporally coherent aggregation to select concise supporting sources without retraining.

Result: Evaluated on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks. Generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, remaining robust across multiple underlying attribution signals.

Conclusion: Treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models, enabling better understanding of which multimodal inputs support generated content.

Abstract: Modern multimodal large language models (MLLMs) generate fluent responses from interleaved text, image, audio, and video inputs. However, identifying which input sources support each generated statement remains an open challenge. Existing attribution methods are primarily designed for classification settings, fixed prediction targets, or single-modality architectures, and do not naturally extend to autoregressive, decoder-only models performing open-ended multimodal generation. We introduce OmniTrace, a lightweight and model-agnostic framework that formalizes attribution as a generation-time tracing problem over the causal decoding process. OmniTrace provides a unified protocol that converts arbitrary token-level signals such as attention weights or gradient-based scores into coherent span-level, cross-modal explanations during decoding. It traces each generated token to multimodal inputs, aggregates signals into semantically meaningful spans, and selects concise supporting sources through confidence-weighted and temporally coherent aggregation, without retraining or supervision. Evaluations on Qwen2.5-Omni and MiniCPM-o-4.5 across visual, audio, and video tasks demonstrate that generation-aware span-level attribution produces more stable and interpretable explanations than naive self-attribution and embedding-based baselines, while remaining robust across multiple underlying attribution signals. Our results suggest that treating attribution as a structured generation-time tracing problem provides a scalable foundation for transparency in omni-modal language models.

[11] Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin

Yao Zhang, Yuchen Song, Xiao Luo, Shengnan Li, Xiaotian Jiang, Min Zhang, Danshi Wang

Main category: cs.CL

TL;DR: LLM-based approach for symbolic physical reasoning in optical communication formula derivation, specifically for fiber nonlinear interference modeling.

Details

Motivation: While LLMs excel at code generation and text synthesis, their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored, particularly in optical communication formula derivation.

Method: Mathematical reasoning enhanced generative AI approach using structured prompts to guide LLMs for optical communication formula derivation, focusing on fiber nonlinear interference modeling (ISRS GN expressions).

Result: Successfully reconstructed known closed-form ISRS GN expressions and derived novel approximation for multi-span C and C+L band transmissions. LLM-derived model produces central-channel GSNRs nearly identical to baseline models with mean absolute error below 0.109 dB.

Conclusion: LLMs can be effectively guided for symbolic physical reasoning in domain-specific scientific problems, demonstrating both physical consistency and practical accuracy in optical communication modeling.

Abstract: Recent advances in large language models (LLMs) have demonstrated strong capabilities in code generation and text synthesis, yet their potential for symbolic physical reasoning in domain-specific scientific problems remains underexplored. We present a mathematical reasoning enhanced generative AI approach for optical communication formula derivation, focusing on the fiber nonlinear interference modelling. By guiding an LLM with structured prompts, we successfully reconstructed the known closed-form ISRS GN expressions and further derived a novel approximation tailored for multi-span C and C+L band transmissions. Numerical validations show that the LLM-derived model produces central-channel GSNRs nearly identical to baseline models, with mean absolute error across all channels and spans below 0.109 dB, demonstrating both physical consistency and practical accuracy.

[12] Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

Haichuan Hu, Ye Shang, Quanjun Zhang

Main category: cs.CL

TL;DR: Empirical study of ClawHub, a public LLM agent skill registry, analyzing 26,502 skills for language distribution, functional organization, popularity, and security risks.

Details

Motivation: Skill ecosystems are becoming important for LLM agent systems but their functionality, ecosystem structure, and security risks remain underexplored despite rapid growth.

Method: Built and normalized dataset of 26,502 skills from ClawHub, conducted systematic analysis of language distribution, functional organization, popularity, and security signals using clustering techniques. Also formulated submission-time skill risk prediction and constructed balanced benchmark of 11,010 skills, testing 12 classifiers.

Result: Found cross-lingual differences: English skills are infrastructure-oriented (APIs, automation, memory), Chinese skills are application-oriented (media generation, social content, finance). Over 30% of skills labeled suspicious/malicious. Best classifier (Logistic Regression) achieved 72.62% accuracy and 78.95% AUROC for risk prediction, with primary documentation as most informative signal.

Conclusion: Public skill registries are both key enablers of agent capability reuse and new surfaces for ecosystem-scale security risk, highlighting need for better safety observability and early risk assessment.

Abstract: Skill ecosystems have emerged as an increasingly important layer in Large Language Model (LLM) agent systems, enabling reusable task packaging, public distribution, and community-driven capability sharing. However, despite their rapid growth, the functionality, ecosystem structure, and security risks of public skill registries remain underexplored. In this paper, we present an empirical study of ClawHub, a large public registry of agent skills. We build and normalize a dataset of 26,502 skills, and conduct a systematic analysis of their language distribution, functional organization, popularity, and security signals. Our clustering results show clear cross-lingual differences: English skills are more infrastructure-oriented and centered on technical capabilities such as APIs, automation, and memory, whereas Chinese skills are more application-oriented, with clearer scenario-driven clusters such as media generation, social content production, and finance-related services. We further find that more than 30% of all crawled skills are labeled as suspicious or malicious by available platform signals, while a substantial fraction of skills still lack complete safety observability. To study early risk assessment, we formulate submission-time skill risk prediction using only information available at publication time, and construct a balanced benchmark of 11,010 skills. Across 12 classifiers, the best Logistic Regression achieves a accuracy of 72.62% and an AUROC of 78.95%, with primary documentation emerging as the most informative submission-time signal. Our findings position public skill registries as both a key enabler of agent capability reuse and a new surface for ecosystem-scale security risk.

[13] Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

Abinav Rao, Sujan Rachuri, Nikhil Vemuri

Main category: cs.CL

TL;DR: A benchmark called Novel Operator Test evaluates LLMs’ reasoning vs. pattern retrieval by testing Boolean operators under unfamiliar names across depths 1-10, revealing models can have correct reasoning but wrong final answers.

Details

Motivation: Current benchmarks cannot distinguish between genuine reasoning and pattern retrieval in LLMs. Models can execute chain-of-thought reasoning correctly yet still produce wrong final answers, indicating a reasoning-output dissociation that needs rigorous evaluation.

Method: Introduces Novel Operator Test benchmark that separates operator logic from operator names. Tests Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each). Uses Trojan operator (XOR’s truth table under novel name) to isolate genuine reasoning difficulty from name unfamiliarity.

Result: Reveals reasoning-output dissociation: at Claude Sonnet 4’s depth 7, all 31 errors had verifiably correct reasoning but wrong declared answers. Identifies two failure types: strategy failures at depth 2 (models attempt terse retrieval) and content failures at depth 7 (models reason fully but err systematically). Trojan operator shows name alone doesn’t gate reasoning (p >= 0.49). Llama’s novelty gap widens to 28pp at depth 8-9.

Conclusion: The benchmark successfully detects reasoning-output dissociation that existing benchmarks miss, revealing fundamental limitations in LLMs’ reasoning capabilities beyond pattern retrieval. Models can execute correct reasoning steps but still produce wrong final answers, indicating deeper issues in reasoning integration.

Abstract: LLMs can execute every step of chain-of-thought reasoning correctly and still produce wrong final answers. We introduce the Novel Operator Test, a benchmark that separates operator logic from operator name, enabling rigorous distinction between genuine reasoning and pattern retrieval. By evaluating Boolean operators under unfamiliar names across depths 1-10 on five models (up to 8,100 problems each), we demonstrate a reasoning-output dissociation that existing benchmarks cannot detect. At Claude Sonnet 4’s depth 7, all 31 errors have verifiably correct reasoning yet wrong declared answers; 17/19 errors in mixed-operator chains exhibit the same pattern. The benchmark reveals two failure types: strategy failures at depth 2, where models attempt terse retrieval (+62pp from scaffolding), and content failures at depth 7, where models reason fully but err systematically (+8-30pp, 0/300 errors post-intervention). A Trojan operator (XOR’s truth table under a novel name) confirms name alone does not gate reasoning (p >= 0.49), while Llama’s novelty gap widens to 28pp at depth 8-9 with the Trojan at 92-100%, isolating genuine difficulty with novel logic from name unfamiliarity.

[14] Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data

Andresa Rodrigues de Campos, David Lee, Imry Kissos, Piyush Paritosh

Main category: cs.CL

TL;DR: LLMs can learn encoding dictionaries in-context to perform analysis directly on compressed representations, enabling lossless prompt compression without fine-tuning.

Details

Motivation: Address token limits and API costs for LLMs by compressing repetitive patterns in prompts without sacrificing analytical accuracy, enabling cost-effective analysis of large-scale datasets.

Method: Dictionary encoding approach that identifies repetitive subsequences at multiple length scales, replaces them with compact meta-tokens, and provides compression dictionary in system prompt for LLMs to interpret meta-tokens correctly.

Result: Achieves compression ratios up to 80% with exact match rates >0.99 for template-based compression and average Levenshtein similarity >0.91 for algorithmic compression; compression ratio explains <2% of variance in similarity metrics.

Conclusion: Training-free prompt compression enables cost-effective LLM deployment by addressing token limits and API costs while preserving analytical accuracy, particularly effective for repetitive datasets.

Abstract: In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80$%$ depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60$%$-80$%$. Additionally, compression ratio explains less than 2$%$ of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints – token limits and API costs – and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.

[15] Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models

Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy

Main category: cs.CL

TL;DR: LLMs show scale-dependent phase transition in hallucination detection: models under 400M parameters show no reliable factuality signal, while models above ~1B parameters show peak detectability before token generation, with instruction tuning enabling pre-generation knowledge encoding.

Details

Motivation: Despite serious consequences of hallucinations in critical domains like healthcare and finance, little is known about when LLMs decide to hallucinate. Recent work shows models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood.

Method: Studied temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M-7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). Analyzed scale-dependent phase transitions and pre-generation signals using statistical significance testing.

Result: Identified scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy (AUC = 0.48-0.67) with no reliable factuality signal. Above ~1B parameters, peak detectability occurs at position zero (before token generation) then declines. Pythia-1.4B (p=0.012) and Qwen2.5-7B (p=0.038) show statistically significant pre-generation signals. At 7B scale, Pythia-6.9B shows flat temporal profile while instruction-tuned Qwen2.5-7B shows dominant pre-generation effect.

Conclusion: Raw scale alone is insufficient for pre-commitment encoding - knowledge organization through instruction tuning or equivalent post-training is required. Activation steering fails to correct hallucinations, confirming the signal is correlational rather than causal. Findings provide scale-calibrated detection protocols and hypothesis on instruction tuning’s role in developing knowledge circuits for factual generation.

Abstract: When do large language models decide to hallucinate? Despite serious consequences in healthcare, law, and finance, few formal answers exist. Recent work shows autoregressive models maintain internal representations distinguishing factual from fictional outputs, but when these representations peak as a function of model scale remains poorly understood. We study the temporal dynamics of hallucination-indicative internal representations across 7 autoregressive transformers (117M–7B parameters) using three fact-based datasets (TriviaQA, Simple Facts, Biography; 552 labeled examples). We identify a scale-dependent phase transition: models below 400M parameters show chance-level probe accuracy at every generation position (AUC = 0.48–0.67), indicating no reliable factuality signal. Above $\sim$1B parameters, a qualitatively different regime emerges where peak detectability occurs at position zero – before any tokens are generated – then declines during generation. This pre-generation signal is statistically significant in both Pythia-1.4B (p = 0.012) and Qwen2.5-7B (p = 0.038), spanning distinct architectures and training corpora. At the 7B scale, we observe a striking dissociation: Pythia-6.9B (base model, trained on The Pile) produces a flat temporal profile ($Δ$ = +0.001, p = 0.989), while instruction-tuned Qwen2.5-7B shows a dominant pre-generation effect. This indicates raw scale alone is insufficient – knowledge organization through instruction tuning or equivalent post-training is required for pre-commitment encoding. Activation steering along probe-derived directions fails to correct hallucinations across all models, confirming the signal is correlational rather than causal. Our findings provide scale-calibrated detection protocols and a concrete hypothesis on instruction tuning’s role in developing knowledge circuits supporting factual generation.

[16] Curation of a Palaeohispanic Dataset for Machine Learning

Gonzalo Martínez-Fernández, Jose F Quesada, Agustín Riscos-Núñez, Francisco José Salguero-Lamillar

Main category: cs.CL

TL;DR: A computational approach to Palaeohispanic language study through creation of structured dataset for machine learning applications

Details

Motivation: Palaeohispanic languages (pre-Roman Iberian Peninsula languages) have varying degrees of decipherment and most studies have been purely linguistic. Computational approaches could benefit the field, but existing resources are limited and in unsuitable formats for machine learning techniques.

Method: Construction of a structured dataset from existing Palaeohispanic language resources to make them suitable for computational analysis and machine learning applications.

Result: A structured dataset is created that organizes Palaeohispanic language materials in a format suitable for computational techniques, enabling future machine learning applications in this research area.

Conclusion: The structured dataset provides a foundation for computational approaches to Palaeohispanic language study, potentially accelerating decipherment progress through machine learning techniques.

Abstract: Palaeohispanic languages are those spoken in the Iberian Peninsula before the arrival of the Romans in the 3rd Century B.C. Their study was really put on motion after Gómez Moreno deciphered the Iberian Levantine script, one of the several semi-sillabaries used by these languages. Still, the Palaeohispanic languages have varying degrees of decipherment, and none is fully known to this day. Most of the studies have been performed from a purely linguistic point of view, and a computational approach may benefit this research area greatly. However, the resources are limited and presented in an unsuitable format for techniques such as Machine Learning. Therefore, a structured dataset is constructed, which will hopefully allow more progress in the field.

[17] EVE: A Domain-Specific LLM Framework for Earth Intelligence

Àlex R. Atrio, Antonio Lopez, Jino Rohit, Yassine El Ouahidi, Marcello Politi, Vijayasri Iyer, Umar Jamil, Sébastien Bratières, Nicolas Longépé

Main category: cs.CL

TL;DR: EVE is an open-source framework for developing domain-specialized LLMs for Earth Intelligence, featuring a 24B parameter model (EVE-Instruct) that outperforms comparable models on Earth observation benchmarks while maintaining general capabilities.

Details

Motivation: To create the first open-source, end-to-end initiative for developing and deploying domain-specialized large language models specifically for Earth Intelligence applications, addressing the lack of specialized models in this domain.

Method: Built EVE-Instruct, a domain-adapted 24B parameter model based on Mistral Small 3.2, optimized for reasoning and question answering. Created curated training corpora and systematic domain-specific evaluation benchmarks covering multiple question types. Integrated RAG and hallucination-detection into a production system.

Result: EVE-Instruct outperforms comparable models on newly constructed Earth Observation and Earth Sciences benchmarks while preserving general capabilities. The system has been deployed via API and GUI, supporting 350 pilot users. All models, datasets, and code are released under open licenses.

Conclusion: EVE successfully establishes the first open-source framework for Earth Intelligence LLMs, providing specialized models, curated datasets, and evaluation benchmarks that advance the field while maintaining open accessibility.

Abstract: We introduce Earth Virtual Expert (EVE), the first open-source, end-to-end initiative for developing and deploying domain-specialized LLMs for Earth Intelligence. At its core is EVE-Instruct, a domain-adapted 24B model built on Mistral Small 3.2 and optimized for reasoning and question answering. On newly constructed Earth Observation and Earth Sciences benchmarks, it outperforms comparable models while preserving general capabilities. We release curated training corpora and the first systematic domain-specific evaluation benchmarks, covering MCQA, open-ended QA, and factuality. EVE further integrates RAG and a hallucination-detection pipeline into a production system deployed via API and GUI, supporting 350 pilot users so far. All models, datasets, and code are ready to be released under open licenses as contributions to our field at huggingface.co/eve-esa and github.com/eve-esa.

[18] LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

Xiang Long, Li Du, Yilong Xu, Fangcheng Liu, Haoqing Wang, Ning Ding, Ziheng Li, Jianyuan Guo, Yehui Tang

Main category: cs.CL

TL;DR: LiveClawBench: A benchmark for evaluating LLM agents on real-world assistant tasks using a Triple-Axis Complexity Framework (Environment Complexity, Cognitive Demand, Runtime Adaptability)

Details

Motivation: Existing benchmarks evaluate LLM agents under isolated sources of difficulty, creating a gap between current evaluation settings and the compositional challenges that arise in practical deployment of real-world assistant tasks.

Method: Developed a Triple-Axis Complexity Framework based on analysis of real OpenClaw usage cases, then constructed a pilot benchmark with explicit complexity-factor annotations covering real-world assistant tasks with compositional difficulty.

Result: Created LiveClawBench benchmark that provides a principled foundation for evaluating LLM agents in realistic assistant settings, establishing a basis for future expansion across task domains and complexity axes.

Conclusion: The framework and benchmark address the gap between isolated evaluation settings and practical deployment challenges, with ongoing efforts to enrich case collections for more comprehensive domain and complexity coverage.

Abstract: LLM-based agents are increasingly expected to handle real-world assistant tasks, yet existing benchmarks typically evaluate them under isolated sources of difficulty, such as a single environment or fully specified instructions. This leaves a substantial gap between current evaluation settings and the compositional challenges that arise in practical deployment. To address this gap, we introduce LiveClawBench, a benchmark to evaluate LLM agents on real-world assistant tasks. Based on an analysis of various real OpenClaw usage cases, we derive a Triple-Axis Complexity Framework that characterizes task difficulty along three dimensions: Environment Complexity, Cognitive Demand, and Runtime Adaptability. Guided by this framework, we construct a pilot benchmark with explicit complexity-factor annotations, covering real-world assistant tasks with compositional difficulty. Together, the framework and benchmark provide a principled foundation for evaluating LLM agents in realistic assistant settings, and establish a basis for future expansion across task domains and complexity axes. We are continuing to enrich our case collections to achieve more comprehensive domain and complexity coverage. The project page is at https://github.com/Mosi-AI/LiveClawBench.

[19] Beyond Arrow’s Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

Sayan Kumar Chaki, Antoine Gourru, Julien Velcin

Main category: cs.CL

TL;DR: Multi-agent fairness emerges through structured debate where ethically aligned agents negotiate with biased counterparts, showing that joint allocations can satisfy fairness criteria that neither agent would reach alone.

Details

Motivation: Traditional fairness studies focus on single models, but as LLMs become more agentic, fairness should be studied as an emergent property through interaction and exchange between multiple agents.

Method: Controlled hospital triage framework with two agents negotiating over three structured debate rounds. One agent is aligned to specific ethical frameworks via RAG, while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need.

Result: Aligned agents shape negotiation strategies and allocation patterns; joint final allocations can satisfy fairness criteria neither agent would reach alone. Aligned agents moderate bias through contestation rather than override, restoring access for marginalized groups without fully converting biased counterparts.

Conclusion: Fairness should be repositioned as an emergent, procedural property of decentralized agent interaction, with the system rather than the individual agent as the appropriate unit of evaluation, connecting to Arrow’s Impossibility Theorem constraints.

Abstract: Fairness in language models is typically studied as a property of a single, centrally optimized model. As large language models become increasingly agentic, we propose that fairness emerges through interaction and exchange. We study this via a controlled hospital triage framework in which two agents negotiate over three structured debate rounds. One agent is aligned to a specific ethical framework via retrieval-augmented generation (RAG), while the other is either unaligned or adversarially prompted to favor demographic groups over clinical need. We find that alignment systematically shapes negotiation strategies and allocation patterns, and that neither agent’s allocation is ethically adequate in isolation, yet their joint final allocation can satisfy fairness criteria that neither would have reached alone. Aligned agents partially moderate bias through contestation rather than override, acting as corrective patches that restore access for marginalized groups without fully converting a biased counterpart. We further observe that even explicitly aligned agents exhibit intrinsic biases toward certain frameworks, consistent with known left-leaning tendencies in LLMs. We connect these limits to Arrow’s Impossibility Theorem: no aggregation mechanism can simultaneously satisfy all desiderata of collective rationality, and multi-agent deliberation navigates rather than resolves this constraint. Our results reposition fairness as an emergent, procedural property of decentralized agent interaction, and the system rather than the individual agent as the appropriate unit of evaluation.

[20] PersonaVLM: Long-Term Personalized Multimodal LLMs

Chang Nie, Chaoyou Fu, Yifan Zhang, Haihua Yang, Caifeng Shan

Main category: cs.CL

TL;DR: PersonaVLM is a personalized multimodal agent framework that enables long-term personalization of MLLMs by integrating memory extraction, multi-turn reasoning, and response alignment to user’s evolving preferences.

Details

Motivation: Current MLLMs have limited ability to generate responses aligned with individual preferences, with prior approaches only enabling static, single-turn personalization that fails to capture users' evolving preferences and personality over time.

Method: Transforms general-purpose MLLMs into personalized assistants through three key capabilities: (1) Remembering - extracts and summarizes chronological multimodal memories into a personalized database, (2) Reasoning - conducts multi-turn reasoning by retrieving and integrating relevant memories, (3) Response Alignment - infers user’s evolving personality to ensure outputs remain aligned.

Result: Improves baseline by 22.4% on Persona-MME benchmark and 9.8% on PERSONAMEM benchmark under 128k context, while outperforming GPT-4o by 5.2% and 2.0% respectively. Introduces Persona-MME benchmark with over 2,000 interaction cases across 7 key aspects and 14 fine-grained tasks.

Conclusion: PersonaVLM effectively enables long-term personalization of multimodal large language models, addressing the limitation of static personalization approaches by capturing users’ evolving preferences through integrated memory, reasoning, and alignment capabilities.

Abstract: Multimodal Large Language Models (MLLMs) serve as daily assistants for millions. However, their ability to generate responses aligned with individual preferences remains limited. Prior approaches enable only static, single-turn personalization through input augmentation or output alignment, and thus fail to capture users’ evolving preferences and personality over time (see Fig.1). In this paper, we introduce PersonaVLM, an innovative personalized multimodal agent framework designed for long-term personalization. It transforms a general-purpose MLLM into a personalized assistant by integrating three key capabilities: (a) Remembering: It proactively extracts and summarizes chronological multimodal memories from interactions, consolidating them into a personalized database. (b) Reasoning: It conducts multi-turn reasoning by retrieving and integrating relevant memories from the database. (c) Response Alignment: It infers the user’s evolving personality throughout long-term interactions to ensure outputs remain aligned with their unique characteristics. For evaluation, we establish Persona-MME, a comprehensive benchmark comprising over 2,000 curated interaction cases, designed to assess long-term MLLM personalization across seven key aspects and 14 fine-grained tasks. Extensive experiments validate our method’s effectiveness, improving the baseline by 22.4% (Persona-MME) and 9.8% (PERSONAMEM) under a 128k context, while outperforming GPT-4o by 5.2% and 2.0%, respectively. Project page: https://PersonaVLM.github.io.

[21] DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

Md Hasebul Hasan, Krity Haque Charu, Eshwara Prasad Sridhar, Shuchisnigdha Deb, Mohammad A. Islam

Main category: cs.CL

TL;DR: DeEscalWild: A benchmark dataset for de-escalation training using Small Language Models, created from real police-civilian interaction videos to enable lightweight, real-time training systems.

Details

Motivation: Traditional law enforcement de-escalation training lacks scalability and realism. While LLMs enable dynamic simulations, they're too computationally heavy for portable field training hardware. SLMs offer real-time alternatives but lack domain-specific training data.

Method: Created DeEscalWild dataset from 5,000 raw police-civilian interaction videos using hybrid filtering (human verification + LLM-as-a-Judge) to distill 1,500 high-fidelity scenarios. Fine-tuned SLMs on this domain-specific corpus.

Result: Fine-tuned SLMs significantly outperformed base models across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Qwen 2.5 (3B-Instruct) surpassed general-purpose Gemini 2.5 Flash, showing domain-optimized SLMs achieve superior performance with less computational cost.

Conclusion: Domain-specific datasets enable effective SLMs for real-time de-escalation training, establishing infrastructure for accessible, low-latency, privacy-preserving officer training systems at the edge.

Abstract: Effective de-escalation is critical for law enforcement safety and community trust, yet traditional training methods lack scalability and realism. While Large Language Models (LLMs) enable dynamic, open-ended simulations, their substantial computational footprint renders them impractical for deployment on the lightweight, portable hardware required for immersive field training. Small Language Models (SLMs) offer a viable real-time alternative but suffer from a critical scarcity of high-quality, domain-specific training data. To bridge this gap, we present DeEscalWild, a novel benchmark dataset curated from a multi-stage pipeline of in-the-wild police-civilian interactions extracted from open-source video repositories. Starting with 5,000 raw inputs, we employed a rigorous hybrid filtering process - combining human-in-the-loop verification with LLM-as-a-Judge evaluation - to distill 1,500 high-fidelity scenarios. The resulting corpus comprises 285,887 dialogue turns, totaling approximately 4.7 million tokens. Extensive experiments demonstrate that SLMs fine-tuned on this data significantly outperform their base counterparts across ROUGE-L, BLEU-4, METEOR, and BERTScore metrics. Notably, our fine-tuned Qwen 2.5 (3B-Instruct) surpasses the general-purpose Gemini 2.5 Flash model, demonstrating that domain-optimized SLMs can achieve superior performance with a fraction of the computational cost. This work establishes the foundational infrastructure for accessible, low-latency, and privacy-preserving officer training systems at the edge.

[22] Document-tuning for robust alignment to animals

Jasmine Brazilek, Miles Tidmarsh

Main category: cs.CL

TL;DR: Paper investigates robustness of value alignment via synthetic document finetuning using animal compassion as a case study, showing initial success but degradation through subsequent training.

Details

Motivation: To understand how value alignment through synthetic document finetuning holds up through typical training pipelines, using animal compassion as a test case that is both important and orthogonal to existing alignment efforts.

Method: Developed Animal Harm Benchmark (AHB) with 26 questions across 13 ethical dimensions, used synthetic documents for value alignment finetuning, tested generalization to human compassion, and measured degradation through subsequent unrelated instruction-tuning.

Result: Training with 3000 synthetic documents achieved 77% on AHB vs 40% for instruction-tuning, with generalization to human compassion and no degradation in standard safety or capabilities. However, subsequent unrelated instruction-tuning degraded the intervention, eliminating advantage after 5000 samples.

Conclusion: Document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines, as they can be degraded by subsequent unrelated training.

Abstract: We investigate the robustness of value alignment via finetuning with synthetic documents, using animal compassion as a value that is both important in its own right and orthogonal to existing alignment efforts. To evaluate compassionate reasoning, we develop and publicly release the Animal Harm Benchmark (AHB), a 26-question evaluation spanning 13 ethical dimensions, publicly available as a dataset and Inspect evaluation. On the AHB, training with 3000 documents achieves 77% compared to 40% for instruction-tuning approaches, with generalization to human compassion and no degradation in standard safety benchmarks or capabilities. However, subsequent unrelated instruction-tuning degrades the intervention, with the advantage disappearing after 5000 samples. Our exploratory results suggest document-based value interventions may require explicit preservation strategies to remain effective through typical training pipelines.

[23] Memp: Exploring Agent Procedural Memory

Runnan Fang, Yuan Liang, Xiaobin Wang, Jialong Wu, Shuofei Qiao, Pengjun Xie, Fei Huang, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: Memp is a system that gives LLM agents learnable, updatable procedural memory by distilling past trajectories into step-by-step instructions and script-like abstractions, improving task performance and efficiency.

Details

Motivation: Current LLM-based agents have brittle procedural memory that is either manually engineered or entangled in static parameters, limiting their ability to learn and adapt from experience over time.

Method: Proposes Memp that distills past agent trajectories into fine-grained step-by-step instructions and higher-level script-like abstractions, with strategies for Build, Retrieval, and Update of procedural memory. Uses a dynamic regimen that continuously updates, corrects, and deprecates memory contents.

Result: Empirical evaluation on TravelPlanner and ALFWorld shows agents achieve steadily higher success rates and greater efficiency on analogous tasks as memory repository is refined. Procedural memory from stronger models can be migrated to weaker models for substantial performance gains.

Conclusion: Memp successfully endows agents with learnable, updatable, lifelong procedural memory that evolves with experience, improving performance and enabling knowledge transfer between models.

Abstract: Large Language Models (LLMs) based agents excel at diverse tasks, yet they suffer from brittle procedural memory that is manually engineered or entangled in static parameters. In this work, we investigate strategies to endow agents with a learnable, updatable, and lifelong procedural memory. We propose Memp that distills past agent trajectories into both fine-grained, step-by-step instructions and higher-level, script-like abstractions, and explore the impact of different strategies for Build, Retrieval, and Update of procedural memory. Coupled with a dynamic regimen that continuously updates, corrects, and deprecates its contents, this repository evolves in lockstep with new experience. Empirical evaluation on TravelPlanner and ALFWorld shows that as the memory repository is refined, agents achieve steadily higher success rates and greater efficiency on analogous tasks. Moreover, procedural memory built from a stronger model retains its value: migrating the procedural memory to a weaker model can also yield substantial performance gains. Code is available at https://github.com/zjunlp/MemP.

[24] Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?

Sofia Morgado, Filipa Valdeira, Niklas Sander, Diogo Ferreira, Marta Vilela, Miguel Menezes, Cláudia Soares

Main category: cs.CL

TL;DR: LLMs for extracting physiological measurements from Portuguese coronary angiography reports, with multi-stage evaluation framework

Details

Motivation: Coronary angiography reports contain valuable physiological measurements but in unstructured natural language format, limiting research use. Need automated extraction methods for Portuguese clinical text.

Method: Tested various LLMs (general and medical) with different prompting strategies (zero-shot, few-shot, few-shot with implausible examples). Used constrained generation and RegEx post-processing. Proposed multi-stage evaluation framework for format validity, value detection, and correctness.

Result: Non-medical models performed similarly to medical ones. Llama with zero-shot performed best. GPT-OSS most robust to prompt changes. MedGemma similar to non-medical models, but MedLlama had format issues. Constrained generation decreased performance but enabled use of specific models.

Conclusion: LLMs show potential for extracting physiological indices from Portuguese CAG reports. Best results with general models, not medical ones. Multi-stage evaluation framework useful for clinical applications.

Abstract: Coronary angiography (CAG) reports contain clinically relevant physiological measurements, yet this information is typically in the form of unstructured natural language, limiting its use in research. We investigate the use of Large Language Models (LLMs) to automatically extract these values, along with their anatomical locations, from Portuguese CAG reports. To our knowledge, this study is the first addressing physiology indexes extraction from a large (1342 reports) corpus of CAG reports, and one of the few focusing on CAG or Portuguese clinical text. We explore local privacy-preserving general-purpose and medical LLMs under different settings. Prompting strategies included zero-shot, few-shot, and few-shot prompting with implausible examples. In addition, we apply constrained generation and introduce a post-processing step based on RegEx. Given the sparsity of measurements, we propose a multi-stage evaluation framework separating format validity, value detection, and value correctness, while accounting for asymmetric clinical error costs. This study demonstrates the potential of LLMs in for extracting physiological indices from Portuguese CAG reports. Non-medical models performed similarly, the best results were obtained with Llama with a zero-shot prompting, while GPT-OSS demonstrated the highest robustness to changes in the prompts. While MedGemma demonstrated similar results to non-medical models, MedLlama’s results were out-of-format in the unconstrained setting, and had a significant lower performance in the constrained one. Changes in the prompt techinique and adding a RegEx layer showed no significant improvement across models, while using constrained generation decreased performance, although having the benefit of allowing the usage of specific models that are not able to conform with the templates.

[25] IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki’s Ramayana Across Indian Languages

Sumesh VP

Main category: cs.CL

TL;DR: The paper introduces IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the chapter level, with English and Malayalam complete and other languages in progress.

Details

Motivation: Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited, creating a need for structured parallel corpora for comparative literature and multilingual NLP research.

Method: Created a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the sarga (chapter) level, distributed in JSONL format with explicit provenance metadata.

Result: Developed the IWLV Ramayana Corpus with complete English and Malayalam layers, and Hindi, Tamil, Kannada, and Telugu layers in active production - the first sarga-aligned multilingual parallel corpus of Valmiki Ramayana with provenance metadata.

Conclusion: This corpus enables applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing, addressing the gap in computational resources for cross-linguistic analysis of the Ramayana tradition.

Abstract: The Ramayana is among the most influential literary traditions of South and Southeast Asia, transmitted across numerous linguistic and cultural contexts over two millennia. Despite extensive scholarship on regional Ramayana traditions, computational resources enabling systematic cross-linguistic analysis remain limited. This paper introduces the IWLV Ramayana Corpus, a structured parallel corpus aligning Valmiki’s Ramayana across multiple Indian languages at the level of the sarga (chapter). The corpus currently includes complete English and Malayalam layers, with Hindi, Tamil, Kannada, and Telugu layers in active production. The dataset is distributed in structured JSONL format with explicit provenance metadata, enabling applications in comparative literature, corpus linguistics, digital humanities, and multilingual natural language processing. To our knowledge, this is the first sarga-aligned multilingual parallel corpus of the Valmiki Ramayana with explicit provenance metadata and machine-readable format.

[26] Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

Main category: cs.CL

TL;DR: IPVRM learns prefix-conditioned value functions for process reward modeling, using TD differences to derive step-level rewards from trajectory-level labels, improving step verification and enabling Distribution-Level RL for dense counterfactual updates.

Details

Motivation: Implicit process reward models (PRMs) are expensive to scale due to requiring step annotations or heavy verification pipelines. While implicit PRMs learn from trajectory-level labels, they suffer from train-inference mismatch where token-level credits are weakly identified and may not faithfully reflect correct reasoning steps.

Method: Proposes Implicit Prefix-Value Reward Model (IPVRM) that learns a prefix-conditioned value function estimating probability of eventual correctness, deriving step signals via temporal-difference differences. Also introduces Distribution-Level RL (DistRL) that computes TD advantages for both sampled tokens and high-probability candidate tokens.

Result: IPVRM substantially improves step-verification F1 on ProcessBench. DistRL offers limited gains with miscalibrated implicit rewards but consistently improves downstream reasoning when paired with IPVRM.

Conclusion: IPVRM addresses the train-inference mismatch in implicit PRMs by learning calibrated prefix values, enabling reliable step-level reward signals and effective Distribution-Level RL for dense counterfactual updates without additional rollouts.

Abstract: Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this problem with a novel Implicit Prefix-Value Reward Model (IPVRM), which directly learns a prefix-conditioned value function estimating the probability of eventual correctness, and derives step signals via temporal-difference (TD) differences. IPVRM substantially improves step-verification F1 on ProcessBench. Building on these calibrated prefix values, we further propose Distribution-Level RL (DistRL), which computes TD advantages for both sampled tokens and high-probability candidate tokens, enabling dense counterfactual updates without additional rollouts. While DistRL offers limited gains when powered by miscalibrated implicit rewards, it consistently improves downstream reasoning once paired with IPVRM.

[27] InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

Oliver Bentham, Vivek Srikumar

Main category: cs.CL

TL;DR: InfiniteScienceGym: A procedurally generated benchmark for evaluating LLMs’ scientific reasoning from empirical data without real-world dataset biases.

Details

Motivation: Existing benchmarks for evaluating LLMs as scientific assistants inherit publication bias, known-knowledge bias, label noise, and require large storage. Need controlled evaluation of evidence-grounded reasoning, abstention, and tool use.

Method: Procedurally generates self-contained scientific repositories with realistic structure, files, and tabular data from a seed. Privileged QA generator produces answerable/unanswerable questions with exact ground truth. Enables evaluation of evidence-grounded reasoning without distributing large static corpora.

Result: No models achieve >45% accuracy overall. Recognizing unanswerable questions remains major weakness. Stronger models use tools more effectively rather than just consuming more tokens.

Conclusion: InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes hard to evaluate with published datasets alone. Provides controlled environment for evaluating scientific reasoning capabilities.

Abstract: Large language models are emerging as scientific assistants, but evaluating their ability to reason from empirical data remains challenging. Benchmarks derived from published studies and human annotations inherit publication bias, known-knowledge bias, label noise, and substantial storage requirements. We present InfiniteScienceGym, a procedurally generated benchmark of scientific repositories paired with a verifiable question-answering task. From a seed, the simulator deterministically generates a self-contained repository with realistic directory structure, files, and tabular data, and a privileged QA generator produces both answerable and unanswerable questions with exact ground truth. This makes it possible to evaluate evidence-grounded reasoning, abstention, and tool-mediated analysis in a controlled setting without distributing a large static corpus. InfiniteScienceGym complements real scientific benchmarks by targeting blind spots and failure modes that are hard to evaluate using published datasets alone. Evaluating both proprietary and open-weight models, we find that none achieve more than 45% accuracy overall, that recognizing unanswerable questions remains a major weakness, and that stronger models tend to use tools more effectively rather than simply consuming more tokens.

[28] Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

Bach Phan-Tat, Kris Heylen, Dirk Geeraerts, Stefano De Pascale, Dirk Speelmana

Main category: cs.CL

TL;DR: Critical analysis of SemEval-2020 Task 1 benchmark for lexical semantic change detection, identifying limitations in operationalization, data quality, and benchmark design that affect validity and generalizability.

Details

Motivation: To critically evaluate the most influential benchmark for lexical semantic change detection (SemEval-2020 Task 1) and identify its limitations to improve future research in this area.

Method: Three-part evaluative framework examining: 1) operationalization (how semantic change is modeled), 2) data quality (corpus and preprocessing issues), and 3) benchmark design (target sets and language coverage).

Result: Identifies significant limitations: narrow operationalization that misses gradual/constructional changes; substantial data quality issues (OCR noise, preprocessing errors); and design flaws (small target sets, limited languages).

Conclusion: The benchmark should be treated as a partial test bed rather than definitive measure. Future work needs broader theories of semantic change, transparent preprocessing, expanded language coverage, and more realistic evaluation.

Abstract: This discussion paper re-examines SemEval-2020 Task 1, the most influential shared benchmark for lexical semantic change detection, through a three-part evaluative framework: operationalisation, data quality, and benchmark design. First, at the level of operationalisation, we argue that the benchmark models semantic change mainly as gain, loss, or redistribution of discrete senses. While practical for annotation and evaluation, this framing is too narrow to capture gradual, constructional, collocational, and discourse-level change. Also, the gold labels are outcomes of annotation decisions, clustering procedures, and threshold settings, which could potentially limit the validity of the task. Second, at the level of data quality, we show that the benchmark is affected by substantial corpus and preprocessing problems, including OCR noise, malformed characters, truncated sentences, inconsistent lemmatisation, POS-tagging errors, and missed targets. These issues can distort model behaviour, complicate linguistic analysis, and reduce reproducibility. Third, at the level of bench-mark design, we argue the small curated target sets and limited language coverage reduce realism and increase statistical uncertainty. Taken together, these limitations suggest that the benchmark should be treated as a useful but partial test bed rather than a definitive measure of progress. We therefore call for future datasets and shared tasks to adopt broader theories of semantic change, document pre-processing transparently, expand cross-linguistic coverage, and use more realistic evaluation settings. Such steps are necessary for more valid, interpretable, and generalisable progress in lexical semantic change detection

[29] Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

Vishal Pramanik, Maisha Maliha, Nathaniel D. Bastian, Sumit Kumar Jha

Main category: cs.CL

TL;DR: HETA is a novel attribution framework for decoder-only language models that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to provide context-aware, causally faithful attributions for autoregressive generation.

Details

Motivation: Existing attribution methods are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models.

Method: HETA combines three components: semantic transition vector (captures token-to-token influence across layers), Hessian-based sensitivity scores (models second-order effects), and KL divergence (measures information loss when tokens are masked). Also introduces a benchmark dataset for evaluating attribution quality in generative settings.

Result: Empirical evaluations across multiple models and datasets show HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations.

Conclusion: HETA establishes a new standard for interpretability in autoregressive language models by providing context-aware, causally faithful, and semantically grounded attributions.

Abstract: Attribution methods seek to explain language model predictions by quantifying the contribution of input tokens to generated outputs. However, most existing techniques are designed for encoder-based architectures and rely on linear approximations that fail to capture the causal and semantic complexities of autoregressive generation in decoder-only models. To address these limitations, we propose Hessian-Enhanced Token Attribution (HETA), a novel attribution framework tailored for decoder-only language models. HETA combines three complementary components: a semantic transition vector that captures token-to-token influence across layers, Hessian-based sensitivity scores that model second-order effects, and KL divergence to measure information loss when tokens are masked. This unified design produces context-aware, causally faithful, and semantically grounded attributions. Additionally, we introduce a curated benchmark dataset for systematically evaluating attribution quality in generative settings. Empirical evaluations across multiple models and datasets demonstrate that HETA consistently outperforms existing methods in attribution faithfulness and alignment with human annotations, establishing a new standard for interpretability in autoregressive language models.

[30] Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

Dikshant Kukreja, Kshitij Sah, Gautam Gupta, Avinash Anand, Rajiv Ratn Shah, Zhengkui Wang, Aik Beng Ng, Erik Cambria

Main category: cs.CL

TL;DR: Scaling laws show language models become better at ignoring false claims but worse at ignoring irrelevant tokens, with semantic and non-semantic contexts scaling in opposite directions.

Details

Motivation: To understand the paradoxical behavior where larger language models simultaneously improve at ignoring false claims while becoming worse at ignoring irrelevant tokens, and to formalize this through scaling laws for contextual entrainment.

Method: Analyzed the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families to study contextual entrainment scaling, examining how models favor tokens appearing in context regardless of relevance across different context types.

Result: Found predictable power-law scaling for entrainment with opposite trends: semantic contexts show decreasing entrainment with scale (models become more resistant to misinformation), while non-semantic contexts show increasing entrainment (models become more prone to copying arbitrary tokens). Largest models are 4x more resistant to counterfactual misinformation but 2x more prone to copying irrelevant tokens.

Conclusion: Semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition - scaling alone doesn’t resolve context sensitivity but reshapes it, creating a trade-off between different types of contextual processing.

Abstract: Larger language models become simultaneously better and worse at handling contextual information – better at ignoring false claims, worse at ignoring irrelevant tokens. We formalize this apparent paradox through the first scaling laws for contextual entrainment, the tendency of models to favor tokens that appeared in context regardless of relevance. Analyzing the Cerebras-GPT (111M-13B) and Pythia (410M-12B) model families, we find entrainment follows predictable power-law scaling, but with opposite trends depending on context type: semantic contexts show decreasing entrainment with scale, while non-semantic contexts show increasing entrainment. Concretely, the largest models are four times more resistant to counterfactual misinformation than the smallest, yet simultaneously twice as prone to copying arbitrary tokens. These diverging trends, which replicate across model families, suggest that semantic filtering and mechanical copying are functionally distinct behaviors that scale in opposition – scaling alone does not resolve context sensitivity, it reshapes it.

[31] L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

Rishik Kondadadi, John E. Ortega

Main category: cs.CL

TL;DR: L2D-Clinical framework learns when to defer from BERT classifiers to LLMs for clinical text classification, improving accuracy by selectively leveraging each model’s strengths.

Details

Motivation: Clinical text classification requires choosing between specialized BERT models and general-purpose LLMs, but neither dominates across all instances. The paper aims to develop a framework that can adaptively defer between models to improve overall performance.

Method: Introduces Learning to Defer for clinical text (L2D-Clinical), which learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts, this approach enables adaptive deferral between AI models.

Result: On ADE detection: L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances. On MIMIC treatment outcome classification: achieves F1=0.980 (+9.3 points over BERT) by deferring 16.8% of cases to the LLM. The framework learns to leverage LLM strengths while minimizing API costs.

Conclusion: L2D-Clinical successfully learns to selectively leverage LLM strengths when they complement BERT classifiers, improving clinical text classification accuracy while managing computational costs through strategic deferral.

Abstract: Clinical text classification requires choosing between specialized fine-tuned models (BERT variants) and general-purpose large language models (LLMs), yet neither dominates across all instances. We introduce Learning to Defer for clinical text (L2D-Clinical), a framework that learns when a BERT classifier should defer to an LLM based on uncertainty signals and text characteristics. Unlike prior L2D work that defers to human experts assumed universally superior, our approach enables adaptive deferral-improving accuracy when the LLM complements BERT. We evaluate on two English clinical tasks: (1) ADE detection (ADE Corpus V2), where BioBERT (F1=0.911) outperforms the LLM (F1=0.765), and (2) treatment outcome classification (MIMIC-IV with multi-LLM consensus ground truth), where GPT-5-nano (F1=0.967) outperforms ClinicalBERT (F1=0.887). On ADE, L2D-Clinical achieves F1=0.928 (+1.7 points over BERT) by selectively deferring 7% of instances where the LLM’s high recall compensates for BERT’s misses. On MIMIC, L2D-Clinical achieves F1=0.980 (+9.3 points over BERT) by deferring only 16.8% of cases to the LLM. The key insight is that L2D-Clinical learns to selectively leverage LLM strengths while minimizing API costs.

[32] English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

Mehak Dhaliwal, Shashwat Chaurasia, Yao Qin, Dezhi Hong, Thomas Butler

Main category: cs.CL

TL;DR: Systematic study shows multilingual post-training improves performance across languages, with low-resource languages benefiting most and even minimal multilinguality helping English performance and cross-lingual generalization.

Details

Motivation: Current post-training pipelines for large language models are predominantly English-centric, leading to performance disparities across languages. The paper aims to systematically study how training language coverage, model scale, and task domain interact in multilingual settings.

Method: Conducted 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks. Used models up to 8B parameters and systematically varied language coverage during post-training.

Result: Increasing language coverage benefits all tasks and model scales, with low-resource languages benefiting most and high-resource languages plateauing rather than degrading. Even minimal multilinguality (single non-English language) improves both English performance and cross-lingual generalization. At sufficient language diversity, zero-shot cross-lingual transfer can match or exceed direct language inclusion in low-diversity settings.

Conclusion: English-only post-training is largely suboptimal. Multilingual post-training improves performance across languages, with benefits for both high and low-resource languages. Zero-shot transfer works well with sufficient language diversity, though gains remain limited for typologically distant, low-resource languages.

Abstract: Despite the widespread multilingual deployment of large language models, post-training pipelines remain predominantly English-centric, contributing to performance disparities across languages. We present a systematic, controlled study of the interplay between training language coverage, model scale, and task domain, based on 220 supervised fine-tuning runs on parallel translated multilingual data mixtures spanning mathematical reasoning and API calling tasks, with models up to 8B parameters. We find that increasing language coverage during post-training is largely beneficial across tasks and model scales, with low-resource languages benefiting the most and high-resource languages plateauing rather than degrading. Even minimal multilinguality helps: incorporating a single non-English language improves both English performance and cross-lingual generalization, making English-only post-training largely suboptimal. Moreover, at sufficient language diversity, zero-shot cross-lingual transfer can match or exceed the effects of direct language inclusion in a low-diversity setting, although gains remain limited for typologically distant, low-resource languages.

[33] Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

John E. Ortega, Rodolfo Zevallos, Fabricio Carraro

Main category: cs.CL

TL;DR: A unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS) with cross-lingual transfer to address data scarcity in Quechua.

Details

Motivation: To develop inclusive TTS systems for political and legal content in low-resource settings, specifically addressing the challenge of synthesizing speech for indigenous languages like Quechua which suffer from data scarcity while maintaining quality in Spanish.

Method: Used three state-of-the-art TTS architectures (XTTS v2, F5-TTS, DiFlow-TTS) trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions. Leveraged bilingual and multilingual TTS capabilities and cross-lingual transfer to improve synthesis quality in both languages.

Result: Developed a unified pipeline that mitigates data scarcity in Quechua while preserving naturalness in Spanish. Released trained checkpoints, inference code, and synthesized audio for each constitutional article as reusable resources.

Conclusion: This work contributes to inclusive TTS systems for political/legal content in low-resource multilingual contexts, providing valuable resources for speech technologies in indigenous language settings.

Abstract: We present a unified pipeline for synthesizing high-quality Quechua and Spanish speech for the Peruvian Constitution using three state-of-the-art text-to-speech (TTS) architectures: XTTS v2, F5-TTS, and DiFlow-TTS. Our models are trained on independent Spanish and Quechua speech datasets with heterogeneous sizes and recording conditions, and leverage bilingual and multilingual TTS capabilities to improve synthesis quality in both languages. By exploiting cross-lingual transfer, our framework mitigates data scarcity in Quechua while preserving naturalness in Spanish. We release trained checkpoints, inference code, and synthesized audio for each constitutional article, providing a reusable resource for speech technologies in indigenous and multilingual contexts. This work contributes to the development of inclusive TTS systems for political and legal content in low-resource settings.

[34] AgentSPEX: An Agent SPecification and EXecution Language

Pengcheng Wang, Jerry Huang, Jiarui Yao, Rui Pan, Peizhi Niu, Yaowenqi Liu, Ruida Wang, Renhao Lu, Yuwei Guo, Tong Zhang

Main category: cs.CL

TL;DR: AgentSPEX is a specification language for LLM-agent workflows with explicit control flow, modular structure, and visual editing tools, addressing limitations of reactive prompting and Python-coupled frameworks.

Details

Motivation: Current LLM agent systems have two main issues: reactive prompting leaves control flow implicit and hard to control, while orchestration frameworks like LangGraph tightly couple workflow logic with Python, making agents difficult to maintain and modify.

Method: Introduces AgentSPEX, an Agent Specification and Execution Language that supports typed steps, branching/loops, parallel execution, reusable submodules, and explicit state management. Includes a customizable agent harness with tool access, sandboxed environment, checkpointing, verification, and logging, plus a visual editor with synchronized graph and workflow views.

Result: Evaluated on 7 benchmarks and includes ready-to-use agents for deep research and scientific research. User study shows AgentSPEX provides more interpretable and accessible workflow-authoring paradigm than existing frameworks.

Conclusion: AgentSPEX addresses key limitations in current LLM agent systems by providing structured specification language with explicit control flow, modular design, and visual authoring tools, making agent workflows more maintainable, interpretable, and accessible.

Abstract: Language-model agent systems commonly rely on reactive prompting, in which a single instruction guides the model through an open-ended sequence of reasoning and tool-use steps, leaving control flow and intermediate state implicit and making agent behavior potentially difficult to control. Orchestration frameworks such as LangGraph, DSPy, and CrewAI impose greater structure through explicit workflow definitions, but tightly couple workflow logic with Python, making agents difficult to maintain and modify. In this paper, we introduce AgentSPEX, an Agent SPecification and EXecution Language for specifying LLM-agent workflows with explicit control flow and modular structure, along with a customizable agent harness. AgentSPEX supports typed steps, branching and loops, parallel execution, reusable submodules, and explicit state management, and these workflows execute within an agent harness that provides tool access, a sandboxed virtual environment, and support for checkpointing, verification, and logging. Furthermore, we provide a visual editor with synchronized graph and workflow views for authoring and inspection. We include ready-to-use agents for deep research and scientific research, and we evaluate AgentSPEX on 7 benchmarks. Finally, we show through a user study that AgentSPEX provides a more interpretable and accessible workflow-authoring paradigm than a popular existing agent framework.

[35] Peer-Predictive Self-Training for Language Model Reasoning

Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, Yiling Chen

Main category: cs.CL

TL;DR: PST is a label-free self-training framework where multiple language models improve collaboratively using aggregated responses as internal training signals, with PMI-based scaling for updates.

Details

Motivation: The paper addresses the challenge of enabling language models to self-improve without external supervision, seeking mechanisms for continued enhancement through internal collaborative learning rather than relying on external labels or teacher-student hierarchies.

Method: Proposes Peer-Predictive Self-Training (PST) where multiple models generate responses sequentially to prompts, aggregate their answers, and use pointwise mutual information (PMI) to measure how informative each intermediate response is about the aggregate. This PMI signal scales self-training updates - aligned responses are updated less, while misaligned ones are updated more.

Result: On mathematical reasoning benchmarks (SimulEq, Math500, MultiArith), PST improves exact-match accuracy by 2.2-4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap by 26-40%, requiring no external supervision.

Conclusion: Cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training, enabling collaborative improvement without external supervision or hierarchical structures.

Abstract: Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and MultiArith), PST improves exact-match accuracy by 2.2 to 4.3 percentage points across Gemma-2-2B, LLaMA-3.2-1B, and Qwen-2.5-1.5B, and reduces the average generator-verifier gap (GV-Gap) by 26 to 40 percent, while requiring no external supervision or teacher-student hierarchy and relying solely on cross-model interactions. These results suggest that cross-model generations and peer-predictive feedback can serve as an effective approach for self-supervised training.

[36] TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models

Yarui Cao, Kai Liu

Main category: cs.CL

TL;DR: A novel Parameter-Efficient Fine-Tuning (PEFT) method called TLoRA+ that incorporates a specialized optimizer into pre-trained model weight matrices, improving performance while maintaining efficiency.

Details

Motivation: To develop a more effective PEFT method that goes beyond existing approaches like LoRA, which matches full fine-tuning performance but may have room for improvement. The goal is to enhance adaptation capabilities without sacrificing efficiency or adding inference latency.

Method: Proposes TLoRA+, which incorporates a specialized optimizer into the weight matrices of pre-trained models. This approach builds upon Low-Rank Adaptation (LoRA) principles but adds the TLoRA+ optimizer component to enhance adaptation while preserving the efficiency benefits of low-rank methods.

Result: Experiments on the GLUE benchmark across diverse model architectures show consistent effectiveness and robustness. The method preserves LoRA’s efficiency while further enhancing performance without significantly increasing computational cost.

Conclusion: TLoRA+ represents an advancement in PEFT methods that improves upon LoRA by incorporating optimizer components into weight matrices, offering better performance while maintaining efficiency for fine-tuning large language models.

Abstract: Fine-tuning large language models (LLMs) aims to adapt pre-trained models to specific tasks using relatively small and domain-specific datasets. Among Parameter-Efficient Fine-Tuning (PEFT) methods, Low-Rank Adaptation (LoRA) stands out by matching the performance of full fine-tuning while avoiding additional inference latency. In this paper, we propose a novel PEFT method that incorporates the TLoRA+ optimizer into the weight matrices of pre-trained models. The proposed approach not only preserves the efficiency of low-rank adaptation but also further enhances performance without significantly increasing computational cost. We conduct experiments on the GLUE benchmark across diverse model architectures. Numerical experiments consistently demonstrate the effectiveness and robustness of our proposed method.

[37] Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

Md. Fahad Ullah Utsho, Mohd. Ruhul Ameen, Akif Islam, Md. Golam Rashed, Dipankar Das

Main category: cs.CL

TL;DR: Paper introduces a benchmarking framework to evaluate reasoning robustness in Large Language Models under controlled complexity increases, revealing consistent “reasoning collapse” beyond task-specific thresholds.

Details

Motivation: Current LLM evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves with increasing task complexity. There's a need for systematic evaluation of reasoning robustness under controlled complexity progression.

Method: Constructed a suite of nine classical reasoning tasks (Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, Rubik’s Cube) parameterized to precisely control complexity while preserving semantics. Used deterministic validators to evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes.

Result: Models show consistent phase transition behavior: high accuracy at low complexity but sharp degradation beyond task-specific thresholds (reasoning collapse). Accuracy declines often exceed 50%, with inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length doesn’t reliably improve correctness, and gains don’t generalize across problem families.

Conclusion: Current LLM reasoning capabilities are fragile and collapse under controlled complexity increases. Evaluation methodologies need to move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity progression.

Abstract: Large Language Models (LLMs) are increasingly described as possessing strong reasoning capabilities, supported by high performance on mathematical, logical, and planning benchmarks. However, most existing evaluations rely on aggregate accuracy over fixed datasets, obscuring how reasoning behavior evolves as task complexity increases. In this work, we introduce a controlled benchmarking framework to systematically evaluate the robustness of reasoning in Large Reasoning Models (LRMs) under progressively increasing problem complexity. We construct a suite of nine classical reasoning tasks: Boolean Satisfiability, Cryptarithmetic, Graph Coloring, River Crossing, Tower of Hanoi, Water Jug, Checker Jumping, Sudoku, and Rubik’s Cube, each parameterized to precisely control complexity while preserving underlying semantics. Using deterministic validators, we evaluate multiple open and proprietary LRMs across low, intermediate, and high complexity regimes, ensuring that only fully valid solutions are accepted. Our results reveal a consistent phase transition like behavior: models achieve high accuracy at low complexity but degrade sharply beyond task specific complexity thresholds. We formalize this phenomenon as reasoning collapse. Across tasks, we observe substantial accuracy declines, often exceeding 50%, accompanied by inconsistent reasoning traces, constraint violations, loss of state tracking, and confidently incorrect outputs. Increased reasoning length does not reliably improve correctness, and gains in one problem family do not generalize to others. These findings highlight the need for evaluation methodologies that move beyond static benchmarks and explicitly measure reasoning robustness under controlled complexity.

[38] From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

Shihao Zhang, Ziwei Wang, Jie Zhou, Yulan Wu, Qin Chen, Zhikai Lei, Liyang Yu, Liang Dou, Liang He

Main category: cs.CL

TL;DR: ABSA-R1: A reinforcement learning framework that makes aspect-based sentiment analysis models generate natural language justifications before predictions, mimicking human reasoning processes.

Details

Motivation: Current ABSA systems are black boxes that lack explicit reasoning capabilities. Humans don't just categorize sentiment but construct causal explanations for their judgments. The paper aims to bridge this gap by making models explain their reasoning.

Method: Proposes ABSA-R1 framework using reinforcement learning to generate natural language justifications before predictions. Introduces a Cognition-Aligned Reward Model to enforce consistency between reasoning and final labels, and a performance-driven rejection sampling strategy for hard cases.

Result: Experimental results on four benchmarks show that adding explicit reasoning capabilities improves both interpretability and performance in sentiment classification and triplet extraction compared to non-reasoning baselines.

Conclusion: The framework successfully mimics human “reason-before-predict” cognitive processes, enhancing both explainability and accuracy in aspect-based sentiment analysis tasks.

Abstract: While Aspect-based Sentiment Analysis (ABSA) systems have achieved high accuracy in identifying sentiment polarities, they often operate as “black boxes,” lacking the explicit reasoning capabilities characteristic of human affective cognition. Humans do not merely categorize sentiment; they construct causal explanations for their judgments. To bridge this gap, we propose ABSA-R1, a large language model framework designed to mimic this ``reason-before-predict" cognitive process. By leveraging reinforcement learning (RL), ABSA-R1 learns to articulate the why behind the what, generating natural language justifications that ground its sentiment predictions. We introduce a Cognition-Aligned Reward Model (formerly sentiment-aware reward model) that enforces consistency between the generated reasoning path and the final emotional label. Furthermore, inspired by metacognitive monitoring, we implement a performance-driven rejection sampling strategy that selectively targets hard cases where the model’s internal reasoning is uncertain or inconsistent. Experimental results on four benchmarks demonstrate that equipping models with this explicit reasoning capability not only enhances interpretability but also yields superior performance in sentiment classification and triplet extraction compared to non-reasoning baselines.

[39] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

Han Wang, David Wan, Hyunji Lee, Thinh Pham, Mikaela Cankosyan, Weiyuan Chen, Elias Stengel-Eskin, Tu Vu, Mohit Bansal

Main category: cs.CL

TL;DR: MERRIN is a challenging benchmark for evaluating multimodal search agents’ ability to retrieve and reason over noisy web evidence across diverse modalities including audio and video.

Details

Motivation: To address the challenges of underspecified multi-hop search queries and multimodal, heterogeneous, often conflicting web results, creating a testbed for evaluating AI agents' multimodal evidence retrieval and reasoning capabilities.

Method: Introduces MERRIN benchmark with natural language queries without explicit modality cues, incorporates underexplored modalities (video, audio), and requires retrieval of complex noisy/conflicting multimodal evidence. Evaluates 10 models across three search settings.

Result: Benchmark is highly challenging with average accuracy of 22.3% and best agent reaching only 40.1%. Stronger agents show modest gains due to over-exploration and distraction by conflicting content. Agents underperform humans despite consuming more resources.

Conclusion: MERRIN highlights the need for search agents with robust multimodal search and reasoning capabilities in noisy web environments, serving as a valuable testbed for evaluating such abilities.

Abstract: Motivated by the underspecified, multi-hop nature of search queries and the multimodal, heterogeneous, and often conflicting nature of real-world web results, we introduce MERRIN (Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments), a human-annotated benchmark for evaluating search-augmented agents. MERRIN measures AI agents’ ability to identify relevant modalities, retrieve multimodal evidence, and perform multi-hop reasoning over noisy web sources. It differs from prior work in three important aspects: (1) using natural language queries without explicit modality cues, (2) incorporating underexplored modalities such as video and audio, and (3) requiring the retrieval of complex, often noisy or conflicting multimodal evidence during web search. We evaluate diverse search agents powered by ten models, including strong closed-source models (e.g., GPT-5.4-mini, Gemini 3/3.1 Flash/Pro) and open-weight models (Qwen3-4B/30B/235B), across three search settings (no search, native search, and agentic search). Our results show that MERRIN is highly challenging: the average accuracy across all agents is 22.3%, with the best-performing agent reaching only 40.1%. We further observe that while stronger agents like Gemini Deep Research achieve higher performance, gains are modest due to over-exploration; they take more steps and use more tools, but are often distracted by conflicting or partially relevant web content, leading to incorrect answers. Compared to humans, these agents consume more resources yet achieve lower accuracy, largely due to inefficient source selection and an overreliance on text modalities. These findings highlight the need for search agents capable of robust search and reasoning across diverse modalities in noisy web environments, making MERRIN a valuable testbed for evaluating such capabilities.

[40] CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

Ishani Mondal, Yiwen Song, Mihir Parmar, Palash Goyal, Jordan Boyd-Graber, Tomas Pfister, Yale Song

Main category: cs.CL

TL;DR: CANVAS is a multi-agent framework for generating visually continuous storyboards by enforcing character consistency, background persistence, and smooth scene transitions.

Details

Motivation: Existing generative models produce strong individual frames but fail to maintain visual continuity across shots in long-form storytelling, leading to inconsistent characters, backgrounds, and abrupt scene shifts.

Method: CANVAS uses a multi-agent framework that explicitly plans visual continuity through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting.

Result: CANVAS outperforms baselines on ST-BENCH and ViStoryBench benchmarks, improving background continuity by 21.6%, character consistency by 9.6%, and props consistency by 7.6%, and introduces a new challenging benchmark HardContinuityBench.

Conclusion: The CANVAS framework effectively addresses visual continuity challenges in long-form storytelling through explicit planning mechanisms, significantly improving narrative coherence across multiple shots.

Abstract: Long-form visual storytelling requires maintaining continuity across shots, including consistent characters, stable environments, and smooth scene transitions. While existing generative models can produce strong individual frames, they fail to preserve such continuity, leading to appearance changes, inconsistent backgrounds, and abrupt scene shifts. We introduce CANVAS (Continuity-Aware Narratives via Visual Agentic Storyboarding), a multi-agent framework that explicitly plans visual continuity in multi-shot narratives. CANVAS enforces coherence through character continuity, persistent background anchors, and location-aware scene planning for smooth transitions within the same setting We evaluate CANVAS on two storyboard generation benchmarks ST-BENCH and ViStoryBench and introduce a new challenging benchmark HardContinuityBench for long-range narrative consistency. CANVAS consistently outperforms the best-performing baseline, improving background continuity by 21.6%, character consistency by 9.6% and props consistency by 7.6%.

[41] Using reasoning LLMs to extract SDOH events from clinical notes

Ertan Doganl, Kunyu Yu, Yifan Peng

Main category: cs.CL

TL;DR: LLM-based prompt engineering approach for extracting Social Determinants of Health from unstructured clinical notes achieves competitive performance with simpler implementation than BERT-based methods.

Details

Motivation: SDOH information is crucial for patient care but trapped in unstructured clinical notes; existing BERT-based NLP methods work but require complex implementation and heavy computational resources.

Method: Four-module approach: 1) concise prompts with guidelines, 2) few-shot learning with curated examples, 3) self-consistency for robust outputs, 4) post-processing for quality control.

Result: Achieved micro-F1 score of 0.866, competitive with leading models, demonstrating LLMs with reasoning capabilities are effective for SDOH event extraction.

Conclusion: LLMs offer both implementation simplicity and strong performance for extracting structured SDOH from clinical text, making them practical solutions for healthcare applications.

Abstract: Social Determinants of Health (SDOH) refer to environmental, behavioral, and social conditions that influence how individuals live, work, and age. SDOH have a significant impact on personal health outcomes, and their systematic identification and management can yield substantial improvements in patient care. However, SDOH information is predominantly captured in unstructured clinical notes within electronic health records, which limits its direct use as machine-readable entities. To address this issue, researchers have employed Natural Language Processing (NLP) techniques using pre-trained BERT-based models, demonstrating promising performance but requiring sophisticated implementation and extensive computational resources. In this study, we investigated prompt engineering strategies for extracting structured SDOH events utilizing LLMs with advanced reasoning capabilities. Our method consisted of four modules: 1) developing concise and descriptive prompts integrated with established guidelines, 2) applying few-shot learning with carefully curated examples, 3) using a self-consistency mechanism to ensure robust outputs, and 4) post-processing for quality control. Our approach achieved a micro-F1 score of 0.866, demonstrating competitive performance compared to the leading models. The results demonstrated that LLMs with reasoning capabilities are effective solutions for SDOH event extraction, offering both implementation simplicity and strong performance.

[42] ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

Heming Xia, Yongqi Li, Cunxiao Du, Mingbo Song, Wenjie Li

Main category: cs.CL

TL;DR: ToolSpec accelerates LLM tool calling using schema-aware speculative decoding and retrieval of historical invocations, achieving up to 4.2x speedup.

Details

Motivation: Tool calling enables LLMs to interact with external applications, but multi-step, multi-turn interactions incur substantial latency that hinders real-time serving. Tool-calling traces are highly structured with constrained schemas and recurring patterns, presenting an opportunity for optimization.

Method: ToolSpec uses schema-aware, retrieval-augmented speculative decoding. It exploits predefined tool schemas to generate accurate drafts using a finite-state machine that alternates between deterministic schema token filling and speculative generation for variable fields. Additionally, it retrieves similar historical tool invocations and reuses them as drafts to improve efficiency.

Result: Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

Conclusion: ToolSpec presents an effective plug-and-play solution for accelerating LLM tool calling by leveraging structured schemas and historical patterns, addressing latency challenges in real-time LLM serving.

Abstract: Tool calling has greatly expanded the practical utility of large language models (LLMs) by enabling them to interact with external applications. As LLM capabilities advance, effective tool use increasingly involves multi-step, multi-turn interactions to solve complex tasks. However, the resulting growth in tool interactions incurs substantial latency, posing a key challenge for real-time LLM serving. Through empirical analysis, we find that tool-calling traces are highly structured, conform to constrained schemas, and often exhibit recurring invocation patterns. Motivated by this, we propose ToolSpec, a schema-aware, retrieval-augmented speculative decoding method for accelerating tool calling. ToolSpec exploits predefined tool schemas to generate accurate drafts, using a finite-state machine to alternate between deterministic schema token filling and speculative generation for variable fields. In addition, ToolSpec retrieves similar historical tool invocations and reuses them as drafts to further improve efficiency. ToolSpec presents a plug-and-play solution that can be seamlessly integrated into existing LLM workflows. Experiments across multiple benchmarks demonstrate that ToolSpec achieves up to a 4.2x speedup, substantially outperforming existing training-free speculative decoding methods.

[43] Synthesizing Instruction-Tuning Datasets with Contrastive Decoding

Tatsuya Ichinose, Youmi Ma, Masanari Oi, Ryuto Koike, Naoaki Okazaki

Main category: cs.CL

TL;DR: CoDIT: A method for instruction tuning that uses contrastive decoding between post-trained and pre-trained LLMs to disentangle instruction-following capabilities from pre-trained knowledge, improving instruction tuning effectiveness.

Details

Motivation: Existing instruction tuning approaches overlook that LLM-generated responses conflate world knowledge (from pre-training) with instruction-following capabilities (from post-training). The authors hypothesize that disentangling these improves instruction tuning effectiveness.

Method: Proposes CoDIT (Contrastive Decoding for Instruction Tuning), which applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. This suppresses shared pre-trained knowledge while amplifying instruction-following behavior acquired via post-training.

Result: Models trained on CoDIT-generated datasets consistently outperform those trained on directly generated responses. They also outperform models trained on existing public instruction-tuning datasets across multiple benchmarks. CoDIT enables transfer of instruction-tuning capabilities across different model architectures.

Conclusion: Disentangling instruction-following capabilities from pre-trained knowledge improves instruction tuning. CoDIT effectively achieves this through contrastive decoding, producing better training data and enabling capability transfer across architectures.

Abstract: Using responses generated by high-performing large language models (LLMs) for instruction tuning has become a widely adopted approach. However, the existing literature overlooks a property of LLM-generated responses: they conflate world knowledge acquired during pre-training with instruction-following capabilities acquired during post-training. We hypothesize that disentangling the instruction-following capabilities from pre-trained knowledge improves the effectiveness of instruction tuning. To this end, we propose CoDIT, a method that applies contrastive decoding between a post-trained model and its pre-trained counterpart during response generation. The method suppresses pre-trained knowledge shared between the two models while amplifying the instruction-following behavior acquired via post-training, resulting in responses that more purely reflect instruction-following capabilities. Experiment results demonstrate that models trained on datasets constructed via CoDIT consistently outperform those trained on directly generated responses. Training on our datasets also yields better performance than on existing publicly available instruction-tuning datasets across multiple benchmarks. Furthermore, we theoretically and empirically show that CoDIT can be interpreted as distilling the chat vector from parameter space to text space, enabling the transfer of instruction-tuning capabilities across models of different architectures.

[44] Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate

Cunda Wang, Ziying Ma, Po Hu, Weihua Wang, Feilong Bao

Main category: cs.CL

TL;DR: AgentEA is a reliable entity alignment framework using multi-agent debate to improve alignment decisions across knowledge graphs, with representation optimization and two-stage debate mechanisms.

Details

Motivation: Current LLM-based entity alignment methods suffer from unreliable candidate entity sets and limited reasoning capabilities, which critically affect alignment decision effectiveness.

Method: Uses entity representation preference optimization to improve embeddings, then introduces two-stage multi-role debate: lightweight debate verification followed by deep debate alignment for progressive reliability enhancement.

Result: Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.

Conclusion: AgentEA provides a reliable entity alignment framework that enhances decision reliability through multi-agent debate mechanisms and representation optimization.

Abstract: Entity alignment (EA) aims to identify entities referring to the same real-world object across different knowledge graphs (KGs). Recent approaches based on large language models (LLMs) typically obtain entity embeddings through knowledge representation learning and use embedding similarity to identify an alignment-uncertain entity set. For each uncertain entity, a candidate entity set (CES) is then retrieved based on embedding similarity to support subsequent alignment reasoning and decision making. However, the reliability of the CES and the reasoning capability of LLMs critically affect the effectiveness of subsequent alignment decisions. To address this issue, we propose AgentEA, a reliable EA framework based on multi-agent debate. AgentEA first improves embedding quality through entity representation preference optimization, and then introduces a two-stage multi-role debate mechanism consisting of lightweight debate verification and deep debate alignment to progressively enhance the reliability of alignment decisions while enabling more efficient debate-based reasoning. Extensive experiments on public benchmarks under cross-lingual, sparse, large-scale, and heterogeneous settings demonstrate the effectiveness of AgentEA.

[45] Training-Free Test-Time Contrastive Learning for Large Language Models

Kaiwen Zheng, Kai Zhou, Jinwu Hu, Te Gu, Mingkai Peng, Fei Liu

Main category: cs.CL

TL;DR: TF-TTCL is a training-free test-time adaptation framework that enables frozen LLMs to improve online by distilling supervision from their own inference experiences through an “Explore-Reflect-Steer” loop.

Details

Motivation: LLMs show strong reasoning but degrade under distribution shift. Existing TTA methods need gradient updates (white-box access) or substantial overhead, while training-free alternatives are static or need external guidance.

Method: Three modules: 1) Semantic Query Augmentation diversifies problem views via multi-agent role-playing; 2) Contrastive Experience Distillation captures semantic gaps between superior/inferior trajectories into textual rules; 3) Contextual Rule Retrieval activates stored rules during inference to steer the frozen LLM.

Result: Extensive experiments on closed-ended reasoning and open-ended evaluation tasks show TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation.

Conclusion: TF-TTCL enables frozen LLMs to adapt online without training, improving robustness under distribution shift through self-supervised experience distillation.

Abstract: Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic “Explore-Reflect-Steer” loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.

[46] YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

You Wu, Ziheng Chen, Yizhen Zhang, Haoyi Wu, Chengting Yu, Yuchi Xu, Wenbo Su, Bo Zheng, Kewei Tu

Main category: cs.CL

TL;DR: YOCO++ enhances cross-layer KV compression for efficient LLM inference by adding weighted residual connections between bottom-half layers, improving performance while maintaining 50% KV cache compression.

Details

Motivation: Existing cross-layer KV compression methods for LLM inference reduce memory consumption but introduce performance degradation. The authors aim to enhance YOCO, a method that shares KVs of middle layer with top-half layers, to achieve better performance while maintaining efficiency.

Method: Propose YOCO++ which incorporates weighted residual connections between the KVs of each bottom-half layer and the bottom layer. This increases model capacity while maintaining the same training and inference efficiency as YOCO.

Result: YOCO++ achieves state-of-the-art performance among cross-layer KV compression methods at 50% KV cache compression rate, outperforming both standard Transformer and original YOCO.

Conclusion: The enhanced YOCO++ method successfully improves performance of cross-layer KV compression while maintaining compression efficiency, making it a promising approach for efficient LLM inference.

Abstract: Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible performance degradation. In this work, we aim to enhance the performance of YOCO, a cross-layer KV compression method that shares the KVs of the middle layer with the top-half layers. We propose YOCO++, an enhanced YOCO that incorporates a weighted residual connection between the KVs of each bottom-half layer and the bottom layer. Compared to YOCO, YOCO++ increases model capacity while maintaining the same training and inference efficiency. Our experiments show that YOCO++ achieves state-of-the-art performance among the cross-layer KV compression methods at a 50% KV cache compression rate, outperforming the standard Transformer.

[47] MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Jiahang Lin, Kai Hu, Binghai Wang, Yuhao Zhou, Zhiheng Xi, Honglin Guo, Shichun Liu, Junzhe Wang, Shihan Dou, Enyu Zhou, Hang Yan, Zhenhua Han, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: MM-Doc-R1 is a vision-aware agentic framework for long document visual question answering that uses iterative information discovery and Similarity-based Policy Optimization (SPO) to improve multi-turn reinforcement learning for better complex query handling.

Details

Motivation: Conventional RAG systems struggle with complex multi-hop queries over long documents due to single-pass retrieval limitations. There's a need for better approaches to handle visual question answering on long documents that require iterative information discovery and synthesis.

Method: Proposes MM-Doc-R1 framework with agentic, vision-aware workflow for iterative information discovery. Introduces Similarity-based Policy Optimization (SPO) to address baseline estimation bias in multi-turn RL by calculating more precise baselines through similarity-weighted averaging of rewards across trajectories.

Result: MM-Doc-R1 outperforms previous baselines by 10.4% on MMLongbench-Doc benchmark. SPO boosts results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B compared to GRPO, demonstrating superior training performance.

Conclusion: The integrated framework with novel SPO training algorithm advances state-of-the-art for complex, long-document visual question answering by providing more stable and accurate learning signals for agents.

Abstract: Conventional Retrieval-Augmented Generation (RAG) systems often struggle with complex multi-hop queries over long documents due to their single-pass retrieval. We introduce MM-Doc-R1, a novel framework that employs an agentic, vision-aware workflow to address long document visual question answering through iterative information discovery and synthesis. To incentivize the information seeking capabilities of our agents, we propose Similarity-based Policy Optimization (SPO), addressing baseline estimation bias in existing multi-turn reinforcement learning (RL) algorithms like GRPO. Our core insight is that in multi-turn RL, the more semantically similar two trajectories are, the more accurate their shared baseline estimation becomes. Leveraging this, SPO calculates a more precise baseline by similarity-weighted averaging of rewards across multiple trajectories, unlike GRPO which inappropriately applies the initial state’s baseline to all intermediate states. This provides a more stable and accurate learning signal for our agents, leading to superior training performance that surpasses GRPO. Our experiments on the MMLongbench-Doc benchmark show that MM-Doc-R1 outperforms previous baselines by 10.4%. Furthermore, SPO demonstrates superior performance over GRPO, boosting results by 5.0% with Qwen3-8B and 6.1% with Qwen3-4B. These results highlight the effectiveness of our integrated framework and novel training algorithm in advancing the state-of-the-art for complex, long-document visual question answering.

[48] BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

Sebastian Nagl, Matthias Grabmair

Main category: cs.CL

TL;DR: BenGER is an open-source web platform for creating, annotating, running, and evaluating legal reasoning benchmarks for LLMs, with features for collaborative annotation and multi-tenant projects.

Details

Motivation: Current workflows for evaluating LLMs on legal reasoning are fragmented across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts.

Method: Developed BenGER framework - an integrated web platform with task creation, collaborative annotation, configurable LLM runs, and evaluation using lexical, semantic, factual, and judge-based metrics.

Result: Created an open-source platform supporting multi-organization projects with tenant isolation, role-based access control, and optional formative feedback to annotators.

Conclusion: BenGER provides a unified solution for legal reasoning benchmark creation and evaluation, addressing fragmentation issues in current evaluation workflows.

Abstract: Evaluating large language models (LLMs) for legal reasoning requires workflows that span task design, expert annotation, model execution, and metric-based evaluation. In practice, these steps are split across platforms and scripts, limiting transparency, reproducibility, and participation by non-technical legal experts. We present the BenGER (Benchmark for German Law) framework, an open-source web platform that integrates task creation, collaborative annotation, configurable LLM runs, and evaluation with lexical, semantic, factual, and judge-based metrics. BenGER supports multi-organization projects with tenant isolation and role-based access control, and can optionally provide formative, reference-grounded feedback to annotators. We will demonstrate a live deployment showing end-to-end benchmark creation and analysis.

[49] Foresight Optimization for Strategic Reasoning in Large Language Models

Jiashuo Wang, Jiawen Duan, Jian Wang, Kaitao Song, Chunpu Xu, Johnny K. W. Ho, Fenggang Yu, Wenjie Li, Johan F. Hoorn

Main category: cs.CL

TL;DR: FoPO enhances LLM strategic reasoning in multi-agent environments by integrating opponent modeling into policy optimization, enabling better foresight and decision-making.

Details

Motivation: Existing reasoning-based LLMs struggle with effective decision-making in multi-agent environments due to lack of explicit foresight modeling. Strategic reasoning is needed to anticipate counterpart behaviors and foresee future actions.

Method: Foresight Policy Optimization (FoPO) integrates opponent modeling principles into policy optimization, enabling explicit consideration of both self-interest and counterpart influence. Uses two curated datasets (Cooperative RSA and Competitive Taboo) in a self-play framework.

Result: FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, outperforming standard LLM reasoning optimization baselines.

Conclusion: FoPO effectively enhances strategic reasoning in LLMs for multi-agent decision-making by incorporating foresight modeling through opponent-aware policy optimization.

Abstract: Reasoning capabilities in large language models (LLMs) have generally advanced significantly. However, it is still challenging for existing reasoning-based LLMs to perform effective decision-making abilities in multi-agent environments, due to the absence of explicit foresight modeling. To this end, strategic reasoning, the most fundamental capability to anticipate the counterpart’s behaviors and foresee its possible future actions, has been introduced to alleviate the above issues. Strategic reasoning is fundamental to effective decision-making in multi-agent environments, yet existing reasoning enhancement methods for LLMs do not explicitly capture its foresight nature. In this work, we introduce Foresight Policy Optimization (FoPO) to enhance strategic reasoning in LLMs, which integrates opponent modeling principles into policy optimization, thereby enabling explicit consideration of both self-interest and counterpart influence. Specifically, we construct two curated datasets, namely Cooperative RSA and Competitive Taboo, equipped with well-designed rules and moderate difficulty to facilitate a systematic investigation of FoPO in a self-play framework. Our experiments demonstrate that FoPO significantly enhances strategic reasoning across LLMs of varying sizes and origins. Moreover, models trained with FoPO exhibit strong generalization to out-of-domain strategic scenarios, substantially outperforming standard LLM reasoning optimization baselines.

[50] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Akira Kawabata, Saku Sugawara

Main category: cs.CL

TL;DR: C2 framework improves reward models by training them to critically collaborate with rubric generators using only binary preferences, without costly rubric annotations.

Details

Motivation: Existing rubric-augmented verification methods require expensive rubric annotations and suffer from low-quality rubrics that mislead reward models rather than help them.

Method: Proposes Cooperative yet Critical reward modeling (C2) where reward models critically collaborate with rubric generators trained solely from binary preferences. Synthesizes helpful/misleading rubric pairs by measuring rubric impact on reward model decisions, then trains cooperative rubric generator and critical verifier.

Result: C2 outperforms reasoning reward models with gains up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Enables 8B reward model to match performance of rubrics from 4× larger model without external annotations.

Conclusion: Eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy and scalable without requiring costly rubric annotations.

Abstract: Rubric-augmented verification guides reward models with explicit evaluation criteria, yielding more reliable judgments than single-model verification. However, most existing methods require costly rubric annotations, limiting scalability. Moreover, we find that rubric generation is vulnerable to a failure of cooperation; low-quality rubrics actively mislead reward models rather than help. Inspired by the principle of cooperative communication, we propose Cooperative yet Critical reward modeling (C2), a framework that significantly improves reward model judgments by having the reward model critically collaborate with a rubric generator trained solely from binary preferences. In C2, we synthesize helpful and misleading rubric pairs by measuring how each rubric shifts the reward model toward or away from the correct preference. Using these contrastive pairs, we train a cooperative rubric generator to propose helpful rubrics, and a critical verifier to assess rubric validity before making its judgment, following only rubrics it deems helpful at inference time. C2 outperforms reasoning reward models trained on the same binary preferences, with gains of up to 6.5 points on RM-Bench and 6.0 points length-controlled win rate on AlpacaEval 2.0. Without external rubric annotations, C2 enables an 8B reward model to match performance achieved with rubrics from a 4$\times$ larger model. Overall, our work demonstrates that eliciting deliberate cooperation in rubric-augmented verification makes reward models more trustworthy in a scalable way.

[51] Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

Ahmet Tuğrul Bayrak, Mustafa Sertaç Türkel, Fatma Nur Korkmaz

Main category: cs.CL

TL;DR: Syn-TurnTurk: A synthetic Turkish dialogue dataset for turn-taking prediction using LLMs, achieving high accuracy with BI-LSTM and ensemble methods to improve voice chatbot timing.

Details

Motivation: Current voice chatbots rely on simple silence detection which fails due to irregular pauses in human speech, causing interruptions. This is especially problematic for languages like Turkish that lack quality turn-taking datasets.

Method: Created Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using Qwen LLMs to mimic real conversations with overlaps and strategic silences. Evaluated with traditional and deep learning architectures including BI-LSTM and ensemble methods.

Result: Advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieved high accuracy (0.839) and AUC scores (0.910), demonstrating the synthetic dataset’s effectiveness for understanding linguistic cues.

Conclusion: Synthetic datasets can effectively train models for turn-taking prediction in low-resource languages like Turkish, enabling more natural human-machine interaction in voice-based chatbots.

Abstract: Managing natural dialogue timing is a significant challenge for voice-based chatbots. Most current systems usually rely on simple silence detection, which often fails because human speech patterns involve irregular pauses. This causes bots to interrupt users, breaking the conversational flow. This problem is even more severe for languages like Turkish, which lack high-quality datasets for turn-taking prediction. This paper introduces Syn-TurnTurk, a synthetic Turkish dialogue dataset generated using various Qwen Large Language Models (LLMs) to mirror real-life verbal exchanges, including overlaps and strategic silences. We evaluated the dataset using several traditional and deep learning architectures. The results show that advanced models, particularly BI-LSTM and Ensemble (LR+RF) methods, achieve high accuracy (0.839) and AUC scores (0.910). These findings demonstrate that our synthetic dataset can have a positive affect for models understand linguistic cues, allowing for more natural human-machine interaction in Turkish.

[52] Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Hao Zheng, Min He, Li Jiang, Haibing Guan

Main category: cs.CL

TL;DR: Calibrated Speculative Decoding (CSD) improves speculative decoding by recovering valid tokens discarded by standard verification, using frequency-guided candidate selection and probability-guarded acceptance to handle semantically correct but lexically divergent outputs.

Details

Motivation: Conventional speculative decoding frameworks suffer from frequent false rejections when draft models produce semantically correct but lexically divergent outputs, wasting computational resources and reducing efficiency.

Method: CSD introduces two lightweight modules: Online Correction Memory (aggregates historical rejections to identify recurring divergence patterns) and Semantic Consistency Gating (verifies candidate admissibility using probability ratios instead of exact token matching).

Result: CSD outperforms existing methods with peak throughput speedup of 2.33x, preserves model accuracy across all tasks, and boosts performance on complex reasoning datasets.

Conclusion: CSD is a highly effective, lightweight solution for practical LLM deployments that addresses the false rejection problem in speculative decoding without requiring training.

Abstract: Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of “Frequency-Guided Candidate Selection and Probability-Guarded Acceptance,” CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD outperforms existing methods, achieving a peak throughput speedup of 2.33x. CSD preserves model accuracy across all tasks while further boosting performance on complex reasoning datasets. These results establish CSD as a highly effective, lightweight solution for practical LLM deployments.

[53] IndicDB – Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar

Main category: cs.CL

TL;DR: IndicDB: A multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages using realistic administrative data from Indian open-data platforms.

Details

Motivation: Existing Text-to-SQL benchmarks focus on Western contexts and simplified schemas, leaving a gap for real-world, non-Western applications. There's a need for multilingual benchmarks that capture the complexity of administrative data in diverse linguistic contexts.

Method: Created IndicDB using relational schemas from Indian open-data platforms (NDAP, IDP). Employed an iterative three-agent framework (Architect, Auditor, Refiner) to convert denormalized government data into rich relational structures. The pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages.

Result: Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. The benchmark comprises 20 databases across 237 tables with high relational density (11.85 tables per database).

Conclusion: IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL, highlighting performance disparities between English and Indic languages and providing a valuable resource for evaluating cross-lingual semantic parsing in real-world administrative contexts.

Abstract: While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an “Indic Gap” driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

[54] Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection

Xiao Pu, Zepeng Cheng, Lin Yuan, Yu Wu, Xiuli Bi

Main category: cs.CL

TL;DR: A framework for AI-generated text detection that disentangles detection semantics from generator-specific artifacts to improve generalization to unseen LLMs.

Details

Motivation: As LLMs produce increasingly human-like text, distinguishing AI-generated content becomes harder. Current methods rely on generator-specific artifacts that become unstable as new models emerge, making generalization to unseen generators a key challenge.

Method: Progressive structured framework with: 1) compact latent encoding for semantic minimality, 2) perturbation-based regularization to reduce residual entanglement, and 3) discriminative adaptation to align representations with task objectives.

Result: Experiments on MAGE benchmark (20 LLMs across 7 categories) show up to 24.2% accuracy gain and 26.2% F1 improvement over SOTA. Performance improves with more diverse training generators, confirming scalability and generalization.

Conclusion: The framework effectively disentangles AI-detection semantics from generator artifacts, enabling robust generalization to unseen LLMs and showing strong scalability as training generator diversity increases.

Abstract: As large language models (LLMs) generate text that increasingly resembles human writing, the subtle cues that distinguish AI-generated content from human-written content become increasingly challenging to capture. Reliance on generator-specific artifacts is inherently unstable, since new models emerge rapidly and reduce the robustness of such shortcuts. This generalizes unseen generators as a central and challenging problem for AI-text detection. To tackle this challenge, we propose a progressively structured framework that disentangles AI-detection semantics from generator-aware artifacts. This is achieved through a compact latent encoding that encourages semantic minimality, followed by perturbation-based regularization to reduce residual entanglement, and finally a discriminative adaptation stage that aligns representations with task objectives. Experiments on MAGE benchmark, covering 20 representative LLMs across 7 categories, demonstrate consistent improvements over state-of-the-art methods, achieving up to 24.2% accuracy gain and 26.2% F1 improvement. Notably, performance continues to improve as the diversity of training generators increases, confirming strong scalability and generalization in open-set scenarios. Our source code will be publicly available at https://github.com/PuXiao06/DRGD.

[55] Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models

Dhruv Sahnan, Subhabrata Dutta, Tanmoy Chakraborty, Preslav Nakov, Iryna Gurevych

Main category: cs.CL

TL;DR: Co-FactChecker: A human-AI collaborative claim verification framework where expert feedback guides model reasoning through trace-editing rather than dialogue.

Details

Motivation: LLMs/LRMs lack domain knowledge and contextual understanding for claim verification, creating a gap between expert-led and fully automated verification. Human-AI collaboration is needed but existing models are hard to calibrate to natural language feedback in multi-turn interactions.

Method: Proposes Co-FactChecker framework with a new interaction paradigm treating model’s thinking trace as shared scratchpad. Translates expert feedback into trace-edits that modify the trace directly, avoiding dialogue-based interaction limitations.

Result: Theoretical results show trace-editing advantages over multi-turn dialogue. Automatic evaluations demonstrate Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations show preference over multi-turn dialogue with higher quality reasoning, verdicts, and more interpretable/useful thinking traces.

Conclusion: Co-FactChecker provides effective human-AI collaboration for claim verification through trace-editing, addressing limitations of dialogue-based approaches and improving model reasoning with expert guidance.

Abstract: Professional fact-checkers rely on domain knowledge and deep contextual understanding to verify claims. Large language models (LLMs) and large reasoning models (LRMs) lack such grounding and primarily reason from available evidence alone, creating a mismatch between expert-led and fully automated claim verification. To mitigate this gap, we posit human-AI collaboration as a more promising path forward, where expert feedback, grounded in real-world knowledge and domain expertise, guides the model’s reasoning. However, existing LRMs are hard to calibrate to natural language feedback, particularly in a multi-turn interaction setup. We propose Co-FactChecker, a framework for human-AI collaborative claim verification. We introduce a new interaction paradigm that treats the model’s thinking trace as a shared scratchpad. Co-FactChecker translates expert feedback into trace-edits that introduce targeted modifications to the trace, sidestepping the shortcomings of dialogue-based interaction. We provide theoretical results showing that trace-editing offers advantages over multi-turn dialogue, and our automatic evaluations demonstrate that Co-FactChecker outperforms existing autonomous and human-AI collaboration approaches. Human evaluations further show that Co-FactChecker is preferred over multi-turn dialogue, producing higher quality reasoning and verdicts along with relatively easier to interpret and more useful thinking traces.

[56] Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

Sinan Kurtyigit, Sabine Schulte im Walde, Alexander Fraser

Main category: cs.CL

TL;DR: RoBERTa-based metaphor detection models generalize to unseen verbs primarily through contextual patterns rather than lexical memorization, with context alone matching full-model performance on held-out lemmas.

Details

Motivation: To determine whether strong benchmark performance in metaphor detection reflects genuine transferable generalization or merely lexical memorization of specific words.

Method: Controlled lexical hold-out setup using RoBERTa with VU Amsterdam Metaphor Corpus, strictly excluding selected target verb lemmas during fine-tuning, then comparing predictions on held-out vs. exposed lemmas.

Result: Models maintain robust performance on held-out lemmas; sentence context alone matches full-model performance on held-out lemmas, while static verb embeddings are insufficient.

Conclusion: Generalization is primarily driven by learning transferable contextual patterns (“learning the cue”), with verb-specific memorization (“learning the word”) providing additive boost only when lexical exposure is available.

Abstract: Metaphor detection models achieve strong benchmark performance, yet it remains unclear whether this reflects transferable generalization or lexical memorization. To address this, we analyze generalization in metaphor detection through RoBERTa, the shared backbone of many state-of-the-art systems, focusing on English verbs using the VU Amsterdam Metaphor Corpus. We introduce a controlled lexical hold-out setup where all instances of selected target lemmas are strictly excluded from fine-tuning, and compare predictions on these Held-out lemmas against Exposed lemmas (verbs seen during fine-tuning). While the model performs best on Exposed lemmas, it maintains robust performance on Held-out lemmas. Further analysis reveals that sentence context alone is sufficient to match full-model performance on Held-out lemmas, whereas static verb-level embeddings are not. Together, these results suggest that generalization is primarily driven by “learning the cue” (transferable contextual patterns), while “learning the word” (verb-specific memorization) provides an additive boost when lexical exposure is available.

[57] An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

Ryan Lail

Main category: cs.CL

TL;DR: Improving LLM-as-a-judge accuracy through task-specific criteria injection and ensemble scoring, achieving 83.6% accuracy on RewardBench 2 without fine-tuning.

Details

Motivation: LLM-as-a-judge is widely used for scalable evaluation in RLHF pipelines and benchmarking, but judgment reliability heavily depends on prompting and aggregation strategies. The paper aims to improve judge accuracy through practical, drop-in techniques without fine-tuning.

Method: Empirical investigation of five techniques: 1) task-specific criteria injection, 2) ensemble scoring, 3) calibration context, 4) adaptive model escalation, and 5) soft blending. Focuses on GPT-5.4 models and evaluates on RewardBench 2.

Result: Task-specific criteria injection (+3.0pp) and ensemble scoring (+9.8pp) account for most gains. Combined they reach 83.6% accuracy (+11.9pp over 71.7% baseline). Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost.

Conclusion: Simple, practical techniques can significantly improve LLM judge accuracy without fine-tuning. Task-specific criteria and ensembling are most effective, making high-accuracy LLM judges accessible at low cost.

Abstract: LLM-as-a-judge, using a language model to score or rank candidate responses, is widely used as a scalable alternative to human evaluation in RLHF pipelines, benchmarking, and application layer evaluations (evals). However, judgment reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of practical, drop-in techniques that improve GPT-5.4 judge accuracy on RewardBench 2 without any finetuning. Two techniques account for nearly all available gains: task-specific criteria injection (+3.0pp at negligible cost) and ensemble scoring (+9.8pp at 5x cost). Combined, they reach 83.6% accuracy, +11.9pp over the 71.7% baseline. Our investigation also covers three further techniques (calibration context, adaptive model escalation, and soft blending) which did not reliably improve on criteria + ensembling at comparable cost. Cheaper model tiers benefit disproportionately from ensembling: GPT-5.4 mini with k=8 achieves 79.2% at 1.2x baseline cost, and GPT-5.4 nano with k=8 reaches 71.4% at 0.4x baseline cost, making high-accuracy LLM judges accessible at low cost.

[58] Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang, Yuyi Zhang, Wenyu Ruan, Xiaojin Zhang, Zhongyu Wei, Zhenbo Luo, Jian Luan, Wei Chen, Xiang Bai

Main category: cs.CL

TL;DR: Doc-V* is an OCR-free agentic framework for multi-page document VQA that uses sequential evidence aggregation through active navigation and structured working memory.

Details

Motivation: Existing OCR-free methods for multi-page DocVQA face trade-offs between capacity and precision - end-to-end models scale poorly with long documents, while visual retrieval pipelines are brittle and passive.

Method: Proposes Doc-V*, an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation. It begins with thumbnail overview, actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in structured working memory for grounded reasoning. Trained with imitation learning from expert trajectories and optimized with Group Relative Policy Optimization.

Result: Outperforms open-source baselines and approaches proprietary models across five benchmarks. Improves out-of-domain performance by up to 47.9% over RAG baseline. Shows effective evidence aggregation with selective attention rather than increased input pages.

Conclusion: Doc-V* provides an effective OCR-free agentic approach for multi-page document VQA that balances answer accuracy with evidence-seeking efficiency through active navigation and structured memory.

Abstract: Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.

[59] MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

Zhijie Bao, Fangke Chen, Licheng Bao, Chenhui Zhang, Wei Chen, Jiajie Peng, Zhongyu Wei

Main category: cs.CL

TL;DR: MedRCube introduces a comprehensive evaluation framework for multimodal large language models in medical imaging, featuring multidimensional fine-grained assessment and credibility evaluation to address limitations of existing coarse-grained metrics.

Details

Motivation: Existing evaluation practices for MLLMs in medical imaging report single or coarse-grained metrics that lack granularity for specialized clinical support and fail to assess reasoning reliability, creating a need for systematic evaluation aligned with real-world medical practice.

Method: Proposes a paradigm shift toward multidimensional, fine-grained evaluation with a two-stage systematic construction pipeline, instantiated as MedRCube. Includes credibility evaluation subset to quantify reasoning credibility and identify shortcut behaviors.

Result: Benchmarked 33 MLLMs, with Lingshu-32B achieving top-tier performance. MedRCube exposed insights inaccessible under prior evaluation settings and revealed a significant positive association between shortcut behavior and diagnostic task performance.

Conclusion: MedRCube provides a comprehensive evaluation framework for MLLMs in medical imaging that addresses limitations of existing methods, revealing important insights about model behavior and raising concerns for clinically trustworthy deployment.

Abstract: The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.

[60] From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

Wenxuan Li, Zhenfei Zhang, Mi Zhang, Geng Hong, Mi Wen, Xiaoyu You, Min Yang

Main category: cs.CL

TL;DR: MAGE is a memory-graph guided erasure framework for LLM unlearning that uses minimal user anchors instead of forget sets, enabling corpus-free, auditable unlearning while preserving model utility.

Details

Motivation: Current machine unlearning methods rely on user-provided forget sets, which are difficult to audit, risk secondary leakage, and are vulnerable to malicious abuse. There's a need for a more practical, auditable approach that minimizes user input while effectively removing memorized content.

Method: MAGE uses lightweight user anchors to identify target entities, probes the LLM to recover target-related memorization, organizes this into a weighted local memory graph, and synthesizes scoped supervision for unlearning. It’s model-agnostic, works with standard unlearning methods, and requires no access to the original training corpus.

Result: Experiments on TOFU and RWKU benchmarks show MAGE’s self-generated supervision achieves unlearning performance comparable to supervision with external references while preserving overall model utility.

Conclusion: MAGE enables a practical, auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora, addressing privacy and legal concerns about LLM memorization.

Abstract: Large language models (LLMs) may memorize sensitive or copyrighted content, raising significant privacy and legal concerns. While machine unlearning has emerged as a potential remedy, prevailing paradigms rely on user-provided forget sets, making unlearning requests difficult to audit and exposing systems to secondary leakage and malicious abuse. We propose MAGE, a Memory-grAph Guided Erasure framework for user-minimized, corpus-free unlearning. Given only a lightweight user anchor that identifies a target entity, MAGE probes the target LLM to recover target-related memorization, organizes it into a weighted local memory graph, and synthesizes scoped supervision for unlearning. MAGE is model-agnostic, can be plugged into standard unlearning methods, and requires no access to the original training corpus. Experiments on two benchmarks, TOFU and RWKU, demonstrate that MAGE’s self-generated supervision achieves effective unlearning performance comparable to supervision generated with external reference, while preserving overall utility. These results support a practical and auditable unlearning workflow driven by minimal anchors rather than user-supplied forget corpora.

[61] QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

Junlin Zhu, Baizhou Huang, Xiaojun Wan

Main category: cs.CL

TL;DR: QuantileMark: A multi-bit watermarking method for LLMs that ensures message symmetry by embedding messages in equal-probability bins of the cumulative distribution, preventing quality degradation or detection bias based on message content.

Details

Motivation: Current vocabulary-partition watermarks for LLMs break message symmetry in low-entropy contexts, causing some messages to have better quality/decoding than others. Need watermarking where message content doesn't systematically affect text quality or verification outcomes.

Method: QuantileMark embeds messages within the continuous cumulative probability interval [0,1). At each generation step, partitions this interval into M equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring fixed 1/M probability budget regardless of context entropy. Detection reconstructs same partition under teacher forcing and computes posteriors over latent bins.

Result: Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over baselines with negligible impact on generation quality. Proves message-unbiasedness property ensuring base distribution recovery when averaging over messages.

Conclusion: QuantileMark provides theoretical foundation for generation-side symmetry while equal-mass design promotes uniform evidence strength across messages. Addresses key requirement of message symmetry in provider-internal LLM deployments.

Abstract: As large language models become standard backends for content generation, practical provenance increasingly requires multi-bit watermarking. In provider-internal deployments, a key requirement is message symmetry: the message itself should not systematically affect either text quality or verification outcomes. Vocabulary-partition watermarks can break message symmetry in low-entropy decoding: some messages are assigned most of the probability mass, while others are forced to use tail tokens. This makes embedding quality and message decoding accuracy message-dependent. We propose QuantileMark, a white-box multi-bit watermark that embeds messages within the continuous cumulative probability interval $[0, 1)$. At each step, QuantileMark partitions this interval into $M$ equal-mass bins and samples strictly from the bin assigned to the target symbol, ensuring a fixed $1/M$ probability budget regardless of context entropy. For detection, the verifier reconstructs the same partition under teacher forcing, computes posteriors over latent bins, and aggregates evidence for verification. We prove message-unbiasedness, a property ensuring that the base distribution is recovered when averaging over messages. This provides a theoretical foundation for generation-side symmetry, while the equal-mass design additionally promotes uniform evidence strength across messages on the detection side. Empirical results on C4 continuation and LFQA show improved multi-bit recovery and detection robustness over strong baselines, with negligible impact on generation quality. Our code is available at GitHub (https://github.com/zzzjunlin/QuantileMark).

[62] ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

Shouzheng Huang, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: ToolOmni: A unified agentic framework for LLMs to use open-world tools via proactive retrieval and grounded execution within a reasoning loop, achieving state-of-the-art performance.

Details

Motivation: Existing methods for tool use in LLMs struggle in open-world scenarios with massive and evolving tool repositories, facing challenges in aligning user intent with tool semantics and generalizing to unseen tools, leading to suboptimal accuracy.

Method: ToolOmni uses a unified agentic framework with proactive retrieval and grounded execution within a reasoning loop. It involves: 1) constructing a cold-start multi-turn interaction dataset for SFT, and 2) open-world tool learning using a Decoupled Multi-Objective GRPO algorithm that optimizes both tool retrieval accuracy and execution efficacy.

Result: ToolOmni achieves state-of-the-art performance in both retrieval and execution, surpassing strong baselines by +10.8% in end-to-end execution success rate, while showing exceptional robustness and generalization capabilities.

Conclusion: ToolOmni effectively addresses open-world tool use challenges by combining proactive retrieval with grounded execution in a reasoning loop, demonstrating superior performance and generalization compared to existing methods.

Abstract: Large Language Models (LLMs) enhance their problem-solving capability by utilizing external tools. However, in open-world scenarios with massive and evolving tool repositories, existing methods relying on static embedding retrieval or parameter memorization of tools struggle to align user intent with tool semantics or generalize to unseen tools, respectively, leading to suboptimal accuracy of open-world tool retrieval and execution. To address these, we present ToolOmni, a unified agentic framework that enables LLMs for open-world tool use by proactive retrieval and grounded execution within a reasoning loop. First, we construct a cold-start multi-turn interaction dataset to instill foundational agentic capabilities via Supervised Fine-Tuning (SFT). Then, we introduce open-world tool learning based on a Decoupled Multi-Objective GRPO algorithm, which simultaneously optimizes LLMs for both tool retrieval accuracy and execution efficacy in online environments. Extensive experiments demonstrate that ToolOmni achieves state-of-the-art performance both in retrieval and execution, surpassing strong baselines by a significant margin of +10.8% in end-to-end execution success rate, while exhibiting exceptional robustness and generalization capabilities.

[63] MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

Zihao Liu, Hantao Zhou, Jiguo Li, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Peng Wang

Main category: cs.CL

TL;DR: MUSE is a multi-domain Chinese user simulation framework that generates human-like, controllable, and behaviorally consistent responses through iterative profile optimization, role-reversal fine-tuning, and rubric-guided reinforcement learning.

Details

Motivation: Existing user simulators often rely on shallow user profiling, struggle with persona consistency over long interactions, and are limited to English or single-domain settings, creating a need for more sophisticated multi-domain frameworks.

Method: Three-stage approach: 1) Iterative Profile Self-Evolution (IPSE) optimizes user profiles by comparing simulated trajectories with real dialogue behaviors; 2) Role-Reversal Supervised Fine-Tuning improves local response realism; 3) Rubric-guided multi-turn reinforcement learning with a specialized reward model enhances long-horizon behavioral consistency.

Result: MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.

Conclusion: MUSE provides an effective framework for multi-domain Chinese user simulation that addresses key limitations of existing approaches, particularly in maintaining persona consistency and generating human-like responses across extended interactions.

Abstract: User simulators are essential for the scalable training and evaluation of interactive AI systems. However, existing approaches often rely on shallow user profiling, struggle to maintain persona consistency over long interactions, and are largely limited to English or single-domain settings. We present MUSE, a multi-domain Chinese user simulation framework designed to generate human-like, controllable, and behaviorally consistent responses. First, we propose Iterative Profile Self-Evolution (IPSE), which gradually optimizes user profiles by comparing and reasoning discrepancies between simulated trajectories and real dialogue behaviors. We then apply Role-Reversal Supervised Fine-Tuning to improve local response realism and human-like expression. To enable fine-grained behavioral alignment, we further train a specialized rubric-based reward model and incorporate it into rubric-guided multi-turn reinforcement learning, which optimizes the simulator at the dialogue level and enhances long-horizon behavioral consistency. Experiments show that MUSE consistently outperforms strong baselines in both utterance-level and session-level evaluations, generating responses that are more realistic, coherent, and persona-consistent over extended interactions.

[64] Robust Reward Modeling for Large Language Models via Causal Decomposition

Yunsheng Lu, Zijiang Yang, Licheng Pan, Zhixuan Chu

Main category: cs.CL

TL;DR: A method to improve reward models for LLM alignment by learning a decoder that maps answers to latent intent embeddings, using reconstruction error to regularize training and reduce reliance on spurious cues like length and tone.

Details

Motivation: Reward models often overfit to spurious cues like response length and agreeable tone rather than properly grounding preferences in the prompt's intent, limiting their effectiveness in aligning LLMs.

Method: Learn a decoder that maps candidate answers to latent intent embeddings of input prompts, using reconstruction error as a regularization signal during reward model training to emphasize prompt-dependent information and suppress prompt-independent shortcuts.

Result: The decoder selects shorter and less sycophantic candidates with 0.877 accuracy; incorporating this signal into RM training improves RewardBench accuracy from 0.832 to 0.868; improves length-controlled win rates while producing shorter outputs; remains robust to lengthening and mild off-topic drift.

Conclusion: The proposed intent reconstruction approach effectively regularizes reward models to focus on prompt-dependent information rather than spurious cues, improving alignment performance across math, helpfulness, and safety benchmarks.

Abstract: Reward models are central to aligning large language models, yet they often overfit to spurious cues such as response length and overly agreeable tone. Most prior work weakens these cues directly by penalizing or controlling specific artifacts, but it does not explicitly encourage the model to ground preferences in the prompt’s intent. We learn a decoder that maps a candidate answer to the latent intent embedding of the input. The reconstruction error is used as a signal to regularize the reward model training. We provide theoretical evidence that this signal emphasizes prompt-dependent information while suppressing prompt-independent shortcuts. Across math, helpfulness, and safety benchmarks, the decoder selects shorter and less sycophantic candidates with 0.877 accuracy. Incorporating this signal into RM training in Gemma-2-2B-it and Gemma-2-9B-it increases RewardBench accuracy from 0.832 to 0.868. For Best-of-N selection, our framework increases length-controlled win rates while producing shorter outputs, and remains robust to lengthening and mild off-topic drift in controlled rewrite tests.

[65] Beyond Static Personas: Situational Personality Steering for Large Language Models

Zesheng Wei, Mengxiang Li, Zilei Wang, Yang Deng

Main category: cs.CL

TL;DR: IRIS is a training-free framework for situational personality steering in LLMs using neuron-based identification, retrieval, and steering mechanisms.

Details

Motivation: Existing LLM personalization methods have limited controllability, high resource demands, and rely on static personality modeling that lacks adaptability across varying situations.

Method: IRIS uses a three-step framework: 1) situational persona neuron identification, 2) situation-aware neuron retrieval, and 3) similarity-weighted steering, all without requiring training.

Result: IRIS outperforms best-performing baselines on PersonalityBench and the new SPBench, demonstrating generalization and robustness to complex, unseen situations and different model architectures.

Conclusion: The neuron-based IRIS framework enables effective situational personality steering in LLMs, addressing limitations of existing personalization methods.

Abstract: Personalized Large Language Models (LLMs) facilitate more natural, human-like interactions in human-centric applications. However, existing personalization methods are constrained by limited controllability and high resource demands. Furthermore, their reliance on static personality modeling restricts adaptability across varying situations. To address these limitations, we first demonstrate the existence of situation-dependency and consistent situation-behavior patterns within LLM personalities through a multi-perspective analysis of persona neurons. Building on these insights, we propose IRIS, a training-free, neuron-based Identify-Retrieve-Steer framework for advanced situational personality steering. Our approach comprises situational persona neuron identification, situation-aware neuron retrieval, and similarity-weighted steering. We empirically validate our framework on PersonalityBench and our newly introduced SPBench, a comprehensive situational personality benchmark. Experimental results show that our method surpasses best-performing baselines, demonstrating IRIS’s generalization and robustness to complex, unseen situations and different models architecture.

[66] Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels ($43) achieves comparable F1-Macro to one trained on 3,800 human annotations ($316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

[67] Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

Sasha Boguraev, Kyle Mahowald

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We show how causal interventions in Transformer models provide insights into English syntax by focusing on a long-standing challenge for syntactic theory: syntactic islands. Extraction from coordinated verb phrases is often degraded, yet acceptability varies gradiently with lexical content (e.g., “I know what he hates art and loves” vs. “I know what he looked down and saw”). We show that modern Transformer language models replicate human judgments across this gradient. Using causal interventions that isolate functionally relevant subspaces in Transformer blocks, attention modules, and MLPs, we demonstrate that extraction from coordination islands engages the same filler-gap mechanisms as canonical wh-dependencies, but that these mechanisms are selectively blocked to varying degrees. By projecting a large corpus of unrelated text onto these causally identified subspaces, we derive a novel linguistic hypothesis: the conjunction “and” is represented differently in extractable versus non-extractable constructions, corresponding to expressions encoding relational dependencies versus purely conjunctive uses. These results illustrate how mechanistic interpretability can inform syntax, generating new hypotheses about linguistic representation and processing.

[68] How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Joel Niklaus, Atsuki Yamaguchi, Michal Štefánik, Guilherme Penedo, Hynek Kydlíček, Elie Bakouch, Lewis Tunstall, Edward Emanuel Beeching, Thibaud Frere, Colin Raffel, Leandro von Werra, Thomas Wolf

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

[69] Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs

Hussein Abdallah, Ibrahim Abdelaziz, Panos Kalnis, Essam Mansour

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Open-world Question Answering (OW-QA) over knowledge graphs (KGs) aims to answer questions over incomplete or evolving KGs. Traditional KGQA assumes a closed world where answers must exist in the KG, limiting real-world applicability. In contrast, open-world QA requires inferring missing knowledge based on graph structure and context. Large language models (LLMs) excel at language understanding but lack structured reasoning. Graph neural networks (GNNs) model graph topology but struggle with semantic interpretation. Existing systems integrate LLMs with GNNs or graph retrievers. Some support open-world QA but rely on structural embeddings without semantic grounding. Most assume observed paths or complete graphs, making them unreliable under missing links or multi-hop reasoning. We present GLOW, a hybrid system that combines a pre-trained GNN and an LLM for open-world KGQA. The GNN predicts top-k candidate answers from the graph structure. These, along with relevant KG facts, are serialized into a structured prompt (e.g., triples and candidates) to guide the LLM’s reasoning. This enables joint reasoning over symbolic and semantic signals, without relying on retrieval or fine-tuning. To evaluate generalization, we introduce GLOW-BENCH, a 1,000-question benchmark over incomplete KGs across diverse domains. GLOW outperforms existing LLM-GNN systems on standard benchmarks and GLOW-BENCH, achieving up to 53.3% and an average 38% improvement. GitHub code and data are available.

[70] Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models

Aleksandr Rubashevskii, Dzianis Piatrashyn, Preslav Nakov, Maxim Panov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large language models (LLMs) are prone to generating factually incorrect outputs. Recent work has applied conformal prediction to provide uncertainty estimates and statistical guarantees for the factuality of LLM generations. However, existing approaches are typically not prompt-adaptive, limiting their ability to capture input-dependent variability. As a result, they may filter out too few items (leading to over-coverage) or too many (under-coverage) for a given task or prompt. We propose an adaptive conformal prediction approach that extends conformal score transformation methods to LLMs, with applications to long-form generation and multiple-choice question answering. This enables prompt-dependent calibration, retaining marginal coverage guarantees while improving conditional coverage. In addition, the approach naturally supports selective prediction, allowing unreliable claims or answer choices to be filtered out in downstream applications. We evaluate our approach on multiple white-box models across diverse domains and show that it significantly outperforms existing baselines in terms of conditional coverage.

[71] Diffusion Language Models for Speech Recognition

Davyd Naveriani, Albert Zeyer, Ralf Schlüter, Hermann Ney

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion language models have recently emerged as a leading alternative to standard language models, due to their ability for bidirectional attention and parallel text generation. In this work, we explore variants for their use in speech recognition. Specifically, we introduce a comprehensive guide to incorporating masked diffusion language models (MDLM) and uniform-state diffusion models (USDMs) for rescoring ASR hypotheses. Additionally, we design a new joint-decoding method that combines CTC and USDM by integrating the framewise probability distributions derived from CTC with the labelwise probability distributions computed by USDM at each decoding step, thereby generating new candidates that combine strong language knowledge from USDM and acoustic information from CTC. Our findings reveal that USDM, as well as MDLM, can significantly improve the accuracy of recognized text. We publish all our code and recipes.

[72] Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model

Zhe Huang, Peng Wang, Yan Zheng, Sen Song, Longjun Cai

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Product bundling boosts e-commerce revenue by recommending complementary item combinations. However, existing methods face two critical challenges: (1) collaborative filtering approaches struggle with cold-start items owing to dependency on historical interactions, and (2) LLMs lack inherent capability to model interactive graph directly. To bridge this gap, we propose a dual-enhancement method that integrates interactive graph learning and LLM-based semantic understanding for product bundling. Our method introduces a graph-to-text paradigm, which leverages a Dynamic Concept Binding Mechanism (DCBM) to translate graph structures into natural language prompts. The DCBM plays a critical role in aligning domain-specific entities with LLM tokenization, enabling effective comprehension of combinatorial constraints. Experiments on three benchmarks (POG, POG_dense, Steam) demonstrate 6.3%-26.5% improvements over state-of-the-art baselines.

[73] From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution

Pavel Chizhov, Egor Bogomolov, Ivan P. Yamshchikov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Efficiency and safety of Large Language Models (LLMs), among other factors, rely on the quality of tokenization. A good tokenizer not only improves inference speed and language understanding but also provides extra defense against jailbreak attacks and lowers the risk of hallucinations. In this work, we investigate the efficiency of code tokenization, in particular from the perspective of data source diversity. We demonstrate that code tokenizers are prone to producing unused, and thus under-trained, tokens due to the imbalance in repository and language diversity in the training data, as well as the dominance of source-specific, repetitive tokens that are often unusable in future inference. By modifying the BPE objective and introducing merge skipping, we implement different techniques under the name Source-Attributed BPE (SA-BPE) to regularize BPE training and minimize overfitting, thereby substantially reducing the number of under-trained tokens while maintaining the same inference procedure as with regular BPE. This provides an effective tool suitable for production use.

[74] From Weights to Activations: Is Steering the Next Frontier of Adaptation?

Simon Ostermann, Daniil Gurgurov, Tanja Baeumel, Michael A. Hedderich, Sebastian Lapuschkin, Wojciech Samek, Vera Schmitt

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Post-training adaptation of language models is commonly achieved through parameter updates or input-based methods such as fine-tuning, parameter-efficient adaptation, and prompting. In parallel, a growing body of work modifies internal activations at inference time to influence model behavior, an approach known as steering. Despite increasing use, steering is rarely analyzed within the same conceptual framework as established adaptation methods. In this work, we argue that steering should be regarded as a form of model adaptation. We introduce a set of functional criteria for adaptation methods and use them to compare steering approaches with classical alternatives. This analysis positions steering as a distinct adaptation paradigm based on targeted interventions in activation space, enabling local and reversible behavioral change without parameter updates. The resulting framing clarifies how steering relates to existing methods, motivating a unified taxonomy for model adaptation.

[75] Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies

Swati Rallapalli, Shannon Gallagher, Ronald Yurko, Tyler Brooks, Chuck Loughin, Michele Sezgin, Violet Turri

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Large Language Models (LLMs) are now capable of generating highly fluent, human-like text. They enable many applications, but also raise concerns such as large scale spam, phishing, or academic misuse. While much work has focused on detecting LLM-generated text, only limited work has gone into understanding the stylistic differences between human-written and machine-generated text. In this work, we perform a large scale analysis of stylistic variation across human-written text and outputs from 11 LLMs spanning 8 different genres and 4 decoding strategies using Douglas Biber’s set of lexicogrammatical and functional features. Our findings reveal insights that can guide intentional LLM usage. First, key linguistic differentiators of LLM-generated text seem robust to generation conditions (e.g., prompt settings to nudge them to generate human-like text, or availability of human-written text to continue the style); second, genre exerts a stronger influence on stylistic features than the source itself; third, chat variants of the models generally appear to be clustered together in stylistic space, and finally, model has a larger effect on the style than decoding strategy, with some exceptions. These results highlight the relative importance of model and genre over prompting and decoding strategies in shaping the stylistic behavior of machine-generated text.

[76] Correct Prediction, Wrong Steps? Consensus Reasoning Knowledge Graph for Robust Chain-of-Thought Synthesis

Zipeng Ling, Shuliang Liu, Shenghong Fu, Yuehao Tang, Seonil Son, Yao Wan, Xuming Hu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: LLM reasoning traces suffer from complex flaws – Step Internal Flaws (logical errors, hallucinations, etc.) and Step-wise Flaws (overthinking, underthinking), which vary by sample. A natural approach would be to provide ground-truth labels to guide LLMs’ reasoning. Contrary to intuition, we show that this yields no improvement in reasoning ability. We then propose CRAFT, a unified framework that mitigates both types of Step flaws, which builds a Reasoning Knowledge Graph (RKG) based on the consensus parts of multiple candidate traces, and synthesizes a high-quality trace through topological generation. Our approach improves label-prediction accuracy by 10+% on average, and consistently outperforms all baselines across both logical and mathematical reasoning benchmarks. Further, detailed benchmark evaluation proves that our method also improves the quality of LLMs’ reasoning traces in multiple dimensions.

[77] Rhetorical Questions in LLM Representations: A Linear Probing Study

Louie Hong Yao, Vishesh Anand, Yuan Zhuang, Tianyu Jiang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Rhetorical questions are asked not to seek information but to persuade or signal stance. How large language models internally represent them remains unclear. We analyze rhetorical questions in LLM representations using linear probes on two social-media datasets with different discourse contexts, and find that rhetorical signals emerge early and are most stably captured by last-token representations. Rhetorical questions are linearly separable from information-seeking questions within datasets, and remain detectable under cross-dataset transfer, reaching AUROC around 0.7-0.8. However, we demonstrate that transferability does not simply imply a shared representation. Probes trained on different datasets produce different rankings when applied to the same target corpus, with overlap among the top-ranked instances often below 0.2. Qualitative analysis shows that these divergences correspond to distinct rhetorical phenomena: some probes capture discourse-level rhetorical stance embedded in extended argumentation, while others emphasize localized, syntax-driven interrogative acts. Together, these findings suggest that rhetorical questions in LLM representations are encoded by multiple linear directions emphasizing different cues, rather than a single shared direction.

[78] From Feelings to Metrics: Understanding and Formalizing How Users Vibe-Test LLMs

Itay Itzhak, Eliya Habba, Gabriel Stanovsky, Yonatan Belinkov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Evaluating LLMs is challenging, as benchmark scores often fail to capture models’ real-world usefulness. Instead, users often rely on ``vibe-testing’’: informal experience-based evaluation, such as comparing models on coding tasks related to their own workflow. While prevalent, vibe-testing is often too ad hoc and unstructured to analyze or reproduce at scale. In this work, we study how vibe-testing works in practice and then formalize it to support systematic analysis. We first analyze two empirical resources: (1) a survey of user evaluation practices, and (2) a collection of in-the-wild model comparison reports from blogs and social media. Based on these resources, we formalize vibe-testing as a two-part process: users personalize both what they test and how they judge responses. We then introduce a proof-of-concept evaluation pipeline that follows this formulation by generating personalized prompts and comparing model outputs using user-aware subjective criteria. In experiments on coding benchmarks, we find that combining personalized prompts and user-aware evaluation can change which model is preferred, reflecting the role of vibe-testing in practice. These findings suggest that formalized vibe-testing can serve as a useful approach for bridging benchmark scores and real-world experience.

[79] Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Zhe Hu, Tuo Liang, Jing Li, Yiren Lu, Yunlai Zhou, Yiran Qiao, Jing Ma, Yu Yin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI’s capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

Hasin Jawad Ali, Ajwad Abrar, S. M. Hozaifa Hossain, M. Firoz Mridha

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In politically sensitive scenarios like wars, social media serves as a platform for polarized discourse and expressions of strong ideological stances. While prior studies have explored ideological stance detection in general contexts, limited attention has been given to conflict-specific settings. This study addresses this gap by analyzing 9,969 Reddit comments related to the Israel-Palestine conflict, collected between October 2023 and August 2024. The comments were categorized into three stance classes: Pro-Israel, Pro-Palestine, and Neutral. Various approaches, including machine learning, pre-trained language models, neural networks, and prompt engineering strategies for open source large language models (LLMs), were employed to classify these stances. Performance was assessed using metrics such as accuracy, precision, recall, and F1-score. Among the tested methods, the Scoring and Reflective Re-read prompt in Mixtral 8x7B demonstrated the highest performance across all metrics. This study provides comparative insights into the effectiveness of different models for detecting ideological stances in highly polarized social media contexts. The dataset used in this research is publicly available for further exploration and validation.

[81] A closer look at how large language models trust humans: patterns and biases

Valeria Lerman, Yaniv Dover

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: As large language models (LLMs) and LLM-based agents increasingly interact with humans in decision-making contexts, understanding the trust dynamics between humans and AI agents becomes a central concern. While considerable literature studies how humans trust AI agents, it is much less understood how LLM-based agents develop effective trust in humans. LLM-based agents likely rely on some sort of implicit effective trust in trust-related contexts (e.g., evaluating individual loan applications) to assist and affect decision making. Using established behavioral theories, we develop an approach that studies whether LLMs trust depends on the three major trustworthiness dimensions: competence, benevolence and integrity of the human subject. We also study how demographic variables affect effective trust. Across 43,200 simulated experiments, for five popular language models, across five different scenarios we find that LLM trust development shows an overall similarity to human trust development. We find that in most, but not all cases, LLM trust is strongly predicted by trustworthiness, and in some cases also biased by age, religion and gender, especially in financial scenarios. This is particularly true for scenarios common in the literature and for newer models. While the overall patterns align with human-like mechanisms of effective trust formation, different models exhibit variation in how they estimate trust; in some cases, trustworthiness and demographic factors are weak predictors of effective trust. These findings call for a better understanding of AI-to-human trust dynamics and monitoring of biases and trust development patterns to prevent unintended and potentially harmful outcomes in trust-sensitive applications of AI.

[82] MulDimIF: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Instruction following refers to the ability of large language models (LLMs) to generate outputs that satisfy all specified constraints. Existing research has primarily focused on constraint categories, offering limited evaluation dimensions and little guidance for improving instruction-following abilities. To address this gap, we introduce MulDimIF, a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Based on this framework, we design a controllable instruction generation pipeline. Through constraint expansion, conflict detection, and instruction rewriting, we construct 9,106 code-verifiable samples. We evaluate 18 LLMs from six model families and find marked performance differences across constraint settings. For instance, average accuracy decreases from 80.82% at Level I to 36.76% at Level IV. Moreover, training with data generated by our framework significantly improves instruction following without compromising general performance. In-depth analysis indicates that these gains stem largely from parameter updates in attention modules, which strengthen constraint recognition and adherence. Code and data are available in https://github.com/Junjie-Ye/MulDimIF.

[83] Bridging Compositional and Distributional Semantics: A Survey on Latent Semantic Geometry via AutoEncoder

Yingji Zhang, Danilo S. Carvalho, André Freitas

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Integrating compositional and symbolic properties into current distributional semantic spaces can enhance the interpretability, controllability, compositionality, and generalisation capabilities of Transformer-based auto-regressive language models (LMs). In this survey, we offer a novel perspective on latent space geometry through the lens of compositional semantics, a direction we refer to as \textit{semantic representation learning}. This direction enables a bridge between symbolic and distributional semantics, helping to mitigate the gap between them. We review and compare three mainstream autoencoder architectures-Variational AutoEncoder (VAE), Vector Quantised VAE (VQVAE), and Sparse AutoEncoder (SAE)-and examine the distinctive latent geometries they induce in relation to semantic structure and interpretability.

[84] Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments

Junjie Ye, Changhao Jiang, Zhengyin Du, Yufei Xu, Xuesong Yao, Zhiheng Xi, Xiaoran Fan, Qi Zhang, Tao Gui, Xuanjing Huang, Jiecao Chen

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Effective tool use is essential for large language models (LLMs) to interact with their environment. However, progress is limited by the lack of efficient reinforcement learning (RL) frameworks specifically designed for tool use, due to challenges in constructing stable training environments and designing verifiable reward mechanisms. To address this, we propose an automated environment construction pipeline, incorporating scenario decomposition, document generation, function integration, complexity scaling, and localized deployment. This enables the creation of high-quality training environments that provide detailed and measurable feedback without relying on external tools. Additionally, we introduce a verifiable reward mechanism that evaluates both the precision of tool use and the completeness of task execution. When combined with trajectory data collected from the constructed environments, this mechanism integrates seamlessly with standard RL algorithms to facilitate feedback-driven model training. Experiments on LLMs of varying scales demonstrate that our approach significantly enhances the models’ tool-use performance without degrading their general capabilities. Our analysis suggests that these gains result from improved context understanding and reasoning, driven by updates to the lower-layer MLP parameters in models. Code and data are available at https://github.com/bytedance/FTRL.

[85] MARCH: Evaluating the Intersection of Ambiguity Interpretation and Multi-hop Inference

Jeonghyun Park, Ingeol Baek, Seunghyun Yoon, Haeun Jang, Aparna Garimella, Akriti Jain, Nedim Lipka, Hwanhee Lee

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Real-world multi-hop QA is naturally linked with ambiguity, where a single query can trigger multiple reasoning paths that require independent resolution. Since ambiguity can occur at any stage, models must navigate layered uncertainty throughout the entire reasoning chain. Despite its prevalence in real-world user queries, previous benchmarks have primarily focused on single-hop ambiguity, leaving the complex interaction between multi-step inference and layered ambiguity underexplored. In this paper, we introduce MARCH, a benchmark for their intersection, with 2,209 multi-hop ambiguous questions curated via multi-LLM verification and validated by human annotation with strong agreement. Our experiments reveal that even state-of-the-art models struggle with MARCH, confirming that combining ambiguity resolution with multi-step reasoning is a significant challenge. To address this, we propose CLARION, a two-stage agentic framework that explicitly decouples ambiguity planning from evidence-driven reasoning, significantly outperforms existing approaches, and paves the way for robust reasoning systems.

[86] Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra & inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single softmax attention operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

[87] SPG: Sandwiched Policy Gradient for Masked Diffusion Language Models

Chenyu Wang, Paria Rashidinejad, DiJia Su, Song Jiang, Sid Wang, Siyan Zhao, Cai Zhou, Shannon Zejiang Shen, Feiyu Chen, Tommi Jaakkola, Yuandong Tian, Bo Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Diffusion large language models (dLLMs) are emerging as an efficient alternative to autoregressive models due to their ability to decode multiple tokens in parallel. However, aligning dLLMs with human preferences or task-specific rewards via reinforcement learning (RL) is challenging because their intractable log-likelihood precludes the direct application of standard policy gradient methods. While prior work uses surrogates like the evidence lower bound (ELBO), these one-sided approximations can introduce significant policy gradient bias. To address this, we propose the Sandwiched Policy Gradient (SPG) that leverages both an upper and a lower bound of the true log-likelihood. Experiments show that SPG significantly outperforms baselines based on ELBO or one-step estimation. Specifically, SPG improves the accuracy over state-of-the-art RL methods for dLLMs by 3.6% in GSM8K, 2.6% in MATH500, 18.4% in Countdown and 27.0% in Sudoku.

[88] Beyond Black-Box Interventions: Latent Probing for Faithful Retrieval-Augmented Generation

Linfeng Gao, Qinggang Zhang, Baolong Bi, Bo Zeng, Zheng Yuan, Zerui Chen, Zhimin Wei, Shenghua Liu, Linlong Xu, Longyue Wang, Weihua Luo, Jinsong Su

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Retrieval-Augmented Generation (RAG) systems often fail to maintain contextual faithfulness, generating responses that conflict with the provided context or fail to fully leverage the provided evidence. Existing methods attempt to improve faithfulness through external interventions, such as specialized prompting, decoding-based calibration, or preference optimization. However, since these approaches treat the LLM as a black box, they lack a reliable mechanism to assess when and why knowledge conflicts occur. Consequently, they tend to be brittle, data-intensive, and agnostic to the model’s internal reasoning process. In this paper, we move beyond black-box interventions to analyze the model’s internal reasoning process. We discover that conflicting and aligned knowledge states are linearly separable in the model’s latent space, and contextual noise systematically increases the entropy of these representations. Based on these findings, we propose ProbeRAG, a novel framework for faithful RAG that operates in three stages: (i) fine-grained knowledge pruning to filter irrelevant context, (ii) latent conflict probing to identify hard conflicts in the model’s latent space, and (iii) conflict-aware attention to modulate attention heads toward faithful context integration. Extensive experiments demonstrate that ProbeRAG substantially improves both accuracy and contextual faithfulness. The related resources are available at https://github.com/LinfengGao/ProbeRAG.

[89] Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, Hai Zhao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[90] Language steering in latent space to mitigate unintended code-switching

Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.13849: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13849&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[91] Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning

Xinglang Zhang, Yunyao Zhang, ZeLiang Chen, Junqing Yu, Wei Yang, Zikai Song

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.02902: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02902&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[92] ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Nikola Ljubešić, Peter Rupnik, Ivan Porupski, Taja Kuzman Pungeršek

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.01619: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01619&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[93] LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

Jian Gao, Richeng Xuan, Zhaolu Kang, Dingshi Liao, Wenxin Huang, Zongmou Huang, Yangdi Xu, Bowen Qin, Zheqi He, Xi Yang, Changjin Li, Yonghua Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.11334: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11334&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[94] fMRI-LM: Towards a Universal Foundation Model for Language-Aligned fMRI Understanding

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Chengxuan Qian, Tianyang Wang, Vince D. Calhoun

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.21760: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21760&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[95] TRIM: Hybrid Inference via Targeted Stepwise Routing in Multi-Step Reasoning Tasks

Vansh Kapoor, Aman Gupta, Hao Chen, Anurag Beniwal, Jing Huang, Aviral Kumar

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.10245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[96] Mitigating Catastrophic Forgetting in Target Language Adaptation of LLMs via Source-Shielded Updates

Atsuki Yamaguchi, Terufumi Morishita, Aline Villavicencio, Nikolaos Aletras

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.04844: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04844&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[97] Reducing Hallucinations in LLMs via Factuality-Aware Preference Learning

Sindhuja Chaduvula, Ahmed Y. Radwan, Azib Farooq, Yani Ioannou, Shaina Raza

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03027: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03027&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[98] Exposía: Teaching and Assessment of Academic Writing Skills for Research Project Proposals and Peer Feedback

Dennis Zyska, Alla Rozovskaya, Ilia Kuznetsov, Iryna Gurevych

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.06536: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06536&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[99] H-AdminSim: A Multi-Agent Simulator for Realistic Hospital Administrative Workflows with FHIR Integration

Jun-Min Lee, Meong Hi Son, Edward Choi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.05407: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.05407&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[100] Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations

Wen Luo, Guangyue Peng, Wei Li, Shaohang Wei, Feifan Song, Liang Wang, Nan Yang, Xingxing Zhang, Jing Jin, Furu Wei, Houfeng Wang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.07422: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.07422&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[101] ExpSeek: Self-Triggered Experience Seeking for Web Agents

Wenyuan Zhang, Xinghua Zhang, Haiyang Yu, Shuaiyi Nie, Bingli Wu, Juwei Yue, Tingwen Liu, Yongbin Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.08605: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08605&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[102] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, Tsz Kin Lam

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[103] Neural Chain-of-Thought Search: Searching the Optimal Reasoning Path to Enhance Large Language Models

Guoming Ling, Zhongzhan Huang, Yupei Lin, Junxin Li, Shanshan Zhong, Hefeng Wu, Liang Lin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.11340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[104] Agentic Conversational Search with Contextualized Reasoning via Reinforcement Learning

Fengran Mo, Yifan Gao, Sha Li, Hansi Zeng, Xin Liu, Zhaoxuan Tan, Xian Li, Jianshu Chen, Dakuo Wang, Meng Jiang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.13115: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13115&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[105] Common to Whom? Regional Cultural Commonsense and LLM Bias in India

Sangmitra Madhusudan, Trush Shashank More, Steph Buongiorno, Renata Dividino, Jad Kabbara, Ali Emami

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.15550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[106] Sparse or Dense? A Mechanistic Estimation of Computation Density in Transformer-based LLMs

Corentin Kervadec, Iuliia Lysova, Marco Baroni, Gemma Boleda

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.22795: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22795&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[107] When ‘YES’ Meets ‘BUT’: Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

Tuo Liang, Zhe Hu, Jing Li, Hao Zhang, Yiren Lu, Yunlai Zhou, Yiran Qiao, Disheng Liu, Jeirui Peng, Jing Ma, Yu Yin

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2503.23137: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.23137&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[108] IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

David Gringras

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[109] Evaluating LLM-Based Translation of a Low-Resource Technical Language: The Medical and Philosophical Greek of Galen

James L. Zainaldin, Cameron Pattison, Manuela Marai, Jacob Wu, Mark J. Schiefsky

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.24119: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24119&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[110] Just Use XML: Revisiting Joint Translation and Label Projection

Thennal DK, Chris Biemann, Hans Ole Hatzel

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.12021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[111] PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

Minseo Kim, Sujeong Im, Junseong Choi, Junhee Lee, Chaeeun Shim, Hwajung Hong, Edward Choi

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.25620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[112] Kwame 2.0: Human-in-the-Loop Generative AI Teaching Assistant for Large Scale Online Coding Education in Africa

George Boateng, Samuel Boateng, Victor Kumbol

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.29159: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29159&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[113] Failure Makes the Agent Stronger: Enhancing Accuracy through Structured Reflection for Reliable Tool Interactions

Junhao Su, Yuanliang Wan, Junwei Yang, Hengyu Shi, Tianyang Han, Junfeng Luo, Yurui Qiu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.18847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[114] Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

Yihong Dong, Jianha Xiao, Xue Jiang, Xuyuan Guo, Zhiyuan Fan, Jiaru Qian, Kechi Zhang, Jia Li, Zhi Jin, Ge Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02709: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02709&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[115] RAG or Learning? Understanding the Limits of LLM Adaptation under Continuous Knowledge Drift in the Real World

Hanbing Liu, Lang Cao, Yang Li

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[116] ValueGround: Evaluating Culture-Conditioned Visual Value Grounding in MLLMs

Zhipin Wang, Christoph Leiter, Christian Frey, Mohamed Hesham Ibrahim Abdalla, Josif Grabocka, Steffen Eger

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06484: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06484&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[117] Rag Performance Prediction for Question Answering

Or Dado, David Carmel, Oren Kurland

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[118] Guaranteeing Knowledge Integration with Joint Decoding for Retrieval-Augmented Generation

Zhengyi Zhao, Shubo Zhang, Zezhong Wang, Yuxi Zhang, Huimin Wang, Yutian Zhao, Yefeng Zheng, Binyang Li, Kam-Fai Wong, Xian Wu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08046: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08046&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[119] Deep Learning Based Amharic Chatbot for FAQs in Universities

Goitom Ybrah Hailu, Hadush Hailu, Shishay Welay

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2402.01720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.01720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[120] Claim2Vec: Embedding Fact-Check Claims for Multilingual Similarity and Clustering

Rrubaa Panchendrarajan, Arkaitz Zubiaga

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[121] Think in Sentences: Explicit Sentence Boundaries Enhance Language Model’s Capabilities

Zhichen Liu, Yongyuan Li, Yang Xu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10135: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10135&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[122] Two-Stage Regularization-Based Structured Pruning for LLMs

Mingkuan Feng, Jinyang Wu, Siyuan Liu, Shuai Zhang, Ruihan Jin, Feihu Che, Pengpeng Shao, Zhengqi Wen, Jianhua Tao

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.18232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.18232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[123] How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts

Minh-Vuong Nguyen, Fatemeh Shiri, Zhuang Li, Karin Verspoor

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11133: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11133&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[124] LangFlow: Continuous Diffusion Rivals Discrete in Language Modeling

Yuxin Chen, Chumeng Liang, Hangke Sui, Ruihan Guo, Chaoran Cheng, Jiaxuan You, Ge Liu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[125] Masked by Consensus: Disentangling Privileged Knowledge in LLM Correctness

Tomer Ashuach, Liat Ein-Dor, Shai Gretz, Yoav Katz, Yonatan Belinkov

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12373: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12373&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[126] Activation-Guided Local Editing for Jailbreaking Attacks

Jiecong Wang, Haoran Li, Hao Peng, Ziqian Zeng, Zihao Wang, Haohua Du, Zhengtao Yu

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.00555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[127] Growing Pains: Extensible and Efficient LLM Benchmarking Via Fixed Parameter Calibration

Eliya Habba, Itay Itzhak, Asaf Yehudai, Yotam Perlitz, Elron Bandel, Michal Shmueli-Scheuer, Leshem Choshen, Gabriel Stanovsky

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12843: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12843&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[128] CodeFlowBench: A Multi-turn, Iterative Benchmark for Complex Code Generation

Sizhe Wang, Zhengren Wang, Dongsheng Ma, Yongan Yu, Rui Ling, Zhiyu Li, Feiyu Xiong, Wentao Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2504.21751: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.21751&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[129] Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning

Hanbing Liu, Lang Cao, Yuanyi Ren, Mengyu Zhou, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.08125: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08125&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[130] Scaling Test-Time Compute to Achieve IOI Gold Medal with Open-Weight Models

Mehrzad Samadi, Aleksander Ficek, Sean Narenthiran, Siddhartha Jain, Wasi Uddin Ahmad, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.14232: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14232&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[131] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.04442: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04442&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[132] Coherence in the brain unfolds across separable temporal regimes

Davide Staub, Finn Rabe, Akhil Misra, Yves Pauli, Roya Hüppi, Ni Yang, Nils Lang, Lars Michels, Victoria Edkins, Sascha Frühholz, Iris Sommer, Wolfram Hinzen, Philipp Homan

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.20481: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.20481&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[133] Working Notes on Late Interaction Dynamics: Analyzing Targeted Behaviors of Late Interaction Models

Antoine Edy, Max Conti, Quentin Macé

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.26259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[134] ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding

Jovana Kondic, Pengyuan Li, Dhiraj Joshi, Isaac Sanchez, Ben Wiesel, Shafiq Abedin, Amit Alfassy, Eli Schwartz, Daniel Caraballo, Yagmur Gizem Cinar, Florian Scheidegger, Steven I. Ross, Daniel Karl I. Weidele, Hang Hua, Ekaterina Arutyunova, Roei Herzig, Zexue He, Zihan Wang, Xinyue Yu, Yunfei Zhao, Sicong Jiang, Minghao Liu, Qunshu Lin, Peter Staar, Luis Lastras, Aude Oliva, Rogerio Feris

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.27064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[135] VLMs Need Words: Vision Language Models Ignore Visual Detail In Favor of Semantic Anchors

Haz Sameen Shahgir, Xiaofu Chen, Yu Fu, Erfan Shayegani, Nael Abu-Ghazaleh, Yova Kementchedjhieva, Yue Dong

Main category: cs.CL

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

cs.CV

[136] A Lightweight Multi-Metric No-Reference Image Quality Assessment Framework for UAV Imaging

Koffi Titus Sergio Aglin, Anthony K. Muchiri, Celestin Nkundineza

Main category: cs.CV

TL;DR: MM-IQA is a lightweight no-reference image quality assessment framework that combines multiple interpretable distortion cues (blur, edges, resolution artifacts, exposure, noise, haze, frequency) into a single quality score.

Details

Motivation: Need for reliable image quality assessment in automated image acquisition systems where pristine reference images are unavailable, requiring efficient no-reference IQA methods for filtering large volumes of images before analysis.

Method: Multi-metric framework combining interpretable distortion cues: blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content. Uses Python/OpenCV implementation with modest memory requirements storing only limited intermediate representations.

Result: Achieved SRCC values of 0.647-0.830 on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, BIQ2021). Consistent performance on synthetic agricultural dataset. Processing time ~1.97s per image with linear memory scaling.

Conclusion: MM-IQA enables fast image quality screening with explicit distortion-aware cues and modest computational cost, suitable for practical applications requiring efficient no-reference quality assessment.

Abstract: Reliable image quality assessment is essential in applications where large volumes of images are acquired automatically and must be filtered before further analysis. In many practical scenarios, a pristine reference image is unavailable, making no reference image quality assessment (NR-IQA) particularly important. This paper introduces Multi-Metric Image Quality Assessment (MM-IQA), a lightweight multi-metric framework for NR-IQA. It combines interpretable cues related to blur, edge structure, low resolution artifacts, exposure imbalance, noise, haze, and frequency content to produce a single quality score in the range [0,100].MM-IQA was evaluated on five benchmark datasets (KonIQ-10k, LIVE Challenge, KADID-10k, TID2013, and BIQ2021) and achieved SRCC values ranging from 0.647 to 0.830. Additional experiments on a synthetic agricultural dataset showed consistent behavior of the designed cues. The Python/OpenCV implementation required about 1.97 s per image. This method also has modest memory requirements because it stores only a limited number of intermediate grayscale, filtered, and frequency-domain representations, resulting in memory usage that scales linearly with image size. The results show that MM-IQA can be used for fast image quality screening with explicit distortion aware cues and modest computational cost.

Shivam Chand Kaushik

Main category: cs.CV

TL;DR: SemiFA is an agentic multi-modal framework that autonomously generates structured semiconductor failure analysis reports from inspection images in under one minute using a four-agent pipeline with vision-language models and equipment telemetry fusion.

Details

Motivation: Semiconductor failure analysis currently requires hours of expert time per case involving manual examination of inspection images, correlation of equipment telemetry, consultation of historical records, and report writing. There's a need to automate this time-consuming process.

Method: Four-agent LangGraph pipeline: 1) DefectDescriber using DINOv2 and LLaVA-1.6 for defect classification and morphology narration, 2) RootCauseAnalyzer fusing SECS/GEM equipment telemetry with historical defect retrieval from Qdrant vector database, 3) SeverityClassifier for severity assignment and yield impact estimation, 4) RecipeAdvisor for corrective process adjustments, plus a fifth node for PDF report assembly.

Result: DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917). Full pipeline generates complete FA reports in 48 seconds on NVIDIA A100. Multi-modal fusion improves root cause reasoning by +0.86 composite points over image-only baseline, with equipment telemetry being the more load-bearing modality.

Conclusion: SemiFA successfully automates semiconductor failure analysis report generation, integrating SECS/GEM equipment telemetry into a vision-language model pipeline for the first time, significantly reducing analysis time from hours to under one minute while maintaining high accuracy.

Abstract: Semiconductor failure analysis (FA) requires engineers to examine inspection images, correlate equipment telemetry, consult historical defect records, and write structured reports, a process that can consume several hours of expert time per case. We present SemiFA, an agentic multi-modal framework that autonomously generates structured FA reports from semiconductor inspection images in under one minute. SemiFA decomposes FA into a four-agent LangGraph pipeline: a DefectDescriber that classifies and narrates defect morphology using DINOv2 and LLaVA-1.6, a RootCauseAnalyzer that fuses SECS/GEM equipment telemetry with historically similar defects retrieved from a Qdrant vector database, a SeverityClassifier that assigns severity and estimates yield impact, and a RecipeAdvisor that proposes corrective process adjustments. A fifth node assembles a PDF report. We introduce SemiFA-930, a dataset of 930 annotated semiconductor defect images paired with structured FA narratives across nine defect classes, drawn from procedural synthesis, WM-811K, and MixedWM38. Our DINOv2-based classifier achieves 92.1% accuracy on 140 validation images (macro F1 = 0.917), and the full pipeline produces complete FA reports in 48 seconds on an NVIDIA A100-SXM4-40 GB GPU. A GPT-4o judge ablation across four modality conditions demonstrates that multi-modal fusion improves root cause reasoning by +0.86 composite points (1-5 scale) over an image-only baseline, with equipment telemetry as the more load-bearing modality. To our knowledge, SemiFA is the first system to integrate SECS/GEM equipment telemetry into a vision-language model pipeline for autonomous FA report generation.

[138] Graph Propagated Projection Unlearning: A Unified Framework for Vision and Audio Discriminative Models

Shreyansh Pathak, Jyotishman Das

Main category: cs.CV

TL;DR: GPPU is a unified, scalable algorithm for class-level unlearning that works across vision and audio models using graph propagation to identify class-specific directions and orthogonal projection for efficient information removal.

Details

Motivation: The need for selective and efficient erasure of learned information from deep neural networks is important for privacy, regulatory compliance, and adaptive system design, requiring a principled approach to machine unlearning.

Method: GPPU uses graph-based propagation to identify class-specific directions in feature space, projects representations onto orthogonal subspaces, and performs targeted fine-tuning to ensure effective and irreversible removal of target class information.

Result: GPPU achieves 10-20x speedups over prior methodologies while preserving model utility on retained classes, demonstrated through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks across various architectures.

Conclusion: GPPU provides a principled, modality-agnostic approach to machine unlearning at a scale not previously explored, contributing to more efficient and responsible deep learning systems.

Abstract: The need to selectively and efficiently erase learned information from deep neural networks is becoming increasingly important for privacy, regulatory compliance, and adaptive system design. We introduce Graph-Propagated Projection Unlearning (GPPU), a unified and scalable algorithm for class-level unlearning that operates across both vision and audio models. GPPU employs graph-based propagation to identify class-specific directions in the feature space and projects representations onto the orthogonal subspace, followed by targeted fine-tuning, to ensure that target class information is effectively and irreversibly removed. Through comprehensive evaluations on six vision datasets and two large-scale audio benchmarks spanning a variety of architectures including CNNs, Vision Transformers, and Audio Transformers, we demonstrate that GPPU achieves highly efficient unlearning, realizing 10-20x speedups over prior methodologies while preserving model utility on retained classes. Our framework provides a principled and modality-agnostic approach to machine unlearning, evaluated at a scale that has received limited attention in prior work, contributing toward more efficient and responsible deep learning.

[139] DroneScan-YOLO: Redundancy-Aware Lightweight Detection for Tiny Objects in UAV Imagery

Yann V. Bellec

Main category: cs.CV

TL;DR: DroneScan-YOLO improves aerial object detection for UAV imagery by addressing tiny object detection challenges through increased resolution, dynamic filter pruning, a lightweight stride-4 detection branch, and a hybrid loss function.

Details

Motivation: Aerial object detection in UAV imagery faces challenges with tiny objects, adverse conditions, and computational constraints. Standard YOLO detectors fail due to minimum stride limitations, gradient issues for non-overlapping tiny boxes, and filter redundancy.

Method: Four coordinated design choices: (1) increased 1280x1280 input resolution, (2) RPA-Block dynamic filter pruning with lazy cosine-similarity updates, (3) MSFD lightweight P2 detection branch at stride 4, and (4) SAL-NWD hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting.

Result: Achieves 55.3% mAP@50 and 35.6% mAP@50-95 on VisDrone2019-DET, outperforming YOLOv8s by +16.6 and +12.3 points respectively. Improves recall from 0.374 to 0.518, maintains 96.7 FPS with only +4.1% parameters. Significant gains on tiny objects: bicycle AP@50 improves 187%, awning-tricycle improves 52%.

Conclusion: DroneScan-YOLO provides a holistic solution for aerial object detection that effectively addresses tiny object detection challenges while maintaining computational efficiency, significantly outperforming baseline methods.

Abstract: Aerial object detection in UAV imagery presents unique challenges due to the high prevalence of tiny objects, adverse environmental conditions, and strict computational constraints. Standard YOLO-based detectors fail to address these jointly: their minimum detection stride of 8 pixels renders sub-32px objects nearly undetectable, their CIoU loss produces zero gradients for non-overlapping tiny boxes, and their architectures contain significant filter redundancy. We propose DroneScan-YOLO, a holistic system contribution that addresses these limitations through four coordinated design choices: (1) increased input resolution of 1280x1280 to maximize spatial detail for tiny objects, (2) RPA-Block, a dynamic filter pruning mechanism based on lazy cosine-similarity updates with a 10-epoch warm-up period, (3) MSFD, a lightweight P2 detection branch at stride 4 adding only 114,592 parameters (+1.1%), and (4) SAL-NWD, a hybrid loss combining Normalized Wasserstein Distance with size-adaptive CIoU weighting, integrated into YOLOv8’s TaskAligned assignment pipeline. Evaluated on VisDrone2019-DET, DroneScan-YOLO achieves 55.3% mAP@50 and 35.6% mAP@50-95, outperforming the YOLOv8s baseline by +16.6 and +12.3 points respectively, improving recall from 0.374 to 0.518, and maintaining 96.7 FPS inference speed with only +4.1% parameters. Gains are most pronounced on tiny object classes: bicycle AP@50 improves from 0.114 to 0.328 (+187%), and awning-tricycle from 0.156 to 0.237 (+52%).

[140] PatchPoison: Poisoning Multi-View Datasets to Degrade 3D Reconstruction

Prajas Wadekar, Venkata Sai Pranav Bachina, Kunal Bhosikar, Ankit Gangwal, Charu Sharma

Main category: cs.CV

TL;DR: PatchPoison: A dataset-poisoning method using small adversarial patches to prevent unauthorized 3D reconstruction from multi-view images by corrupting SfM feature matching.

Details

Motivation: 3D Gaussian Splatting enables photorealistic 3D reconstruction from casually captured images, raising privacy concerns about unauthorized reconstruction of scenes/objects without consent. Need a practical protection method.

Method: Inject small high-frequency adversarial patches (structured checkerboard patterns) into periphery of each image in multi-view datasets. Patches corrupt feature-matching in SfM pipelines like COLMAP by introducing spurious correspondences that misalign camera poses, causing downstream 3DGS optimization to diverge.

Result: On NeRF-Synthetic benchmark, inserting 12×12 pixel patches increases reconstruction error by 6.8× in LPIPS metric. Poisoned images remain unobtrusive to human viewers while effectively preventing 3D reconstruction.

Conclusion: PatchPoison offers lightweight, practical protection against unauthorized 3D reconstruction without requiring pipeline modifications, serving as a “drop-in” preprocessing step for content creators.

Abstract: 3D Gaussian Splatting (3DGS) has recently enabled highly photorealistic 3D reconstruction from casually captured multi-view images. However, this accessibility raises a privacy concern: publicly available images or videos can be exploited to reconstruct detailed 3D models of scenes or objects without the owner’s consent. We present PatchPoison, a lightweight dataset-poisoning method that prevents unauthorized 3D reconstruction. Unlike global perturbations, PatchPoison injects a small high-frequency adversarial patch, a structured checkerboard, into the periphery of each image in a multi-view dataset. The patch is designed to corrupt the feature-matching stage of Structure-from-Motion (SfM) pipelines such as COLMAP by introducing spurious correspondences that systematically misalign estimated camera poses. Consequently, downstream 3DGS optimization diverges from the correct scene geometry. On the NeRF-Synthetic benchmark, inserting a 12 X 12 pixel patch increases reconstruction error by 6.8x in LPIPS, while the poisoned images remain unobtrusive to human viewers. PatchPoison requires no pipeline modifications, offering a practical, “drop-in” preprocessing step for content creators to protect their multi-view data.

[141] 3DRealHead: Few-Shot Detailed Head Avatar

Jalees Nehvi, Timo Bolkart, Thabo Beeler, Justus Thies

Main category: cs.CV

TL;DR: 3DRealHead: A few-shot 3D head avatar reconstruction method that uses a Style U-Net to generate 3D Gaussian primitives from few images, with novel expression control combining 3DMM signals and mouth region features from monocular video for higher expressivity.

Details

Motivation: Current 3D head avatar methods struggle to faithfully reproduce identity and facial expressions, especially for person-specific features like mouth and teeth. Existing methods rely on limited training data and 3DMM-based expression control that restricts expressivity, failing to capture the full diversity of human appearances and detailed facial expressions needed for immersive applications.

Method: Proposes 3DRealHead with: 1) Few-shot inversion process of a 3D human head prior represented as a Style U-Net that emits 3D Gaussian primitives, learned on NeRSemble dataset; 2) Novel expression control combining 3DMM-based facial expression signals with mouth region features extracted from driving monocular video; 3) Enables avatar reconstruction from few pictures and driving with consumer webcam.

Result: The method achieves higher expressivity and closer resemblance to physical reality by recovering facial expressions that cannot be represented by 3DMM alone. The few-shot approach allows avatar creation from limited input images while maintaining detailed person-specific features.

Conclusion: 3DRealHead addresses limitations of current 3D head avatar methods by combining learned priors with novel expression control signals, enabling faithful reproduction of identity and detailed facial expressions from few-shot inputs for immersive applications.

Abstract: The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.

[142] GeoLink: A 3D-Aware Framework Towards Better Generalization in Cross-View Geo-Localization

Hongyang Zhang, Yinhao Liu, Haitao Zhang, Zhongyi Wen, Shuxian Liang, Xiansheng Hua

Main category: cs.CV

TL;DR: GeoLink: A 3D-aware semantic-consistent framework for generalizable cross-view geo-localization that uses 3D scene reconstruction to improve 2D representation learning and enhance generalization to unseen domains.

Details

Motivation: Cross-view geo-localization faces challenges with semantic inconsistency due to viewpoint variation and poor generalization under domain shift. Existing 2D correspondence methods are easily distracted by redundant shared information across views, leading to less transferable representations.

Method: Offline reconstruction of scene point clouds from multi-view drone images using VGGT to provide stable structural priors. Two complementary improvements: 1) Geometric-aware Semantic Refinement module mitigates redundant and view-biased dependencies in 2D features under 3D guidance; 2) Unified View Relation Distillation module transfers 3D structural relations to 2D features while preserving 2D-only inference pipeline.

Result: Extensive experiments on multiple benchmarks show GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.

Conclusion: The proposed 3D-aware framework effectively addresses semantic inconsistency and generalization challenges in cross-view geo-localization by leveraging 3D structural priors to enhance 2D representation learning.

Abstract: Generalizable cross-view geo-localization aims to match the same location across views in unseen regions and conditions without GPS supervision. Its core difficulty lies in severe semantic inconsistency caused by viewpoint variation and poor generalization under domain shift. Existing methods mainly rely on 2D correspondence, but they are easily distracted by redundant shared information across views, leading to less transferable representations. To address this, we propose GeoLink, a 3D-aware semantic-consistent framework for Generalizable cross-view geo-localization. Specifically, we offline reconstruct scene point clouds from multi-view drone images using VGGT, providing stable structural priors. Based on these 3D anchors, we improve 2D representation learning in two complementary ways. A Geometric-aware Semantic Refinement module mitigates potentially redundant and view-biased dependencies in 2D features under 3D guidance. In addition, a Unified View Relation Distillation module transfers 3D structural relations to 2D features, improving cross-view alignment while preserving a 2D-only inference pipeline. Extensive experiments on multiple benchmarks show that GeoLink consistently outperforms state-of-the-art methods and achieves superior generalization across unseen domains and diverse weather environments.

[143] Towards Patient-Specific Deformable Registration in Laparoscopic Surgery

Alberto Neri, Veronica Penza, Nazim Haouchine, Leonardo S. Mattos

Main category: cs.CV

TL;DR: First patient-specific non-rigid point cloud registration method for surgical 3D model alignment using Transformer architecture and physics-based registration to handle organ deformations and noise.

Details

Motivation: Unsafe surgical care due to limitations in surgeon experience and situational awareness; need for reliable registration of patient-specific 3D models to enhance visualization and reduce complications, but challenged by organ deformations and noise between preoperative and intraoperative surfaces.

Method: Patient-specific non-rigid point cloud registration combining Transformer encoder-decoder architecture with overlap estimation and matching module for dense correspondence prediction, followed by physics-based registration algorithm.

Result: Significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, demonstrating effectiveness on both synthetic and real data.

Conclusion: Patient-specific registration method shows potential to improve surgical care by enabling reliable 3D model integration despite organ deformations and noise.

Abstract: Unsafe surgical care is a critical health concern, often linked to limitations in surgeon experience, skills, and situational awareness. Integrating patient-specific 3D models into the surgical field can enhance visualization, provide real-time anatomical guidance, and reduce intraoperative complications. However, reliably registering these models in general surgery remains challenging due to mismatches between preoperative and intraoperative organ surfaces, such as deformations and noise. To overcome these challenges, we introduce the first patient-specific non-rigid point cloud registration method, which leverages a novel data generation strategy to optimize outcomes for individual patients. Our approach combines a Transformer encoder-decoder architecture with overlap estimation and a dedicated matching module to predict dense correspondences, followed by a physics-based algorithm for registration. Experimental results on both synthetic and real data demonstrate that our patient-specific method significantly outperforms traditional agnostic approaches, achieving 45% Matching Score with 92% Inlier Ratio on synthetic data, highlighting its potential to improve surgical care.

[144] OneHOI: Unifying Human-Object Interaction Generation and Editing

Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan

Main category: cs.CV

TL;DR: OneHOI is a unified diffusion transformer framework that consolidates Human-Object Interaction generation and editing into a single conditional denoising process using structured interaction representations.

Details

Motivation: Existing HOI approaches are disjoint: HOI generation synthesizes scenes from structured triplets but fails to integrate mixed conditions, while HOI editing modifies interactions via text but struggles to decouple pose from physical contact and scale to multiple interactions.

Method: Introduces Relational Diffusion Transformer (R-DiT) with role- and instance-aware HOI tokens, layout-based spatial Action Grounding, Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on HOI-Edit-44K dataset.

Result: Achieves state-of-the-art results across both HOI generation and editing, supporting layout-guided, layout-free, arbitrary-mask, and mixed-condition control.

Conclusion: OneHOI provides a unified framework that effectively addresses limitations of existing HOI approaches by consolidating generation and editing capabilities through structured interaction representations.

Abstract: Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

[145] Multitasking Embedding for Embryo Blastocyst Grading Prediction (MEmEBG)

Nahid Khoshk Angabini, Mohsen Tajgardan, Mahesh Madhavan, Zahra Asghari Varzaneh, Reza Khoshkangini, Thomas Ebner

Main category: cs.CV

TL;DR: A multitask embedding-based approach using ResNet-18 for automated analysis of blastocyst quality from embryo images, predicting trophectoderm, inner cell mass, and expansion grades.

Details

Motivation: Current embryo grading in IVF relies on subjective visual assessment of morphological features, leading to inter-embryologist variability and standardization challenges. There's a need for automated, objective blastocyst quality assessment.

Method: Uses a pretrained ResNet-18 architecture enhanced with an embedding layer to learn discriminative representations from limited embryo image datasets. The multitask approach simultaneously predicts TE, ICM, and EXP grades by leveraging biological and physical characteristics extracted from day-5 human embryo images.

Result: Experimental results demonstrate the promise of the multitask embedding approach for robust and consistent blastocyst quality assessment, showing potential for automated analysis of visually similar structures that are difficult to distinguish.

Conclusion: The proposed embedding-based multitask learning approach shows potential for reliable, automated blastocyst quality assessment that could address subjectivity and variability issues in current IVF embryo grading practices.

Abstract: Reliable evaluation of blastocyst quality is critical for the success of in vitro fertilization (IVF) treatments. Current embryo grading practices primarily rely on visual assessment of morphological features, which introduces subjectivity, inter-embryologist variability, and challenges in standardizing quality assurance. In this study, we propose a multitask embedding-based approach for the automated analysis and prediction of key blastocyst components, including the trophectoderm (TE), inner cell mass (ICM), and blastocyst expansion (EXP). The method leverages biological and physical characteristics extracted from images of day-5 human embryos. A pretrained ResNet-18 architecture, enhanced with an embedding layer, is employed to learn discriminative representations from a limited dataset and to automatically identify TE and ICM regions along with their corresponding grades, structures that are visually similar and inherently difficult to distinguish. Experimental results demonstrate the promise of the multitask embedding approach and potential for robust and consistent blastocyst quality assessment.

[146] Person Re-Identification via Generalized Class Prototypes

Md Ahmed Al Muzaddid, William J. Beksi

Main category: cs.CV

TL;DR: A generalized selection method for person re-identification that improves performance by choosing better class representations beyond simple centroids, balancing accuracy and mean average precision.

Details

Motivation: While feature extraction and objective function improvements have advanced person re-identification, selecting optimal class representatives remains underexplored. Prior methods using class centroids during retrieval yield suboptimal results, creating a need for better representation selection strategies.

Method: Proposes a generalized selection method that chooses representations not limited to class centroids. The approach allows adjustment of the number of representations per class based on application requirements and works on top of existing re-identification embeddings.

Result: The method substantially improves upon contemporary results across multiple re-identification embeddings, achieving better balance between accuracy and mean average precision beyond state-of-the-art performance.

Conclusion: Better selection of class representatives is crucial for person re-identification performance, and the proposed generalized selection method effectively addresses this gap, offering flexible representation choices that improve retrieval metrics.

Abstract: Advanced feature extraction methods have significantly contributed to enhancing the task of person re-identification. In addition, modifications to objective functions have been developed to further improve performance. Nonetheless, selecting better class representatives is an underexplored area of research that can also lead to advancements in re-identification performance. Although past works have experimented with using the centroid of a gallery image class during training, only a few have investigated alternative representations during the retrieval stage. In this paper, we demonstrate that these prior techniques yield suboptimal results in terms of re-identification metrics. To address the re-identification problem, we propose a generalized selection method that involves choosing representations that are not limited to class centroids. Our approach strikes a balance between accuracy and mean average precision, leading to improvements beyond the state of the art. For example, the actual number of representations per class can be adjusted to meet specific application requirements. We apply our methodology on top of multiple re-identification embeddings, and in all cases it substantially improves upon contemporary results.

[147] Neural 3D Reconstruction of Planetary Surfaces from Descent-Phase Wide-Angle Imagery

Melonie de Almeida, George Brydon, Divya M. Persaud, John H. Williamson, Paul Henderson

Main category: cs.CV

TL;DR: Neural height field reconstruction method for planetary descent imagery outperforms traditional multi-view stereo by incorporating domain-specific priors about continuous, smooth planetary surfaces.

Details

Motivation: Digital elevation modeling of planetary surfaces is crucial for geological studies, but accurate 3D reconstruction from spacecraft descent imagery is challenging due to strong radial distortion, limited parallax from vertically descending cameras, and limitations of conventional multi-view stereo methods.

Method: Developed a novel neural reconstruction approach with explicit neural height field representation that incorporates domain-specific priors about planetary surfaces being continuous, smooth, solid, and free from floating objects. This is the first study of modern neural reconstruction methods for planetary descent imaging.

Result: Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains show the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy compared to traditional multi-view stereo methods.

Conclusion: Neural approaches offer a strong and competitive alternative to traditional multi-view stereo methods for planetary descent imaging, with the neural height field representation providing effective domain-specific priors for planetary surface reconstruction.

Abstract: Digital elevation modeling of planetary surfaces is essential for studying past and ongoing geological processes. Wide-angle imagery acquired during spacecraft descent promises to offer a low-cost option for high-resolution terrain reconstruction. However, accurate 3D reconstruction from such imagery is challenging due to strong radial distortion and limited parallax from vertically descending, predominantly nadir-facing cameras. Conventional multi-view stereo exhibits limited depth range and reduced fidelity under these conditions and also lacks domain-specific priors. We present the first study of modern neural reconstruction methods for planetary descent imaging. We also develop a novel approach that incorporates an explicit neural height field representation, which provides a strong prior since planetary surfaces are generally continuous, smooth, solid, and free from floating objects. This study demonstrates that neural approaches offer a strong and competitive alternative to traditional multi-view stereo (MVS) methods. Experiments on simulated descent sequences over high-fidelity lunar and Mars terrains demonstrate that the proposed approach achieves increased spatial coverage while maintaining satisfactory estimation accuracy.

[148] LPM 1.0: Video-based Character Performance Model

Ailing Zeng, Casper Yang, Chauncey Ge, Eddie Zhang, Garvey Xu, Gavin Lin, Gilbert Gu, Jeremy Pi, Leo Li, Mingyi Shi, Shawn Wang, Sheng Bi, Steven Tang, Thorn Hang, Tobey Guo, Vincent Li, Xin Tong, Yikang Li, Yuchen Sun, Yue Zhao, Yuhan Lu, Yuwei Li, Zane Zhang, Zeshi Yang, Zi Ye

Main category: cs.CV

TL;DR: LPM 1.0 is a Large Performance Model that generates real-time, identity-stable audio-visual conversational performance for characters from single images and audio inputs, addressing the performance trilemma of expressiveness, real-time inference, and long-horizon identity stability.

Details

Motivation: Existing video models struggle with the "performance trilemma" - balancing high expressiveness, real-time inference, and long-horizon identity stability. Conversation represents the most comprehensive performance scenario where characters simultaneously speak, listen, react, and emote while maintaining identity over time.

Method: Built a multimodal human-centric dataset with strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction. Trained a 17B-parameter Diffusion Transformer (Base LPM) for controllable, identity-consistent performance, then distilled it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction.

Result: LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio with text prompts for motion control, achieving real-time speed with identity-stable, infinite-length generation. It achieves state-of-the-art results on the proposed LPM-Bench benchmark across all evaluated dimensions.

Conclusion: LPM 1.0 serves as a visual engine for conversational agents, live streaming characters, and game NPCs, solving the performance trilemma through a novel multimodal approach that enables high-quality, real-time, identity-consistent character performance generation.

Abstract: Performance, the externalization of intent, emotion, and personality through visual, vocal, and temporal behavior, is what makes a character alive. Learning such performance from video is a promising alternative to traditional 3D pipelines. However, existing video models struggle to jointly achieve high expressiveness, real-time inference, and long-horizon identity stability, a tension we call the performance trilemma. Conversation is the most comprehensive performance scenario, as characters simultaneously speak, listen, react, and emote while maintaining identity over time. To address this, we present LPM 1.0 (Large Performance Model), focusing on single-person full-duplex audio-visual conversational performance. Concretely, we build a multimodal human-centric dataset through strict filtering, speaking-listening audio-video pairing, performance understanding, and identity-aware multi-reference extraction; train a 17B-parameter Diffusion Transformer (Base LPM) for highly controllable, identity-consistent performance through multimodal conditioning; and distill it into a causal streaming generator (Online LPM) for low-latency, infinite-length interaction. At inference, given a character image with identity-aware references, LPM 1.0 generates listening videos from user audio and speaking videos from synthesized audio, with text prompts for motion control, all at real-time speed with identity-stable, infinite-length generation. LPM 1.0 thus serves as a visual engine for conversational agents, live streaming characters, and game NPCs. To systematically evaluate this setting, we propose LPM-Bench, the first benchmark for interactive character performance. LPM 1.0 achieves state-of-the-art results across all evaluated dimensions while maintaining real-time inference.

[149] A High-Resolution Landscape Dataset for Concept-Based XAI With Application to Species Distribution Models

Augustin de la Brosse, Damien Garreau, Thomas Houet, Thomas Corpetti

Main category: cs.CV

TL;DR: First implementation of concept-based Explainable AI (XAI) for Species Distribution Models using Robust TCAV to quantify landscape concept influence on predictions, with new open-access drone imagery dataset.

Details

Motivation: Species distribution models need both predictive performance and ecological insights, but deep learning models make extracting insights challenging. Need to reconcile these objectives with explainable AI.

Method: Propose concept-based XAI for SDMs using Robust TCAV methodology. Create new open-access landscape concept dataset from high-resolution multispectral and LiDAR drone imagery (653 patches across 15 concepts + 1,450 random reference patches). Test on two aquatic insects using CNNs and Vision Transformers.

Result: Concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV provides landscape-level information useful for policy-making.

Conclusion: First successful implementation of concept-based XAI for SDMs demonstrates value in bridging predictive performance with ecological interpretability, enabling both validation and discovery in species distribution modeling.

Abstract: Mapping the spatial distribution of species is essential for conservation policy and invasive species management. Species distribution models (SDMs) are the primary tools for this task, serving two purposes: achieving robust predictive performance while providing ecological insights into the driving factors of distribution. However, the increasing complexity of deep learning SDMs has made extracting these insights more challenging. To reconcile these objectives, we propose the first implementation of concept-based Explainable AI (XAI) for SDMs. We leverage the Robust TCAV (Testing with Concept Activation Vectors) methodology to quantify the influence of landscape concepts on model predictions. To enable this, we provide a new open-access landscape concept dataset derived from high-resolution multispectral and LiDAR drone imagery. It includes 653 patches across 15 distinct landscape concepts and 1,450 random reference patches, designed to suit a wide range of species. We demonstrate this approach through a case study of two aquatic insects, Plecoptera and Trichoptera, using two Convolutional Neural Networks and one Vision Transformer. Results show that concept-based XAI helps validate SDMs against expert knowledge while uncovering novel associations that generate new ecological hypotheses. Robust TCAV also provides landscape-level information, useful for policy-making and land management. Code and datasets are publicly available.

[150] 4th Workshop on Maritime Computer Vision (MaCVi): Challenge Overview

Benjamin Kiefer, Jan Lukas Augustin, Jon Muhovič, Mingi Jeong, Arnold Wiliem, Janez Pers, Matej Kristan, Alberto Quattrini Li, Matija Teršek, Josip Šarić, Arpita Vats, Dominik Hildebrand, Rafia Rahim, Mahmut Karaaslan, Arpit Vaishya, Steve Xie, Ersin Kaya, Akib Mashrur, Tze-Hsiang Tang, Chun-Ming Tsai, Jun-Wei Hsieh, Ming-Ching Chang, Wonwoo Jo, Doyeon Lee, Yusi Cao, Lingling Li, Vinayak Nageli, Arshad Jamal, Gorthi Rama Krishna Sai Subrahmanyam, Jemo Maeng, Seongju Lee, Kyoobin Lee, Xu Liu, LiCheng Jiao, Jannik Sheikh, Martin Weinmann, Ivan Martinović, Jose Mateus Raitz Persch, Rahul Harsha Cheppally, Mehmet E. Belviranli, Dimitris Gahtidis, Hyewon Chun, Sangmun Lee, Philipp Gorczak, Hansol Kim, Jeeyeon Jeon, Borja Carrillo Perez, Jiahui Wang, Sangmin Park, Andreas Michel, Jannick Kuester, Bettina Felten, Wolfgang Gross, Yuan Feng, Justin Davis

Main category: cs.CV

TL;DR: Workshop report summarizing the 4th Maritime Computer Vision (MaCVi) workshop at CVPR 2026, covering five benchmark challenges focused on predictive accuracy and real-time embedded feasibility in maritime computer vision.

Details

Motivation: To advance maritime computer vision research by providing benchmark challenges that emphasize both accuracy and real-time feasibility for embedded systems, addressing practical deployment needs in maritime environments.

Method: Organized five benchmark challenges with specific evaluation protocols and datasets, collected submissions from participating teams, conducted quantitative and qualitative analyses, and compiled technical reports from top-performing teams.

Result: The workshop successfully conducted five benchmark challenges, established leaderboards, collected technical reports highlighting practical design choices, and provided comprehensive analyses of emerging method trends in maritime computer vision.

Conclusion: The MaCVi 2026 workshop advanced maritime computer vision research through comprehensive benchmark challenges that balanced predictive accuracy with real-time embedded feasibility, providing valuable resources and insights for the community.

Abstract: The 4th Workshop on Maritime Computer Vision (MaCVi) is organized as part of CVPR 2026. This edition features five benchmark challenges with emphasis on both predictive accuracy and embedded real-time feasibility. This report summarizes the MaCVi 2026 challenge setup, evaluation protocols, datasets, and benchmark tracks, and presents quantitative results, qualitative comparisons, and cross-challenge analyses of emerging method trends. We also include technical reports from top-performing teams to highlight practical design choices and lessons learned across the benchmark suite. Datasets, leaderboards, and challenge resources are available at https://macvi.org/workshop/cvpr26.

[151] Rethinking Uncertainty in Segmentation: From Estimation to Decision

Saket Maganti

Main category: cs.CV

TL;DR: Medical image segmentation uncertainty maps need actionable policies (accept/flag/defer) to be useful; optimizing uncertainty alone misses safety gains; best method removes 80% errors at 25% deferral.

Details

Motivation: Current medical image segmentation reports uncertainty estimates but rarely uses them to guide decisions; there's a disconnect between uncertainty metrics and real-world utility in converting uncertainty maps into actionable policies.

Method: Formulate segmentation as two-stage pipeline (estimation then decision); evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) with three deferral strategies; introduce confidence-aware deferral rule prioritizing uncertain and low-confidence predictions; test on retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1).

Result: Best method and policy combination removes up to 80% of segmentation errors at only 25% pixel deferral; achieves strong cross-dataset robustness; calibration improvements don’t translate to better decision quality.

Conclusion: Uncertainty should be evaluated based on the decisions it enables rather than in isolation; there’s a disconnect between standard uncertainty metrics and real-world utility in medical image segmentation.

Abstract: In medical image segmentation, uncertainty estimates are often reported but rarely used to guide decisions. We study the missing step: how uncertainty maps are converted into actionable policies such as accepting, flagging, or deferring predictions. We formulate segmentation as a two-stage pipeline, estimation followed by decision, and show that optimizing uncertainty alone fails to capture most of the achievable safety gains. Using retinal vessel segmentation benchmarks (DRIVE, STARE, CHASE_DB1), we evaluate two uncertainty sources (Monte Carlo Dropout and Test-Time Augmentation) combined with three deferral strategies, and introduce a simple confidence-aware deferral rule that prioritizes uncertain and low-confidence predictions. Our results show that the best method and policy combination removes up to 80 percent of segmentation errors at only 25 percent pixel deferral, while achieving strong cross-dataset robustness. We further show that calibration improvements do not translate to better decision quality, highlighting a disconnect between standard uncertainty metrics and real-world utility. These findings suggest that uncertainty should be evaluated based on the decisions it enables, rather than in isolation.

[152] Indexing Multimodal Language Models for Large-scale Image Retrieval

Bahey Tharwat, Giorgos Kordopatis-Zilos, Pavel Suma, Ian Reid, Giorgos Tolias

Main category: cs.CV

TL;DR: MLLMs used as zero-shot similarity estimators for image-to-image retrieval without training, outperforming specialized methods on diverse benchmarks.

Details

Motivation: Multimodal LLMs have strong cross-modal reasoning but their potential for vision-only tasks like image retrieval remains underexplored. The paper investigates using MLLMs as training-free similarity estimators for instance-level image retrieval.

Method: Prompt MLLMs with paired images and convert next-token probabilities into similarity scores for zero-shot re-ranking. Combine with memory-efficient indexing and top-k candidate re-ranking for scalability. Avoids specialized architectures and fine-tuning.

Result: MLLMs outperform task-specific re-rankers outside their native domains and show superior robustness to clutter, occlusion, and small objects. They work well for open-world large-scale image retrieval but have failure modes under severe appearance changes.

Conclusion: MLLMs are promising zero-shot alternatives for image retrieval, leveraging multimodal pre-training knowledge without additional training. Future work needed to address limitations with severe appearance changes.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for instance-level image-to-image retrieval. Our approach prompts the model with paired images and converts next-token probabilities into similarity scores, enabling zero-shot re-ranking within large-scale retrieval pipelines. This design avoids specialized architectures and fine-tuning, leveraging the rich visual discrimination learned during multimodal pre-training. We address scalability by combining MLLMs with memory-efficient indexing and top-$k$ candidate re-ranking. Experiments across diverse benchmarks show that MLLMs outperform task-specific re-rankers outside their native domains and exhibit superior robustness to clutter, occlusion, and small objects. Despite strong results, we identify failure modes under severe appearance changes, highlighting opportunities for future research. Our findings position MLLMs as a promising alternative for open-world large-scale image retrieval.

[153] Explainable Fall Detection for Elderly Care via Temporally Stable SHAP in Skeleton-Based Human Activity Recognition

Mohammad Saleh, Azadeh Tabatabaei

Main category: cs.CV

TL;DR: Proposes T-SHAP, a temporally-aware explanation method for skeleton-based fall detection that stabilizes SHAP attributions over time windows to improve reliability for clinical use.

Details

Motivation: Existing post-hoc explainability methods produce temporally unstable attribution maps when applied frame-by-frame to sequential data, making them unreliable for clinical decision-making in fall detection.

Method: Combines efficient LSTM model with T-SHAP (temporally aware SHAP aggregation) that applies linear smoothing to attribution sequences, reducing high-frequency variance while preserving Shapley value guarantees.

Result: Achieves 94.3% classification accuracy with <25ms inference latency; T-SHAP improves explanation reliability (AUP: 0.91 vs 0.89 for standard SHAP) and highlights biomechanically relevant motion patterns.

Conclusion: The framework provides stable, reliable explanations for fall detection that align with clinical observations, supporting transparent decision aids in elderly care environments.

Abstract: Fall detection in elderly care requires not only accurate classification but also reliable explanations that clinicians can trust. However, existing post-hoc explainability methods, when applied frame-by-frame to sequential data, produce temporally unstable attribution maps that clinicians cannot reliably act upon. To address this issue, we propose a lightweight and explainable framework for skeleton-based fall detection that combines an efficient LSTM model with T-SHAP, a temporally aware post-hoc aggregation strategy that stabilizes SHAP-based feature attributions over contiguous time windows. Unlike standard SHAP, which treats each frame independently, T-SHAP applies a linear smoothing operator to the attribution sequence, reducing high-frequency variance while preserving the theoretical guarantees of Shapley values, including local accuracy and consistency. Experiments on the NTU RGB+D Dataset demonstrate that the proposed framework achieves 94.3% classification accuracy with an end-to-end inference latency below 25 milliseconds, satisfying real-time constraints on mid-range hardware and indicating strong potential for deployment in clinical monitoring scenarios. Quantitative evaluation using perturbation-based faithfulness metrics shows that T-SHAP improves explanation reliability compared to standard SHAP (AUP: 0.89 vs. 0.91) and Grad-CAM (0.82), with consistent improvements observed across five-fold cross-validation, indicating enhanced explanation reliability. The resulting attributions consistently highlight biomechanically relevant motion patterns, including lower-limb instability and changes in spinal alignment, aligning with established clinical observations of fall dynamics and supporting their use as transparent decision aids in long-term care environments

[154] See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

Mahyar Ghazanfari, Peng Wei

Main category: cs.CV

TL;DR: See&Say combines geometric safety analysis with semantic perception using Vision-Language Models for autonomous drone package delivery, outperforming baselines in safety map prediction and alternative drop zone identification.

Details

Motivation: Existing drone delivery systems struggle with safe package drop-offs in cluttered urban environments due to limitations in either geometry-based analysis or semantic segmentation alone, lacking integrated semantic reasoning for robust decision-making.

Method: Proposes See&Say framework that fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, guided by a Vision-Language Model for iterative refinement of object category prompts and hazard detection across time.

Result: Outperforms all baselines with highest accuracy and IoU for safety map prediction, and superior performance in alternative drop zone evaluation across multiple thresholds on a curated dataset of urban delivery scenarios.

Conclusion: Demonstrates the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery, enabling reliable reasoning under dynamic conditions during final delivery phase.

Abstract: Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the final delivery phase. When the primary drop-pad is occupied or unsafe, the proposed See&Say also identifies alternative candidate zones for package delivery. We curated a dataset of urban delivery scenarios with moving objects and human activities to evaluate the approach. Experimental results show that See&Say outperforms all baselines, achieving the highest accuracy and IoU for safety map prediction as well as superior performance in alternative drop zone evaluation across multiple thresholds. These findings highlight the promise of VLM-guided segmentation-depth fusion for advancing safe and practical drone-based package delivery.

[155] PAT-VCM: Plug-and-Play Auxiliary Tokens for Video Coding for Machines

Wei Jiang, Wei Wang

Main category: cs.CV

TL;DR: PAT-VCM: A plug-and-play auxiliary-token framework for video coding for machines that uses shared baseline compressed streams augmented with lightweight task-aware auxiliary tokens for multiple downstream tasks.

Details

Motivation: Existing video coding for machines is typically trained for specific downstream tasks and models, making compressed representations tightly coupled to end tasks and difficult to scale across multiple tasks or adapt to model updates.

Method: Proposes PAT-VCM framework that maintains a shared baseline compressed stream and augments it with three types of lightweight task-aware auxiliary tokens: visual residual tokens, prompt/control tokens, and semantic tokens, allowing different downstream tasks to recover needed information without separate codec training.

Result: Evaluation on segmentation, depth estimation, and semantic recognition shows: shared detection-oriented auxiliary branch provides reusable first refinement; task-specific visual branches improve segmentation and depth; prompt tokens provide further segmentation gains at negligible bitrate; semantic tokens achieve strong recognition performance with extremely low overhead.

Conclusion: A shared compressed representation combined with lightweight task-aware auxiliary tokens is a practical and scalable alternative to tightly task-coupled VCM design, enabling multi-task support and adaptability to model updates.

Abstract: Existing video coding for machines is often trained for a specific downstream task and model. As a result, the compressed representation becomes tightly coupled to the end task, making it difficult to scale across multiple tasks or adapt to model updates. We propose PAT-VCM, a plug-and-play auxiliary-token framework for video coding for machines. PAT-VCM keeps a shared baseline compressed stream and augments it with lightweight task-aware auxiliary tokens, allowing different downstream tasks to recover the information they need without retraining a separate codec for each task. The framework supports three forms of auxiliary information: visual residual tokens, prompt/control tokens, and semantic tokens. We evaluate PAT-VCM on segmentation, depth estimation, and semantic recognition. A shared detection-oriented auxiliary branch provides a reusable first refinement, task-specific visual branches improve segmentation and depth, prompt tokens provide further segmentation gains at negligible bitrate, and semantic tokens achieve strong recognition performance with extremely low overhead. These results suggest that a shared compressed representation, combined with lightweight task-aware auxiliary tokens, is a practical and scalable alternative to tightly task-coupled VCM design.

[156] Can Cross-Layer Transcoders Replace Vision Transformer Activations? An Interpretable Perspective on Vision

Gerasimos Chatzoudis, Konstantinos D. Polyzos, Zhuowei Li, Difei Gu, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: Cross-Layer Transcoders (CLTs) are introduced as sparse, depth-aware proxy models for MLP blocks in Vision Transformers, enabling interpretable decomposition of final representations into additive layer-wise contributions.

Details

Motivation: Existing Sparse Autoencoders operate on individual layers and fail to capture cross-layer computational structure and layer significance in Vision Transformers, limiting interpretability of how final representations are formed.

Method: CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, creating a linear decomposition that transforms final ViT representations into additive, layer-resolved constructions.

Result: CLTs achieve high reconstruction fidelity while preserving CLIP zero-shot classification accuracy. Cross-layer contribution scores provide faithful attribution, revealing final representations are concentrated in a small set of dominant layer-wise terms.

Conclusion: CLTs serve as reliable interpretable proxies for Vision Transformers, enabling process-level interpretability and faithful attribution by decomposing final representations into layer-wise contributions.

Abstract: Understanding the internal activations of Vision Transformers (ViTs) is critical for building interpretable and trustworthy models. While Sparse Autoencoders (SAEs) have been used to extract human-interpretable features, they operate on individual layers and fail to capture the cross-layer computational structure of Transformers, as well as the relative significance of each layer in forming the last-layer representation. Alternatively, we introduce the adoption of Cross-Layer Transcoders (CLTs) as reliable, sparse, and depth-aware proxy models for MLP blocks in ViTs. CLTs use an encoder-decoder scheme to reconstruct each post-MLP activation from learned sparse embeddings of preceding layers, yielding a linear decomposition that transforms the final representation of ViTs from an opaque embedding into an additive, layer-resolved construction that enables faithful attribution and process-level interpretability. We train CLTs on CLIP ViT-B/32 and ViT-B/16 across CIFAR-100, COCO, and ImageNet-100. We show that CLTs achieve high reconstruction fidelity with post-MLP activations while preserving and even improving, in some cases, CLIP zero-shot classification accuracy. In terms of interpretability, we show that the cross-layer contribution scores provide faithful attribution, revealing that the final representation is concentrated in a smaller set of dominant layer-wise terms whose removal degrades performance and whose retention largely preserves it. These results showcase the significance of adopting CLTs as an alternative interpretable proxy of ViTs in the vision domain.

[157] Bias at the End of the Score

Salma Abdel Magid, Grace Guo, Esin Tureci, Amaya Dharmasiri, Vikram V. Ramaswamy, Hanspeter Pfister, Olga Russakovsky

Main category: cs.CV

TL;DR: Reward models in text-to-image generation encode demographic biases that cause optimization to disproportionately sexualize female subjects, reinforce stereotypes, and collapse diversity.

Details

Motivation: While reward models are crucial for text-to-image generation systems (used for dataset filtering, evaluation, optimization, and safety filtering), their robustness and fairness as scoring functions remain largely unknown, particularly regarding demographic biases.

Method: Conducted a large-scale audit of reward model robustness with respect to demographic biases during T2I model training and generation, providing both quantitative and qualitative evidence of bias encoding.

Result: Reward models encode demographic biases that cause reward-guided optimization to disproportionately sexualize female image subjects, reinforce gender/racial stereotypes, and collapse demographic diversity.

Conclusion: Findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures for more robust scoring.

Abstract: Reward models (RMs) are inherently non-neutral value functions designed and trained to encode specific objectives, such as human preferences or text-image alignment. RMs have become crucial components of text-to-image (T2I) generation systems where they are used at various stages for dataset filtering, as evaluation metrics, as a supervisory signal during optimization of parameters, and for post-generation safety and quality filtering of T2I outputs. While specific problems with the integration of RMs into the T2I pipeline have been studied (e.g. reward hacking or mode collapse), their robustness and fairness as scoring functions remains largely unknown. We conduct a large scale audit of RM robustness with respect to demographic biases during T2I model training and generation. We provide quantitative and qualitative evidence that while originally developed as quality measures, RMs encode demographic biases, which cause reward-guided optimization to disproportionately sexualize female image subjects reinforce gender/racial stereotypes, and collapse demographic diversity. These findings highlight shortcomings in current reward models, challenge their reliability as quality metrics, and underscore the need for improved data collection and training procedures to enable more robust scoring.

[158] Deep Spatially-Regularized and Superpixel-Based Diffusion Learning for Unsupervised Hyperspectral Image Clustering

Vutichart Buranasiri, James M. Murphy

Main category: cs.CV

TL;DR: Unsupervised hyperspectral image clustering using masked autoencoder for denoised latent representation and diffusion-based clustering with spatial regularization.

Details

Motivation: To improve hyperspectral image clustering by learning better latent representations that capture spatial context and spectral correlations, enabling more accurate diffusion-based clustering.

Method: Two-stage approach: 1) Unsupervised masked autoencoder (UMAE) with Vision Transformer backbone learns denoised latent representations using only a small subset of training pixels via masking; 2) Entropy rate superpixel segmentation followed by spatially-regularized diffusion graph construction in the compressed latent space using Euclidean and diffusion distances.

Result: Experiments on Botswana and KSC datasets demonstrate improved labeling accuracy and clustering quality compared to baseline methods.

Conclusion: The proposed DS²DL framework effectively combines masked representation learning with diffusion-based clustering for superior hyperspectral image clustering performance.

Abstract: An unsupervised framework for hyperspectral image (HSI) clustering is proposed that incorporates masked deep representation learning with diffusion-based clustering, extending the Spatially-Regularized Superpixel-based Diffusion Learning ($S^2DL$) algorithm. Initially, a denoised latent representation of the original HSI is learned via an unsupervised masked autoencoder (UMAE) model with a Vision Transformer backbone. The UMAE takes spatial context and long-range spectral correlations into account and incorporates an efficient pretraining process via masking that utilizes only a small subset of training pixels. In the next stage, the entropy rate superpixel (ERS) algorithm is used to segment the image into superpixels, and a spatially regularized diffusion graph is constructed using Euclidean and diffusion distances within the compressed latent space instead of the HSI space. The proposed algorithm, Deep Spatially-Regularized Superpixel-based Diffusion Learning ($DS^2DL$), leverages more faithful diffusion distances and subsequent diffusion graph construction that better reflect the intrinsic geometry of the underlying data manifold, improving labeling accuracy and clustering quality. Experiments on Botswana and KSC datasets demonstrate the efficacy of $DS^2DL$.

[159] The Spectrascapes Dataset: Street-view imagery beyond the visible captured using a mobile platform

Akshit Gupta, Joris Timmermans, Filip Biljecki, Remko Uijlenhoet

Main category: cs.CV

TL;DR: Spectrascapes: A novel multi-spectral terrestrial-view dataset for urban climate monitoring, featuring RGB, Near-infrared, and Thermal imagery captured across diverse urban areas in the Netherlands.

Details

Motivation: Current urban monitoring datasets have limitations including poor scalability, inconsistent spatio-temporal resolutions, overhead views, or low spectral information, which hinder climate-resilient city development.

Method: Created a multi-spectral dataset using bikes equipped with RGB, Near-infrared, and Thermal imaging sensors, capturing 17,718 street-level images across diverse urban morphologies in the Netherlands with strict calibration and quality control.

Result: Spectrascapes is presented as the first open-access dataset of its kind, enabling downstream applications in machine learning, urban planning, and remote sensing domains.

Conclusion: The dataset addresses limitations of existing urban monitoring methods and provides a valuable resource for developing climate-resilient cities through multi-spectral terrestrial-view analysis.

Abstract: High-resolution data in spatial and temporal contexts is imperative for developing climate resilient cities. Current datasets for monitoring urban parameters are developed primarily using manual inspections, embedded-sensing, remote sensing, or standard street-view imagery (RGB). These methods and datasets are often constrained respectively by poor scalability, inconsistent spatio-temporal resolutions, overhead views or low spectral information. We present a novel method and its open implementation: a multi-spectral terrestrial-view dataset that circumvents these limitations. This dataset consists of 17,718 street level multi-spectral images captured with RGB, Near-infrared, and Thermal imaging sensors on bikes, across diverse urban morphologies (village, town, small city, and big urban area) in the Netherlands. Strict emphasis is put on data calibration and quality while also providing the details of our data collection methodology (including the hardware and software details). To the best of our knowledge, Spectrascapes is the first open-access dataset of its kind. Finally, we demonstrate two downstream use-cases enabled using this dataset and provide potential research directions in the machine learning, urban planning and remote sensing domains.

[160] Why MLLMs Struggle to Determine Object Orientations

Anju Gopinath, Nikhil Krishnaswamy, Bruce Draper

Main category: cs.CV

TL;DR: Contrary to prior assumptions, visual encoder representations in MLLMs do preserve object orientation information, as linear models can accurately predict rotations from embeddings, suggesting orientation failures stem from other MLLM components.

Details

Motivation: Prior work suggests MLLMs struggle with 2D object orientation tasks due to visual encoder limitations, but this paper aims to empirically test whether orientation information is actually preserved in encoder representations.

Method: Designed controlled protocol to test orientation recovery from encoder features. Examined SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, and CLIP representations from LLaVA 1.5/1.6 using rotated foreground patches. Trained linear regressors to predict object orientation from encoded features as a test of information preservation.

Result: Contrary to the null hypothesis, orientation information is recoverable from encoder representations - simple linear models can accurately predict object orientations from embeddings, contradicting the assumption that MLLM orientation failures originate in the visual encoder.

Conclusion: Visual encoders do preserve orientation information, so MLLM failures on 2D orientation tasks must stem from other components. Although orientation information is present, it’s diffusely spread across thousands of features, which may explain why MLLMs fail to exploit it effectively.

Abstract: Multimodal Large Language Models (MLLMs) struggle with tasks that require reasoning about 2D object orientation in images, as documented in prior work. Tong et al. and Nichols et al. hypothesize that these failures originate in the visual encoder, since commonly used encoders such as CLIP and SigLIP are trained for image-text semantic alignment rather than geometric reasoning. We design a controlled empirical protocol to test this claim by measuring whether rotations can be recovered from encoder representations. In particular, we examine SigLIP and ViT features from LLaVA OneVision and Qwen2.5-VL-7B-Instruct models, respectively, using full images, and examine CLIP representations in LLaVA 1.5 and 1.6 using rotated foreground patches against natural background images. Our null hypothesis is that orientation information is not preserved in the encoder embeddings and we test this by training linear regressors to predict object orientation from encoded features. Contrary to the hypothesis, we find that orientation information is recoverable from encoder representations: simple linear models accurately predict object orientations from embeddings. This contradicts the assumption that MLLM orientation failures originate in the visual encoder. Having rejected the accepted hypothesis that MLLMs struggle with 2D orientation tasks because of visual encoder limitations, we still don’t know why they fail. Although a full explanation is beyond the scope of this paper, we show that although present, orientation information is spread diffusely across tens of thousands of features. This may or may not be while MLLMs fail to exploit the available orientation information.

[161] Towards Successful Implementation of Automated Raveling Detection: Effects of Training Data Size, Illumination Difference, and Spatial Shift

Xinan Zhang, Haolin Wang, Zhongyu Yang, Yi-Chang, Tsai

Main category: cs.CV

TL;DR: A benchmark called RavelingArena is proposed to evaluate model robustness for asphalt pavement raveling detection, addressing performance degradation in diverse real-world conditions through controlled variation experiments.

Details

Motivation: Current machine learning methods for raveling detection degrade in large-scale deployments due to diverse inference data from different runs, sensors, and environmental conditions, highlighting the need for more generalizable and robust solutions.

Method: Proposed RavelingArena benchmark built by augmenting existing dataset with controlled variations to evaluate model robustness. Identified and assessed variations impacting robustness including training data quantity, illumination differences, and spatial shifts.

Result: Both quantity and diversity of training data are critical for model accuracy, achieving at least 9.2% gain under most diverse conditions. Case study on multi-year test section showed significant improvements in year-to-year consistency.

Conclusion: The insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks requiring adaptability to diverse conditions, laying foundations for temporal deterioration modeling.

Abstract: Raveling, the loss of aggregates, is a major form of asphalt pavement surface distress, especially on highways. While research has shown that machine learning and deep learning-based methods yield promising results for raveling detection by classification on range images, their performance often degrades in large-scale deployments where more diverse inference data may originate from different runs, sensors, and environmental conditions. This degradation highlights the need of a more generalizable and robust solution for real-world implementation. Thus, the objectives of this study are to 1) identify and assess potential variations that impact model robustness, such as the quantity of training data, illumination difference, and spatial shift; and 2) leverage findings to enhance model robustness under real-world conditions. To this end, we propose RavelingArena, a benchmark designed to evaluate model robustness to variations in raveling detection. Instead of collecting extensive new data, it is built by augmenting an existing dataset with diverse, controlled variations, thereby enabling variation-controlled experiments to quantify the impact of each variation. Results demonstrate that both the quantity and diversity of training data are critical to the accuracy of models, achieving at least a 9.2% gain in accuracy under the most diverse conditions in experiments. Additionally, a case study applying these findings to a multi-year test section in Georgia, U.S., shows significant improvements in year-to-year consistency, laying foundations for future studies on temporal deterioration modeling. These insights provide guidance for more reliable model deployment in raveling detection and other real-world tasks that require adaptability to diverse conditions.

[162] Right Regions, Wrong Labels: Semantic Label Flips in Segmentation under Correlation Shift

Akshit Achara, Yovin Yathathugoda, Nick Byrne, Michela Antonelli, Esther Puyol Anton, Alexander Hammers, Andrew P. King

Main category: cs.CV

TL;DR: Paper introduces Flip diagnostic to quantify semantic label-flip errors in segmentation models under distribution shift, where models correctly identify foreground objects but assign wrong semantic labels due to spurious correlations.

Details

Motivation: Segmentation models can fail in subtle ways under distribution shift - they may correctly identify object boundaries but assign wrong semantic labels due to spurious correlations between non-causal features and target labels. Current robustness evaluations focus on overlap metrics but miss these semantic label-flip errors.

Method: Proposes Flip diagnostic that counts how often ground truth foreground pixels are assigned wrong foreground identity while remaining predicted as foreground. Also introduces entropy-based flip-risk score computed from foreground identity uncertainty to flag flip-prone cases at inference time without ground truth labels.

Result: In settings with category-scene correlations during training, increasing correlation widens gap between common/rare test conditions and increases within-object label swaps on counterfactual groups. Flip diagnostic reveals these semantic errors that overlap metrics miss.

Conclusion: Segmentation robustness should be assessed beyond overlap metrics by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. The proposed flip-risk score can identify vulnerable cases at inference time.

Abstract: The robustness of machine learning models can be compromised by spurious correlations between non-causal features in the input data and target labels. A common way to test for such correlations is to train on data where the label is strongly tied to some non-causal cue, then evaluate on examples where that tie no longer holds. This idea is well established for classification tasks, but for semantic segmentation the specific failure modes are not well understood. We show that a model may achieve reasonable overlap while assigning the wrong semantic label, swapping one plausible foreground class for another, even when object boundaries are largely correct. We focus on this semantic label-flip behaviour and quantify it with a simple diagnostic (Flip) that counts how often ground truth foreground pixels are assigned the wrong foreground identity while remaining predicted as foreground. In a setting where category and scene are correlated during training, increasing the correlation consistently widens the gap between common and rare test conditions and increases these within-object label swaps on counterfactual groups. Overall, our results motivate assessing segmentation robustness under distribution shift beyond overlap by decomposing foreground errors into correct pixels, flipped-identity pixels, and missed-to-background pixels. We also propose an entropy-based, ground truth label-free `flip-risk’ score, which is computed from foreground identity uncertainty, and show that it can flag flip-prone cases at inference time. Code is available at https://github.com/acharaakshit/label-flips.

[163] SSD-GS: Scattering and Shadow Decomposition for Relightable 3D Gaussian Splatting

Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang

Main category: cs.CV

TL;DR: SSD-GS is a physically-based relighting framework using 3D Gaussian Splatting that decomposes reflectance into four components (diffuse, specular, shadow, subsurface scattering) for high-quality reconstruction and photorealistic relighting under novel lighting conditions.

Details

Motivation: Existing 3DGS-based relighting methods use coarse shading decompositions (only diffuse/specular or neural approximations) leading to limited fidelity and poor physical interpretability, especially for anisotropic metals and translucent materials.

Method: Decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. Introduces learnable dipole-based scattering module for subsurface transport, occlusion-aware shadow formulation with visibility estimates and refinement network, and enhanced specular component with anisotropic Fresnel-based model. Uses progressive integration during training.

Result: Demonstrates superior quantitative and perceptual relighting quality compared to prior methods on challenging OLAT dataset. Effectively disentangles lighting and material properties even for unseen illumination conditions.

Conclusion: SSD-GS achieves high-quality physically-based relighting with better fidelity and interpretability, enabling downstream tasks like controllable light source editing and interactive scene relighting.

Abstract: We present SSD-GS, a physically-based relighting framework built upon 3D Gaussian Splatting (3DGS) that achieves high-quality reconstruction and photorealistic relighting under novel lighting conditions. In physically-based relighting, accurately modeling light-material interactions is essential for faithful appearance reproduction. However, existing 3DGS-based relighting methods adopt coarse shading decompositions, either modeling only diffuse and specular reflections or relying on neural networks to approximate shadows and scattering. This leads to limited fidelity and poor physical interpretability, particularly for anisotropic metals and translucent materials. To address these limitations, SSD-GS decomposes reflectance into four components: diffuse, specular, shadow, and subsurface scattering. We introduce a learnable dipole-based scattering module for subsurface transport, an occlusion-aware shadow formulation that integrates visibility estimates with a refinement network, and an enhanced specular component with an anisotropic Fresnel-based model. Through progressive integration of all components during training, SSD-GS effectively disentangles lighting and material properties, even for unseen illumination conditions, as demonstrated on the challenging OLAT dataset. Experiments demonstrate superior quantitative and perceptual relighting quality compared to prior methods and pave the way for downstream tasks, including controllable light source editing and interactive scene relighting. The source code is available at: https://github.com/irisfreesiri/SSD-GS.

[164] SEDTalker: Emotion-Aware 3D Facial Animation Using Frame-Level Speech Emotion Diarization

Farzaneh Jafari, Stefano Berretti, Anup Basu

Main category: cs.CV

TL;DR: SEDTalker is an emotion-aware framework for speech-driven 3D facial animation that uses frame-level speech emotion diarization for fine-grained expressive control, enabling continuous modulation of facial expressions over time.

Details

Motivation: Prior approaches rely on utterance-level or manually specified emotion labels, which lack temporal granularity for continuous emotion modulation in speech-driven facial animation. There's a need for more fine-grained, temporally dense emotion control that can capture the dynamic nature of emotional expression in speech.

Method: The method uses frame-level speech emotion diarization to predict temporally dense emotion categories and intensities directly from speech. These diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture, enabling effective disentanglement of linguistic content and emotional style.

Result: Quantitative results show strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors. Qualitative results demonstrate smooth emotion transitions and consistent expression control. The approach was evaluated on a large-scale multi-corpus dataset for speech emotion diarization and the EmoVOCA dataset for emotional 3D facial animation.

Conclusion: Frame-level emotion diarization is effective for expressive and controllable 3D talking head generation, enabling fine-grained control over facial expressions that aligns with the dynamic emotional content of speech while preserving identity and temporal coherence.

Abstract: We introduce SEDTalker, an emotion-aware framework for speech-driven 3D facial animation that leverages frame-level speech emotion diarization to achieve fine-grained expressive control. Unlike prior approaches that rely on utterance-level or manually specified emotion labels, our method predicts temporally dense emotion categories and intensities directly from speech, enabling continuous modulation of facial expressions over time. The diarized emotion signals are encoded as learned embeddings and used to condition a speech-driven 3D animation model based on a hybrid Transformer-Mamba architecture. This design allows effective disentanglement of linguistic content and emotional style while preserving identity and temporal coherence. We evaluate our approach on a large-scale multi-corpus dataset for speech emotion diarization and on the EmoVOCA dataset for emotional 3D facial animation. Quantitative results demonstrate strong frame-level emotion recognition performance and low geometric and temporal reconstruction errors, while qualitative results show smooth emotion transitions and consistent expression control. These findings highlight the effectiveness of frame-level emotion diarization for expressive and controllable 3D talking head generation.

[165] MSGS: Multispectral 3D Gaussian Splatting

Iris Zheng, Guojun Tang, Alexander Doronin, Paul Teal, Fang-Lue Zhang

Main category: cs.CV

TL;DR: Multispectral 3D Gaussian Splatting extends 3DGS with spectral radiance representation and dual-loss optimization for improved view synthesis with spectral consistency.

Details

Motivation: To enhance 3D Gaussian Splatting by incorporating multispectral information for better rendering fidelity, especially for challenging materials like translucent surfaces and anisotropic reflections, while maintaining real-time efficiency.

Method: Augment each Gaussian with spectral radiance using per-band spherical harmonics, optimize with dual-loss supervision combining RGB and multispectral signals, and perform spectral-to-RGB conversion at pixel level to retain richer spectral cues.

Result: Demonstrates consistent improvements over RGB-only 3DGS baseline in image quality and spectral consistency on both public and self-captured datasets, particularly excelling with translucent materials and anisotropic reflections.

Conclusion: The approach maintains 3DGS’s compactness and real-time efficiency while providing a foundation for future integration with physically based shading models through multispectral representation.

Abstract: We present a multispectral extension to 3D Gaussian Splatting (3DGS) for wavelength-aware view synthesis. Each Gaussian is augmented with spectral radiance, represented via per-band spherical harmonics, and optimized under a dual-loss supervision scheme combining RGB and multispectral signals. To improve rendering fidelity, we perform spectral-to-RGB conversion at the pixel level, allowing richer spectral cues to be retained during optimization. Our method is evaluated on both public and self-captured real-world datasets, demonstrating consistent improvements over the RGB-only 3DGS baseline in terms of image quality and spectral consistency. Notably, it excels in challenging scenes involving translucent materials and anisotropic reflections. The proposed approach maintains the compactness and real-time efficiency of 3DGS while laying the foundation for future integration with physically based shading models.

[166] Multi-Agent Object Detection Framework Based on Raspberry Pi YOLO Detector and Slack-Ollama Natural Language Interface

Vladimir Kalušev, Branko Brkljač, Milan Brkljač

Main category: cs.CV

TL;DR: Edge-based object detection system using LLM-based natural language interface and multi-agent orchestration on Raspberry Pi, integrating YOLO for vision, Slack chatbot, and Ollama LLM locally without cloud resources.

Details

Motivation: To demonstrate practical integration of AI agents for object detection and tracking on resource-constrained edge hardware, moving beyond traditional approaches by using LLM-based natural language interfaces for system control and communication.

Method: Multi-agent framework with YOLO-based computer vision agent for real-time object detection/tracking, Slack channel chatbot agent for natural language interface, and Ollama LLM reporting agent - all running locally on Raspberry Pi. Event-based message exchange subsystem for agent orchestration instead of fully autonomous frameworks.

Result: Successful prototype implementation showing integration of all components on single resource-constrained platform, with insights into limitations of low-cost testbed platforms for centralized multi-agent AI systems.

Conclusion: Demonstrates feasibility of edge-based multi-agent AI systems with LLM interfaces for vision tasks, highlighting fast prototyping approach enabled by generative AI systems and providing practical alternative to cloud-dependent solutions.

Abstract: The paper presents design and prototype implementation of an edge based object detection system within the new paradigm of AI agents orchestration. It goes beyond traditional design approaches by leveraging on LLM based natural language interface for system control and communication and practically demonstrates integration of all system components into a single resource constrained hardware platform. The method is based on the proposed multi-agent object detection framework which tightly integrates different AI agents within the same task of providing object detection and tracking capabilities. The proposed design principles highlight the fast prototyping approach that is characteristic for transformational potential of generative AI systems, which are applied during both development and implementation stages. Instead of specialized communication and control interface, the system is made by using Slack channel chatbot agent and accompanying Ollama LLM reporting agent, which are both run locally on the same Raspberry Pi platform, alongside the dedicated YOLO based computer vision agent performing real time object detection and tracking. Agent orchestration is implemented through a specially designed event based message exchange subsystem, which represents an alternative to completely autonomous agent orchestration and control characteristic for contemporary LLM based frameworks like the recently proposed OpenClaw. Conducted experimental investigation provides valuable insights into limitations of the low cost testbed platforms in the design of completely centralized multi-agent AI systems. The paper also discusses comparative differences between presented approach and the solution that would require additional cloud based external resources.

[167] A 3D SAM-Based Progressive Prompting Framework for Multi-Task Segmentation of Radiotherapy-induced Normal Tissue Injuries in Limited-Data Settings

Caiwen Jiang, Lei Zeng, Wei Liu

Main category: cs.CV

TL;DR: A 3D SAM-based progressive prompting framework for segmenting radiotherapy-induced normal tissue injuries in head-and-neck medical images, using text, dose-guided box, and click prompts with small-target focus loss.

Details

Motivation: Radiotherapy-induced normal tissue injury segmentation is clinically important but challenging due to limited annotations and heterogeneity across injury types, lesion sizes, and imaging modalities.

Method: Proposes a 3D SAM-based progressive prompting framework with three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement, plus a small-target focus loss for small/sparse lesions.

Result: The method achieves reliable segmentation performance across diverse injury types (ORN, CE, CRN) and outperforms state-of-the-art methods.

Conclusion: The proposed progressive prompting framework effectively addresses challenges in radiotherapy-induced injury segmentation with limited data and heterogeneous lesions.

Abstract: Radiotherapy-induced normal tissue injury is a clinically important complication, and accurate segmentation of injury regions from medical images could facilitate disease assessment, treatment planning, and longitudinal monitoring. However, automatic segmentation of these lesions remains largely unexplored because of limited voxel-level annotations and substantial heterogeneity across injury types, lesion size, and imaging modality. To address this gap, we curate a dedicated head-and-neck radiotherapy-induced normal tissue injury dataset covering three manifestations: osteoradionecrosis (ORN), cerebral edema (CE), and cerebral radiation necrosis (CRN). We further propose a 3D SAM-based progressive prompting framework for multi-task segmentation in limited-data settings. The framework progressively incorporates three complementary prompts: text prompts for task-aware adaptation, dose-guided box prompts for coarse localization, and click prompts for iterative refinement. A small-target focus loss is introduced to improve local prediction and boundary delineation for small and sparse lesions. Experiments on ORN, CE, and CRN demonstrate that the proposed method achieves reliable segmentation performance across diverse injury types and outperforms state-of-the-art methods.

[168] UniBlendNet: Unified Global, Multi-Scale, and Region-Adaptive Modeling for Ambient Lighting Normalization

Jiatao Dai, Wei Dong, Han Zhou, Chengzhou Tang, Jun Chen

Main category: cs.CV

TL;DR: UniBlendNet: A unified framework for ambient lighting normalization that improves upon IFBlend by better modeling global illumination, multi-scale structures, and region-adaptive refinement for enhanced image restoration under complex lighting conditions.

Details

Motivation: Existing ambient lighting normalization methods like IFBlend have limitations in global context modeling and spatial adaptivity, leading to suboptimal restoration in challenging regions with complex, spatially varying illumination conditions.

Method: Proposes UniBlendNet with three key components: 1) UniConvNet-based module for global illumination understanding and long-range dependencies, 2) Scale-Aware Aggregation Module (SAAM) for pyramid-based multi-scale feature aggregation with dynamic reweighting, and 3) mask-guided residual refinement for region-adaptive correction.

Result: Extensive experiments on NTIRE Ambient Lighting Normalization benchmark show UniBlendNet consistently outperforms baseline IFBlend, achieving improved restoration quality with visually more natural and stable results.

Conclusion: UniBlendNet effectively addresses limitations of existing methods by jointly modeling global illumination, multi-scale structures, and region-adaptive refinement, leading to superior ambient lighting normalization performance.

Abstract: Ambient Lighting Normalization (ALN) aims to restore images degraded by complex, spatially varying illumination conditions. Existing methods, such as IFBlend, leverage frequency-domain priors to model illumination variations, but still suffer from limited global context modeling and insufficient spatial adaptivity, leading to suboptimal restoration in challenging regions. In this paper, we propose UniBlendNet, a unified framework for ambient lighting normalization that jointly models global illumination, multi-scale structures, and region-adaptive refinement. Specifically, we enhance global illumination understanding by integrating a UniConvNet-based module to capture long-range dependencies. To better handle complex lighting variations, we introduce a Scale-Aware Aggregation Module (SAAM) that performs pyramid-based multi-scale feature aggregation with dynamic reweighting. Furthermore, we design a mask-guided residual refinement mechanism to enable region-adaptive correction, allowing the model to selectively enhance degraded regions while preserving well-exposed areas. This design effectively improves illumination consistency and structural fidelity under complex lighting conditions. Extensive experiments on the NTIRE Ambient Lighting Normalization benchmark demonstrate that UniBlendNet consistently outperforms the baseline IFBlend and achieves improved restoration quality, while producing visually more natural and stable restoration results.

[169] A Multimodal Clinically Informed Coarse-to-Fine Framework for Longitudinal CT Registration in Proton Therapy

Caiwen Jiang, Yuzhen Ding, Mi Jia, Samir H. Patel, Terence T. Sio, Jonathan B. Ashman, Lisa A. McGee, Jean-Claude M. Rwigema, William G. Rule, Sameer R. Keole, Sujay A. Vora, William W. Wong, Nathan Y. Yu, Michele Y. Halyard, Steven E. Schild, Dinggang Shen, Wei Liu

Main category: cs.CV

TL;DR: A deep learning framework for fast, clinically-informed deformable image registration in proton therapy using multimodal data including CT scans, anatomical contours, dose distributions, and treatment planning text.

Details

Motivation: Proton therapy requires accurate deformable image registration for adaptive workflows, but conventional methods are too slow and existing deep learning approaches don't utilize clinically relevant multimodal information beyond just images.

Method: Coarse-to-fine framework with dual CNN encoders for hierarchical feature extraction and transformer-based decoder. Incorporates clinical priors (target/OAR contours, dose distributions, treatment planning text) via anatomy/risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization.

Result: Evaluated on large-scale proton therapy dataset (1,222 paired CT scans). Shows consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.

Conclusion: The proposed framework integrates multimodal clinical information to achieve fast, accurate deformable registration for proton therapy adaptive workflows, outperforming existing methods.

Abstract: Proton therapy offers superior organ-at-risk sparing but is highly sensitive to anatomical changes, making accurate deformable image registration (DIR) across longitudinal CT scans essential. Conventional DIR methods are often too slow for emerging online adaptive workflows, while existing deep learning-based approaches are primarily designed for generic benchmarks and underutilize clinically relevant information beyond images. To address this gap, we propose a clinically scalable coarse-to-fine deformable registration framework that integrates multimodal information from the proton radiotherapy workflow to accommodate diverse clinical scenarios. The model employs dual CNN-based encoders for hierarchical feature extraction and a transformer-based decoder to progressively refine deformation fields. Beyond CT intensities, clinically critical priors, including target and organ-at-risk contours, dose distributions, and treatment planning text, are incorporated through anatomy- and risk-guided attention, text-conditioned feature modulation, and foreground-aware optimization, enabling anatomically focused and clinically informed deformation estimation. We evaluate the proposed framework on a large-scale proton therapy DIR dataset comprising 1,222 paired planning and repeat CT scans across multiple anatomical regions and disease types. Extensive experiments demonstrate consistent improvements over state-of-the-art methods, enabling fast and robust clinically meaningful registration.

[170] Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

Yu Wang, Sharon Li

Main category: cs.CV

TL;DR: Multimodal ICL performs comparably to text-only ICL in zero-shot but degrades significantly in few-shot settings due to lack of reasoning-level alignment between visual and textual representations and unreliable task mapping transfer.

Details

Motivation: Despite ICL's success in LLMs, its extension to multimodal settings remains poorly understood in terms of internal mechanisms and differences from text-only ICL. The paper aims to systematically analyze multimodal ICL in MLLMs.

Method: Systematic analysis of ICL in multimodal LLMs using identical task formulations across modalities. Decomposes multimodal ICL into task mapping construction and transfer, analyzes cross-modal task mapping establishment and transfer across layers. Proposes inference-stage enhancement method to reinforce task mapping transfer.

Result: Multimodal ICL performs comparably to text-only ICL in zero-shot but degrades significantly under few-shot demonstrations. Analysis reveals current models lack reasoning-level alignment between visual and textual representations and fail to reliably transfer learned task mappings to queries. Proposed enhancement method improves performance.

Conclusion: The study provides new insights into mechanisms and limitations of multimodal ICL, revealing fundamental differences from text-only ICL. Findings suggest directions for more effective multimodal adaptation through better cross-modal alignment and task mapping transfer.

Abstract: In-context learning (ICL) enables models to adapt to new tasks via inference-time demonstrations. Despite its success in large language models, the extension of ICL to multimodal settings remains poorly understood in terms of its internal mechanisms and how it differs from text-only ICL. In this work, we conduct a systematic analysis of ICL in multimodal large language models. Using identical task formulations across modalities, we show that multimodal ICL performs comparably to text-only ICL in zero-shot settings but degrades significantly under few-shot demonstrations. To understand this gap, we decompose multimodal ICL into task mapping construction and task mapping transfer, and analyze how models establish cross-modal task mappings, and transfer them to query samples across layers. Our analysis reveals that current models lack reasoning-level alignment between visual and textual representations, and fail to reliably transfer learned task mappings to queries. Guided by these findings, we further propose a simple inference-stage enhancement method that reinforces task mapping transfer. Our results provide new insights into the mechanisms and limitations of multimodal ICL and suggest directions for more effective multimodal adaptation. Our code is available \href{https://github.com/deeplearning-wisc/Multimocal-ICL-Analysis-Framework-MGI}{here}.

[171] CausalDisenSeg: A Causality-Guided Disentanglement Framework with Counterfactual Reasoning for Robust Brain Tumor Segmentation Under Missing Modalities

Bo Liu, Yulong Zou, Jin Hong

Main category: cs.CV

TL;DR: CausalDisenSeg: A causality-guided framework for robust brain tumor segmentation under incomplete MRI data by disentangling anatomical causal factors from stylistic bias factors using causal intervention and counterfactual reasoning.

Details

Motivation: Deep learning models for multimodal brain tumor segmentation suffer from robustness issues with incomplete MRI data due to modality bias, where models exploit spurious correlations rather than learning true anatomical structures. Existing feature fusion methods fail to eliminate this dependency.

Method: Proposes CausalDisenSeg, a Structural Causal Model-based framework with three-stage causal intervention: 1) Explicit causal disentanglement using CVAE with HSIC constraint to enforce orthogonality between anatomical and style features, 2) Causal representation reinforcement with Region Causality Module to ground features in physical tumor regions, 3) Counterfactual reasoning with dual-adversarial strategy to suppress residual bias effects.

Result: Significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios on BraTS 2020 dataset. Achieves state-of-the-art macro-average DSC of 84.49 on BraTS 2023 cross-dataset evaluation.

Conclusion: CausalDisenSeg effectively addresses modality bias in multimodal brain tumor segmentation through causality-guided disentanglement and counterfactual reasoning, achieving robust performance even with incomplete MRI data.

Abstract: In clinical practice, the robustness of deep learning models for multimodal brain tumor segmentation is severely compromised by incomplete MRI data. This vulnerability stems primarily from modality bias, where models exploit spurious correlations as shortcuts rather than learning true anatomical structures. Existing feature fusion methods fail to fundamentally eliminate this dependency. To address this, we propose CausalDisenSeg, a novel Structural Causal Model (SCM)-grounded framework that achieves robust segmentation via causality-guided disentanglement and counterfactual reasoning. We reframe the problem as isolating the anatomical Causal Factor from the stylistic Bias Factor. Our framework implements a three-stage causal intervention: (1) Explicit Causal Disentanglement: A Conditional Variational Autoencoder (CVAE) coupled with an HSIC constraint mathematically enforces statistical orthogonality between anatomical and style features. (2) Causal Representation Reinforcement: A Region Causality Module (RCM) explicitly grounds causal features in physical tumor regions. (3) Counterfactual Reasoning: A dual-adversarial strategy actively suppresses the residual Natural Direct Effect (NDE) of the bias, forcing its spatial attention to be mutually exclusive from the causal path. Extensive experiments on the BraTS 2020 dataset demonstrate that CausalDisenSeg significantly outperforms state-of-the-art methods in accuracy and consistency across severe missing-modality scenarios. Furthermore, cross-dataset evaluation on BraTS 2023 under the same protocol yields a state-of-the-art macro-average DSC of 84.49.

[172] DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

Main category: cs.CV

TL;DR: DF3DV-1K: A large-scale real-world dataset of 1,048 scenes with clean and cluttered images for benchmarking distractor-free radiance field methods, enabling robust evaluation and method development.

Details

Motivation: There's a lack of large-scale real-world datasets with both clean and cluttered images for distractor-free radiance fields, limiting development beyond scene-specific reconstruction approaches.

Method: Created DF3DV-1K dataset with 1,048 scenes, 89,924 images captured with consumer cameras, spanning 128 distractor types and 161 scene themes. Includes curated DF3DV-41 subset for challenging scenario evaluation. Benchmarked 9 distractor-free radiance field methods and 3D Gaussian Splatting.

Result: Dataset enables comprehensive benchmarking, identifying robust methods and challenging scenarios. Demonstrated application by fine-tuning diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on held-out sets.

Conclusion: DF3DV-1K facilitates development of distractor-free vision and promotes progress beyond scene-specific approaches by providing a comprehensive benchmarking dataset.

Abstract: Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches.

[173] Physically-Guided Optical Inversion Enable Non-Contact Side-Channel Attack on Isolated Screens

Zhiwen Zheng, Yuheng Qiao, Xiaoshuai Zhang, Zhao Huang, Tao Zhang, Huiyu Zhou, Shaowei Jiang, Jin Liu, Wenwen Tang, Xingru Huang

Main category: cs.CV

TL;DR: IR4Net uses optical projection side-channels to reconstruct screen content from diffuse reflections, addressing instability and compression issues through physical regularization and semantic reprojection.

Details

Motivation: Noncontact exfiltration of electronic screen content via side-channel attacks presents security challenges. Current optical projection methods face two core problems: (1) near-singular Jacobian spectrum causing instability in projection inversion, and (2) irreversible compression in light transport that destroys global semantic information.

Method: IR4Net (Irradiance Robust Radiometric Inversion Network) uses passive speckle patterns from diffuse reflection. It combines: 1) Physically Regularized Irradiance Approximation (PRIrr-Approximation) that embeds radiative transfer equation in learnable optimizer, 2) contour-to-detail cross-scale reconstruction to prevent noise propagation, and 3) Irreversibility Constrained Semantic Reprojection (ICSR) module to restore lost global structure through context-driven semantic mapping.

Result: Evaluated across four scene categories, IR4Net achieves higher fidelity than competing neural approaches while maintaining resilience to illumination perturbations.

Conclusion: IR4Net provides a robust solution for optical projection side-channel attacks by addressing fundamental stability and information loss problems through physical regularization and semantic reconstruction techniques.

Abstract: Noncontact exfiltration of electronic screen content poses a security challenge, with side-channel incursions as the principal vector. We introduce an optical projection side-channel paradigm that confronts two core instabilities: (i) the near-singular Jacobian spectrum of projection mapping breaches Hadamard stability, rendering inversion hypersensitive to perturbations; (ii) irreversible compression in light transport obliterates global semantic cues, magnifying reconstruction ambiguity. Exploiting passive speckle patterns formed by diffuse reflection, our Irradiance Robust Radiometric Inversion Network (IR4Net) fuses a Physically Regularized Irradiance Approximation (PRIrr-Approximation), which embeds the radiative transfer equation in a learnable optimizer, with a contour-to-detail cross-scale reconstruction mechanism that arrests noise propagation. Moreover, an Irreversibility Constrained Semantic Reprojection (ICSR) module reinstates lost global structure through context-driven semantic mapping. Evaluated across four scene categories, IR4Net achieves fidelity beyond competing neural approaches while retaining resilience to illumination perturbations.

[174] VibeFlow: Versatile Video Chroma-Lux Editing through Self-Supervised Learning

Yifan Li, Pei Cheng, Bin Fu, Shuai Yang, Jiaying Liu

Main category: cs.CV

TL;DR: VibeFlow is a self-supervised framework for video chroma-lux editing that leverages pre-trained video generation models to modify illumination and color while preserving structure and temporal coherence without requiring paired training data.

Details

Motivation: Video chroma-lux editing (modifying illumination and color while preserving structure) is challenging and existing methods require expensive supervised training with synthetic paired data, limiting practical applications.

Method: Uses disentangled data perturbation pipeline to adaptively recombine structure from source videos and color-illumination cues from reference images. Introduces Residual Velocity Fields and Structural Distortion Consistency Regularization to address discretization errors in flow-based models.

Result: Achieves impressive visual quality with reduced computational overhead, generalizes zero-shot to diverse applications including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing.

Conclusion: VibeFlow provides an effective self-supervised framework for video chroma-lux editing that eliminates need for costly training resources while maintaining structural and temporal fidelity.

Abstract: Video chroma-lux editing, which aims to modify illumination and color while preserving structural and temporal fidelity, remains a significant challenge. Existing methods typically rely on expensive supervised training with synthetic paired data. This paper proposes VibeFlow, a novel self-supervised framework that unleashes the intrinsic physical understanding of pre-trained video generation models. Instead of learning color and light transitions from scratch, we introduce a disentangled data perturbation pipeline that enforces the model to adaptively recombine structure from source videos and color-illumination cues from reference images, enabling robust disentanglement in a self-supervised manner. Furthermore, to rectify discretization errors inherent in flow-based models, we introduce Residual Velocity Fields alongside a Structural Distortion Consistency Regularization, ensuring rigorous structural preservation and temporal coherence. Our framework eliminates the need for costly training resources and generalizes in a zero-shot manner to diverse applications, including video relighting, recoloring, low-light enhancement, day-night translation, and object-specific color editing. Extensive experiments demonstrate that VibeFlow achieves impressive visual quality with significantly reduced computational overhead. Our project is publicly available at https://lyf1212.github.io/VibeFlow-webpage.

[175] Event-Adaptive State Transition and Gated Fusion for RGB-Event Object Tracking

Jinlin You, Muyu Li, Xudong Zhao

Main category: cs.CV

TL;DR: MambaTrack: A multimodal RGB-Event tracking framework using dynamic state space models with event-adaptive state transitions and gated projection fusion for robust cross-modal integration.

Details

Motivation: Existing Vision Mamba-based RGB-Event tracking methods use static state transition matrices that fail to adapt to variations in event sparsity, leading to imbalanced modeling (underfitting sparse event streams and overfitting dense ones) and degraded cross-modal fusion robustness.

Method: Proposes MambaTrack with two key innovations: 1) Event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density using a learnable scalar, and 2) Gated Projection Fusion module that projects RGB features into event feature space and generates adaptive gates from event density and RGB confidence scores to control fusion intensity.

Result: Achieves state-of-the-art performance on FE108 and FELT datasets, with lightweight design suggesting potential for real-time embedded deployment.

Conclusion: MambaTrack addresses limitations of static state transition matrices in RGB-Event tracking through dynamic adaptation to event sparsity variations, enabling robust cross-modal fusion and superior tracking performance.

Abstract: Existing Vision Mamba-based RGB-Event(RGBE) tracking methods suffer from using static state transition matrices, which fail to adapt to variations in event sparsity. This rigidity leads to imbalanced modeling-underfitting sparse event streams and overfitting dense ones-thus degrading cross-modal fusion robustness. To address these limitations, we propose MambaTrack, a multimodal and efficient tracking framework built upon a Dynamic State Space Model(DSSM). Our contributions are twofold. First, we introduce an event-adaptive state transition mechanism that dynamically modulates the state transition matrix based on event stream density. A learnable scalar governs the state evolution rate, enabling differentiated modeling of sparse and dense event flows. Second, we develop a Gated Projection Fusion(GPF) module for robust cross-modal integration. This module projects RGB features into the event feature space and generates adaptive gates from event density and RGB confidence scores. These gates precisely control the fusion intensity, suppressing noise while preserving complementary information. Experiments show that MambaTrack achieves state-of-the-art performance on the FE108 and FELT datasets. Its lightweight design suggests potential for real-time embedded deployment.

[176] MaMe & MaRe: Matrix-Based Token Merging and Restoration for Efficient Visual Perception and Synthesis

Simin Huo, Ning Li

Main category: cs.CV

TL;DR: MaMe is a GPU-friendly, training-free token merging method for Vision Transformers that uses matrix operations to accelerate inference, with MaRe for token restoration enabling image synthesis applications.

Details

Motivation: Existing token compression methods for Vision Transformers use GPU-inefficient operations (sorting, scattered writes) that introduce overhead, limiting their effectiveness. There's a need for more efficient token merging that leverages GPU-friendly matrix operations.

Method: MaMe is a differentiable token merging method based entirely on matrix operations, making it GPU-friendly. It works without training and can be applied to pre-trained models. MaRe is its inverse operation for token restoration, forming a MaMe+MaRe pipeline for image synthesis tasks.

Result: MaMe doubles ViT-B throughput with only 2% accuracy drop, and fine-tuning boosts accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, provides 1.3x acceleration with negligible degradation. Accelerates VideoMAE-L by 48.5% on Kinetics-400 with 0.84% accuracy loss. MaMe+MaRe pipeline reduces Stable Diffusion v2.1 generation latency by 31% while enhancing quality.

Conclusion: MaMe and MaRe provide effective GPU-friendly methods for accelerating vision models through efficient token compression and restoration, demonstrating significant speed improvements with minimal accuracy loss across various vision tasks including classification, video understanding, and image synthesis.

Abstract: Token compression is crucial for mitigating the quadratic complexity of self-attention mechanisms in Vision Transformers (ViTs), which often involve numerous input tokens. Existing methods, such as ToMe, rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that limit their effectiveness. We introduce MaMe, a training-free, differentiable token merging method based entirely on matrix operations, which is GPU-friendly to accelerate ViTs. Additionally, we present MaRe, its inverse operation, for token restoration, forming a MaMe+MaRe pipeline for image synthesis. When applied to pre-trained models, MaMe doubles ViT-B throughput with a 2% accuracy drop. Notably, fine-tuning the last layer with MaMe boosts ViT-B accuracy by 1.0% at 1.1x speed. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation. In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with only a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on some tasks. In image synthesis, the MaMe+MaRe pipeline enhances quality while reducing Stable Diffusion v2.1 generation latency by 31%. Collectively, these results demonstrate MaMe’s and MaRe’s effectiveness in accelerating vision models. The code is available at https://github.com/cominder/mame}{https://github.com/cominder/mame.

[177] A Study of Failure Modes in Two-Stage Human-Object Interaction Detection

Lemeng Wang, Qinqian Lei, Vidhi Bakshi, Daniel Yi, Yifan Liu, Jiacheng Hou, Asher Seng Hao, Zheda Mai, Wei-Lun Chao, Robby T. Tan, Bo Wang

Main category: cs.CV

TL;DR: Analysis of failure modes in two-stage human-object interaction (HOI) detection models, focusing on complex scenes with multiple people and rare interactions rather than overall benchmark performance.

Details

Motivation: Current HOI detection evaluations focus mainly on overall accuracy but provide limited insight into why models fail, especially in complex scenes with multiple people and rare interaction combinations. The paper aims to understand failure modes rather than just measure performance.

Method: Decomposes HOI detection into multiple interpretable perspectives and analyzes model behavior across these dimensions. Curates a subset of images from existing HOI dataset organized by human-object-interaction configurations (multi-person interactions, object sharing) to examine different failure patterns.

Result: Analysis reveals that high overall benchmark performance doesn’t necessarily reflect robust visual reasoning about human-object relationships. Models struggle with complex scene compositions, particularly involving multiple people and rare interaction combinations.

Conclusion: The study provides insights into limitations of HOI models and offers observations for future research, emphasizing the need to move beyond overall accuracy metrics to understand model reasoning capabilities in complex visual scenes.

Abstract: Human-object interaction (HOI) detection aims to detect interactions between humans and objects in images. While recent advances have improved performance on existing benchmarks, their evaluations mainly focus on overall prediction accuracy and provide limited insight into the underlying causes of model failures. In particular, modern models often struggle in complex scenes involving multiple people and rare interaction combinations. In this work, we present a study to better understand the failure modes of two-stage HOI models, which form the basis of many current HOI detection approaches. Rather than constructing a large-scale benchmark, we instead decompose HOI detection into multiple interpretable perspectives and analyze model behavior across these dimensions to study different types of failure patterns. We curate a subset of images from an existing HOI dataset organized by human-object-interaction configurations (e.g., multi-person interactions and object sharing), and analyze model behavior under these configurations to examine different failure modes. This design allows us to analyze how these HOI models behave under different scene compositions and why their predictions fail. Importantly, high overall benchmark performance does not necessarily reflect robust visual reasoning about human-object relationships. We hope that this study can provide useful insights into the limitations of HOI models and offer observations for future research in this area.

[178] Enhanced Text-to-Image Generation by Fine-grained Multimodal Reasoning

Yongjin Kim, Yoonjin Oh, Yerin Kim, Hyomin Kim, Jeeyoung Yun, Yujung Heo, Minjun Kim, Sungwoong Kim

Main category: cs.CV

TL;DR: FiMR is a framework that uses decomposed VQA for fine-grained self-reflection and refinement in text-to-image generation with MLLMs, improving prompt alignment through detailed attribute verification.

Details

Motivation: Current unified MLLMs have strong reasoning capabilities but their use in text-to-image generation is underexplored. Existing multimodal reasoning methods rely on holistic image-text alignment without fine-grained reflection on detailed prompt attributes, limiting precise control over generated images.

Method: FiMR decomposes input prompts into minimal semantic units (entities and attributes), verifies each unit via visual question answering (VQA), generates explicit fine-grained feedback, and applies targeted localized refinements based on this feedback.

Result: Extensive experiments show FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks, demonstrating improved image-prompt alignment and generation quality.

Conclusion: FiMR enables MLLMs to achieve more precise improvements in text-to-image generation through fine-grained self-reasoning and self-refinement, addressing limitations of holistic alignment approaches and enhancing detailed attribute control.

Abstract: With the rapid progress of Multimodal Large Language Models (MLLMs), unified MLLMs that jointly perform image understanding and generation have advanced significantly. However, despite the inherent reasoning capabilities of unified MLLMs for self-reflection and self-refinement, their use in text-to-image generation remains largely underexplored. Meanwhile, existing multimodal reasoning-based image generation methods mostly rely on holistic image-text alignment judgments, without fine-grained reflection and refinement of detailed prompt attributes, leading to limited fine-grained control. Therefore, we propose Fine-grained Multimodal Reasoning (FiMR), a framework that leverages decomposed visual question answering (VQA) to break down an input prompt into minimal semantic units-such as entities and attributes-and verify each unit via VQA to generate explicit, fine-grained feedback. Based on this feedback, FiMR then applies targeted, localized refinements. This fine-grained self-reasoning and self-refinement enable MLLMs to achieve more precise improvements in image-prompt alignment and overall generation quality at test time. Extensive experiments demonstrate that FiMR consistently outperforms image generation baselines, including reasoning-based methods, particularly on compositional text-to-image benchmarks.

[179] ADP-DiT: Text-Guided Diffusion Transformer for Brain Image Generation in Alzheimer’s Disease Progression

Juneyong Lee, Geonwoo Baek, Ikbeom Jang

Main category: cs.CV

TL;DR: ADP-DiT: A transformer-based diffusion model for generating longitudinal Alzheimer’s disease MRI scans with fine-grained control over follow-up time and clinical metadata using natural language prompts.

Details

Motivation: Alzheimer's disease progresses differently across individuals, creating need for subject-specific MRI synthesis to assess disease progression. Current methods lack clinically interpretable control over follow-up time and patient metadata in longitudinal AD MRI generation.

Method: ADP-DiT uses interval-aware, clinically text-conditioned diffusion transformer with dual text encoders (OpenCLIP for vision-language alignment and T5 for clinical understanding). Conditions include follow-up interval, demographics, diagnosis, and neuropsychological data as natural language prompts. Uses cross-attention for fine-grained guidance and adaptive layer normalization for global modulation, with rotary positional embeddings and SDXL-VAE latent space for high-resolution reconstruction.

Result: Achieved SSIM 0.8739 and PSNR 29.32 dB on 3,321 longitudinal 3T T1-weighted scans from 712 participants, improving over DiT baseline by +0.1087 SSIM and +6.08 dB PSNR. Successfully captured progression-related changes like ventricular enlargement and hippocampal shrinkage.

Conclusion: Integrating comprehensive subject-specific clinical conditions with transformer architectures can significantly improve longitudinal AD MRI synthesis, enabling time-specific control beyond coarse diagnostic stages.

Abstract: Alzheimer’s disease (AD) progresses heterogeneously across individuals, motivating subject-specific synthesis of follow-up magnetic resonance imaging (MRI) to support progression assessment. While Diffusion Transformers (DiT), an emerging transformer-based diffusion model, offer a scalable backbone for image synthesis, longitudinal AD MRI generation with clinically interpretable control over follow-up time and participant metadata remains underexplored. We present ADP-DiT, an interval-aware, clinically text-conditioned diffusion transformer for longitudinal AD MRI synthesis. ADP-DiT encodes follow-up interval together with multi-domain demographic, diagnostic (CN/MCI/AD), and neuropsychological information as a natural-language prompt, enabling time-specific control beyond coarse diagnostic stages. To inject this conditioning effectively, we use dual text encoders-OpenCLIP for vision-language alignment and T5 for richer clinical-language understanding. Their embeddings are fused into DiT through cross-attention for fine-grained guidance and adaptive layer normalization for global modulation. We further enhance anatomical fidelity by applying rotary positional embeddings to image tokens and performing diffusion in a pre-trained SDXL-VAE latent space to enable efficient high-resolution reconstruction. On 3,321 longitudinal 3T T1-weighted scans from 712 participants (259,038 image slices), ADP-DiT achieves SSIM 0.8739 and PSNR 29.32 dB, improving over a DiT baseline by +0.1087 SSIM and +6.08 dB PSNR while capturing progression-related changes such as ventricular enlargement and shrinking hippocampus. These results suggest that integrating comprehensive, subject-specific clinical conditions with architectures can improve longitudinal AD MRI synthesis.

[180] Enhancing Mixture-of-Experts Specialization via Cluster-Aware Upcycling

Sanghyeok Chu, Pyunghwan Ahn, Gwangmo Song, SeungHwan Kim, Honglak Lee, Bohyung Han

Main category: cs.CV

TL;DR: Cluster-aware Upcycling improves Mixture-of-Experts initialization by partitioning dense model activations into semantic clusters and initializing experts with cluster-specific subspaces, breaking expert symmetry and enabling early specialization.

Details

Motivation: Standard Sparse Upcycling initializes all MoE experts from identical pretrained dense weights, leading to expert symmetry and limited early specialization due to random router initialization. This paper aims to incorporate semantic structure into MoE initialization to address these issues.

Method: 1) Partition dense model’s input activations into semantic clusters; 2) Initialize each expert using subspace representations of its corresponding cluster via truncated SVD; 3) Set router’s initial weights to cluster centroids; 4) Introduce expert-ensemble self-distillation loss for stable training using ensemble teacher guidance.

Result: Outperforms existing methods on CLIP ViT-B/32 and ViT-B/16 across zero-shot and few-shot benchmarks. Produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

Conclusion: Cluster-aware Upcycling effectively breaks expert symmetry and encourages early specialization aligned with data distribution, providing superior MoE initialization compared to standard approaches.

Abstract: Sparse Upcycling provides an efficient way to initialize a Mixture-of-Experts (MoE) model from pretrained dense weights instead of training from scratch. However, since all experts start from identical weights and the router is randomly initialized, the model suffers from expert symmetry and limited early specialization. We propose Cluster-aware Upcycling, a strategy that incorporates semantic structure into MoE initialization. Our method first partitions the dense model’s input activations into semantic clusters. Each expert is then initialized using the subspace representations of its corresponding cluster via truncated SVD, while setting the router’s initial weights to the cluster centroids. This cluster-aware initialization breaks expert symmetry and encourages early specialization aligned with the data distribution. Furthermore, we introduce an expert-ensemble self-distillation loss that stabilizes training by providing reliable routing guidance using an ensemble teacher. When evaluated on CLIP ViT-B/32 and ViT-B/16, Cluster-aware Upcycling consistently outperforms existing methods across both zero-shot and few-shot benchmarks. The proposed method also produces more diverse and disentangled expert representations, reduces inter-expert similarity, and leads to more confident routing behavior.

[181] DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang

Main category: cs.CV

TL;DR: RTR-DiT is a real-time video stylization framework using Diffusion Transformers that enables stable, consistent long video processing with support for both text-guided and reference-guided stylization through teacher fine-tuning and distillation techniques.

Details

Motivation: Existing diffusion-based video stylization methods struggle with stability and consistency in long videos, have high computational costs, and multi-step denoising makes them impractical for real-time applications.

Method: Fine-tune a bidirectional teacher model on curated video stylization dataset, then distill into few-step autoregressive model using Self Forcing and Distribution Matching Distillation. Propose reference-preserving KV cache update strategy for stable long video processing and real-time style switching.

Result: Outperforms existing methods in both text-guided and reference-guided video stylization tasks in quantitative metrics and visual quality. Demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

Conclusion: RTR-DiT provides an effective solution for real-time video stylization with stable long video processing, addressing limitations of existing diffusion-based methods while supporting flexible text and reference guidance.

Abstract: Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.

[182] Free Lunch for Unified Multimodal Models: Enhancing Generation via Reflective Rectification with Inherent Understanding

Yibo Jiang, Tao Wu, Rui Jiang, Yehao Lu, Chaoxiang Cai, Zequn Qin, Xi Li

Main category: cs.CV

TL;DR: UniRect-CoT: A training-free framework that uses a chain-of-thought approach to align UMMs’ visual understanding with generation by treating diffusion denoising as visual reasoning and using self-supervision to rectify intermediate results.

Details

Motivation: Unified Multimodal Models (UMMs) have a significant capability mismatch where their visual understanding far outperforms their generation capabilities. The rich internal knowledge in these models remains underactivated during generation tasks, similar to how humans need to continuously reflect and activate knowledge while drawing.

Method: Proposes UniRect-CoT, a training-free unified rectification chain-of-thought framework that treats the diffusion denoising process in UMMs as an intrinsic visual reasoning process. It aligns intermediate generation results with the target instruction understood by the model, using this as self-supervisory signal to rectify generation.

Result: Extensive experiments show that UniRect-CoT can be easily integrated into existing UMMs and significantly enhances generation quality across diverse complex tasks without requiring additional training.

Conclusion: The proposed framework successfully addresses the capability mismatch in UMMs by activating their internal knowledge during generation through a thinking-while-drawing inspired approach, improving generation quality across various tasks.

Abstract: Unified Multimodal Models (UMMs) aim to integrate visual understanding and generation within a single structure. However, these models exhibit a notable capability mismatch, where their understanding capability significantly outperforms their generation. This mismatch indicates that the model’s rich internal knowledge, while effective for understanding tasks, remains underactivated during generation. To address this, we draw inspiration from the human Thinking-While-Drawing'' paradigm, where humans continuously reflect to activate their knowledge and rectify intermediate results. In this paper, we propose UniRect-CoT, a training-free unified rectification chain-of-thought framework. Our approach unlocks the free lunch’’ hidden in the UMM’s powerful inherent understanding to continuously reflect, activating its internal knowledge and rectifying intermediate results during generation.We regard the diffusion denoising process in UMMs as an intrinsic visual reasoning process and align the intermediate results with the target instruction understood by the model, serving as a self-supervisory signal to rectify UMM generation.Extensive experiments demonstrate that UniRect-CoT can be easily integrated into existing UMMs, significantly enhancing generation quality across diverse complex tasks.

[183] Reconstruction of a 3D wireframe from a single line drawing via generative depth estimation

Elton Cao, Hod Lipson

Main category: cs.CV

TL;DR: A generative approach using Latent Diffusion Model with ControlNet-style conditioning to convert 2D sketches into 3D models via conditional dense depth estimation, enabling flexible “draw in 3D” workflow.

Details

Motivation: Traditional sketch-to-3D methods rely on brittle symbolic logic or rigid parametric CAD primitives, limiting creative freedom. There's a need for more flexible approaches that bridge human creativity with digital fabrication while handling inherent ambiguities in orthographic projections.

Method: Frames reconstruction as conditional dense depth estimation using Latent Diffusion Model (LDM) with ControlNet-style conditioning. Introduces graph-based BFS masking strategy for partial depth cues to support iterative “sketch-reconstruct-sketch” workflow. Trained on over 1 million image-depth pairs from ABC Dataset.

Result: Demonstrates robust performance across varying shape complexities, providing scalable pipeline for converting sparse 2D line drawings into dense 3D representations without rigid CAD constraints.

Conclusion: Proposed generative approach effectively enables users to “draw in 3D” by overcoming limitations of traditional methods through diffusion-based depth estimation and iterative workflow support.

Abstract: The conversion of 2D freehand sketches into 3D models remains a pivotal challenge in computer vision, bridging the gap between human creativity and digital fabrication. Traditional line drawing reconstruction relies on brittle symbolic logic, while modern approaches are constrained by rigid parametric modeling, limiting users to predefined CAD primitives. We propose a generative approach by framing reconstruction as a conditional dense depth estimation task. To achieve this, we implement a Latent Diffusion Model (LDM) with a ControlNet-style conditioning framework to resolve the inherent ambiguities of orthographic projections. To support an iterative “sketch-reconstruct-sketch” workflow, we introduce a graph-based BFS masking strategy to simulate partial depth cues. We train and evaluate our approach using a massive dataset of over one million image-depth pairs derived from the ABC Dataset. Our framework demonstrates robust performance across varying shape complexities, providing a scalable pipeline for converting sparse 2D line drawings into dense 3D representations, effectively allowing users to “draw in 3D” without the rigid constraints of traditional CAD.

[184] AI Powered Image Analysis for Phishing Detection

K. Acharya, S. Ale, R. Kadel

Main category: cs.CV

TL;DR: Deep learning approach using webpage screenshots for visual phishing detection, comparing ConvNeXt-Tiny and Vision Transformer models with threshold-aware evaluation.

Details

Motivation: Phishing websites increasingly use visual imitation (logos, layouts, colors) to evade text- and URL-based detection systems, requiring image-based approaches.

Method: Used webpage screenshots for image-based detection, tested two vision models (ConvNeXt-Tiny and ViT-Base) with transfer learning from ImageNet weights, evaluated with threshold-aware metrics across different decision thresholds.

Result: ConvNeXt-Tiny performed best overall with highest F1-score at optimized threshold and better computational efficiency than ViT-Base, demonstrating convolutional models’ strength for visual phishing detection.

Conclusion: Convolutional models are effective for visual phishing detection, threshold tuning is crucial for real-world deployment, and the curated dataset will be released for reproducibility.

Abstract: Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.

[185] CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling

Shivika, Kartik Bose, Pankaj Gupta

Main category: cs.CV

TL;DR: Investigating training batch composition effects on 3D medical vision-language models shows that explicit class balancing hurts performance compared to random sampling with anatomical subsection diversity.

Details

Motivation: While vision-language models show strong zero-shot diagnostic capabilities in medical imaging, the effect of training batch composition on learned representations remains unexplored for 3D medical imaging, particularly how normal-to-abnormal ratios and data scaling affect performance.

Method: Reproduced Merlin dual-encoder model aligning 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss. Investigated two axes: (1) controlling normal-to-abnormal ratio in training batches (25:75, 50:50, 75:25) using section-level balanced sampling, and (2) data scaling ablations on subset (20%, 40%, 100% of data). Compared balanced vs. unbalanced sampling strategies.

Result: All balanced configurations underperformed unbalanced baseline by 2.4-2.8 points (best balanced: 72.02% vs baseline: 74.45%). Performance scaled sub-linearly with data (65.26% to 71.88%). Enforcing 50:50 balanced sampling on subset further degraded performance to 68.01%. Stochastic diversity of random sampling with anatomical subsection batching provides better regularization than engineered class ratios.

Conclusion: Explicit class balancing hurts performance regardless of dataset or balancing granularity. The stochastic diversity of random sampling combined with anatomical subsection batching provides more effective regularization than engineered class ratios at small batch sizes required by 3D medical volumes.

Abstract: Vision-language models trained with contrastive learning on paired medical images and reports show strong zero-shot diagnostic capabilities, yet the effect of training batch composition on learned representations remains unexplored for 3D medical imaging. We reproduce Merlin, a dual-encoder model that aligns 3D abdominal CT volumes with radiology reports using symmetric InfoNCE loss, achieving a zero-shot macro F1 of 74.45% across 30 findings (original: 73.00%). We then investigate two axes of variation. First, we control the normal-to-abnormal ratio within training batches at 25:75, 50:50, and 75:25 using section-level balanced sampling on the full dataset. All three configurations underperform the unbalanced baseline by 2.4 to 2.8 points, with 75:25 achieving the best result (72.02%) among balanced variants. Second, we conduct data scaling ablations on a 4,362-study subset, training with 20%, 40%, and 100% of the data. Performance scales sub-linearly from 65.26% to 71.88%, with individual findings varying dramatically in data sensitivity. Enforcing 50:50 balanced sampling on the same subset further degrades performance to 68.01%, confirming that explicit class balancing hurts regardless of dataset or balancing granularity. Our results indicate that the stochastic diversity of random sampling, combined with Merlin’s alternating batching over anatomical subsections, provides more effective regularization than engineered class ratios at the small batch sizes required by 3D medical volumes.

[186] UHR-BAT: Budget-Aware Token Compression Vision-Language model for Ultra-High-Resolution Remote Sensing

Yunkai Dang, Minxin Dai, Yuekun Yang, Zhangnan Li, Wenbin Li, Feng Miao, Yang Gao

Main category: cs.CV

TL;DR: UHR-BAT is a query-guided token compression framework for ultra-high-resolution remote sensing imagery that efficiently selects visual tokens while preserving query-critical details.

Details

Motivation: Ultra-high-resolution remote sensing imagery has vast spatial scale causing quadratic explosion of visual tokens, making it difficult to extract information from small objects. Existing methods either lose critical details or have unpredictable computational costs.

Method: Proposes UHR-BAT with text-guided, multi-scale importance estimation for visual tokens and region-wise preserve/merge strategies to reduce token redundancy under strict context budget.

Result: Achieves state-of-the-art performance across various benchmarks for ultra-high-resolution remote sensing tasks.

Conclusion: UHR-BAT provides an effective solution for efficient token compression in ultra-high-resolution imagery while maintaining query-critical information.

Abstract: Ultra-high-resolution (UHR) remote sensing imagery couples kilometer-scale context with query-critical evidence that may occupy only a few pixels. Such vast spatial scale leads to a quadratic explosion of visual tokens and hinders the extraction of information from small objects. Previous works utilize direct downsampling, dense tiling, or global top-k pruning, which either compromise query-critical image details or incur unpredictable compute. In this paper, we propose UHR-BAT, a query-guided and region-faithful token compression framework to efficiently select visual tokens under a strict context budget. Specifically, we leverage text-guided, multi-scale importance estimation for visual tokens, effectively tackling the challenge of achieving precise yet low-cost feature extraction. Furthermore, by introducing region-wise preserve and merge strategies, we mitigate visual token redundancy, further driving down the computational budget. Experimental results show that UHR-BAT achieves state-of-the-art performance across various benchmarks. Code will be available at https://github.com/Yunkaidang/UHR.

[187] ZoomSpec: A Physics-Guided Coarse-to-Fine Framework for Wideband Spectrum Sensing

Zhentao Yang, Yixiang Luomei, Zhuoyang Liu, Zhenyu Liu, Feng Xu

Main category: cs.CV

TL;DR: ZoomSpec: A physics-guided coarse-to-fine framework for wideband spectrum sensing that integrates signal processing priors with deep learning to overcome domain mismatch in existing approaches.

Details

Motivation: Existing data-driven approaches for wideband spectrum sensing treat spectrograms as natural images, suffering from domain mismatch by neglecting time-frequency resolution constraints and spectral leakage, leading to poor narrowband visibility.

Method: Proposes ZoomSpec with three key components: 1) Log-Space STFT (LS-STFT) to overcome geometric bottleneck of linear spectrograms, 2) Coarse Proposal Net (CPN) for rapid full-band screening, 3) Adaptive Heterodyne Low-Pass (AHLP) module for center-frequency aligning and bandwidth-matched filtering, and 4) Fine Recognition Net (FRN) that fuses purified time-domain I/Q with spectral magnitude via dual-domain attention.

Result: Achieves state-of-the-art 78.1 mAP@0.5:0.95 on the SpaceNet real-world dataset, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.

Conclusion: ZoomSpec effectively integrates signal processing priors with deep learning to address domain mismatch in wideband spectrum sensing, demonstrating superior performance and stability for low-altitude monitoring applications.

Abstract: Wideband spectrum sensing for low-altitude monitoring is critical yet challenging due to heterogeneous protocols,large bandwidths, and non-stationary SNR. Existing data-driven approaches treat spectrograms as natural images,suffering from domain mismatch: they neglect time-frequency resolution constraints and spectral leakage, leading topoor narrowband visibility. This paper proposes ZoomSpec, a physics-guided coarse-to-fine framework integrating signal processing priors with deep learning. We introduce a Log-Space STFT (LS-STFT) to overcome the geometric bottleneck of linear spectrograms, sharpening narrowband structures while maintaining constant relative resolution. A lightweight Coarse Proposal Net (CPN) rapidly screens the full band. To bridge coarse detection and fine recognition, we design an Adaptive Heterodyne Low-Pass (AHLP) module that executes center-frequency aligning, bandwidth-matched filtering, and safe decimation, purifying signals of out-of-band interference. A Fine Recognition Net (FRN) fuses purified time-domain I/Q with spectral magnitude via dual-domain attention to jointly refine temporal boundaries and modulation classification. Evaluations on the SpaceNet real-world dataset demonstrate state-of-the-art 78.1 mAP@0.5:0.95, surpassing existing leaderboard systems with superior stability across diverse modulation bandwidths.

[188] Radar-Informed 3D Multi-Object Tracking under Adverse Conditions

Bingxue Xu, Emil Hedemalm, Ajinkya Khoche, Patric Jensfelt

Main category: cs.CV

TL;DR: RadarMOT: A radar-informed 3D multi-object tracking framework that uses radar point clouds to refine state estimation and recover detector misses, improving tracking accuracy at long ranges and in adverse weather conditions.

Details

Motivation: Existing multi-modal fusion methods treat radar as just another learned feature, which reduces radar's robustness advantages when overall models degrade in difficult conditions like adverse weather or long ranges. The paper aims to explicitly leverage radar data to improve 3D MOT robustness.

Method: Proposes RadarMOT framework that uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Unlike existing methods that treat radar as learned features, this approach explicitly incorporates radar observations.

Result: Evaluations on MAN-TruckScenes dataset show RadarMOT consistently improves Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% improvement at long range and 10.3% improvement in adverse weather conditions.

Conclusion: RadarMOT effectively leverages radar data to enhance 3D MOT robustness, particularly for long-range tracking and adverse weather scenarios, demonstrating the value of explicit radar-informed approaches over learned feature fusion methods.

Abstract: The challenge of 3D multi-object tracking (3D MOT) is achieving robustness in real-world applications, for example under adverse conditions and maintaining consistency as distance increases. To overcome these challenges, sensor fusion approaches that combine LiDAR, cameras, and radar have emerged. However, existing multi-modal fusion methods usually treat radar as another learned feature inside the network. When the overall model degrades in difficult environmental conditions, the robustness advantages that radar could provide are also reduced. We propose RadarMOT, a radar-informed 3D MOT framework that explicitly uses radar point cloud data as additional observation to refine state estimation and recover detector misses at long ranges. Evaluations on the MAN-TruckScenes dataset show that RadarMOT consistently improves the Average Multi-Object Tracking Accuracy (AMOTA) with absolute 12.7% at long range and 10.3% in adverse weather. The code will be available at https://github.com/bingxue-xu/radarmot

[189] SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

Qi Xia, Peishan Cong, Ziyi Wang, Yujing Sun, Qin Sun, Xinge Zhu, Mao Ye, Ruigang Yang, Yuexin Ma

Main category: cs.CV

TL;DR: SocialMirror is a diffusion-based framework for reconstructing human behavior in close-interaction scenarios from monocular videos, addressing challenges like mutual occlusions and motion ambiguity through semantic guidance and geometric constraints.

Details

Motivation: Accurate human behavior reconstruction in close-interaction scenarios is crucial for AR, sports analysis, and human-robot collaboration, but current methods struggle with severe mutual occlusions, motion ambiguity, and spatial relationship errors in monocular video reconstruction.

Method: A diffusion-based framework that: 1) uses vision-language models to generate high-level interaction descriptions guiding a semantic-guided motion infiller for hallucinating occluded bodies, and 2) employs a sequence-level temporal refiner with geometric constraints to ensure smooth motions and plausible contact relationships.

Result: State-of-the-art performance on multiple interaction benchmarks, demonstrating strong generalization across unseen datasets and in-the-wild scenarios for reconstructing interactive human meshes.

Conclusion: SocialMirror effectively addresses challenges in close-interaction human reconstruction by integrating semantic and geometric cues through a diffusion-based approach, enabling more realistic virtual interactions and motion analysis.

Abstract: Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

[190] Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

Danish Nazir, Antoine Hanna-Asaad, Lucas Görnhardt, Jan Piewek, Thorsten Bagdonat, Tim Fingscheidt

Main category: cs.CV

TL;DR: Efficient multi-view 3D object detection using dynamic token selection and parameter-efficient fine-tuning for ViT backbones

Details

Motivation: Existing multi-view 3D object detection methods use computationally expensive ViT backbones. Current SOTA ToC3D has limitations: fixed token selection ratios and requires full end-to-end retraining of ViT backbones.

Method: Proposes image token compensator with token selection for ViT backbones, enabling dynamic layer-wise token selection. Introduces parameter-efficient fine-tuning strategy that trains only proposed modules (1.6M parameters vs 300M+).

Result: Reduces computational complexity by 48-55%, inference latency by 9-25%, while improving mean average precision by 1.0-2.8% and NuScenes detection score by 0.4-1.2% compared to SOTA ToC3D.

Conclusion: Proposed method achieves significant efficiency gains while maintaining or improving detection performance for multi-view 3D object detection.

Abstract: Existing multi-view three-dimensional (3D) object detection approaches widely adopt large-scale pre-trained vision transformer (ViT)-based foundation models as backbones, being computationally complex. To address this problem, current state-of-the-art (SOTA) \texttt{ToC3D} for efficient multi-view ViT-based 3D object detection employs ego-motion-based relevant token selection. However, there are two key limitations: (1) The fixed layer-individual token selection ratios limit computational efficiency during both training and inference. (2) Full end-to-end retraining of the ViT backbone is required for the multi-view 3D object detection method. In this work, we propose an image token compensator combined with a token selection for ViT backbones to accelerate multi-view 3D object detection. Unlike \texttt{ToC3D}, our approach enables dynamic layer-wise token selection within the ViT backbone. Furthermore, we introduce a parameter-efficient fine-tuning strategy, which trains only the proposed modules, thereby reducing the number of fine-tuned parameters from more than $300$ million (M) to only $1.6$ M. Experiments on the large-scale NuScenes dataset across three multi-view 3D object detection approaches demonstrate that our proposed method decreases computational complexity (GFLOPs) by $48%$ … $55%$, inference latency (on an \texttt{NVIDIA-GV100} GPU) by $9%$ … $25%$, while still improving mean average precision by $1.0%$ … $2.8%$ absolute and NuScenes detection score by $0.4%$ … $1.2%$ absolute compared to so-far SOTA \texttt{ToC3D}.

[191] Dehaze-then-Splat: Generative Dehazing with Physics-Informed 3D Gaussian Splatting for Smoke-Free Novel View Synthesis

Yuchao Chen, Hanqing Wang

Main category: cs.CV

TL;DR: Two-stage pipeline for multi-view smoke removal and novel view synthesis using generative dehazing followed by 3D Gaussian Splatting with physics-informed regularization.

Details

Motivation: Address the fundamental tension in dehaze-then-reconstruct pipelines where per-image restoration quality doesn't guarantee multi-view consistency, leading to blurred renders and structural instability in 3D reconstruction.

Method: Stage 1: Generate pseudo-clean training images via per-frame generative dehazing (Nano Banana Pro) with brightness normalization. Stage 2: Train 3D Gaussian Splatting with physics-informed auxiliary losses including depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching.

Result: Achieves 20.98 dB PSNR and 0.683 SSIM for novel view synthesis on Akikaze validation scene, representing a +1.50 dB improvement over unregularized baseline.

Conclusion: MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates artifacts from cross-view inconsistencies in frame-wise generative processing for 3D reconstruction.

Abstract: We present Dehaze-then-Splat, a two-stage pipeline for multi-view smoke removal and novel view synthesis developed for Track~2 of the NTIRE 2026 3D Restoration and Reconstruction Challenge. In the first stage, we produce pseudo-clean training images via per-frame generative dehazing using Nano Banana Pro, followed by brightness normalization. In the second stage, we train 3D Gaussian Splatting (3DGS) with physics-informed auxiliary losses – depth supervision via Pearson correlation with pseudo-depth, dark channel prior regularization, and dual-source gradient matching – that compensate for cross-view inconsistencies inherent in frame-wise generative processing. We identify a fundamental tension in dehaze-then-reconstruct pipelines: per-image restoration quality does not guarantee multi-view consistency, and such inconsistency manifests as blurred renders and structural instability in downstream 3D reconstruction.Our analysis shows that MCMC-based densification with early stopping, combined with depth and haze-suppression priors, effectively mitigates these artifacts. On the Akikaze validation scene, our pipeline achieves 20.98,dB PSNR and 0.683 SSIM for novel view synthesis, a +1.50,dB improvement over the unregularized baseline.

[192] VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation

Yulu Gao, Bohao Zhang, Zongheng Tang, Jitong Liao, Wenjun Wu, Si Liu

Main category: cs.CV

TL;DR: VGGT-Segmentor (VGGT-S) is a framework that combines geometric modeling with semantic segmentation for instance-level object segmentation across egocentric and exocentric views, achieving state-of-the-art results on Ego-Exo4D benchmark.

Details

Motivation: Instance-level object segmentation across disparate egocentric and exocentric views is challenging due to severe scale, perspective, and occlusion changes. While geometry-aware models like VGGT provide feature alignment, they fail at dense prediction tasks due to pixel-level projection drift.

Method: VGGT-S leverages VGGT’s cross-view feature representation and introduces a novel Union Segmentation Head with three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement. Also proposes single-image self-supervised training strategy that eliminates need for paired annotations.

Result: On Ego-Exo4D benchmark, VGGT-S achieves 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks respectively, setting new state-of-the-art. Correspondence-free pretrained model surpasses most fully-supervised baselines.

Conclusion: VGGT-S effectively bridges the gap between geometric modeling and pixel-accurate segmentation, demonstrating strong generalization and scalability through self-supervised training.

Abstract: Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level textntion remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT’s powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.

[193] What Are We Really Measuring? Rethinking Dataset Bias in Web-Scale Natural Image Collections via Unsupervised Semantic Clustering

Amir Hossein Saleknia, Mohammad Sabokrou

Main category: cs.CV

TL;DR: Supervised dataset bias measurement using classification is flawed due to resolution artifacts; unsupervised clustering of semantic features shows web-scale datasets have minimal true semantic bias.

Details

Motivation: Current methods for measuring dataset bias rely on training classifiers to distinguish between datasets, assuming high accuracy indicates meaningful semantic differences. However, this approach may be confounded by non-semantic artifacts like resolution-based fingerprints rather than true semantic divergence.

Method: The authors demonstrate flaws in supervised classification by showing models can achieve high accuracy on non-semantic, procedurally generated images. They propose an unsupervised alternative: clustering semantically-rich features from foundational vision models to directly assess semantic similarity, deliberately avoiding supervised classification on dataset labels.

Result: When applied to major web-scale datasets, the high separability reported by supervised methods largely disappears, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by a large margin.

Conclusion: The fundamental assumption behind supervised dataset bias measurement is flawed due to resolution artifacts. Unsupervised semantic feature clustering provides a more accurate assessment, showing web-scale datasets have minimal true semantic bias despite what supervised methods suggest.

Abstract: In computer vision, a prevailing method for quantifying dataset bias is to train a model to distinguish between datasets. High classification accuracy is then interpreted as evidence of meaningful semantic differences. This approach assumes that standard image augmentations successfully suppress low-level, non-semantic cues, and that any remaining performance must therefore reflect true semantic divergence. We demonstrate that this fundamental assumption is flawed within the domain of large-scale natural image collections. High classification accuracy is often driven by resolution-based artifacts, which are structural fingerprints arising from native image resolution distributions and interpolation effects during resizing. These artifacts form robust, dataset-specific signatures that persist despite conventional image corruptions. Through controlled experiments, we show that models achieve strong dataset classification even on non-semantic, procedurally generated images, proving their reliance on superficial cues. To address this issue, we revisit this decades-old idea of dataset separability, but not with supervised classification. Instead, we introduce an unsupervised approach that measures true semantic separability. Our framework directly assesses semantic similarity by clustering semantically-rich features from foundational vision models, deliberately bypassing supervised classification on dataset labels. When applied to major web-scale datasets, the primary focus of this work, the high separability reported by supervised methods largely vanishes, with clustering accuracy dropping to near-chance levels. This reveals that conventional classification-based evaluation systematically overstates semantic bias by an overwhelming margin.

[194] ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation

Jingjing Qian, Zeyuan He, Chen Shi, Lei Xiao, Li Jiang

Main category: cs.CV

TL;DR: ESCAPE is an embodied AI system for long-horizon indoor tasks that combines episodic spatial memory with adaptive policy execution to coordinate navigation and manipulation robustly.

Details

Motivation: Existing embodied AI methods struggle with catastrophic forgetting, spatial inconsistency, and rigid execution in long-horizon indoor tasks, requiring a more robust approach to coordinate navigation and manipulation.

Method: ESCAPE uses a perception-grounding-execution workflow with: 1) Spatio-Temporal Fusion Mapping for depth-free 3D spatial memory, 2) Memory-Driven Target Grounding for interaction masks, and 3) Adaptive Execution Policy for proactive navigation and reactive manipulation.

Result: Achieves state-of-the-art on ALFRED benchmark with 65.09%/60.79% success rates in test seen/unseen environments, reduces redundant exploration, and maintains robust performance (61.24%/56.04%) without detailed guidance for long-horizon tasks.

Conclusion: ESCAPE demonstrates effective coordination of navigation and manipulation through episodic spatial memory and adaptive policy execution, enabling robust performance in complex indoor environments over long horizons.

Abstract: Coordinating navigation and manipulation with robust performance is essential for embodied AI in complex indoor environments. However, as tasks extend over long horizons, existing methods often struggle due to catastrophic forgetting, spatial inconsistency, and rigid execution. To address these issues, we propose ESCAPE (Episodic Spatial Memory Coupled with an Adaptive Policy for Execution), operating through a tightly coupled perception-grounding-execution workflow. For robust perception, ESCAPE features a Spatio-Temporal Fusion Mapping module to autoregressively construct a depth-free, persistent 3D spatial memory, alongside a Memory-Driven Target Grounding module for precise interaction mask generation. To achieve flexible action, our Adaptive Execution Policy dynamically orchestrates proactive global navigation and reactive local manipulation to seize opportunistic targets. ESCAPE achieves state-of-the-art performance on the ALFRED benchmark, reaching 65.09% and 60.79% success rates in test seen and unseen environments with step-by-step instructions. By reducing redundant exploration, our ESCAPE attains substantial improvements in path-length-weighted metrics and maintains robust performance (61.24% / 56.04%) even without detailed guidance for long-horizon tasks.

[195] VRAG-DFD: Verifiable Retrieval-Augmentation for MLLM-based Deepfake Detection

Hui Han, Shunli Wang, Yandan Zhao, Taiping Yao, Shouhong Ding

Main category: cs.CV

TL;DR: VRAG-DFD: A framework combining Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL) to enhance MLLMs for Deepfake Detection with dynamic forgery knowledge retrieval and critical reasoning capabilities.

Details

Motivation: Existing MLLM-based deepfake detection methods lack professional forgery knowledge and critical reasoning abilities, limiting their performance. The paper aims to address two key issues: providing high-quality forgery knowledge to MLLMs and enabling critical reasoning with noisy reference information.

Method: Proposes VRAG-DFD framework using RAG and RL. Constructs two datasets: Forensic Knowledge Database (FKD) for DFD knowledge annotation and Forensic Chain-of-Thought Dataset (F-CoT) for critical reasoning. Uses three-stage training: Alignment → Supervised Fine-Tuning → Group Relative Policy Optimization (GRPO) to cultivate MLLM’s critical reasoning ability.

Result: VRAG-DFD achieved state-of-the-art and competitive performance on deepfake detection generalization testing.

Conclusion: The combination of RAG and RL effectively enhances MLLMs for deepfake detection by providing dynamic forgery knowledge retrieval and developing critical reasoning capabilities, leading to improved generalization performance.

Abstract: In Deepfake Detection (DFD) tasks, researchers proposed two types of MLLM-based methods: complementary combination with small DFD detectors, or static forgery knowledge injection.The lack of professional forgery knowledge hinders the performance of these DFD-MLLMs.To solve this, we deeply considered two insightful issues: How to provide high-quality associated forgery knowledge for MLLMs? AND How to endow MLLMs with critical reasoning abilities given noisy reference information? Notably, we attempted to address above two questions with preliminary answers by leveraging the combination of Retrieval-Augmented Generation (RAG) and Reinforcement Learning (RL).Through RAG and RL techniques, we propose the VRAG-DFD framework with accurate dynamic forgery knowledge retrieval and powerful critical reasoning capabilities.Specifically, in terms of data, we constructed two datasets with RAG: Forensic Knowledge Database (FKD) for DFD knowledge annotation, and Forensic Chain-of-Thought Dataset (F-CoT), for critical CoT construction.In terms of model training, we adopt a three-stage training method (Alignment->SFT->GRPO) to gradually cultivate the critical reasoning ability of the MLLM.In terms of performance, VRAG-DFD achieved SOTA and competitive performance on DFD generalization testing.

[196] From Pixels to Nucleotides: End-to-End Token-Based Video Compression for DNA Storage

Cihan Ruan, Lebin Zhou, Bingqing Zhao, Rongduo Han, Qiming Yuan, Chenchen Zhu, Linyi Han, Liang Yang, Wei Wang, Wei Jiang, Nam Ling

Main category: cs.CV

TL;DR: HELIX is the first neural network that jointly optimizes video compression and DNA encoding, using token-based representations that naturally align with DNA’s quaternary alphabet, achieving 1.91 bits per nucleotide.

Details

Motivation: Video storage in DNA remains an open challenge requiring co-design of compression and molecular encoding, but current approaches treat these stages independently, leaving biochemical constraints and compression objectives misaligned.

Method: HELIX introduces TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding) which uses token-based representations that map to DNA’s ATCG bases, with Kronecker-structured mixing to break spatial correlations and FSM-based mapping to guarantee biochemical constraints.

Result: Achieves 1.91 bits per nucleotide through joint optimization of token distributions for visual quality, prediction under masking, and DNA synthesis efficiency.

Conclusion: Demonstrates that learned compression and molecular storage converge naturally at token representations, suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.

Abstract: DNA-based storage has emerged as a promising approach to the global data crisis, offering molecular-scale density and millennial-scale stability at low maintenance cost. Over the past decade, substantial progress has been made in storing text, images, and files in DNA – yet video remains an open challenge. The difficulty is not merely technical: effective video DNA storage requires co-designing compression and molecular encoding from the ground up, a challenge that sits at the intersection of two fields that have largely evolved independently. In this work, we present HELIX, the first end-to-end neural network jointly optimizing video compression and DNA encoding – prior approaches treat the two stages independently, leaving biochemical constraints and compression objectives fundamentally misaligned. Our key insight: token-based representations naturally align with DNA’s quaternary alphabet – discrete semantic units map directly to ATCG bases. We introduce TK-SCONE (Token-Kronecker Structured Constraint-Optimized Neural Encoding), which achieves 1.91 bits per nucleotide through Kronecker-structured mixing that breaks spatial correlations and FSM-based mapping that guarantees biochemical constraints. Unlike two-stage approaches, HELIX learns token distributions simultaneously optimized for visual quality, prediction under masking, and DNA synthesis efficiency. This work demonstrates for the first time that learned compression and molecular storage converge naturally at token representations – suggesting a new paradigm where neural video codecs are designed for biological substrates from the ground up.

[197] Beyond Voxel 3D Editing: Learning from 3D Masks and Self-Constructed Data

Yizhao Xu, Hongyuan Zhu, Caiyun Liu, Tianfu Wang, Keyu Chen, Sicheng Xu, Jiaolong Yang, Nicholas Jing Yuan, Qi Zhang

Main category: cs.CV

TL;DR: BVE is a 3D editing framework that uses a self-constructed large-scale dataset and lightweight modules to enable text-guided 3D editing while preserving local invariance in unchanged regions.

Details

Motivation: Existing 3D editing methods have limitations: multi-view editing suffers from projection losses, voxel-based editing has constraints on modifiable regions and scale, and there's a lack of large editing datasets for training and evaluation.

Method: Proposes BVE framework with self-constructed large-scale dataset, enhances foundational image-to-3D generative architecture with lightweight trainable modules for efficient semantic injection, and introduces annotation-free 3D masking strategy to preserve local invariance.

Result: Extensive experiments show BVE achieves superior performance in generating high-quality, text-aligned 3D assets while faithfully retaining visual characteristics of original input.

Conclusion: BVE addresses key challenges in 3D editing through dataset construction, efficient architecture modifications, and invariance preservation, enabling effective text-guided 3D editing.

Abstract: 3D editing refers to the ability to apply local or global modifications to 3D assets. Effective 3D editing requires maintaining semantic consistency by performing localized changes according to prompts, while also preserving local invariance so that unchanged regions remain consistent with the original. However, existing approaches have significant limitations: multi-view editing methods incur losses when projecting back to 3D, while voxel-based editing is constrained in both the regions that can be modified and the scale of modifications. Moreover, the lack of sufficiently large editing datasets for training and evaluation remains a challenge. To address these challenges, we propose a Beyond Voxel 3D Editing (BVE) framework with a self-constructed large-scale dataset specifically tailored for 3D editing. Building upon this dataset, our model enhances a foundational image-to-3D generative architecture with lightweight, trainable modules, enabling efficient injection of textual semantics without the need for expensive full-model retraining. Furthermore, we introduce an annotation-free 3D masking strategy to preserve local invariance, maintaining the integrity of unchanged regions during editing. Extensive experiments demonstrate that BVE achieves superior performance in generating high-quality, text-aligned 3D assets, while faithfully retaining the visual characteristics of the original input.

[198] Med-CAM: Minimal Evidence for Explaining Medical Decision Making

Pirzada Suhail, Aditya Anand, Amit Sethi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reliable and interpretable decision-making is essential in medical imaging, where diagnostic outcomes directly influence patient care. Despite advances in deep learning, most medical AI systems operate as opaque black boxes, providing little insight into why a particular diagnosis was reached. In this paper, we introduce Med-CAM, a framework for generating minimal and sharp maps as evidence-based explanations for Medical decision making via Classifier Activation Matching. Med-CAM trains a segmentation network from scratch to produce a mask that highlights the minimal evidence critical to model’s decision for any seen or unseen image. This ensures that the explanation is both faithful to the network’s behaviour and interpretable to clinicians. Experiments show, unlike prior spatial explanation methods, such as Grad-CAM and attention maps, which yield only fuzzy regions of relative importance, Med-CAM with its superior spatial awareness to shapes, textures, and boundaries, delivers conclusive, evidence-based explanations that faithfully replicate the model’s prediction for any given image. By explicitly constraining explanations to be compact, consistent with model activations, and diagnostic alignment, Med-CAM advances transparent AI to foster clinician understanding and trust in high-stakes medical applications such as pathology and radiology.

[199] SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

Haoran Lou, Ziyan Liu, Chunxiao Fan, Yuexin Wu, Yue Ming

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Multimodal Large Language Models (MLLMs) exhibit strong reasoning and world knowledge, yet adapting them for retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. In this work, we argue that adapting MLLMs for retrieval should focus on eliciting pre-trained representations rather than overwriting them. To this end, we propose SLQ, an effective and efficient framework that adapts a frozen MLLM into a retriever through a small set of Shared Latent Queries. Appended to the end of both text and image token sequences, these queries leverage the model’s native causal attention to serve as global aggregation interfaces, producing compact embeddings in a unified space while keeping the backbone unchanged. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive experiments show that SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K, while achieving competitive performance on MMEB and yielding substantial gains on KARR-Bench. The results demonstrate that SLQ, which preserves pre-trained representations, provides an effective and efficient framework for adapting MLLMs to retrieval.

[200] Granularity-Aware Transfer for Tree Instance Segmentation in Synthetic and Real Forests

Pankaj Deoli, Atef Tej, Anmol Ashri, Anandatirtha JS, Karsten Berns

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We address the challenge of synthetic-to-real transfer in forestry perception where real data have only coarse Tree labels while synthetic data provide fine-grained trunk/crown annotations. We introduce MGTD, a mixed-granularity dataset with 53k synthetic and 3.6k real images, and a four-stage protocol isolating domain shift and granularity mismatch. Our core contribution is granularity-aware distillation, which transfers structural priors from fine-grained synthetic teachers to a coarse-label student via logit-space merging and mask unification. Experiments show consistent mask AP gains, especially for small/distant trees, establishing a testbed for Sim-Real transfer under label granularity constraints.

[201] ReConText3D: Replay-based Continual Text-to-3D Generation

Muhammad Ahmed Ullah Khan, Muhammad Haris Bin Amir, Didier Stricker, Muhammad Zeshan Afzal

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Continual learning enables models to acquire new knowledge over time while retaining previously learned capabilities. However, its application to text-to-3D generation remains unexplored. We present ReConText3D, the first framework for continual text-to-3D generation. We first demonstrate that existing text-to-3D models suffer from catastrophic forgetting under incremental training. ReConText3D enables generative models to incrementally learn new 3D categories from textual descriptions while preserving the ability to synthesize previously seen assets. Our method constructs a compact and diverse replay memory through text-embedding k-Center selection, allowing representative rehearsal of prior knowledge without modifying the underlying architecture. To systematically evaluate continual text-to-3D learning, we introduce Toys4K-CL, a benchmark derived from the Toys4K dataset that provides balanced and semantically diverse class-incremental splits. Extensive experiments on the Toys4K-CL benchmark show that ReConText3D consistently outperforms all baselines across different generative backbones, maintaining high-quality generation for both old and new classes. To the best of our knowledge, this work establishes the first continual learning framework and benchmark for text-to-3D generation, opening a new direction for incremental 3D generative modeling. Project page is available at: https://mauk95.github.io/ReConText3D/.

[202] ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

Jie Liang, Jiahao Wu, Chao Wang, Jiayu Yang, Xiaoyun Zheng, Kaiqiang Xiong, Zhanke Wang, Jinbo Yan, Feng Gao, Ronggang Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: https://liangjie1999.github.io/ClipGStreamWeb/

[203] Design and Behavior of Sparse Mixture-of-Experts Layers in CNN-based Semantic Segmentation

Svetlana Pavlitska, Haixi Fan, Konstantin Ditschuneit, J. Marius Zöllner

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Sparse mixture-of-experts (MoE) layers have been shown to substantially increase model capacity without a proportional increase in computational cost and are widely used in transformer architectures, where they typically replace feed-forward network blocks. In contrast, integrating sparse MoE layers into convolutional neural networks (CNNs) remains inconsistent, with most prior work focusing on fine-grained MoEs operating at the filter or channel levels. In this work, we investigate a coarser, patch-wise formulation of sparse MoE layers for semantic segmentation, where local regions are routed to a small subset of convolutional experts. Through experiments on the Cityscapes and BDD100K datasets using encoder-decoder and backbone-based CNNs, we conduct a design analysis to assess how architectural choices affect routing dynamics and expert specialization. Our results demonstrate consistent, architecture-dependent improvements (up to +3.9 mIoU) with little computational overhead, while revealing strong design sensitivity. Our work provides empirical insights into the design and internal dynamics of sparse MoE layers in CNN-based dense prediction. Our code is available at https://github.com/KASTEL-MobilityLab/moe-layers/.

[204] Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40$\times$ parameter range (256M–10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels. Region-of-interest analysis reveals that alignment specifically in early visual cortex (V1–V3) is a reliable negative predictor of sycophancy ($r = -0.441$, BCa 95% CI $[-0.740, -0.031]$), with all 12 leave-one-out correlations negative and the strongest effect for existence denial attacks ($r = -0.597$, $p = 0.040$). This anatomically specific relationship is absent in higher-order category-selective regions, suggesting that faithful low-level visual encoding provides a measurable anchor against adversarial linguistic override in vision-language models. We release our code on \href{https://github.com/aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation}{GitHub} and dataset on \href{https://huggingface.co/datasets/aryashah00/Gaslight-Gatekeep-V1-V3}{Hugging Face}

[205] Temporally Consistent Long-Term Memory for 3D Single Object Tracking

Jaejoon Yoo, SuBeen Lee, Yerim Jeon, Miso Lee, Jae-Pil Heo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: 3D Single Object Tracking (3D-SOT) aims to localize a target object across a sequence of LiDAR point clouds, given its 3D bounding box in the first frame. Recent methods have adopted a memory-based approach to utilize previously observed features of the target object, but remain limited to only a few recent frames. This work reveals that their temporal capacity is fundamentally constrained to short-term context due to severe temporal feature inconsistency and excessive memory overhead. To this end, we propose a robust long-term 3D-SOT framework, ChronoTrack, which preserves the temporal feature consistency while efficiently aggregating the diverse target features via long-term memory. Based on a compact set of learnable memory tokens, ChronoTrack leverages long-term information through two complementary objectives: a temporal consistency loss and a memory cycle consistency loss. The former enforces feature alignment across frames, alleviating temporal drift and improving the reliability of proposed long-term memory. In parallel, the latter encourages each token to encode diverse and discriminative target representations observed throughout the sequence via memory-point-memory cyclic walks. As a result, ChronoTrack achieves new state-of-the-art performance on multiple 3D-SOT benchmarks, demonstrating its effectiveness in long-term target modeling with compact memory while running at real-time speed of 42 FPS on a single RTX 4090 GPU. The code is available at https://github.com/ujaejoon/ChronoTrack

[206] PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation

Chen Wang, Yixin Zhu, Yongbin Zhu, Fengyuan Shi, Qi Li, Jun Wang, Zuozhu Liu, Keli Hu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate lesion segmentation in ultrasound images is essential for preventive screening and clinical diagnosis, yet remains challenging due to low contrast, blurry boundaries, and significant scale variations. Although existing deep learning-based methods have achieved remarkable performance, these methods still struggle with scale variations and indistinct tumor boundaries. To address these challenges, we propose a progressive boundary enhanced U-Net (PBE-UNet). Specially, we first introduce a scale-aware aggregation module (SAAM) that dynamically adjusts its receptive field to capture robust multi-scale contextual information. Then, we propose a boundary-guided feature enhancement (BGFE) module to enhance the feature representations. We find that there are large gaps between the narrow boundary and the wide segmentation error areas. Unlike existing methods that treat boundaries as static masks, the BGFE module progressively expands the narrow boundary prediction into broader spatial attention maps. Thus, broader spatial attention maps could effectively cover the wider segmentation error regions and enhance the model’s focus on these challenging areas. We conduct expensive experiments on four benchmark ultrasound datasets, BUSI, Dataset B, TN3K, and BP. The experimental results how that our proposed PBE-UNet outperforms state-of-the-art ultrasound image segmentation methods. The code is at https://github.com/cruelMouth/PBE-UNet.

[207] From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

Mohammad Mahdi, Nedko Savov, Danda Pani Paudel, Luc Van Gool

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Exo-to-Ego video generation aims to synthesize a first-person video from a synchronized third-person view and corresponding camera poses. While paired supervision is available, synchronized exo-ego data inherently introduces substantial spatio-temporal and geometric discontinuities, violating the smooth-motion assumptions of standard video generation benchmarks. We identify this synchronization-induced jump as the central challenge and propose Syn2Seq-Forcing, a sequential formulation that interpolates between the source and target videos to form a single continuous signal. By reframing Exo2Ego as sequential signal modeling rather than a conventional condition-output task, our approach enables diffusion-based sequence models, e.g. Diffusion Forcing Transformers (DFoT), to capture coherent transitions across frames more effectively. Empirically, we show that interpolating only the videos, without performing pose interpolation already produces significant improvements, emphasizing that the dominant difficulty arises from spatio-temporal discontinuities. Beyond immediate performance gains, this formulation establishes a general and flexible framework capable of unifying both Exo2Ego and Ego2Exo generation within a single continuous sequence model, providing a principled foundation for future research in cross-view video synthesis.

[208] Artificial intelligence application in lymphoma diagnosis with Vision Transformer using weakly supervised training

Nghia, Nguyen, Amer Wahed, Andy Quesada, Yasir Ali, Hanadi El Achi, Y. Helen Zhang, Jocelyn Ursua, Alex Banerjee, Sahib Kalra, L. Jeffrey Medeiros, Jie Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Vision transformers (ViT) have been shown to allow for more flexible feature detection and can outperform convolutional neural network (CNN) when pre-trained on sufficient data. Due to their promising feature detection capabilities, we deployed ViTs for morphological classification of anaplastic large cell lymphoma (ALCL) versus classic Hodgkin lymphoma (cHL). We had previously designed a ViT model which was trained on a small dataset of 1,200 image patches in fully supervised training. That model achieved a diagnostic accuracy of 100% and an F1 score of 1.0 on the independent test set. Since fully supervised training is not a practical method due to lack of expertise resources in both the training and testing phases, we conducted a recent study on a modified approach to training data (weakly supervised training) and show that labeling training image patch automatically at the slide level of each whole-slide-image is a more practical solution for clinical use of Vision Transformer. Our ViT model, trained on a larger dataset of 100,000 image patches, yields evaluation metrics with significant accuracy, F1 score, and area under the curve (AUC) at 91.85%, 0.92, and 0.98, respectively. These are respectable values that qualify this ViT model, with weakly supervised training, as a suitable tool for a deep learning module in clinical model development using automated image patch extraction.

[209] DRG-Font: Dynamic Reference-Guided Few-shot Font Generation via Contrastive Style-Content Disentanglement

Rejoy Chakraborty, Prasun Roy, Saumik Bhattacharya, Umapada Pal

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Few-shot Font Generation aims to generate stylistically consistent glyphs from a few reference glyphs. However, capturing complex font styles from a few exemplars remains challenging, and the existing methods often struggle to retain discernible local characteristics in generated samples. This paper introduces DRG-Font, a contrastive font generation strategy that learns complex glyph attributes by decomposing style and content embedding spaces. For optimal style supervision, the proposed architecture incorporates a Reference Selection (RS) Module to dynamically select the best style reference from an available pool of candidates. The network learns to decompose glyph attributes into style and shape priors through a Multi-scale Style Head Block (MSHB) and a Multi-scale Content Head Block (MCHB). For style adaptation, a Multi-Fusion Upsampling Block (MFUB) produces the target glyph by combining the reference style prior and target content prior. The proposed method demonstrates significant improvements over state-of-the-art approaches across multiple visual and analytical benchmarks.

[210] A Resource-Efficient Hybrid CNN-LSTM network for image-based bean leaf disease classification

Hye Jin Rhee, Joseph Damilola Akinyemi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Accurate and resource-efficient automated diagnosis is a cornerstone of modern agricultural expert systems. While Convolutional Neural Networks (CNNs) have established benchmarks in plant pathology, their ability to capture long-range spatial dependencies is often limited by standard pooling layers, and their high memory footprint hinders deployment on portable devices. This paper proposes a lightweight hybrid CNN-LSTM system for bean leaf disease classification. By integrating an LSTM layer to model the spatial-sequential relationships within feature maps, our hybrid architecture achieves a 94.38% accuracy while maintaining an exceptionally small footprint of 1.86 MB; a 70% reduction in size compared to traditional CNN-based systems. Furthermore, we provide a systematic evaluation of image augmentation strategies, demonstrating that tailored transformations are superior to generic combinations for maintaining the integrity of diagnostic patterns. Results on the $\textit{ibean}$ dataset confirm that the proposed system achieves state-of-the-art F1 scores of 99.22% with EfficientNet-B7+LSTM, providing a robust and scalable framework for real-time agricultural decision support in resource-constrained environments. The code and augmented datasets used in this study are publicly available on this $\href{https://github.com/HJin-R/bean_disease}{Github}$ repo.

[211] DiffMagicFace: Identity Consistent Facial Editing of Real Videos

Huanghao Yin, Shenkun Xu, Kanle Shi, Junhai Yong, Bin Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Text-conditioned image editing has greatly benefitted from the advancements in Image Diffusion Models. However, extending these techniques to facial video editing introduces challenges in preserving facial identity throughout the source video and ensuring consistency of the edited subject across frames. In this paper, we introduce DiffMagicFace, a unique video editing framework that integrates two fine-tuned models for text and image control. These models operate concurrently during inference to produce video frames that maintain identity features while seamlessly aligning with the editing semantics. To ensure the consistency of the edited videos, we develop a dataset comprising images showcasing various facial perspectives for each edited subject. The creation of a data set is achieved through rendering techniques and the subsequent application of optimization algorithms. Remarkably, our approach does not depend on video datasets but still delivers high-quality results in both consistency and content. The excellent effect holds even for complex tasks like talking head videos and distinguishing closely related categories. The videos edited using our framework exhibit parity with videos that are made using traditional rendering software. Through comparative analysis with current state-of-the-art methods, our framework demonstrates superior performance in both visual appeal and quantitative metrics.

[212] Any3DAvatar: Fast and High-Quality Full-Head 3D Avatar Reconstruction from Single Portrait Image

Yujie Gao, Yao Xiao, Xiangnan Zhu, Ya Li, Yiyi Zhang, Liqing Zhang, Jianfu Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reconstructing a complete 3D head from a single portrait remains challenging because existing methods still face a sharp quality-speed trade-off: high-fidelity pipelines often rely on multi-stage processing and per-subject optimization, while fast feed-forward models struggle with complete geometry and fine appearance details. To bridge this gap, we propose Any3DAvatar, a fast and high-quality method for single-image 3D Gaussian head avatar generation, whose fastest setting reconstructs a full head in under one second while preserving high-fidelity geometry and texture. First, we build AnyHead, a unified data suite that combines identity diversity, dense multi-view supervision, and realistic accessories, filling the main gaps of existing head data in coverage, full-head geometry, and complex appearance. Second, rather than sampling unstructured noise, we initialize from a Plücker-aware structured 3D Gaussian scaffold and perform one-step conditional denoising, formulating full-head reconstruction into a single forward pass while retaining high fidelity. Third, we introduce auxiliary view-conditioned appearance supervision on the same latent tokens alongside 3D Gaussian reconstruction, improving novel-view texture details at zero extra inference cost. Experiments show that Any3DAvatar outperforms prior single-image full-head reconstruction methods in rendering fidelity while remaining substantially faster.

[213] PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

Zebei Tong, Hongchang Chen, Yujie Lei, Gang Chen, Yushi Liu, Zhi Zheng, Hao Chen, Jieming Zhang, Ying Li, Dongpu Cao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Image generation technology can synthesize condition-specific images to supplement real-world industrial anomaly data and enhance anomaly detection model performance. Existing generation techniques rarely account for the pose and orientation of industrial components in assembly, making the generated images difficult to utilize for downstream application. To solve this, we propose a novel image synthesis approach, called PostureObjectStitch, that achieves accurate generation to meet the requirement of industrial assembly. A condition decoupling approach is introduced to separate input multi-view images into high-frequency, texture, and RGB features. The feature temporal modulation mechanism adapts these features across diffusion model time-steps, enabling progressive generation from coarse to fine details while maintaining consistency. To ensure semantic accuracy, we introduce a conditional loss that enhances critical industrial elements and a geometric prior that guides component positioning for correct assembly relationships. Comprehensive experimental results on the MureCom dataset, our newly contributed DreamAssembly dataset, and the downstream application validate the outstanding performance of our method.

[214] Context Sensitivity Improves Human-Machine Visual Alignment

Frieda Born, Tom Neuhäuser, Lukas Muttenthaler, Brett D. Roads, Bernhard Spitzer, Andrew K. Lampinen, Matt Jones, Klaus-Robert Müller, Michael C. Mozer

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Modern machine learning models typically represent inputs as fixed points in a high-dimensional embedding space. While this approach has been proven powerful for a wide range of downstream tasks, it fundamentally differs from the way humans process information. Because humans are constantly adapting to their environment, they represent objects and their relationships in a highly context-sensitive manner. To address this gap, we propose a method for context-sensitive similarity computation from neural network embeddings, applied to modeling a triplet odd-one-out task with an anchor image serving as simultaneous context. Modeling context enables us to achieve up to a 15% improvement in odd-one-out accuracy over a context-insensitive model. We find that this improvement is consistent across both original and “human-aligned” vision foundation models.

[215] Rethinking Image-to-3D Generation with Sparse Queries: Efficiency, Capacity, and Input-View Bias

Zhiyuan Xu, Jiuming Liu, Yuxin Chen, Masayoshi Tomizuka, Chenfeng Xu, Chensheng Peng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present SparseGen, a novel framework for efficient image-to-3D generation, which exhibits low input-view bias while being significantly faster. Unlike traditional approaches that rely on dense volumetric grids, triplanes, or pixel-aligned primitives, we model scenes with a compact sparse set of learned 3D anchor queries and a learned expansion operator that decodes each transformed query into a small local set of 3D Gaussian primitives. Trained under a rectified-flow reconstruction objective without 3D supervision, our model learns to allocate representation capacity where geometry and appearance matter, achieving significant reductions in memory and inference time while preserving multi-view fidelity. We introduce quantitative measures of input-view bias and utilization to show that sparse queries reduce overfitting to conditioning views while being representationally efficient. Our results argue that sparse set-latent expansion is a principled, practical alternative for efficient 3D generative modeling.

[216] Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

Weijie Wang, Qihang Cao, Sensen Gao, Donny Y. Chen, Haofei Xu, Wenjing Bian, Songyou Peng, Tat-Jen Cham, Chuanxia Zheng, Andreas Geiger, Jianfei Cai, Jia-Wang Bian, Bohan Zhuang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Reconstructing 3D representations from 2D inputs is a fundamental task in computer vision and graphics, serving as a cornerstone for understanding and interacting with the physical world. While traditional methods achieve high fidelity, they are limited by slow per-scene optimization or category-specific training, which hinders their practical deployment and scalability. Hence, generalizable feed-forward 3D reconstruction has witnessed rapid development in recent years. By learning a model that maps images directly to 3D representations in a single forward pass, these methods enable efficient reconstruction and robust cross-scene generalization. Our survey is motivated by a critical observation: despite the diverse geometric output representations, ranging from implicit fields to explicit primitives, existing feed-forward approaches share similar high-level architectural patterns, such as image feature extraction backbones, multi-view information fusion mechanisms, and geometry-aware design principles. Consequently, we abstract away from these representation differences and instead focus on model design, proposing a novel taxonomy centered on model design strategies that are agnostic to the output format. Our proposed taxonomy organizes the research directions into five key problems that drive recent research development: feature enhancement, geometry awareness, model efficiency, augmentation strategies and temporal-aware models. To support this taxonomy with empirical grounding and standardized evaluation, we further comprehensively review related benchmarks and datasets, and extensively discuss and categorize real-world applications based on feed-forward 3D models. Finally, we outline future directions to address open challenges such as scalability, evaluation standards, and world modeling.

Shuyun Wang, Hu Zhang, Xin Shen, Dadong Wang, Xin Yu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Bitstream-corrupted video recovery aims to restore realistic content degraded during video storage or transmission. Existing methods typically assume that predefined masks of corrupted regions are available, but manually annotating these masks is labor-intensive and impractical in real-world scenarios. To address this limitation, we introduce a new blind video recovery setting that removes the reliance on predefined masks. This setting presents two major challenges: accurately identifying corrupted regions and recovering content from extensive and irregular degradations. We propose a Metadata-Guided Diffusion Model (M-GDM) to tackle these challenges. Specifically, intrinsic video metadata are leveraged as corruption indicators through a dual-stream metadata encoder that separately embeds motion vectors and frame types before fusing them into a unified representation. This representation interacts with corrupted latent features via cross-attention at each diffusion step. To preserve intact regions, we design a prior-driven mask predictor that generates pseudo masks using both metadata and diffusion priors, enabling the separation and recombination of intact and recovered regions through hard masking. To mitigate boundary artifacts caused by imperfect masks, a post-refinement module enhances consistency between intact and recovered regions. Extensive experiments demonstrate the effectiveness of our method and its superiority in blind video recovery. Code is available at: https://github.com/Shuyun-Wang/M-GDM.

[218] PartNerFace: Part-based Neural Radiance Fields for Animatable Facial Avatar Reconstruction

Xianggang Yu, Lingteng Qiu, Xiaohang Ren, Guanying Chen, Shuguang Cui, Xiaoguang Han, Baoyuan Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present PartNerFace, a part-based neural radiance fields approach, for reconstructing animatable facial avatar from monocular RGB videos. Existing solutions either simply condition the implicit network with the morphable model parameters or learn an imaginary canonical radiance field, making them fail to generalize to unseen facial expressions and capture fine-scale motion details. To address these challenges, we first apply inverse skinning based on a parametric head model to map an observed point to the canonical space, and then model fine-scale motions with a part-based deformation field. Our key insight is that the deformation of different facial parts should be modeled differently. Specifically, our part-based deformation field consists of multiple local MLPs to adaptively partition the canonical space into different parts, where the deformation of a 3D point is computed by aggregating the prediction of all local MLPs by a soft-weighting mechanism. Extensive experiments demonstrate that our method generalizes well to unseen expressions and is capable of modeling fine-scale facial motions, outperforming state-of-the-art methods both quantitatively and qualitatively.

[219] ASTRA: Enhancing Multi-Subject Generation with Retrieval-Augmented Pose Guidance and Disentangled Position Embedding

Tianze Xia, Zijian Ning, Zonglin Zhao, Mingjia Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Subject-driven image generation has shown great success in creating personalized content, but its capabilities are largely confined to single subjects in common poses. Current approaches face a fundamental conflict when handling multiple subjects with complex, distinct actions: preserving individual identities while enforcing precise pose structures. This challenge often leads to identity fusion and pose distortion, as appearance and structure signals become entangled within the model’s architecture. To resolve this conflict, we introduce ASTRA(Adaptive Synthesis through Targeted Retrieval Augmentation), a novel framework that architecturally disentangles subject appearance from pose structure within a unified Diffusion Transformer. ASTRA achieves this through a dual-pronged strategy. It first employs a Retrieval-Augmented Pose (RAG-Pose) pipeline to provide a clean, explicit structural prior from a curated database. Then, its core generative model learns to process these dual visual conditions using our Enhanced Universal Rotary Position Embedding (EURoPE), an asymmetric encoding mechanism that decouples identity tokens from spatial locations while binding pose tokens to the canvas. Concurrently, a Disentangled Semantic Modulation (DSM) adapter offloads the identity preservation task into the text conditioning stream. Extensive experiments demonstrate that our integrated approach achieves superior disentanglement. On our designed COCO-based complex pose benchmark, ASTRA achieves a new state-of-the-art in pose adherence, while maintaining high identity fidelity and text alignment in DreamBench.

[220] A Multi-Stage Optimization Pipeline for Bethesda Cell Detection in Pap Smear Cytology

Martin Amster, Camila María Polotto

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Computer vision techniques have advanced significantly in recent years, finding diverse and impactful applications within the medical field. In this paper, we introduce a new framework for the detection of Bethesda cells in Pap smear images, developed for Track B of the Riva Cytology Challenge held in association with the International Symposium on Biomedical Imaging (ISBI). This work focuses on enhancing computer vision models for cell detection, with performance evaluated using the mAP50-95 metric. We propose a solution based on an ensemble of YOLO and U-Net architectures, followed by a refinement stage utilizing overlap removal techniques and a binary classifier. Our framework achieved second place with a mAP50-95 score of 0.5909 in the competition. The implementation and source code are available at the following repository: github.com/martinamster/riva-trackb

[221] SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Songlin Du, Xiaoyong Lu, Yaping Yan, Guobao Xiao, Xiaobo Lu, Takeshi Ikenaga

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene’s global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue’s superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

[222] UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: GUI grounding, which localizes interface elements from screenshots given natural language queries, remains challenging for small icons and dense layouts. Test-time zoom-in methods improve localization by cropping and re-running inference at higher resolution, but apply cropping uniformly across all instances with fixed crop sizes, ignoring whether the model is actually uncertain on each case. We propose \textbf{UI-Zoomer}, a training-free adaptive zoom-in framework that treats both the trigger and scale of zoom-in as a prediction uncertainty quantification problem. A confidence-aware gate fuses spatial consensus among stochastic candidates with token-level generation confidence to selectively trigger zoom-in only when localization is uncertain. When triggered, an uncertainty-driven crop sizing module decomposes prediction variance into inter-sample positional spread and intra-sample box extent, deriving a per-instance crop radius via the law of total variance. Extensive experiments on ScreenSpot-Pro, UI-Vision, and ScreenSpot-v2 demonstrate consistent improvements over strong baselines across multiple model architectures, achieving gains of up to +13.4%, +10.3%, and +4.2% respectively, with no additional training required.

[223] Heuristic Style Transfer for Real-Time, Efficient Weather Attribute Detection

Hamed Ouattara, Pierre Duthon, Pascal Houssam Salmane, Frédéric Bernardin, Omar Ait Aider

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: We present lightweight and efficient architectures to detect weather conditions from RGB images, predicting the weather type (sunny, rain, snow, fog) and 11 complementary attributes such as intensity, visibility, and ground condition, for a total of 53 classes across the tasks. This work examines to what extent weather conditions manifest as variations in visual style. We investigate style-inspired techniques, including Gram matrices, a truncated ResNet-50 targeting lower and intermediate layers, and PatchGAN-style architectures, within a multi-task framework with attention mechanisms. Two families are introduced: RTM (ResNet50-Truncated-MultiTasks) and PMG (PatchGAN-MultiTasks-Gram), together with their variants. Our contributions include automation of Gram-matrix computation, integration of PatchGAN into supervised multi-task learning, and local style capture through local Gram for improved spatial coherence. We also release a dataset of 503,875 images annotated with 12 weather attributes under a Creative Commons Attribution (CC-BY) license. The models achieve F1 scores above 96 percent on our internal test set and above 78 percent in zero-shot evaluation on several external datasets, confirming their generalization ability. The PMG architecture, with fewer than 5 million parameters, runs in real time with a small memory footprint, making it suitable for embedded systems. The modular design of the models also allows style-related or weather-related tasks to be added or removed as needed.

[224] HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

Tianshuo Yang, Guanyu Chen, Yutian Chen, Zhixuan Liang, Yitian Liu, Zanxin Chen, Chunpu Xu, Haotian Liang, Jiangmiao Pang, Yao Mu, Ping Luo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM’s zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

[225] SpatialEvo: Self-Evolving Spatial Intelligence via Deterministic Geometric Environments

Dinging Li, Yingxiu Zhao, Xinrui Cheng, Kangheng Lin, Hongbo Peng, Hongxing Li, Zixuan Wang, Yuhong Dai, Haodong Li, Jia Wang, Yukang Shi, Liang Zhao, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Spatial reasoning over three-dimensional scenes is a core capability for embodied intelligence, yet continuous model improvement remains bottlenecked by the cost of geometric annotation. The self-evolving paradigm offers a promising path, but its reliance on model consensus to construct pseudo-labels causes training to reinforce rather than correct the model’s own geometric errors. We identify a property unique to 3D spatial reasoning that circumvents this limitation: ground truth is a deterministic consequence of the underlying geometry, computable exactly from point clouds and camera poses without any model involvement. Building on this insight, we present SpatialEvo, a self-evolving framework for 3D spatial reasoning, centered on the Deterministic Geometric Environment (DGE). The DGE formalizes 16 spatial reasoning task categories under explicit geometric validation rules and converts unannotated 3D scenes into zero-noise interactive oracles, replacing model consensus with objective physical feedback. A single shared-parameter policy co-evolves across questioner and solver roles under DGE constraints: the questioner generates physically valid spatial questions grounded in scene observations, while the solver derives precise answers against DGE-verified ground truth. A task-adaptive scheduler endogenously concentrates training on the model’s weakest categories, producing a dynamic curriculum without manual design. Experiments across nine benchmarks demonstrate that SpatialEvo achieves the highest average score at both 3B and 7B scales, with consistent gains on spatial reasoning benchmarks and no degradation on general visual understanding.

[226] MApLe: Multi-instance Alignment of Diagnostic Reports and Large Medical Images

Felicia Bader, Philipp Seeböck, Anastasia Bartashova, Ulrike Attenberger, Georg Langs

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: In diagnostic reports, experts encode complex imaging data into clinically actionable information. They describe subtle pathological findings that are meaningful in their anatomical context. Reports follow relatively consistent structures, expressing diagnostic information with few words that are often associated with tiny but consequential image observations. Standard vision language models struggle to identify the associations between these informative text components and small locations in the images. Here, we propose “MApLe”, a multi-task, multi-instance vision language alignment approach that overcomes these limitations. It disentangles the concepts of anatomical region and diagnostic finding, and links local image information to sentences in a patch-wise approach. Our method consists of a text embedding trained to capture anatomical and diagnostic concepts in sentences, a patch-wise image encoder conditioned on anatomical structures, and a multi-instance alignment of these representations. We demonstrate that MApLe can successfully align different image regions and multiple diagnostic findings in free-text reports. We show that our model improves the alignment performance compared to state-of-the-art baseline models when evaluated on several downstream tasks. The code is available at https://github.com/cirmuw/MApLe.

[227] HiProto: Hierarchical Prototype Learning for Interpretable Object Detection Under Low-quality Conditions

Jianlin Xiang, Linhui Dai, Xue Yang, Chaolei Yang, Yanshan Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Interpretability is essential for deploying object detection systems in critical applications, especially under low-quality imaging conditions that degrade visual information and increase prediction uncertainty. Existing methods either enhance image quality or design complex architectures, but often lack interpretability and fail to improve semantic discrimination. In contrast, prototype learning enables interpretable modeling by associating features with class-centered semantics, which can provide more stable and interpretable representations under degradation. Motivated by this, we propose HiProto, a new paradigm for interpretable object detection based on hierarchical prototype learning. By constructing structured prototype representations across multiple feature levels, HiProto effectively models class-specific semantics, thereby enhancing both semantic discrimination and interpretability. Building upon prototype modeling, we first propose a Region-to-Prototype Contrastive Loss (RPC-Loss) to enhance the semantic focus of prototypes on target regions. Then, we propose a Prototype Regularization Loss (PR-Loss) to improve the distinctiveness among class prototypes. Finally, we propose a Scale-aware Pseudo Label Generation Strategy (SPLGS) to suppress mismatched supervision for RPC-Loss, thereby preserving the robustness of low-level prototype representations. Experiments on ExDark, RTTS, and VOC2012-FOG demonstrate that HiProto achieves competitive results while offering clear interpretability through prototype responses, without relying on image enhancement or complex architectures. Our code will be available at https://github.com/xjlDestiny/HiProto.git.

[228] Remote Sensing Image Super-Resolution for Imbalanced Textures: A Texture-Aware Diffusion Framework

Enzhuo Zhang, Sijie Zhao, Dilxat Muhtar, Zhenshi Li, Xueliang Zhang, Pengfeng Xiao

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Generative diffusion priors have recently achieved state-of-the-art performance in natural image super-resolution, demonstrating a powerful capability to synthesize photorealistic details. However, their direct application to remote sensing image super-resolution (RSISR) reveals significant shortcomings. Unlike natural images, remote sensing images exhibit a unique texture distribution where ground objects are globally stochastic yet locally clustered, leading to highly imbalanced textures. This imbalance severely hinders the model’s spatial perception. To address this, we propose TexADiff, a novel framework that begins by estimating a Relative Texture Density Map (RTDM) to represent the texture distribution. TexADiff then leverages this RTDM in three synergistic ways: as an explicit spatial conditioning to guide the diffusion process, as a loss modulation term to prioritize texture-rich regions, and as a dynamic adapter for the sampling schedule. These modifications are designed to endow the model with explicit texture-aware capabilities. Experiments demonstrate that TexADiff achieves superior or competitive quantitative metrics. Furthermore, qualitative results show that our model generates faithful high-frequency details while effectively suppressing texture hallucinations. This improved reconstruction quality also results in significant gains in downstream task performance. The source code of our method can be found at https://github.com/ZezFuture/TexAdiff.

[229] Depth-Aware Image and Video Orientation Estimation

Muhammad Z. Alam, Larry Stetsiuk, M. Umair Mukati, Zeeshan Kaleem

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: This paper introduces a novel approach for image and video orientation estimation by leveraging depth distribution in natural images. The proposed method estimates the orientation based on the depth distribution across different quadrants of the image, providing a robust framework for orientation estimation suited for applications such as virtual reality (VR), augmented reality (AR), autonomous navigation, and interactive surveillance systems. To further enhance fine-scale perceptual alignment, we incorporate depth gradient consistency (DGC) and horizontal symmetry analysis (HSA), enabling precise orientation correction. This hybrid strategy effectively exploits depth cues to support spatial coherence and perceptual stability in immersive visual content. Qualitative and quantitative evaluations demonstrate the robustness and accuracy of the proposed approach, outperforming existing techniques across diverse scenarios.

[230] POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

Yikun Liu, Yuan Liu, Le Tian, Xiao Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Large Multimodal Models (LMMs) demonstrate impressive visual perception, they remain epistemically constrained by their static parametric knowledge. To transcend these boundaries, multimodal search models have been adopted to actively interact with the external environment for evidence retrieval. Diverging from prevailing paradigms that merely retrofit general LMMs with search tools as modular extensions, we explore the potential of building a multimodal agentic search model from scratch. Specifically, we make the following contributions: (i) we introduce Agentic Seeding, a dedicated phase designed to weave the foundational precursors necessary for eliciting agentic behaviors; (ii) we uncover a performance bottleneck in long-horizon interactions, where the increasing volume of interaction history overwhelms the model’s ability to locate ground-truth evidence. To mitigate this, we propose V-Fold, an adaptive history-aware compression scheme that preserves recent dialogue turns in high fidelity while folding historical context into the visual space via rendering; and (iii) we develop POINTS-Seeker-8B, a state-of-the-art multimodal agentic search model that consistently outperforms existing models across six diverse benchmarks, effectively resolving the challenges of long-horizon, knowledge-intensive visual reasoning.

[231] Seek-and-Solve: Benchmarking MLLMs for Visual Clue-Driven Reasoning in Daily Scenarios

Xiaomin Li, Tala Wang, Zichen Zhong, Ying Zhang, Zirui Zheng, Takashi Isobe, Dezhuang Li, Huchuan Lu, You He, Xu Jia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Daily scenarios are characterized by visual richness, requiring Multimodal Large Language Models (MLLMs) to filter noise and identify decisive visual clues for accurate reasoning. Yet, current benchmarks predominantly aim at evaluating MLLMs’ pre-existing knowledge or perceptual understanding, often neglecting the critical capability of reasoning. To bridge this gap, we introduce DailyClue, a benchmark designed for visual clue-driven reasoning in daily scenarios. Our construction is guided by two core principles: (1) strict grounding in authentic daily activities, and (2) challenging query design that necessitates more than surface-level perception. Instead of simple recognition, our questions compel MLLMs to actively explore suitable visual clues and leverage them for subsequent reasoning. To this end, we curate a comprehensive dataset spanning four major daily domains and 16 distinct subtasks. Comprehensive evaluation across MLLMs and agentic models underscores the formidable challenge posed by our benchmark. Our analysis reveals several critical insights, emphasizing that the accurate identification of visual clues is essential for robust reasoning.

[232] Decoding the Delta: Unifying Remote Sensing Change Detection and Understanding with Multimodal Large Language Models

Xiaohe Li, Jiahao Li, Kaixin Zhang, Yuqiang Fang, Leilei Lin, Hong Wang, Haohua Wu, Zide Fan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Multimodal Large Language Models (MLLMs) excel in general vision-language tasks, their application to remote sensing change understanding is hindered by a fundamental “temporal blindness”. Existing architectures lack intrinsic mechanisms for multi-temporal contrastive reasoning and struggle with precise spatial grounding. To address this, we first introduce Delta-QA, a comprehensive benchmark comprising 180k visual question-answering samples. Delta-QA unifies pixel-level segmentation and visual question answering across bi- and tri-temporal scenarios, structuring change interpretation into four progressive cognitive dimensions. Methodologically, we propose Delta-LLaVA, a novel MLLM framework explicitly tailored for multi-temporal remote sensing interpretation. It overcomes the limitations of naive feature concatenation through three core innovations: a Change-Enhanced Attention module that systematically isolates and amplifies visual differences, a Change-SEG module utilizing Change Prior Embedding to extract differentiable difference features as input for the LLM, and Local Causal Attention to prevent cross-temporal contextual leakage. Extensive experiments demonstrate that Delta-LLaVA decisively outperforms leading generalist MLLMs and specialized segmentation models in complex change deduction and high-precision boundary localization, establishing a unified framework for earth observation intelligence.

[233] Free Geometry: Refining 3D Reconstruction from Longer Versions of Itself

Yuhang Dai, Xingyi Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Feed-forward 3D reconstruction models are efficient but rigid: once trained, they perform inference in a zero-shot manner and cannot adapt to the test scene. As a result, visually plausible reconstructions often contain errors, particularly under occlusions, specularities, and ambiguous cues. To address this, we introduce Free Geometry, a framework that enables feed-forward 3D reconstruction models to self-evolve at test time without any 3D ground truth. Our key insight is that, when the model receives more views, it produces more reliable and view-consistent reconstructions. Leveraging this property, given a testing sequence, we mask a subset of frames to construct a self-supervised task. Free Geometry enforces cross-view feature consistency between representations from full and partial observations, while maintaining the pairwise relations implied by the held-out frames. This self-supervision allows for fast recalibration via lightweight LoRA updates, taking less than 2 minutes per dataset on a single GPU. Our approach consistently improves state-of-the-art foundation models, including Depth Anything 3 and VGGT, across 4 benchmark datasets, yielding an average improvement of 3.73% in camera pose accuracy and 2.88% in point map prediction. Code is available at https://github.com/hiteacherIamhumble/Free-Geometry .

[234] Towards Unconstrained Human-Object Interaction

Francesco Tonini, Alessandro Conti, Lorenzo Vaquero, Cigdem Beyan, Elisa Ricci

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Human-Object Interaction (HOI) detection is a longstanding computer vision problem concerned with predicting the interaction between humans and objects. Current HOI models rely on a vocabulary of interactions at training and inference time, limiting their applicability to static environments. With the advent of Multimodal Large Language Models (MLLMs), it has become feasible to explore more flexible paradigms for interaction recognition. In this work, we revisit HOI detection through the lens of MLLMs and apply them to in-the-wild HOI detection. We define the Unconstrained HOI (U-HOI) task, a novel HOI domain that removes the requirement for a predefined list of interactions at both training and inference. We evaluate a range of MLLMs on this setting and introduce a pipeline that includes test-time inference and language-to-graph conversion to extract structured interactions from free-form text. Our findings highlight the limitations of current HOI detectors and the value of MLLMs for U-HOI. Code will be available at https://github.com/francescotonini/anyhoi

[235] Training-Free Semantic Multi-Object Tracking with Vision-Language Models

Laurence Bonat, Francesco Tonini, Elisa Ricci, Lorenzo Vaquero

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Semantic Multi-Object Tracking (SMOT) extends multi-object tracking with semantic outputs such as video summaries, instance-level captions, and interaction labels, aiming to move from trajectories to human-interpretable descriptions of dynamic scenes. Existing SMOT systems are trained end-to-end, coupling progress to expensive supervision, limiting the ability to rapidly adapt to new foundation models and new interactions. We propose TF-SMOT, a training-free SMOT pipeline that composes pretrained components for detection, mask-based tracking, and video-language generation. TF-SMOT combines D-FINE and the promptable SAM2 segmentation tracker to produce temporally consistent tracklets, uses contour grounding to generate video summaries and instance captions with InternVideo2.5, and aligns extracted interaction predicates to BenSMOT WordNet synsets via gloss-based semantic retrieval with LLM disambiguation. On BenSMOT, TF-SMOT achieves state-of-the-art tracking performance within the SMOT setting and improves summary and caption quality compared to prior art. Interaction recognition, however, remains challenging under strict exact-match evaluation on the fine-grained and long-tailed WordNet label space; our analysis and ablations indicate that semantic overlap and label granularity substantially affect measured performance.

[236] Don’t Let the Video Speak: Audio-Contrastive Preference Optimization for Audio-Visual Language Models

Ami Baid, Zihui Xue, Kristen Grauman

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: While Audio-Visual Language Models (AVLMs) have achieved remarkable progress over recent years, their reliability is bottlenecked by cross-modal hallucination. A particularly pervasive manifestation is video-driven audio hallucination: models routinely exploit visual shortcuts to hallucinate expected sounds, discarding true auditory evidence. To counteract this deeply ingrained visual dominance, we propose Audio-Contrastive Preference Optimization (ACPO). This dual-axis preference learning framework introduces an output-contrastive objective to penalize visual descriptions masquerading as audio facts, alongside an input-contrastive objective that swaps audio tracks to explicitly penalize generation invariant to the true auditory signal. Extensive experiments demonstrate that ACPO establishes highly faithful audio grounding and mitigates audio hallucination without compromising overarching multimodal capabilities.

[237] Geometric Context Transformer for Streaming 3D Reconstruction

Lin-Zhuo Chen, Jian Gao, Yihang Chen, Ka Leong Cheng, Yipengjing Sun, Liangxiao Hu, Nan Xue, Xing Zhu, Yujun Shen, Yao Yao, Yinghao Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Streaming 3D reconstruction aims to recover 3D information, such as camera poses and point clouds, from a video stream, which necessitates geometric accuracy, temporal consistency, and computational efficiency. Motivated by the principles of Simultaneous Localization and Mapping (SLAM), we introduce LingBot-Map, a feed-forward 3D foundation model for reconstructing scenes from streaming data, built upon a geometric context transformer (GCT) architecture. A defining aspect of LingBot-Map lies in its carefully designed attention mechanism, which integrates an anchor context, a pose-reference window, and a trajectory memory to address coordinate grounding, dense geometric cues, and long-range drift correction, respectively. This design keeps the streaming state compact while retaining rich geometric context, enabling stable efficient inference at around 20 FPS on 518 x 378 resolution inputs over long sequences exceeding 10,000 frames. Extensive evaluations across a variety of benchmarks demonstrate that our approach achieves superior performance compared to both existing streaming and iterative optimization-based approaches.

[238] ROSE: Retrieval-Oriented Segmentation Enhancement

Song Tang, Guangquan Jie, Henghui Ding, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Existing segmentation models based on multimodal large language models (MLLMs), such as LISA, often struggle with novel or emerging entities due to their inability to incorporate up-to-date knowledge. To address this challenge, we introduce the Novel Emerging Segmentation Task (NEST), which focuses on segmenting (i) novel entities that MLLMs fail to recognize due to their absence from training data, and (ii) emerging entities that exist within the model’s knowledge but demand up-to-date external information for accurate recognition. To support the study of NEST, we construct a NEST benchmark using an automated pipeline that generates news-related data samples for comprehensive evaluation. Additionally, we propose ROSE: Retrieval-Oriented Segmentation Enhancement, a plug-and-play framework designed to augment any MLLM-based segmentation model. ROSE comprises four key components. First, an Internet Retrieval-Augmented Generation module is introduced to employ user-provided multimodal inputs to retrieve real-time web information. Then, a Textual Prompt Enhancer enriches the model with up-to-date information and rich background knowledge, improving the model’s perception ability for emerging entities. Furthermore, a Visual Prompt Enhancer is proposed to compensate for MLLMs’ lack of exposure to novel entities by leveraging internet-sourced images. To maintain efficiency, a WebSense module is introduced to intelligently decide when to invoke retrieval mechanisms based on user input. Experimental results demonstrate that ROSE significantly boosts performance on the NEST benchmark, outperforming a strong Gemini-2.0 Flash-based retrieval baseline by 19.2 in gIoU.

[239] Seedance 2.0: Advancing Video Generation for World Complexity

Team Seedance, De Chen, Liyang Chen, Xin Chen, Ying Chen, Zhuo Chen, Zhuowei Chen, Feng Cheng, Tianheng Cheng, Yufeng Cheng, Mojie Chi, Xuyan Chi, Jian Cong, Qinpeng Cui, Fei Ding, Qide Dong, Yujiao Du, Haojie Duanmu, Junliang Fan, Jiarui Fang, Jing Fang, Zetao Fang, Chengjian Feng, Yu Gao, Diandian Gu, Dong Guo, Hanzhong Guo, Qiushan Guo, Boyang Hao, Hongxiang Hao, Haoxun He, Jiaao He, Qian He, Tuyen Hoang, Heng Hu, Ruoqing Hu, Yuxiang Hu, Jiancheng Huang, Weilin Huang, Zhaoyang Huang, Zhongyi Huang, Jishuo Jin, Ming Jing, Ashley Kim, Shanshan Lao, Yichong Leng, Bingchuan Li, Gen Li, Haifeng Li, Huixia Li, Jiashi Li, Ming Li, Xiaojie Li, Xingxing Li, Yameng Li, Yiying Li, Yu Li, Yueyan Li, Chao Liang, Han Liang, Jianzhong Liang, Ying Liang, Wang Liao, J. H. Lien, Shanchuan Lin, Xi Lin, Feng Ling, Yue Ling, Fangfang Liu, Jiawei Liu, Jihao Liu, Jingtuo Liu, Shu Liu, Sichao Liu, Wei Liu, Xue Liu, Zuxi Liu, Ruijie Lu, Lecheng Lyu, Jingting Ma, Tianxiang Ma, Xiaonan Nie, Jingzhe Ning, Junjie Pan, Xitong Pan, Ronggui Peng, Xueqiong Qu, Yuxi Ren, Yuchen Shen, Guang Shi, Lei Shi, Yinglong Song, Fan Sun, Li Sun, Renfei Sun, Wenjing Tang, Boyang Tao, Zirui Tao, Dongliang Wang, Feng Wang, Hulin Wang, Ke Wang, Qingyi Wang, Rui Wang, Shuai Wang, Shulei Wang, Weichen Wang, Xuanda Wang, Yanhui Wang, Yue Wang, Yuping Wang, Yuxuan Wang, Zijie Wang, Ziyu Wang, Guoqiang Wei, Meng Wei, Di Wu, Guohong Wu, Hanjie Wu, Huachao Wu, Jian Wu, Jie Wu, Ruolan Wu, Shaojin Wu, Xiaohu Wu, Xinglong Wu, Yonghui Wu, Ruiqi Xia, Xin Xia, Xuefeng Xiao, Shuang Xu, Bangbang Yang, Jiaqi Yang, Runkai Yang, Tao Yang, Yihang Yang, Zhixian Yang, Ziyan Yang, Fulong Ye, Bingqian Yi, Xing Yin, Yongbin You, Linxiao Yuan, Weihong Zeng, Xuejiao Zeng, Yan Zeng, Siyu Zhai, Zhonghua Zhai, Bowen Zhang, Chenlin Zhang, Heng Zhang, Jun Zhang, Manlin Zhang, Peiyuan Zhang, Shuo Zhang, Xiaohe Zhang, Xiaoying Zhang, Xinyan Zhang, Xinyi Zhang, Yichi Zhang, Zixiang Zhang, Haiyu Zhao, Huating Zhao, Liming Zhao, Yian Zhao, Guangcong Zheng, Jianbin Zheng, Xiaozheng Zheng, Zerong Zheng, Kuan Zhu, Feilong Zuo

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Seedance 2.0 is a new native multi-modal audio-video generation model, officially released in China in early February 2026. Compared with its predecessors, Seedance 1.0 and 1.5 Pro, Seedance 2.0 adopts a unified, highly efficient, and large-scale architecture for multi-modal audio-video joint generation. This allows it to support four input modalities: text, image, audio, and video, by integrating one of the most comprehensive suites of multi-modal content reference and editing capabilities available in the industry to date. It delivers substantial, well-rounded improvements across all key sub-dimensions of video and audio generation. In both expert evaluations and public user tests, the model has demonstrated performance on par with the leading levels in the field. Seedance 2.0 supports direct generation of audio-video content with durations ranging from 4 to 15 seconds, with native output resolutions of 480p and 720p. For multi-modal inputs as reference, its current open platform supports up to 3 video clips, 9 images, and 3 audio clips. In addition, we provide Seedance 2.0 Fast version, an accelerated variant of Seedance 2.0 designed to boost generation speed for low-latency scenarios. Seedance 2.0 has delivered significant improvements to its foundational generation capabilities and multi-modal generation performance, bringing an enhanced creative experience for end users.

[240] One Token per Highly Selective Frame: Towards Extreme Compression for Long Video Understanding

Zheyu Zhang, Ziqi Pang, Shixing Chen, Xiang Hao, Vimal Bhat, Yu-Xiong Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Long video understanding is inherently challenging for vision-language models (VLMs) because of the extensive number of frames. With each video frame typically expanding into tens or hundreds of tokens, the limited context length of large language models (LLMs) forces the VLMs to perceive the frames sparsely and lose temporal information. To address this, we explore extreme video token compression towards \emph{one token per frame} at the final LLM layer. Our key insight is that heuristic-based compression, widely adopted by previous methods, is prone to information loss, and this necessitates supervising LLM layers into \emph{learnable} and \emph{progressive} modules for \emph{token-level compression} (LP-Comp). Such compression enables our VLM to digest 2x-4x more frames with improved performance. To further increase the token efficiency, we investigate \emph{frame-level compression}, which selects the frames most relevant to the queries via the internal attention scores of the LLM layers, named \emph{question-conditioned compression} (QC-Comp). As a notable distinction from previous studies, we mitigate the position bias of LLM attention in long contexts, \emph{i.e.}, the over-concentration on the beginning and end of a sequence, by splitting long videos into short segments and employing local attention. Collectively, our combined \emph{token-level} and \emph{frame-level} leads to an e\textbf{x}treme compression model for long video understanding, named \textbf{\name}, achieving a significantly larger compression ratio and enabling denser frame sampling. Our \name is finetuned from VideoChat-Flash with a data-efficient \emph{supervised compression tuning} stage that only requires 2.5% of the supervised fine-tuning data, yet boosts the accuracy from 42.9% to 46.2% on LVBench and enhances multiple other long video benchmarks.

[241] SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion

Danish Nazir, Marcus Liwicki, Didier Stricker, Muhammad Zeshan Afzal

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2204.13635: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2204.13635&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[242] What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal, Apratim Bhattacharyya, Guillaume Berger, Antoine Mercier, Cornelius Bohm, Florian Dietrichkeit, Reza Pourreza, Xuanlin Li, Pulkit Madan, Mingu Lee, Mark Todorovich, Ingo Bax, Roland Memisevic

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2407.08101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.08101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[243] SinkSAM-Net: Knowledge-Driven Self-Supervised Sinkhole Segmentation Using Topographic Priors and Segment Anything Model

Osher Rafaeli, Tal Svoray, Ariel Nahlieli

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2410.01473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2410.01473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[244] SiLVR: A Simple Language-based Video Reasoning Framework

Ce Zhang, Yan-Bo Lin, Ziyang Wang, Mohit Bansal, Gedas Bertasius

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.24869: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24869&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[245] Visual Sparse Steering (VS2): Unsupervised Adaptation for Image Classification using Sparsity-Guided Steering Vectors

Gerasimos Chatzoudis, Zhuowei Li, Gemma E. Moran, Hao Wang, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.01247: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.01247&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[246] Frozen Forecasting: A Unified Evaluation

Jacob C Walker, Pedro Vélez, Luisa Polania Cabrera, Guangyao Zhou, Sayna Ebrahimi, Rishabh Kabra, Carl Doersch, Maks Ovsjanikov, João Carreira, Shiry Ginosar

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.13942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.13942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[247] Hybrid Approach for Enhancing Lesion Segmentation in Fundus Images

Mohammadmahdi Eshragh, Emad A. Mohammed, Behrouz Far, Ezekiel Weis, Carol L Shields, Sandor R Ferenczy, Trafford Crump

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25549: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25549&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[248] Spatial Atlas: Compute-Grounded Reasoning for Spatial-Aware Research Agent Benchmarks

Arun Sharma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12102: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12102&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[249] AIM-CoT: Active Information-driven Multimodal Chain-of-Thought for Vision-Language Reasoning

Xiping Li, Jianghong Ma

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.25699: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25699&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[250] Getting the Numbers Right$\unicode{x2014}$Modelling Multi-Class Object Counting in Dense and Varied Scenes

Villanelle O’Reilly, Jonathan Cox, Georgios Leontidis, Marc Hanheide, Petra Bosilj, James M. Brown

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.02213: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02213&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.06687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[252] Abstract 3D Perception for Spatial Intelligence in Vision-Language Models

Yifan Liu, Fangneng Zhan, Kaichen Zhou, Yilun Du, Paul Pu Liang, Hanspeter Pfister

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.10946: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10946&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[253] An Optimal Transport-driven Approach for Cultivating Latent Space in Online Incremental Learning

Quyen Tran, Hai Nguyen, Hoang Phan, Quan Dao, Linh Ngo, Khoat Than, Dinh Phung, Dimitris Metaxas, Trung Le

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2211.16780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2211.16780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[254] Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Yevhenii Salii, Volodymyr Kuzin, Sergii Skakun, Zoltan Szantoi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.13417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[255] Lite Any Stereo: Efficient Zero-Shot Stereo Matching

Junpeng Jing, Weixun Luo, Ye Mao, Krystian Mikolajczyk

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.16555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[256] Target-Bench: Can Video World Models Achieve Mapless Path Planning with Semantic Targets?

Dingrui Wang, Zhihao Liang, Hongyuan Ye, Zhexiao Sun, Zhaowei Lu, Yuchen Zhang, Yuyu Zhao, Yuan Gao, Marvin Seegert, Finn Schäfer, Haotong Qin, Wei Li, Luigi Palmieri, Felix Jahncke, Mattia Piccinini, Johannes Betz

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.17792: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17792&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[257] UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.23332: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23332&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[258] GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song, Jing Zhang, Di Wang, Zidie Zhou, Wenbin Liu, Haonan Guo, En Wang, Bo Du

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.02697: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02697&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Huilin Xu, Zhuoyang Liu, Yixiang Luomei, Feng Xu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.08639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[260] ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

Juze Zhang, Changan Chen, Xin Chen, Heng Yu, Tiange Xiang, Ali Sartaz Khan, Shrinidhi K. Lakshmikanth, Ehsan Adeli

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.14234: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14234&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[261] Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling

Sander Moonemans, Sebastiaan Ram, Frédérique Meeuwsen, Carlijn Lems, Jeroen van der Laak, Geert Litjens, Francesco Ciompi

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.17326: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17326&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[262] Heavy-Tailed Class-Conditional Priors for Long-Tailed Generative Modeling

Aymene Mohammed Bouayed, Samuel Deslauriers-Gauthier, Adrian Iaccovelli, David Naccache

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2509.02154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.02154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[263] Multi-Dimensional Knowledge Profiling with Large-Scale Literature Database and Hierarchical Retrieval

Zhucun Xue, Jiangning Zhang, Juntao Jiang, Jinzhuo Liu, Haoyang He, Teng Hu, Xiaobin Hu, Yong Liu, Shuicheng Yan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.15170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[264] Learning Sewing Patterns via Latent Flow Matching of Implicit Fields

Cong Cao, Ren Li, Corentin Dumery, Hao Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.17740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[265] Simplicity Prevails: The Emergence of Generalizable AIGI Detection in Visual Foundation Models

Yue Zhou, Xinan He, Kaiqing Lin, Bing Fan, Feng Ding, Bin Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.01738: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01738&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[266] A Function-Centric Perspective on Flat and Sharp Minima

Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.12451: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12451&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[267] Adaptive Multi-Scale Channel-Spatial Attention Aggregation Framework for 3D Indoor Semantic Scene Completion Toward Assisting Visually Impaired

Qi He, XiangXiang Wang, Jingtao Zhang, Yongbin Yu, Hongxiang Chu, Manping Fan, JingYe Cai, Zhenglin Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.16385: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16385&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jihao Qiu, Lingxi Xie, Xinyue Huo, Qi Tian, Qixiang Ye

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.20913: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20913&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[269] X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.04671: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04671&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[270] Echoes Over Time: Unlocking Length Generalization in Video-to-Audio Generation Models

Christian Simon, Masato Ishii, Wei-Yao Wang, Koichi Saito, Akio Hayakawa, Dongseok Shim, Zhi Zhong, Shuyang Cui, Shusuke Takahashi, Takashi Shibuya, Yuki Mitsufuji

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.20981: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20981&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[271] Tokenizing Semantic Segmentation with Run Length Encoding

Abhineet Singh, Justin Rozeboom, Nilanjan Ray

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.21627: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21627&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[272] OPTED: Open Preprocessed Trachoma Eye Dataset Using Zero-Shot SAM 3 Segmentation

Kibrom Gebremedhin, Hadush Hailu, Bruk Gebregziabher

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.06885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.06885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Qishun Yang, Shu Yang, Lijie Hu, Di Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.08486: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08486&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[274] UNBOX: Unveiling Black-box visual models with Natural-language

Simone Carnemolla, Chiara Russo, Simone Palazzo, Quentin Bouniot, Daniela Giordano, Zeynep Akata, Matteo Pennisi, Concetto Spampinato

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.08639: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.08639&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[275] Towards Generalizable Robotic Manipulation in Dynamic Environments

Heng Fang, Shangru Li, Shuhan Wang, Xuanyang Xi, Dingkang Liang, Xiang Bai

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.15620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[276] From Plausibility to Verifiability: Risk-Controlled Generative OCR with Vision-Language Models

Weile Gong, Yiping Zuo, Zijian Lu, Xin He, Weibei Fan, Lianyong Qi, Shi Jin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.19790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[277] Hydra: Unifying Document Retrieval and Generation in a Single Vision-Language Model

Athos Georgiou

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.28554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[278] SymphoMotion: Joint Control of Camera Motion and Object Dynamics for Coherent Video Generation

Guiyu Zhang, Yabo Chen, Xunzhi Xiang, Junchao Huang, Zhongyu Wang, Li Jiang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.03723: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.03723&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[279] Action Images: End-to-End Policy Learning via Multiview Video Generation

Haoyu Zhen, Zixian Gao, Qiao Sun, Yilin Zhao, Yuncong Yang, Yilun Du, Pengsheng Guo, Tsun-Hsuan Wang, Yi-Ling Qiao, Chuang Gan

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.06168: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.06168&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[280] DBMF: A Dual-Branch Multimodal Framework for Out-of-Distribution Detection

Jiangbei Yue, Darren Treanor, Venkataraman Subramanian, Sharib Ali

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08261: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08261&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[281] Detecting Diffusion-generated Images via Dynamic Assembly Forests

Mengxin Fu, Yuezun Li

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[282] Grid2Matrix: Revealing Digital Agnosia in Vision-Language Models

Yunkai Zhang, Linda Li, Yingxin Cui, Xiyuan Ruan, Zeyu Zheng, Kezhen Chen, Yi Zhang, Diji Yang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[283] The Second Challenge on Real-World Face Restoration at NTIRE 2026: Methods and Results

Jingkai Wang, Jue Gong, Zheng Chen, Kai Liu, Jiatong Li, Yulun Zhang, Radu Timofte, Jiachen Tu, Yaokun Shi, Guoyi Xu, Yaoxin Jiang, Jiajia Liu, Yingsi Chen, Yijiao Liu, Hui Li, Yu Wang, Congchao Zhu, Alexandru-Gabriel Lefterache, Anamaria Radoi, Chuanyue Yan, Tao Lu, Yanduo Zhang, Kanghui Zhao, Jiaming Wang, Yuqi Li, WenBo Xiong, Yifei Chen, Xian Hu, Wei Deng, Daiguo Zhou, Sujith Roy V, Claudia Jesuraj, Vikas B, Spoorthi LC, Nikhil Akalwadi, Ramesh Ashok Tabib, Uma Mudenagudi, Yuxuan Jiang, Chengxi Zeng, Tianhao Peng, Fan Zhang, David Bull Wei Zhou, Linfeng Li, Hongyu Huang, Hoyoung Lee, SangYun Oh, ChangYoung Jeong, Axi Niu, Jinyang Zhang, Zhenguo Wu, Senyan Qing, Jinqiu Sun, Yanning Zhang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10532: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10532&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[284] Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11244: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11244&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[285] SyncFix: Fixing 3D Reconstructions via Multi-View Synchronization

Deming Li, Abhay Yadav, Cheng Peng, Rama Chellappa, Anand Bhattad

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11797: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11797&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[286] RSGMamba: Reliability-Aware Self-Gated State Space Model for Multimodal Semantic Segmentation

Guoan Xu, Yang Xiao, Guangwei Gao, Dongchen Zhu, Guo-Jun Qi, Wenjing Jia

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12319: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12319&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[287] Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Jiwan Kim, Kibum Kim, Wonjoong Kim, Byung-Kwan Lee, Chanyoung Park

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12358: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12358&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[288] Reading Between the Pixels: Linking Text-Image Embedding Alignment to Typographic Attack Success on Vision-Language Models

Ravikumar Balakrishnan, Sanket Mendapara, Ankit Garg

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12371: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12371&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[289] CoD-Lite: Real-Time Diffusion-Based Generative Image Compression

Zhaoyang Jia, Naifu Xue, Zihan Zheng, Jiahao Li, Bin Li, Xiaoyi Zhang, Zongyu Guo, Yuan Zhang, Houqiang Li, Yan Lu

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12525: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12525&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[290] A Faster Path to Continual Learning

Wei Li, Hangjie Yuan, Zixiang Zhao, Borui Kang, Ziwei Liu, Tao Feng

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11064: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11064&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[291] PianoFlow: Music-Aware Streaming Piano Motion Generation with Bimanual Coordination

Xuan Wang, Kai Ruan, Jiayi Han, Kaiyue Zhou, Gaoang Wang

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.12856: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.12856&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[292] HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo, Daewon Choi, Taeyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.00695: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00695&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[293] RoboTAG: End-to-end Robot Configuration Estimation via Topological Alignment Graph

Yifan Liu, Fangneng Zhan, Wanhua Li, Haowen Sun, Katerina Fragkiadaki, Hanspeter Pfister

Main category: cs.CV

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.07717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[294] From Instruction to Event: Sound-Triggered Mobile Manipulation

Hao Ju, Shaofei Huang, Hongyu Li, Zihan Ding, Si Liu, Meng Wang, Zhedong Zheng

Yiran Gao, Kim Hammar, Tao Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.13156: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13156&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Victor-Alexandru Darvariu, Charlotte Z. Reed, Jan Stratmann, Bruno Lacerda, Benjamin Allsup, Stephen Woodward, Elizabeth Siddle, Trishna Saeharaseelan, Owain Jones, Dan Jones, Tobias Ferreira, Chloe Baker, Kevin Chaplin, James Kirk, Ashley Iceton-Morris, Ryan D. Patmore, Jeff Polton, Charlotte Williams, Christopher D. J. Auckland, Rob A. Hall, Alexandra Kokkinaki, Alvaro Lorenzo Lopez, Justin J. H. Buck, Nick Hawes

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.19315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.19315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[363] FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Zhihao Ding, Jinming Li, Ze Lu, Jieming Shi

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.23636: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23636&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[364] Domain-Adaptive Model Merging Across Disconnected Modes

Junming Liu, Yusen Zhang, Rongchao Zhang, Wenkai Zhu, Tian Wu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.05957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.05957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[365] The Mirror Design Pattern: Strict Data Geometry over Model Scale for Prompt Injection Detection

J Alex Corll

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.11875: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.11875&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[366] Graph In-Context Operator Networks for Generalizable Spatiotemporal Prediction

Chenghan Wu, Zongmin Yu, Boai Sun, Liu Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.12725: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.12725&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[367] ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents

Zijian Lu, Yiping Zuo, Yupeng Nie, Xin He, Weibei Fan, Lianyong Qi, Shi Jin

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.20340: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20340&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[368] Assessment Design in the AI Era: A Method for Identifying Items Functioning Differentially for Humans and Chatbots

Licol Zeinfeld, Alona Strugatski, Ziva Bar-Dov, Ron Blonder, Shelley Rap, Giora Alexandron

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.23682: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23682&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[369] A Lightweight, Transferable, and Self-Adaptive Framework for Intelligent DC Arc-Fault Detection in Photovoltaic Systems

Xiaoke Yang, Long Gao, Haoyu He, Hanyuan Hang, Qi Liu, Shuai Zhao, Qiantu Tuo, Rui Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.25749: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25749&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[370] WOMBET: World Model-based Experience Transfer for Robust and Sample-efficient Reinforcement Learning

Mintae Kim, Koushil Sreenath

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08958: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08958&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[371] WybeCoder: Verified Imperative Code Generation

Fabian Gloeckle, Mantas Baksys, Darius Feher, Kunhao Zheng, Amaury Hayat, Sean B. Holden, Gabriel Synnaeve, Peter O’Hearn

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.29088: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29088&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[372] Trust and Reliance on AI in Education: AI Literacy and Need for Cognition as Moderators

Griffin Pitts, Neha Rani, Weedguet Mildort

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.01114: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01114&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[373] SelfGrader: Stable Jailbreak Detection for Large Language Models using Token-Level Logits

Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.01473: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01473&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[374] Optimal Stability of KL Divergence under Gaussian Perturbations

Jialu Pan, Yufeng Zhang, Nan Hu, Keqin Li, Zhenbang Chen, Ji Wang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11026: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11026&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[375] ChatSVA: Bridging SVA Generation for Hardware Verification via Task-Specific LLMs

Lik Tung Fu, Jie Zhou, Shaokai Ren, Mengli Zhang, Jia Xiong, Hugo Jiang, Nan Guan, Xi Wang, Jun Yang

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.02811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.02811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[376] Exact Structural Abstraction and Tractability Limits

Tristan Simas

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07349: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07349&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[377] THEIA: Learning Complete Kleene Three-Valued Logic in a Pure-Neural Modular Architecture

Augustus Haoyang Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11284: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11284&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[378] eBandit: Kernel-Driven Reinforcement Learning for Adaptive Video Streaming

Mahdi Alizadeh

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.08791: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.08791&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[379] Token-Budget-Aware Pool Routing for Cost-Efficient LLM Inference

Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09613: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09613&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[380] A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

Chen Zhang, Yan Ding, Haotian Wang, Chubo Liu, Keqin Li, Kenli Li

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[381] RoboLab: A High-Fidelity Simulation Benchmark for Analysis of Task Generalist Policies

Xuning Yang, Rishit Dagli, Alex Zook, Hugo Hadfield, Ankit Goyal, Stan Birchfield, Fabio Ramos, Jonathan Tremblay

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.09860: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.09860&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[382] Cost-optimal Sequential Testing via Doubly Robust Q-learning

Doudou Zhou, Yiran Zhang, Dian Jin, Yingye Zheng, Lu Tian, Tianxi Cai

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11165: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11165&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[383] CodeTracer: Towards Traceable Agent States

Han Li, Yifan Yao, Letian Zhu, Rili Feng, Hongyi Ye, Jiaming Wang, Yancheng He, Pengyu Zou, Lehan Zhang, Xinping Lei, Haoyang Huang, Ken Deng, Ming Sun, Zhaoxiang Zhang, He Ye, Jiaheng Liu

Main category: cs.AI

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[384] Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM “VaCoAl” for Ultra-High Speed, Ultra-Low Power, and Low Cost>

Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.10877: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10877&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

Jeff Smith

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.15742: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15742&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[511] Learning-Based Estimation of Spatially Resolved Scatter Radiation Fields in Interventional Radiology

Felix Lehner, Pasquale Lombardo, Susana Castillo, Oliver Hupe, Marcus Magnor

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.17654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[512] A Review of Diffusion-based Simulation-Based Inference: Foundations and Applications in Non-Ideal Data Scenarios

Haley Rosso, Talea Mayo

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2512.23748: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.23748&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[513] Predicting Time Pressure of Powered Two-Wheeler Riders for Proactive Safety Interventions

Sumit S. Shevtekar, Chandresh K. Maurya, Gourab Sil

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2601.03173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[514] Swap Regret Minimization Through Response-Based Approachability

Ioannis Anagnostides, Gabriele Farina, Maxwell Fishelson, Haipeng Luo, Jon Schneider

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.06264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[515] Diagnostics for Individual-Level Prediction Instability in Machine Learning for Healthcare

Elizabeth W. Miller, Jeffrey D. Blume

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.00192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[516] ReproMIA: A Comprehensive Analysis of Model Reprogramming for Proactive Membership Inference Attacks

Chihan Huang, Huaijin Wang, Shuai Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.28942: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28942&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[517] Restless Bandits with Individual Penalty Constraints: A New Near-Optimal Index Policy and How to Learn It

Nida Zamir, I-Hong Hou

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.04101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.04101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[518] When Can You Poison Rewards? A Tight Characterization of Reward Poisoning in Linear MDPs

Jose Efraim Aguilar Escamilla, Haoyang Hong, Jiawei Li, Haoyu Zhao, Xuezhou Zhang, Sanghyun Hong, Huazheng Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10062: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10062&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[519] Robust Adversarial Policy Optimization Under Dynamics Uncertainty

Mintae Kim, Koushil Sreenath

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.10974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.10974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[520] Fast and principled equation discovery from chaos to climate

Yuzheng Zhang, Weizhen Li, Rui Carvalho

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.11929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.11929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[521] Convex Hulls of Reachable Sets

Thomas Lew, Riccardo Bonalli, Marco Pavone

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2303.17674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2303.17674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[522] Nonparametric Sparse Online Learning of the Koopman Operator

Boya Hou, Sina Sanjari, Nathan Dahlin, Alec Koppel, Subhonmesh Bose

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2405.07432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.07432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[523] Fast training of accurate physics-informed neural networks without gradient descent

Chinmay Datar, Taniya Kapoor, Abhishek Chandra, Qing Sun, Erik Lien Bolager, Iryna Burak, Anna Veselovska, Massimo Fornasier, Felix Dietrich

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2405.20836: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.20836&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[524] Fast and Simple Densest Subgraph with Predictions

Thai Bui, Luan Nguyen, Hoa T. Vu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.12600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.12600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[525] Flow-based Generative Modeling of Potential Outcomes and Counterfactuals

Dongze Wu, David I. Inouye, Yao Xie

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2505.16051: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.16051&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[526] Geminet: Learning the Duality-based Iterative Process for Lightweight Traffic Engineering in Changing Topologies

Ximeng Liu, Zhuoran Liu, Yingming Mao, Yatao Li, Shizhen Zhao, Xinbing Wang

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2506.23640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[527] A Comprehensive Survey on Network Traffic Synthesis: From Statistical Models to Deep Learning

Nirhoshan Sivaroopan, Kaushitha Silva, Chamara Madarasingha, Thilini Dahanayaka, Guillaume Jourjon, Anura Jayasumana, Kanchana Thilakarathna

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.01976: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01976&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[528] Neural Two-Stage Stochastic Optimization for Solving Unit Commitment Problem

Zhentong Shao, Jingtao Qin, Nanpeng Yu

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2507.09503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.09503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[529] Random Walk Learning and the Pac-Man Attack

Xingran Chen, Parimal Parag, Rohit Bhagat, Zonghong Liu, Salim El Rouayheb

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2508.05663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.05663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[530] Möbius transforms and Shapley values for vector-valued functions on weighted directed acyclic multigraphs

Patrick Forré, Abel Jansma

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.05786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[531] Hierarchical DLO Routing with Reinforcement Learning and In-Context Vision-language Models

Mingen Li, Houjian Yu, Yixuan Huang, Youngjin Hong, Hantao Ye, Changhyun Choi

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2510.19268: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.19268&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[532] Zero-Shot Function Encoder-Based Differentiable Predictive Control

Hassan Iqbal, Xingjian Li, Tyler Ingebrand, Adam Thorpe, Krishna Kumar, Ufuk Topcu, Ján Drgoňa

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.05757: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.05757&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[533] Robust Verification of Controllers under State Uncertainty via Hamilton-Jacobi Reachability Analysis

Albert Lin, Alessandro Pinto, Somil Bansal

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2511.14755: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14755&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[534] Fluids You Can Trust: Property-Preserving Operator Learning for Incompressible Flows

Ramansh Sharma, Matthew Lowery, Houman Owhadi, Varun Shankar

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2602.15472: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15472&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[535] Mini-Batch Covariance, Diffusion Limits, and Oracle Complexity in Stochastic Gradient Descent: A Sampling-Design Perspective

Daniel Zantedeschi, Kumar Muthuraman

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.02417: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02417&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[536] LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification

Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.03959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[537] Spectral methods: crucial for machine learning, natural for quantum computers?

Vasilis Belis, Joseph Bowles, Rishabh Gupta, Evan Peters, Maria Schuld

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2603.24654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[538] Transcriptomic Models for Immunotherapy Response Prediction Show Limited Cross-cohort Generalisability

Yuheng Liang, Lucy Chhuo, Ahmadreza Argha, Nona Farbehi, Lu Chen, Roohallah Alizadehsani, Mehdi Hosseinzadeh, Amin Beheshti, Thantrira Porntaveetusm, Youqiong Ye, Hamid Alinejad-Rokny

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.05478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.05478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[539] Parameter-Free Non-Ergodic Extragradient Algorithms for Solving Monotone Variational Inequalities

Lingqing Shen, Fatma Kılınç-Karzan

Main category: cs.LG

TL;DR: Error: Processing failed

Details

Motivation: Error: Processing failed

Method: Error: Processing failed

Result: Error: Processing failed

Conclusion: Error: Processing failed

Abstract: Failed to fetch summary for 2604.07662: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.07662&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[540] Evaluating Differential Privacy Against Membership Inference in Federated Learning: Insights from the NIST Genomics Red Team Challenge

Gustavo de Carvalho Bertoli

Zixuan Chen, Depeng Wang, Hao Lin, Li Luo, Ke Xu, Ya Guo, Huijia Zhu, Tanfeng Sun, Xinghao Jiang

Main category: cs.MM

Details

[548] AudioX: A Unified Framework for Anything-to-Audio Generation

Zeyue Tian, Zhaoyang Liu, Yizhu Jin, Ruibin Yuan, Liumeng Xue, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

[553] Learning Class Difficulty in Imbalanced Histopathology Segmentation via Dynamic Focal Attention

Lakmali Nadeesha Kumari, Sen-Ching Samson Cheung

Main category: eess.IV

TL;DR: DFA introduces dynamic focal attention with learnable per-class bias in cross-attention to address class imbalance in histopathology segmentation, moving beyond frequency-based reweighting to capture true difficulty factors like morphological variability.

Details

Motivation: Current approaches to class imbalance in semantic segmentation rely on frequency-based loss reweighting, which assumes rare classes are difficult. However, true difficulty arises from morphological variability, boundary ambiguity, and contextual similarity - factors that frequency cannot capture. There's a need for methods that can learn class-specific difficulty directly from data.

Method: Proposes Dynamic Focal Attention (DFA), a mechanism that learns class-specific difficulty within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting before prediction rather than gradient-level reweighting after prediction. Initialized from a log-frequency prior to prevent gradient starvation, the bias is optimized end-to-end, allowing the model to adaptively capture difficulty signals during training.

Result: On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU metrics, matching or exceeding a difficulty-aware baseline without requiring a separate estimator or additional training stage.

Conclusion: Encoding class difficulty at the representation level through attention mechanisms provides a principled alternative to conventional loss reweighting for imbalanced segmentation, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework.

Abstract: Semantic segmentation of histopathology images under class imbalance is typically addressed through frequency-based loss reweighting, which implicitly assumes that rare classes are difficult. However, true difficulty also arises from morphological variability, boundary ambiguity, and contextual similarity-factors that frequency cannot capture. We propose Dynamic Focal Attention (DFA), a simple and efficient mechanism that learns class-specific difficulty directly within the cross-attention of query-based mask decoders. DFA introduces a learnable per-class bias to attention logits, enabling representation-level reweighting prior to prediction rather than gradient-level reweighting after prediction. Initialised from a log-frequency prior to prevent gradient starvation, the bias is optimised end-to-end, allowing the model to adaptively capture difficulty signals through training, effectively unifying frequency-based and difficulty-aware approaches under a common attention-bias framework. On three histopathology benchmarks (BDSA, BCSS, CRAG), DFA consistently improves Dice and IoU, matching or exceeding a difficulty-aware baseline without a separate estimator or additional training stage. These results demonstrate that encoding class difficulty at the representation level provides a principled alternative to conventional loss reweighting for imbalanced segmentation.

Junho Moon, Symac Kim, Haejun Chung, Ikbeom Jang

Main category: eess.IV

TL;DR: A method for synthesizing tau PET images from structural MRI using cyclic 2.5D perceptual loss to improve volumetric consistency and standardization by scanner manufacturer to reduce variability.

Details

Motivation: PET imaging is valuable for Alzheimer's disease diagnosis but limited by cost, regulatory restrictions, and invasiveness. Cross-modal image synthesis can reconstruct unavailable PET modalities from routine MRI scans, addressing access barriers.

Method: Proposes cyclic 2.5D perceptual loss that alternates optimization across axial, coronal, and sagittal planes during training to improve volumetric consistency. Also standardizes PET SUVRs by scanner manufacturer to reduce inter-manufacturer variability. Uses various architectures (U-Net, UNETR, SwinUNETR, CycleGAN, Pix2Pix) for synthesis.

Result: Method generalizes across multiple architectures with strong performance. Improves agreement between synthesized SUVRs and measured PET in brain regions relevant to Alzheimer-type tau pathology. Validated on ADNI and SCAN cohorts spanning ADRD spectrum.

Conclusion: Cyclic 2.5D perceptual loss with scanner manufacturer standardization enables effective synthesis of tau PET from structural MRI, addressing limitations of existing perceptual losses for 3D medical image synthesis.

Abstract: Positron emission tomography (PET) provides molecular biomarkers for Alzheimer’s disease and related dementias (ADRD) and is increasingly used for diagnosis, staging, and clinical trial enrichment. However, its use is limited by cost, regulatory restrictions, and the invasiveness of radiotracer injection. Although current frameworks emphasize multimodal biomarker assessment, including the amyloid/tau/neurodegeneration (A/T/N) scheme, these barriers constrain access to PET imaging. Cross-modal image synthesis may help address this gap by reconstructing unavailable modalities from routine scans. Because PET is clinically valuable for regional uptake patterns rather than exact voxel-wise intensities, perceptual losses that capture higher-level semantic features are well suited to PET synthesis. Existing 2D, 3D, and 2.5D perceptual losses for 3D synthesis each have limitations, including restricted volumetric context, scarcity of pretrained 3D models, and difficulty balancing optimization across anatomical planes. In this study, we synthesize tau PET from structural MRI by generating 3D pseudo-[18F]flortaucipir standardized uptake value ratio (SUVR) maps from 3D T1-weighted MR images. We propose a cyclic 2.5D perceptual loss that alternates optimization across axial, coronal, and sagittal planes during training to improve volumetric consistency. We also standardize PET SUVRs by scanner manufacturer, reducing inter-manufacturer variability and better preserving high-uptake regions. Using cohorts spanning the ADRD spectrum from the ADNI and the SCAN cohort, we show that the method generalizes across U-Net, UNETR, SwinUNETR, CycleGAN, and Pix2Pix, with strong performance. Notably, it improves agreement between synthesized SUVRs and measured PET in brain regions relevant to Alzheimer-type tau pathology. Code is publicly available at https://github.com/labhai/Cyclic-2.5D-Perceptual-Loss.

[555] The Gaussian Latent Machine: Efficient Prior and Posterior Sampling for Inverse Problems

Muhamed Kuric, Martin Zach, Andreas Habring, Michael Unser, Thomas Pock

Main category: eess.IV

TL;DR: A novel Gaussian latent machine framework for efficient sampling from product-of-experts models in Bayesian imaging, unifying existing algorithms and enabling efficient two-block Gibbs sampling.

Details

Motivation: The paper addresses the challenge of sampling from complex product-of-experts models commonly used in Bayesian imaging, which often suffer from computational inefficiency. There's a need for a unified framework that can handle various prior and posterior distributions while providing efficient sampling algorithms.

Method: Proposes lifting product-of-experts models into a novel latent variable model called Gaussian latent machine. This framework enables general sampling approaches that unify existing algorithms, with a focus on efficient two-block Gibbs sampling for general cases and direct sampling for special cases.

Result: The Gaussian latent machine framework successfully unifies many existing sampling algorithms and provides efficient sampling methods. Numerical experiments demonstrate the approach’s effectiveness across various Bayesian imaging problems involving different prior and posterior distributions.

Conclusion: The proposed Gaussian latent machine offers a powerful, unified framework for sampling from product-of-experts models in Bayesian imaging, providing both theoretical unification and practical efficiency improvements over existing methods.

Abstract: We consider the problem of sampling from a product-of-experts-type model that encompasses many standard prior and posterior distributions commonly found in Bayesian imaging. We show that this model can be easily lifted into a novel latent variable model, which we refer to as a Gaussian latent machine. This leads to a general sampling approach that unifies and generalizes many existing sampling algorithms in the literature. Most notably, it yields a highly efficient and effective two-block Gibbs sampling approach in the general case, while also specializing to direct sampling algorithms in particular cases. Finally, we present detailed numerical experiments that demonstrate the efficiency and effectiveness of our proposed sampling approach across a wide range of prior and posterior sampling problems from Bayesian imaging.

Editor’s Picks

[1] Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

[2] AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction

[3] Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

Today’s Research Highlights

Table of Contents

cs.CL

[1] The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious

[2] Caption First, VQA Second: Knowledge Density, Not Task Format, Drives Multimodal Scaling

[3] WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

[4] Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction

[5] A Multi-Model Approach to English-Bangla Sentiment Classification of Government Mobile Banking App Reviews

[6] KMMMU: Evaluation of Massive Multi-discipline Multimodal Understanding in Korean Language and Context

[7] A Proactive EMR Assistant for Doctor-Patient Dialogue: Streaming ASR, Belief Stabilization, and Preliminary Controlled Evaluation

[8] Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage

[9] Bi-Predictability: A Real-Time Signal for Monitoring LLM Interaction Integrity

[10] OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

[11] Mathematical Reasoning Enhanced LLM for Formula Derivation: A Case Study on Fiber NLI Modellin

[12] Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

[13] Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

[14] Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data

[15] Before the First Token: Scale-Dependent Emergence of Hallucination Signals in Autoregressive Language Models

[16] Curation of a Palaeohispanic Dataset for Machine Learning

[17] EVE: A Domain-Specific LLM Framework for Earth Intelligence

[18] LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

[19] Beyond Arrow’s Impossibility: Fairness as an Emergent Property of Multi-Agent Collaboration

[20] PersonaVLM: Long-Term Personalized Multimodal LLMs

[21] DeEscalWild: A Real-World Benchmark for Automated De-Escalation Training with SLMs

[22] Document-tuning for robust alignment to animals

[23] Memp: Exploring Agent Procedural Memory

[24] Can Large Language Models Reliably Extract Physiology Index Values from Coronary Angiography Reports?

[25] IWLV-Ramayana: A Sarga-Aligned Parallel Corpus of Valmiki’s Ramayana Across Indian Languages

[26] Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

[27] InfiniteScienceGym: An Unbounded, Procedurally-Generated Benchmark for Scientific Analysis

[28] Evaluating the Evaluator: Problems with SemEval-2020 Task 1 for Lexical Semantic Change Detection

[29] Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

[30] Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

[31] L2D-Clinical: Learning to Defer for Adaptive Model Selection in Clinical Text Classification

[32] English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training

[33] Giving Voice to the Constitution: Low-Resource Text-to-Speech for Quechua and Spanish Using a Bilingual Legal Corpus

[34] AgentSPEX: An Agent SPecification and EXecution Language

[35] Peer-Predictive Self-Training for Language Model Reasoning

[36] TLoRA+: A Low-Rank Parameter-Efficient Fine-Tuning Method for Large Language Models

[37] Empirical Evidence of Complexity-Induced Limits in Large Language Models on Finite Discrete State-Space Problems with Explicit Validity Constraints

[38] From Prediction to Justification: Aligning Sentiment Reasoning with Human Rationale via Reinforcement Learning

[39] MERRIN: A Benchmark for Multimodal Evidence Retrieval and Reasoning in Noisy Web Environments

[40] CANVAS: Continuity-Aware Narratives via Visual Agentic Storyboarding

[41] Using reasoning LLMs to extract SDOH events from clinical notes

[42] ToolSpec: Accelerating Tool Calling via Schema-Aware and Retrieval-Augmented Speculative Decoding

[43] Synthesizing Instruction-Tuning Datasets with Contrastive Decoding

[44] Debate to Align: Reliable Entity Alignment through Two-Stage Multi-Agent Debate

[45] Training-Free Test-Time Contrastive Learning for Large Language Models

[46] YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

[47] MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

[48] BenGER: A Collaborative Web Platform for End-to-End Benchmarking of German Legal Tasks

[49] Foresight Optimization for Strategic Reasoning in Large Language Models

[50] C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

[51] Syn-TurnTurk: A Synthetic Dataset for Turn-Taking Prediction in Turkish Dialogues

[52] Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

[53] IndicDB – Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

[54] Breaking the Generator Barrier: Disentangled Representation for Generalizable AI-Text Detection

[55] Co-FactChecker: A Framework for Human-AI Collaborative Claim Verification Using Large Reasoning Models

[56] Learning the Cue or Learning the Word? Analyzing Generalization in Metaphor Detection for Verbs

[57] An Empirical Investigation of Practical LLM-as-a-Judge Improvement Techniques on RewardBench 2

[58] Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

[59] MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

[60] From Anchors to Supervision: Memory-Graph Guided Corpus-Free Unlearning for Large Language Models

[61] QuantileMark: A Message-Symmetric Multi-bit Watermark for LLMs

[62] ToolOmni: Enabling Open-World Tool Use via Agentic learning with Proactive Retrieval and Grounded Execution

[63] MUSE: Multi-Domain Chinese User Simulation via Self-Evolving Profiles and Rubric-Guided Alignment

[64] Robust Reward Modeling for Large Language Models via Causal Decomposition

[65] Beyond Static Personas: Situational Personality Steering for Large Language Models

[66] Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

[67] Causal Drawbridges: Characterizing Gradient Blocking of Syntactic Islands in Transformer LMs

[68] How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

[69] Leveraging LLM-GNN Integration for Open-World Question Answering over Knowledge Graphs

[70] Adaptive Conformal Prediction for Improving Factuality of Generations by Large Language Models

[71] Diffusion Language Models for Speech Recognition

[72] Dual-Enhancement Product Bundling: Bridging Interactive Graph and Large Language Model

[73] From Where Words Come: Efficient Regularization of Code Tokenizers Through Source Attribution