Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 122]
cs.CV [Total: 156]
cs.AI [Total: 72]
cs.SD [Total: 13]
cs.LG [Total: 164]
cs.MA [Total: 8]
cs.MM [Total: 3]
eess.AS [Total: 6]
eess.IV [Total: 13]

cs.CL

[1] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

Souradeep Mukhopadhyay, Rishabh Baral, Nimeesh Mahajan, Samhitha Harish, Aswin RRV, Mihir Parmar, Mutsumi Nakamura, Chitta Baral

Main category: cs.CL

TL;DR: LLMs perform well on original logic puzzles but fail on perturbed versions due to phantom recall - reproducing memorized solutions rather than genuine reasoning.

Details

Motivation: To investigate whether LLMs genuinely reason or just memorize templates for logic puzzles, and to understand their fragility when puzzles are modified.

Method: Created PHANTOM RECALL benchmark with 25 logic puzzles and 149 perturbations, evaluated 11 LLMs, and developed tools including logical-equivalence judge, error taxonomy, and prompting-based mitigation framework.

Result: LLMs show near-perfect accuracy on unmodified puzzles but significantly underperform humans on perturbed ones, exhibiting phantom recall and over-elaboration.

Conclusion: LLMs often fail to re-reason when contextual cues shift, revealing a crucial limitation in their logical understanding despite linguistic fluency.

Abstract: Large language models (LLMs) such as GPT, Gemini, and Claude often appear adept at solving classic logic puzzles–but how much genuine reasoning underlies their answers? Recent evidence suggests that these models frequently rely on memorized templates rather than reasoning from first principles. When puzzles are slightly modified, their performance collapses, revealing a striking fragility. In particular, we asked: Have LLMs addressed these issues? To what extent? How about perturbations to other puzzles? Is there a general way of reformulating the prompt so that the models do better? To examine these things systematically, we introduce PHANTOM RECALL, a benchmark comprising 25 well-known logic puzzles and 149 carefully designed perturbations that preserve reasoning structure but alter superficial details and solutions. We evaluate eleven leading LLMs and identify a recurring failure mode–phantom recall–where models confidently reproduce memorized solutions or spurious rationales that no longer fit the altered scenario. To probe and mitigate this issue, we contribute three tools: (i) an automated logical-equivalence judge to detect reasoning mismatches, (ii) a taxonomy of fine-grained reasoning error categories, and (iii) a prompting-based mitigation framework guided by these categories. Despite near-perfect accuracy on unmodified puzzles, models significantly underperform humans on perturbed ones, exhibiting both phantom recall and over-elaboration. Our findings reveal a crucial limitation: LLMs often fail to re-reason when contextual cues shift–highlighting the gap between linguistic fluency and logical understanding.

[2] R-WoM: Retrieval-augmented World Model For Computer-use Agents

Kai Mei, Jiang Guo, Shuaichen Chang, Mingwen Dong, Dongkyu Lee, Xing Niu, Jiarong Jiang

Main category: cs.CL

TL;DR: LLMs can serve as world models for agent decision-making but suffer from hallucination and compounding errors in long-horizon simulations. The proposed Retrieval-augmented World Model (R-WoM) addresses this by incorporating external knowledge, achieving significant performance improvements.

Details

Motivation: To investigate whether LLMs are suitable for world modeling in digital environments, given their tendency toward hallucination and reliance on static training knowledge that limits long-horizon simulation capabilities.

Method: Probed LLMs’ world modeling capabilities through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Proposed R-WoM which grounds LLM simulations by retrieving factual, up-to-date knowledge from external tutorials.

Result: LLMs effectively capture immediate next states and identify meaningful state transitions, but performance degrades rapidly in full-procedure planning. R-WoM achieved improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

Conclusion: While LLMs have limitations in reliably modeling environment dynamics over long horizons, grounding their simulations with retrieved external knowledge through R-WoM significantly enhances their world modeling capabilities, especially for longer-horizon tasks.

Abstract: Large Language Models (LLMs) can serve as world models to enhance agent decision-making in digital environments by simulating future states and predicting action outcomes, potentially eliminating costly trial-and-error exploration. However, this capability is fundamentally limited by LLMs' tendency toward hallucination and their reliance on static training knowledge, which can lead to compounding errors that inhibit long-horizon simulations. To systematically investigate whether LLMs are appropriate for world modeling, we probe two core capabilities of world models–future state prediction and reward estimation–through three tasks: next-state identification, full-procedure planning alignment, and milestone transition recognition. Our analysis shows that while LLMs effectively capture immediate next states and identify meaningful state transitions, their performance rapidly degrades in full-procedure planning. This highlights LLMs’ limitations in reliably modeling environment dynamics over long horizons. To address these limitations, we propose the Retrieval-augmented World Model (R-WoM), which grounds LLM simulations by incorporating factual, up-to-date knowledge retrieved from external tutorials. Experiments show that R-WoM achieves substantial improvements of up to 25.3% (OSWorld) and 18.1% (WebArena) compared to baselines, with particular advantages in longer-horizon simulations.

[3] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

Patrick Haller, Mark Ibrahim, Polina Kirichenko, Levent Sagun, Samuel J. Bell

Main category: cs.CL

TL;DR: LLMs’ ability to distinguish true vs false statements is highly sensitive to superficial input variations, with truthfulness representations collapsing when inputs deviate from pre-training distribution, revealing shallow knowledge representations.

Details

Motivation: To understand why LLM performance is brittle and sensitive to trivial input variations by exploring whether this stems from unstable internal knowledge representations.

Method: Applied semantically-preserving perturbations to drive statements out-of-distribution, then evaluated representation separability of truthfulness across multiple LLM families, datasets, and knowledge probing methods.

Result: Internal representations of statement truthfulness collapse as samples become less similar to pre-training data. LLMs can distinguish truth from falsehood only when inputs closely resemble pre-training distribution.

Conclusion: LLMs learn shallow, non-robust knowledge representations that limit generalizability, challenging the utility of truthfulness probes and calling for research on improving representation robustness.

Abstract: For Large Language Models (LLMs) to be reliable, they must learn robust knowledge that can be generally applied in diverse settings – often unlike those seen during training. Yet, extensive research has shown that LLM performance can be brittle, with models exhibiting excessive sensitivity to trivial input variations. In this work, we explore whether this brittleness is a direct result of unstable internal knowledge representations. To explore this question, we build on previous work showing that LLM representations encode statement truthfulness – i.e., true, factual statements can be easily separated from false, inaccurate ones. Specifically, we test the robustness of learned knowledge by evaluating representation separability on samples that have undergone superficial transformations to drive them out-of-distribution (OOD), such as typos or reformulations. By applying semantically-preserving perturbations, we study how separability degrades as statements become more OOD, across four LLM families, five evaluation datasets, and three knowledge probing methods. Our results reveal that internal representations of statement truthfulness collapse as the samples’ presentations become less similar to those seen during pre-training. While LLMs can often distinguish between true and false statements when they closely resemble the pre-training data, this ability is highly dependent on the statement’s exact surface form. These findings offer a possible explanation for brittle benchmark performance: LLMs may learn shallow, non-robust knowledge representations that allow for only limited generalizability. Our work presents a fundamental challenge for the utility of truthfulness probes, and more broadly, calls for further research on improving the robustness of learned knowledge representations.

[4] Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

Ziyang Ma, Ruiyang Xu, Zhenghao Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, Jin Xu, Pheng-Ann Heng, Kai Yu, Junyang Lin, Eng Siong Chng, Xie Chen

Main category: cs.CL

TL;DR: This paper presents Omni-Detective, a tool-calling agentic pipeline for generating high-quality multimodal data, and introduces Omni-Cloze benchmark for evaluating detailed audio-visual perception in language models.

Details

Motivation: Current Omni Language Models (OLMs) have limited capacity for fine-grained multimodal perception and suffer from a 'co-growth' problem where detail increases alongside hallucination.

Method: Proposed Omni-Detective pipeline for autonomous multimodal data generation, trained Audio-Captioner and Omni-Captioner models, and designed Omni-Cloze benchmark for evaluation.

Result: Audio-Captioner achieved best performance on MMAU and MMAR benchmarks, surpassing Gemini 2.5 Flash and comparable to Gemini 2.5 Pro. Omni-Captioner set new SOTA on VDC and achieved best detail-hallucination trade-off on video-SALMONN 2.

Conclusion: Omni-Detective effectively generates high-quality detailed captions, and Omni-Cloze provides reliable evaluation for omni detailed perception, advancing fine-grained multimodal understanding.

Abstract: Fine-grained perception of multimodal information is critical for advancing human-AI interaction. With recent progress in audio-visual technologies, Omni Language Models (OLMs), capable of processing audio and video signals in parallel, have emerged as a promising paradigm for achieving richer understanding and reasoning. However, their capacity to capture and describe fine-grained details remains limited explored. In this work, we present a systematic and comprehensive investigation of omni detailed perception from the perspectives of the data pipeline, models, and benchmark. We first identify an inherent “co-growth” between detail and hallucination in current OLMs. To address this, we propose Omni-Detective, an agentic data generation pipeline integrating tool-calling, to autonomously produce highly detailed yet minimally hallucinatory multimodal data. Based on the data generated with Omni-Detective, we train two captioning models: Audio-Captioner for audio-only detailed perception, and Omni-Captioner for audio-visual detailed perception. Under the cascade evaluation protocol, Audio-Captioner achieves the best performance on MMAU and MMAR among all open-source models, surpassing Gemini 2.5 Flash and delivering performance comparable to Gemini 2.5 Pro. On existing detailed captioning benchmarks, Omni-Captioner sets a new state-of-the-art on VDC and achieves the best trade-off between detail and hallucination on the video-SALMONN 2 testset. Given the absence of a dedicated benchmark for omni detailed perception, we design Omni-Cloze, a novel cloze-style evaluation for detailed audio, visual, and audio-visual captioning that ensures stable, efficient, and reliable assessment. Experimental results and analysis demonstrate the effectiveness of Omni-Detective in generating high-quality detailed captions, as well as the superiority of Omni-Cloze in evaluating such detailed captions.

[5] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Armel Zebaze, Rachel Bawden, Benoît Sagot

Main category: cs.CL

TL;DR: Large reasoning models don’t benefit from ’thinking tokens’ for machine translation, but modular translation-specific prompting strategies can improve performance.

Details

Motivation: To explore whether intermediate 'thinking tokens' (similar to chain of thought reasoning) can improve machine translation performance in large reasoning models, across various language pairs and resource levels.

Method: Tested generation of intermediate tokens for MT, fine-tuned models with synthetic CoT explanations inspired by human translators, and experimented with modular translation-specific prompting strategies.

Result: Thinking tokens don’t help LRMs perform MT better. Fine-tuning with CoT explanations doesn’t outperform standard input-output fine-tuning. However, combining outputs from modular translation-specific prompting strategies shows improvements.

Conclusion: The value of intermediate tokens depends on whether they contain actual translation attempts. Using teachers to refine target translations or expand parallel corpora is more effective than distilling CoT explanations into MT models.

Abstract: Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that “thinking tokens” do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators’ practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into “thinking” MT models.

[6] Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

Lorena Calvo-Bartolomé, Valérie Aldana, Karla Cantarero, Alonso Madroñal de Mesa, Jerónimo Arenas-García, Jordan Boyd-Graber

Main category: cs.CL

TL;DR: MIND is a user-in-the-loop fact-checking pipeline that detects factual and cultural discrepancies in multilingual QA systems, particularly for culturally sensitive questions that vary by region.

Details

Motivation: Multilingual QA systems need to ensure factual consistency across languages while accounting for cultural variations in subjective responses, especially for culturally sensitive questions.

Method: Propose MIND, a user-in-the-loop fact-checking pipeline that highlights divergent answers to culturally sensitive questions that vary by region and context. Evaluated on bilingual QA system in maternal and infant health domain.

Result: MIND reliably identifies inconsistencies in all tested cases, supporting development of more culturally aware and factually consistent QA systems. Released a dataset of bilingual questions annotated for factual and cultural inconsistencies.

Conclusion: MIND effectively detects factual and cultural discrepancies in multilingual QA knowledge bases, demonstrating reliable performance across different domains and supporting improved cultural awareness in QA systems.

Abstract: Multilingual question answering (QA) systems must ensure factual consistency across languages, especially for objective queries such as What is jaundice?, while also accounting for cultural variation in subjective responses. We propose MIND, a user-in-the-loop fact-checking pipeline to detect factual and cultural discrepancies in multilingual QA knowledge bases. MIND highlights divergent answers to culturally sensitive questions (e.g., Who assists in childbirth?) that vary by region and context. We evaluate MIND on a bilingual QA system in the maternal and infant health domain and release a dataset of bilingual questions annotated for factual and cultural inconsistencies. We further test MIND on datasets from other domains to assess generalization. In all cases, MIND reliably identifies inconsistencies, supporting the development of more culturally aware and factually consistent QA systems.

[7] TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li, Philipp Borchert, Gerasimos Lampouras

Main category: cs.CL

TL;DR: TopoAlign is a framework that repurposes code repositories as training data for Math LLMs by structurally aligning code components to mirror formal mathematical statements, improving autoformalization performance without human annotation.

Details

Motivation: LLMs struggle with autoformalization due to scarcity of informal-formal math statement pairs and structural differences between code and formal mathematics, limiting transfer learning from code generation training.

Method: TopoAlign decomposes code into docstrings, main functions, and dependency functions, then reassembles them into structurally aligned analogues that mirror formal mathematical statements for training Math LLMs.

Result: Substantial performance gains: DeepSeek-Math improved by 17.77% on BEq@10 and 68.82% on typecheck@10; Herald gained 0.12% on BEq@10 and 1.09% on typecheck@10 on minif2f, Putnam, and ProofNet benchmarks.

Conclusion: Training on structurally aligned code data significantly improves Math LLM autoformalization performance, demonstrating that code repositories can be effectively repurposed as training resources even for specialized models.

Abstract: Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

[8] Not in Sync: Unveiling Temporal Bias in Audio Chat Models

Jiayu Yao, Shenghua Liu, Yiwei Wang, Rundong Cheng, Lingrui Mei, Baolong Bi, Zhen Xiong, Xueqi Cheng

Main category: cs.CL

TL;DR: First systematic study of temporal bias in Large Audio Language Models (LALMs) reveals systematic misalignment in timestamp predictions, with bias increasing with audio length and varying by event type.

Details

Motivation: LALMs are increasingly used for audio understanding but their ability to accurately locate when events occur remains underexplored, particularly the systematic temporal bias in timestamp predictions.

Method: Conducted controlled experiments on timestamped datasets, developed Temporal Bias Index (TBI) to quantify systematic misalignment, and created visualization framework to analyze bias patterns.

Result: Found temporal bias is prevalent across datasets and models, increases with audio length (accumulating to tens of seconds), and varies across event types and positions.

Conclusion: Current LALMs have fundamental limitations in temporal localization accuracy, highlighting the need for developing temporally robust architectures.

Abstract: Large Audio Language Models (LALMs) are increasingly applied to audio understanding and multimodal reasoning, yet their ability to locate when events occur remains underexplored. We present the first systematic study of temporal bias in LALMs, revealing a key limitation in their timestamp prediction. For example, when asked “At which second does the lecturer introduce the key formula?”, models often predict timestamps that are consistently earlier or later than the ground truth. Through controlled experiments on timestamped datasets, we find that temporal bias (i) is prevalent across datasets and models, (ii) increases with audio length - even accumulating to tens of seconds in extended recordings, and (iii) varies across event types and positions. We quantify this effect with the Temporal Bias Index (TBI), measuring systematic misalignment in predicted event timings, and complement it with a visualization framework. Our findings highlight a fundamental limitation in current LALMs and call for the development of temporally robust architectures.

[9] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

Priyanka Dey, Daniele Rosa, Wenqing Zheng, Daniel Barcklow, Jieyu Zhao, Emilio Ferrara

Main category: cs.CL

TL;DR: GRAVITY is a framework that generates synthetic preference data using user profiles to enable scalable LLM personalization without costly human annotations.

Details

Motivation: Current LLM personalization relies on expensive human feedback or interaction logs, which limits scalability and fails to capture deeper user attributes like interests, values, and personality traits.

Method: Integrates demographic, cultural, and psychological frameworks (Hofstede’s dimensions, Schwartz’s values, World Values Survey, Big Five OCEAN traits) to synthesize profile-grounded preference pairs for personalized content generation.

Result: Achieved over 4% higher preference gains across baselines, with user studies showing GRAVITY outputs preferred over 86% of the time across multiple cultures (USA, Brazil, Japan, India).

Conclusion: Profile-grounded synthetic data captures richer user variation, reduces reliance on costly annotation, and produces more engaging content, offering a scalable path for LLM personalization.

Abstract: Personalization in LLMs often relies on costly human feedback or interaction logs, limiting scalability and neglecting deeper user attributes. To reduce the reliance on human annotations, we introduce GRAVITY (Generative Response with Aligned Values, Interests, and Traits of You), a framework for generating synthetic, profile-grounded preference data that captures users’ interests, values, beliefs, and personality traits. By integrating demographic, cultural, and psychological frameworks – including Hofstede’s cultural dimensions, Schwartz’s basic values, the World Values Survey, and Big Five OCEAN traits – GRAVITY synthesizes preference pairs to guide personalized content generation. We evaluate GRAVITY on book descriptions for 400 Amazon users, comparing it to prompt-based conditioning, standard fine-tuning, and naive synthetic pair generation. Profile-grounded synthetic data consistently improves generation, especially across multiple cultures (USA, Brazil, Japan, India), achieving over 4% higher preference gains across baselines, with user studies showing that GRAVITY outputs are preferred over 86% of the time. Our results show that scenario-grounded synthetic data can capture richer user variation, reduce reliance on costly annotation, and produce more engaging, user-centered content, offering a scalable path for LLM personalization.

[10] Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

Gabrielle Kaili-May Liu, Bryan Li, Arman Cohan, William Gantt Walden, Eugene Yang

Main category: cs.CL

TL;DR: The paper introduces CRUMQs, a pipeline for creating uncheatable, realistic, unanswerable, and multi-hop queries to better evaluate RAG systems, showing significant improvements in benchmark difficulty and reduced cheatability.

Details

Motivation: Existing RAG benchmarks often fail to reflect real-world complexity, allowing systems to cheat via disconnected reasoning or simple recall, limiting their ability to uncover true system limitations.

Method: Developed an automatic pipeline for creating CRUMQs (uncheatable, realistic, unanswerable, multi-hop queries) that is adaptable to any corpus and domain, then applied it to two popular RAG datasets.

Result: CRUMQs proved highly challenging for leading RAG systems, achieving up to 81.0% reduction in cheatability scores compared to prior benchmarks.

Conclusion: The CRUMQs pipeline provides an effective way to enhance benchmark difficulty and realism, driving development of more capable RAG systems that can handle complex real-world queries.

Abstract: Real-world use cases often present RAG systems with complex queries for which relevant information is missing from the corpus or is incomplete. In these settings, RAG systems must be able to reject unanswerable, out-of-scope queries and identify failures of retrieval and multi-hop reasoning. Despite this, existing RAG benchmarks rarely reflect realistic task complexity for multi-hop or out-of-scope questions, which often can be cheated via disconnected reasoning (i.e., solved without genuine multi-hop inference) or require only simple factual recall. This limits the ability for such benchmarks to uncover limitations of existing RAG systems. To address this gap, we present the first pipeline for automatic, difficulty-controlled creation of un$\underline{c}$heatable, $\underline{r}$ealistic, $\underline{u}$nanswerable, and $\underline{m}$ulti-hop $\underline{q}$uerie$\underline{s}$ (CRUMQs), adaptable to any corpus and domain. We use our pipeline to create CRUMQs over two popular RAG datasets and demonstrate its effectiveness via benchmark experiments on leading retrieval-augmented LLMs. Results show that compared to prior RAG benchmarks, CRUMQs are highly challenging for RAG systems and achieve up to 81.0% reduction in cheatability scores. More broadly, our pipeline offers a simple way to enhance benchmark difficulty and realism and drive development of more capable RAG systems.

[11] Direct Multi-Token Decoding

Xuan Luo, Weizhi Wang, Xifeng Yan

Main category: cs.CL

TL;DR: Decoder-only transformers can generate multiple tokens at once using only late layers, achieving 2x speedup without extra parameters or verification.

Details

Motivation: Early research suggests different transformer layers serve distinct roles, and late layers alone may contain enough information to generate multiple tokens, avoiding repeated computation through early/middle layers.

Method: Direct Multi-Token Decoding (DMTD) - fine-tune models to use only late layers for generating multiple tokens from processed hidden states, without speculative decoding or additional components.

Result: Fine-tuned DMTD Qwen3-4B model achieved up to 2x speedup with minor performance loss, and scaling analysis suggests further improvements with larger training datasets.

Conclusion: DMTD provides an efficient inference paradigm that leverages layer specialization in transformers to accelerate generation without architectural changes or verification overhead.

Abstract: Decoder-only transformers have become the standard architecture for large language models (LLMs) due to their strong performance. Recent studies suggest that, in pre-trained LLMs, early, middle, and late layers may serve distinct roles: Early layers focus on understanding the input context, middle layers handle task-specific processing, and late layers convert abstract representations into output tokens. We hypothesize that once representations have been processed by the early and middle layers, the resulting hidden states may encapsulate sufficient information to support the generation of multiple tokens using only the late layers, eliminating the need to repeatedly traverse the early and middle layers. We refer to this inference paradigm as Direct Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces no additional parameters, auxiliary routines, or post-generation verification. Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model has already demonstrated promising results, achieving up to a 2x speedup with only minor performance loss. Moreover, as shown in our scaling analysis, its performance is expected to further improve with larger training datasets.

[12] Scaling Long-Horizon LLM Agent via Context-Folding

Weiwei Sun, Miao Lu, Zhan Ling, Kang Liu, Xuesong Yao, Yiming Yang, Jiecao Chen

Main category: cs.CL

TL;DR: Context-Folding enables LLM agents to manage working context by procedurally branching into sub-trajectories for subtasks and folding them upon completion, retaining concise summaries while reducing context usage.

Details

Motivation: Large language model agents are fundamentally constrained by context length on long-horizon tasks, requiring efficient context management strategies.

Method: Developed Context-Folding framework with FoldGRPO reinforcement learning using process rewards to encourage effective task decomposition and context management through procedural branching and folding of sub-trajectories.

Result: On complex long-horizon tasks (Deep Research and SWE), the folding agent matches or outperforms ReAct baselines while using 10× smaller active context and significantly outperforms summarization-based context management models.

Conclusion: Context-Folding provides an effective approach for LLM agents to handle long-horizon tasks with significantly reduced context requirements while maintaining or improving performance.

Abstract: Large language model (LLM) agents are fundamentally constrained by context length on long-horizon tasks. We introduce Context-Folding, a framework that empowers agents to actively manage their working context. An agent can procedurally branch into a sub-trajectory to handle a subtask and then fold it upon completion, collapsing the intermediate steps while retaining a concise summary of the outcome. To make this behavior learnable, we develop an end-to-end reinforcement learning framework FoldGRPO with specific process rewards to encourage effective task decomposition and context management. On complex long-horizon tasks (Deep Research and SWE), our folding agent matches or outperforms the ReAct baselines while using an active context 10$\times$ smaller and significantly outperforms models that rely on summarization-based context management.

[13] Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

Jasivan Alex Sivakumar, Philipp Borchert, Ronald Cardenas, Gerasimos Lampouras

Main category: cs.CL

TL;DR: The paper introduces ConjectureBench to evaluate LLMs’ conjecturing ability in autoformalisation, showing current performance is overestimated and proposing Lean-FIRe method to improve end-to-end autoformalisation.

Details

Motivation: Current autoformalisation approaches overlook the critical conjecturing step, and LLM evaluation doesn't properly measure conjecturing ability as a distinct task from formalisation.

Method: Created ConjectureBench dataset, redesigned evaluation framework to measure conjecturing separately, and developed Lean-FIRe inference-time method to improve conjecturing and autoformalisation.

Result: LLMs’ autoformalisation performance is substantially overestimated when conjecture is provided; Lean-FIRe achieved first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1.

Conclusion: Conjecturing should be treated as an independent task in autoformalisation, and LLMs have the knowledge to generate accurate conjectures but need better integration methods for formal mathematical reasoning.

Abstract: Autoformalisation, the task of expressing informal mathematical statements in formal language, is often viewed as a direct translation process. This, however, disregards a critical preceding step: conjecturing. Many mathematical problems cannot be formalised directly without first conjecturing a conclusion such as an explicit answer, or a specific bound. Since Large Language Models (LLMs) already struggle with autoformalisation, and the evaluation of their conjecturing ability is limited and often entangled within autoformalisation or proof, it is particularly challenging to understand its effect. To address this gap, we augment existing datasets to create ConjectureBench, and redesign the evaluation framework and metric specifically to measure the conjecturing capabilities of LLMs both as a distinct task and within the autoformalisation pipeline. Our evaluation of foundational models, including GPT-4.1 and DeepSeek-V3.1, reveals that their autoformalisation performance is substantially overestimated when the conjecture is accounted for during evaluation. However, the conjecture should not be assumed to be provided. We design an inference-time method, Lean-FIRe to improve conjecturing and autoformalisation, which, to the best of our knowledge, achieves the first successful end-to-end autoformalisation of 13 PutnamBench problems with GPT-4.1 and 7 with DeepSeek-V3.1. We demonstrate that while LLMs possess the requisite knowledge to generate accurate conjectures, improving autoformalisation performance requires treating conjecturing as an independent task, and investigating further how to correctly integrate it within autoformalisation. Finally, we provide forward-looking guidance to steer future research toward improving conjecturing, an overlooked step of formal mathematical reasoning.

[14] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

Ryan Shea, Yunan Lu, Liang Qiu, Zhou Yu

Main category: cs.CL

TL;DR: SAGE is a user simulation framework for multi-turn agent evaluation that integrates business knowledge to create more realistic customer interactions and better identify agent errors.

Details

Motivation: Current user simulation approaches for multi-turn agent evaluation are too generic and lack domain-specific knowledge, making them unable to capture realistic customer behavior in business contexts.

Method: SAGE combines top-down business logic (ideal customer profiles) with bottom-up business infrastructure knowledge (product catalogs, FAQs, knowledge bases) to simulate realistic user interactions.

Result: SAGE produces more realistic and diverse interactions while identifying up to 33% more agent errors compared to existing approaches.

Conclusion: SAGE is an effective evaluation tool that supports bug-finding and iterative agent improvement by grounding user simulation in real business contexts.

Abstract: Evaluating multi-turn interactive agents is challenging due to the need for human assessment. Evaluation with simulated users has been introduced as an alternative, however existing approaches typically model generic users and overlook the domain-specific principles required to capture realistic behavior. We propose SAGE, a novel user Simulation framework for multi-turn AGent Evaluation that integrates knowledge from business contexts. SAGE incorporates top-down knowledge rooted in business logic, such as ideal customer profiles, grounding user behavior in realistic customer personas. We further integrate bottom-up knowledge taken from business agent infrastructure (e.g., product catalogs, FAQs, and knowledge bases), allowing the simulator to generate interactions that reflect users’ information needs and expectations in a company’s target market. Through empirical evaluation, we find that this approach produces interactions that are more realistic and diverse, while also identifying up to 33% more agent errors, highlighting its effectiveness as an evaluation tool to support bug-finding and iterative agent improvement.

[15] Generate Logical Equivalence Questions

Xinyu Wang, Haoming Yu, Yicheng Yang, Zhiyuan Li

Main category: cs.CL

TL;DR: Proposes an Automatic Question Generation (AQG) system for logical equivalence questions in Discrete Mathematics to combat plagiarism and provide practice questions, addressing inefficiencies in existing approaches.

Details

Motivation: To mitigate academic dishonesty in online learning environments and provide diverse practice questions, particularly for foundational computer science courses like Discrete Mathematics.

Method: Defines logical equivalence questions using a formal language, translates this into generation rules, and develops a linear-time algorithm for question generation.

Result: Student performance on generated questions was comparable to textbook questions, and difficulty levels were similar to textbook questions and better than large language model-generated questions.

Conclusion: The proposed AQG system successfully generates high-quality logical equivalence questions with appropriate difficulty levels, making it effective for academic use.

Abstract: Academic dishonesty is met with zero tolerance in higher education, yet plagiarism has become increasingly prevalent in the era of online teaching and learning. Automatic Question Generation (AQG) presents a potential solution to mitigate copying by creating unique questions for each student. Additionally, AQG can provide a vast array of practice questions. Our AQG focuses on generating logical equivalence questions for Discrete Mathematics, a foundational course for first-year computer science students. A literature review reveals that existing AQGs for this type of question generate all propositions that meet user-defined constraints, resulting in inefficiencies and a lack of uniform question difficulty. To address this, we propose a new approach that defines logical equivalence questions using a formal language, translates this language into two sets of generation rules, and develops a linear-time algorithm for question generation. We evaluated our AQG through two experiments. The first involved a group of students completing questions generated by our system. Statistical analysis shows that the accuracy of these questions is comparable to that of textbook questions. The second experiment assessed the number of steps required to solve our generated questions, textbook questions, and those generated by multiple large language models. The results indicated that the difficulty of our questions was similar to that of textbook questions, confirming the quality of our AQG.

[16] Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM

Alice Saebom Kwak, Maria Alexeeva, Gus Hahn-Powell, Keith Alcock, Kevin McLaughlin, Doug McCorkle, Gabe McNunn, Mihai Surdeanu

Main category: cs.CL

TL;DR: Comparison of neuro-symbolic vs LLM-based information extraction in agriculture shows LLMs outperform (F1: 69.4 vs 52.7) but with trade-offs in speed, control, and maintenance costs.

Details

Motivation: To compare traditional neuro-symbolic IE systems with modern LLM-based approaches in real-world agricultural applications, highlighting the hidden costs and trade-offs of each method.

Method: Evaluated both neuro-symbolic and LLM-based IE systems on nine interviews across pork, dairy, and crop subdomains, measuring total and core information extraction performance.

Result: LLM-based system outperformed neuro-symbolic approach (F1 total: 69.4 vs 52.7; core: 63.0 vs 47.2), but neuro-symbolic offered faster runtime and greater control while LLMs had deployment advantages.

Conclusion: Both systems have significant trade-offs: neuro-symbolic provides control and speed but lacks generalizability, while LLMs offer higher performance but have slower runtime and hallucination risks, emphasizing the need to balance performance, efficiency, and control.

Abstract: The current trend in information extraction (IE) is to rely extensively on large language models, effectively discarding decades of experience in building symbolic or statistical IE systems. This paper compares a neuro-symbolic (NS) and an LLM-based IE system in the agricultural domain, evaluating them on nine interviews across pork, dairy, and crop subdomains. The LLM-based system outperforms the NS one (F1 total: 69.4 vs. 52.7; core: 63.0 vs. 47.2), where total includes all extracted information and core focuses on essential details. However, each system has trade-offs: the NS approach offers faster runtime, greater control, and high accuracy in context-free tasks but lacks generalizability, struggles with contextual nuances, and requires significant resources to develop and maintain. The LLM-based system achieves higher performance, faster deployment, and easier maintenance but has slower runtime, limited control, model dependency and hallucination risks. Our findings highlight the “hidden cost” of deploying NLP systems in real-world applications, emphasizing the need to balance performance, efficiency, and control.

[17] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement

Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee

Main category: cs.CL

TL;DR: CPR is a plug-and-play framework that refines poorly structured prompts to reduce hallucinations in LLMs by cleaning prompts and generating additional task descriptions using a fine-tuned small language model.

Details

Motivation: LLMs often generate plausible but incorrect facts due to poorly structured or vague prompts from users, which causes models to base responses on assumed rather than actual intentions, undermining trust.

Method: Curative Prompt Refinement (CPR) framework that 1) cleans ill-formed prompts and 2) generates additional informative task descriptions using a fine-tuned small language model to align user intention with prompts.

Result: CPR significantly increases generation quality while mitigating hallucinations, achieving over 90% win rate over original prompts without external knowledge.

Conclusion: The CPR framework effectively reduces hallucinations in LLMs by refining poorly structured prompts, demonstrating that prompt quality improvement can substantially enhance model reliability without requiring external knowledge sources.

Abstract: Recent advancements in large language models (LLMs) highlight their fluency in generating responses to diverse prompts. However, these models sometimes generate plausible yet incorrect ``hallucinated" facts, undermining trust. A frequent but often overlooked cause of such errors is the use of poorly structured or vague prompts by users, leading LLMs to base responses on assumed rather than actual intentions. To mitigate hallucinations induced by these ill-formed prompts, we introduce Curative Prompt Refinement (CPR), a plug-and-play framework for curative prompt refinement that 1) cleans ill-formed prompts, and 2) generates additional informative task descriptions to align the intention of the user and the prompt using a fine-tuned small language model. When applied to language models, we discover that CPR significantly increases the quality of generation while also mitigating hallucination. Empirical studies show that prompts with CPR applied achieves over a 90% win rate over the original prompts without any external knowledge.

Jung-Woo Shim, Yeong-Joon Ju, Ji-Hoon Park, Seong-Whan Lee

Main category: cs.CL

TL;DR: MPR is a multi-stage prompt refinement framework that systematically improves ill-formed prompts to reduce LLM hallucinations, achieving over 85% win rate compared to original prompts.

Details

Motivation: LLMs face challenges with hallucinations, and the impact of ill-formed prompts (ambiguous wording, incorrect grammar, incomplete information) was under-explored as a contributing factor.

Method: Multi-stage Prompt Refinement (MPR) framework that iteratively improves prompts across multiple stages using fine-tuned small language models to address specific errors like punctuation, typos, and key term misuse, with self-reflection and ranking mechanisms.

Result: MPR achieves over 85% win rate compared to original prompts on hallucination benchmarks, effectively reducing hallucinations and improving LLM output accuracy. It can also be combined with existing post-hoc hallucination mitigation frameworks.

Conclusion: MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains by systematically refining ill-formed prompts to reduce hallucinations.

Abstract: Recent advancements in large language models (LLMs) have shown strong performance in natural language understanding and generation tasks. However, LLMs continue to encounter challenges with hallucinations, where models generate plausible but incorrect information. While several factors contribute to hallucinations, the impact of ill-formed prompts, prompts with ambiguous wording, incorrect grammar, or incomplete information, was relatively under explored. To address this, we introduce Multi-stage Prompt Refinement (MPR), a framework designed to systematically improve these ill-formed prompts across multiple stages. Each stage addresses specific errors such as punctuation, typographical mistakes, and misuse of key terms, using small language models (SLMs) fine-tuned for these tasks. MPR iteratively enhances the clarity of prompts with additional context and employs a self-reflection mechanism with ranking to prioritize the most relevant input. Experimental results on hallucination benchmarks show that prompts refined by MPR achieve over an 85~% win rate compared to their original forms, demonstrating its effectiveness in reducing hallucinations and improving LLM output accuracy. Interestingly, we reveal that MPR can be combined with existing post-hoc hallucination mitigation frameworks, further enhancing its versatility. MPR provides a lightweight and adaptable solution for enhancing LLM reliability across various domains.

[19] On the Interplay between Human Label Variation and Model Fairness

Kemal Kurniawan, Meladel Mistica, Timothy Baldwin, Jey Han Lau

Main category: cs.CL

TL;DR: This paper examines how human label variation affects model fairness, showing that HLV training methods improve fairness without explicit debiasing.

Details

Motivation: To explore the previously unstudied impact of human label variation on model fairness.

Method: Compared training on majority-vote labels with various HLV methods.

Result: HLV training methods positively impact fairness even without explicit debiasing techniques.

Conclusion: Human label variation training methods can enhance model fairness without requiring additional debiasing interventions.

Abstract: The impact of human label variation (HLV) on model fairness is an unexplored topic. This paper examines the interplay by comparing training on majority-vote labels with a range of HLV methods. Our experiments show that without explicit debiasing, HLV training methods have a positive impact on fairness.

[20] Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions

Sungmin Kang, Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Salman Avestimehr

Main category: cs.CL

TL;DR: This paper surveys uncertainty quantification (UQ) methods for detecting hallucinations in large language models (LLMs), providing a systematic categorization of approaches and empirical evaluation of representative methods.

Details

Motivation: LLMs are prone to producing plausible but factually incorrect outputs (hallucinations), raising reliability concerns in real-world applications. Uncertainty quantification offers principled measures to assess trustworthiness of model generations.

Method: The paper introduces foundations of UQ, adapts traditional uncertainty concepts (epistemic and aleatoric) to LLMs, systematically categorizes existing hallucination detection methods along multiple dimensions, and presents empirical results for representative approaches.

Result: The study provides a comprehensive overview of how UQ can be used as a mechanism for identifying unreliable generations in LLMs, improving their reliability through principled uncertainty measures.

Conclusion: The paper outlines the current landscape of LLM uncertainty quantification for hallucination detection, discusses current limitations, and identifies promising future research directions to enhance model trustworthiness.

Abstract: The rapid advancement of large language models (LLMs) has transformed the landscape of natural language processing, enabling breakthroughs across a wide range of areas including question answering, machine translation, and text summarization. Yet, their deployment in real-world applications has raised concerns over reliability and trustworthiness, as LLMs remain prone to hallucinations that produce plausible but factually incorrect outputs. Uncertainty quantification (UQ) has emerged as a central research direction to address this issue, offering principled measures for assessing the trustworthiness of model generations. We begin by introducing the foundations of UQ, from its formal definition to the traditional distinction between epistemic and aleatoric uncertainty, and then highlight how these concepts have been adapted to the context of LLMs. Building on this, we examine the role of UQ in hallucination detection, where quantifying uncertainty provides a mechanism for identifying unreliable generations and improving reliability. We systematically categorize a wide spectrum of existing methods along multiple dimensions and present empirical results for several representative approaches. Finally, we discuss current limitations and outline promising future research directions, providing a clearer picture of the current landscape of LLM UQ for hallucination detection.

[21] Improving Text-to-Image Generation with Input-Side Inference-Time Scaling

Ruibo Chen, Jiacheng Pan, Heng Huang, Zhenheng Yang

Main category: cs.CL

TL;DR: A prompt rewriting framework using LLMs to refine user inputs before T2I generation, improving image-text alignment, quality, and aesthetics without supervised data.

Details

Motivation: Existing T2I models struggle with simple or underspecified prompts, leading to poor image-text alignment and quality.

Method: Uses LLMs with reward system and iterative DPO training to enhance prompts, without requiring supervised fine-tuning data.

Result: Consistently improves image-text alignment, visual quality, and aesthetics across diverse T2I models, with strong transferability between backbones.

Conclusion: Prompt rewriting is an effective, scalable, and model-agnostic strategy for improving T2I systems.

Abstract: Recent advances in text-to-image (T2I) generation have achieved impressive results, yet existing models often struggle with simple or underspecified prompts, leading to suboptimal image-text alignment, aesthetics, and quality. We propose a prompt rewriting framework that leverages large language models (LLMs) to refine user inputs before feeding them into T2I backbones. Our approach introduces a carefully designed reward system and an iterative direct preference optimization (DPO) training pipeline, enabling the rewriter to enhance prompts without requiring supervised fine-tuning data. We evaluate our method across diverse T2I models and benchmarks. Results show that our prompt rewriter consistently improves image-text alignment, visual quality, and aesthetics, outperforming strong baselines. Furthermore, we demonstrate strong transferability by showing that a prompt rewriter trained on one T2I backbone generalizes effectively to others without needing to be retrained. We also systematically study scalability, evaluating how performance gains scale with the capacity of the large LLM used as the rewriter. These findings highlight that prompt rewriting is an effective, scalable, and practical model-agnostic strategy for improving T2I systems. We plan to release the code and trained prompt rewriters soon.

[22] Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

Yukun Zhang, Qi Dong

Main category: cs.CL

TL;DR: Hierarchical Alignment applies targeted DPO to distinct functional blocks (local, intermediate, global layers) instead of uniform optimization, improving performance while avoiding alignment tax.

Details

Motivation: Standard alignment techniques treat LLMs as monolithic entities, overlooking functional specialization in Transformer architecture where different layers handle distinct tasks.

Method: Hierarchical Alignment method applies targeted DPO to three functional blocks: local (syntax), intermediate (logic), and global (factuality) layers using LoRA for surgical fine-tuning.

Result: Local-Align improves grammatical fluency; Global-Align enhances factual consistency and is most effective for logical coherence; all strategies avoid alignment tax where fluency gains degrade reasoning.

Conclusion: Hierarchical Alignment provides resource-efficient, controllable, and interpretable path for model alignment, shifting from monolithic optimization to structure-aware surgical fine-tuning.

Abstract: Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model’s layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the “alignment tax” observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.

[23] APCE: Adaptive Progressive Context Expansion for Long Context Processing

Baisub Lee, Sanghyun Byun, Mohanad Odema, Jung Guack, Jacob Song, Woo Seong Chung

Main category: cs.CL

TL;DR: APCE is a context-aware method that surgically selects important input chunks through semantic similarity matching to reduce memory footprint and mitigate ContextRot in long-context transformers, achieving comparable summarization performance using only 50%-70% of input.

Details

Motivation: Address two key challenges in Long-Context Transformer Models: growing memory footprint from quadratic self-attention and linear KV-cache scaling, and ContextRot performance degradation with increasing context length.

Method: APCE (context-aware solution) selects most important input chunks through low-dimensional semantic similarity matching with current query, operating directly on input to decouple from hardware dependencies.

Result: Superior or on-par summarization performance compared to full dense baseline while using only 50%-70% of input sequence, resulting in KV-cache and self-attention memory efficiency improvements.

Conclusion: APCE demonstrates effective context-aware efficiency for long-context summarization tasks and inspires further research on similar solutions for other long-context tasks.

Abstract: Deploying useful Long-Context Transformer Models (LCTMs) requires addressing two key challenges: (1) A growing memory footprint due to quadratic self-attention and linear KV-cache scaling in memory as sequence length increases; (2) the ContextRot phenomena where empirical evidence suggests that transformer architecture’s performance degrades with increasing context length. Given the shared dependency on the input, a natural question arises: Can we surgically select the most important input chunks for processing to synergistically (a) reduce the memory footprint, and (b) mitigate the ContextRot effects? In this paper, we answer this question in the affirmative for long-context summarization tasks. We propose APCE as a context-aware solution to select the most important input chunks through low-dimensional semantic similarity matching with the current query. By directly operating on the input, APCE decouples from strict dependency on underlying hardware or CUDA environments, promising a compatible solution scalable to different deployment systems. Our empirical evaluations have demonstrated superior or on-par summarization performance for APCE compared to the full dense baseline using a fraction (50%-70%) of the input sequence resulting in KV-cache and self-attention memory efficiency improvements. We hope our findings inspire further research on context-aware efficiency solutions for LCTMs geared towards other relevant long-context tasks.

[24] An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

Benjamin W. Nelson, Celeste Wong, Matthew T. Silvestrini, Sooyoon Shin, Alanna Robinson, Jessica Lee, Eric Yang, John Torous, Andrew Trister

Main category: cs.CL

TL;DR: The Verily behavioral health safety filter (VBHSF) outperforms existing content moderation systems in detecting mental health crises, achieving high sensitivity and specificity across different datasets.

Details

Motivation: Large language models often provide harmful or inappropriate advice during psychiatric emergencies, highlighting the need for effective safety filters to prevent destructive behaviors.

Method: Evaluated VBHSF on two clinician-labelled datasets (Verily Mental Health Crisis Dataset and NVIDIA Aegis AI Content Safety Dataset) and compared performance against OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails.

Result: VBHSF achieved high sensitivity (0.990) and specificity (0.992) on Verily dataset, and maintained high sensitivity (0.982) and accuracy (0.921) on NVIDIA dataset. Significantly outperformed both comparison systems in sensitivity across all cases.

Conclusion: VBHSF demonstrates robust, generalizable performance that prioritizes sensitivity to minimize missed crises, making it suitable for healthcare applications where detecting mental health emergencies is critical.

Abstract: Large language models often mishandle psychiatric emergencies, offering harmful or inappropriate advice and enabling destructive behaviors. This study evaluated the Verily behavioral health safety filter (VBHSF) on two datasets: the Verily Mental Health Crisis Dataset containing 1,800 simulated messages and the NVIDIA Aegis AI Content Safety Dataset subsetted to 794 mental health-related messages. The two datasets were clinician-labelled and we evaluated performance using the clinician labels. Additionally, we carried out comparative performance analyses against two open source, content moderation guardrails: OpenAI Omni Moderation Latest and NVIDIA NeMo Guardrails. The VBHSF demonstrated, well-balanced performance on the Verily Mental Health Crisis Dataset v1.0, achieving high sensitivity (0.990) and specificity (0.992) in detecting any mental health crises. It achieved an F1-score of 0.939, sensitivity ranged from 0.917-0.992, and specificity was >= 0.978 in identifying specific crisis categories. When evaluated against the NVIDIA Aegis AI Content Safety Dataset 2.0, VBHSF performance remained highly sensitive (0.982) and accuracy (0.921) with reduced specificity (0.859). When compared with the NVIDIA NeMo and OpenAI Omni Moderation Latest guardrails, the VBHSF demonstrated superior performance metrics across both datasets, achieving significantly higher sensitivity in all cases (all p < 0.001) and higher specificity relative to NVIDIA NeMo (p < 0.001), but not to OpenAI Omni Moderation Latest (p = 0.094). NVIDIA NeMo and OpenAI Omni Moderation Latest exhibited inconsistent performance across specific crisis types, with sensitivity for some categories falling below 0.10. Overall, the VBHSF demonstrated robust, generalizable performance that prioritizes sensitivity to minimize missed crises, a crucial feature for healthcare applications.

[25] Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

Ziliang Qiu, Renfen Hu

Main category: cs.CL

TL;DR: PACE is a method that evaluates LLM creativity by generating parallel association chains, addressing data contamination issues and providing efficient assessment with strong correlation to human rankings.

Details

Motivation: Current LLM creativity evaluation faces challenges like data contamination and expensive human assessments, requiring a more reliable and efficient method.

Method: Propose PACE (Parallel Association Chains to Evaluate) where LLMs generate association chains to measure creativity, minimizing data contamination risk.

Result: Strong correlation with Chatbot Arena Creative Writing rankings (Spearman’s ρ=0.739, p<0.001); high-performing LLMs match average human performance but professionals outperform LLMs; humans show greater associative diversity.

Conclusion: PACE provides effective LLM creativity evaluation; LLMs approach average human creativity levels but still lag behind professional humans in associative creativity and diversity.

Abstract: The evaluation of LLMs’ creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman’s $\rho = 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.

[26] Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation

Xin Zhao, Naoki Yoshinaga, Yuma Tsuta, Akiko Aizawa

Main category: cs.CL

TL;DR: This paper examines how LLMs acquire multilingual domain knowledge during domain adaptation, revealing challenges in cross-lingual transfer despite using high-quality bilingual corpora.

Details

Motivation: To understand the mechanisms of multilingual knowledge acquisition - how domain knowledge is learned within a language and transferred across languages - which remains underexplored and leads to suboptimal performance in low-resource settings.

Method: Proposed AdaXEval, an adaptive evaluation method that builds multiple-choice QA datasets from the same bilingual domain corpus used for training, and conducted continual training of LLMs with diverse data recipes to track knowledge acquisition.

Result: Experiments on a 13B English-Japanese bilingual LLM revealed that cross-lingual transfer remains challenging even with high-quality bilingual corpora.

Conclusion: The study provides insights into the transformation process from domain training data to knowledge and highlights the persistent difficulties in effective cross-lingual knowledge transfer during multilingual domain adaptation.

Abstract: Multilingual domain adaptation (ML-DA) is widely used to learn new domain knowledge across languages into large language models (LLMs). Although many methods have been proposed to improve domain adaptation, the mechanisms of multilingual knowledge acquisition, how domain knowledge is learned within a language and transferred across languages, remain underexplored. This gap leads to suboptimal performance, particularly in low-resource settings. This work examines the learning dynamics of LLMs during ML-DA. Because prior ML-DA studies often train and evaluate on datasets with mismatched knowledge coverage, we propose AdaXEval, an adaptive evaluation method that builds multiple-choice QA datasets from the same bilingual domain corpus used for training, thereby directly studying multilingual knowledge acquisition. Through continual training of LLMs with diverse data recipes, we track how LLMs acquire domain facts and pinpoint the mechanism behind the transformation process from domain training data to knowledge. Our experiments on a 13B English-Japanese bilingual LLM reveal that cross-lingual transfer remains challenging despite a high-quality bilingual corpus. The code has been released.

[27] Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou

Main category: cs.CL

TL;DR: This paper identifies a modality gap between speech and text inputs in Large Speech Language Models (LSLMs), analyzes representation alignment patterns, and proposes interventions to improve speech input performance.

Details

Motivation: End-to-end LSLMs show impressive conversational abilities but underperform traditional pipeline systems on semantic understanding benchmarks, revealing a significant performance gap between speech and text inputs that needs investigation.

Method: Systematic experimentation analyzing coarse- and fine-grained text/speech representations, measuring cosine similarity and Euclidean distance, identifying token-level alignment patterns, and developing targeted interventions through angle projection and length normalization.

Result: Found that speech and text representations align in direction but diverge in magnitude in deeper layers; representation similarity strongly correlates with modality gap; identified spontaneous token-level alignment; proposed Alignment Path Score metric; interventions showed potential to improve speech input correctness.

Conclusion: The study provides the first systematic empirical analysis of modality gap and alignment mechanisms in LSLMs, offering theoretical insights and methodological guidance for future optimization of speech-text alignment in large language models.

Abstract: End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.

[28] SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu, Juntao Dai, Jiaming Ji, Haoran Li, Chengkun Cai, Pengcheng Wen, Chi-Min Chan, Boyuan Chen, Yaodong Yang, Sirui Han, Yike Guo

Main category: cs.CL

TL;DR: SafeMT is a benchmark for evaluating safety of multi-modal LLMs in multi-turn dialogues, showing that attack success rate increases with dialogue length, and proposing a dialogue safety moderator for better protection.

Details

Motivation: Existing benchmarks don't adequately address safety risks in multi-turn dialogues, which are more common in daily interactions and pose greater risks than single prompts.

Method: Introduces SafeMT benchmark with 10,000 samples across 17 scenarios and 4 jailbreak methods, plus a Safety Index (SI) metric and a dialogue safety moderator that detects malicious intent in conversations.

Result: Evaluation of 17 models shows attack success rate increases with dialogue length, indicating inadequate safety mechanisms for recognizing hazards in dialogue interactions.

Conclusion: The proposed dialogue safety moderator is more effective at reducing multi-turn attack success rates compared to existing guard models, addressing critical safety gaps in multi-modal LLMs.

Abstract: With the widespread use of multi-modal Large Language models (MLLMs), safety issues have become a growing concern. Multi-turn dialogues, which are more common in everyday interactions, pose a greater risk than single prompts; however, existing benchmarks do not adequately consider this situation. To encourage the community to focus on the safety issues of these models in multi-turn dialogues, we introduce SafeMT, a benchmark that features dialogues of varying lengths generated from harmful queries accompanied by images. This benchmark consists of 10,000 samples in total, encompassing 17 different scenarios and four jailbreak methods. Additionally, we propose Safety Index (SI) to evaluate the general safety of MLLMs during conversations. We assess the safety of 17 models using this benchmark and discover that the risk of successful attacks on these models increases as the number of turns in harmful dialogues rises. This observation indicates that the safety mechanisms of these models are inadequate for recognizing the hazard in dialogue interactions. We propose a dialogue safety moderator capable of detecting malicious intent concealed within conversations and providing MLLMs with relevant safety policies. Experimental results from several open-source models indicate that this moderator is more effective in reducing multi-turn ASR compared to existed guard models.

[29] Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models

Shihao Ji, Zihui Song, Jiajie Huang

Main category: cs.CL

TL;DR: The Credal Transformer addresses LLM hallucinations by replacing standard attention with Credal Attention Mechanism that produces sets of distributions to quantify uncertainty, reducing confident errors on unanswerable questions.

Details

Motivation: LLMs generate factually incorrect but confident assertions (hallucinations), which stems from the Transformer's Softmax function creating 'Artificial Certainty' by collapsing ambiguous attention scores into single distributions and discarding uncertainty information.

Method: Introduces Credal Transformer with Credal Attention Mechanism (CAM) based on evidential theory. CAM produces ‘credal sets’ (sets of distributions) instead of single attention vectors, with set size measuring uncertainty. Attention scores are re-conceptualized as evidence masses for Dirichlet distribution.

Result: Empirically identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining when uncertain.

Conclusion: Provides a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, offering foundation for more reliable AI.

Abstract: Large Language Models (LLMs) hallucinate, generating factually incorrect yet confident assertions. We argue this stems from the Transformer’s Softmax function, which creates “Artificial Certainty” by collapsing ambiguous attention scores into a single probability distribution, discarding uncertainty information at each layer. To fix this, we introduce the Credal Transformer, which replaces standard attention with a Credal Attention Mechanism (CAM) based on evidential theory. CAM produces a “credal set” (a set of distributions) instead of a single attention vector, with the set’s size directly measuring model uncertainty. We implement this by re-conceptualizing attention scores as evidence masses for a Dirichlet distribution: sufficient evidence recovers standard attention, while insufficient evidence yields a diffuse distribution, representing ambiguity. Empirically, the Credal Transformer identifies out-of-distribution inputs, quantifies ambiguity, and significantly reduces confident errors on unanswerable questions by abstaining. Our contribution is a new architecture to mitigate hallucinations and a design paradigm that integrates uncertainty quantification directly into the model, providing a foundation for more reliable AI.

[30] A Survey on Parallel Reasoning

Ziqi Wang, Boye Niu, Zipeng Gao, Zhi Zheng, Tong Xu, Linghui Meng, Zhongli Li, Jing Liu, Yilong Chen, Chen Zhu, Hua Wu, Haifeng Wang, Enhong Chen

Main category: cs.CL

TL;DR: This paper surveys parallel reasoning in Large Language Models, defining the concept, categorizing techniques, exploring applications, and identifying challenges for future research.

Details

Motivation: To address the fragility of standard sequential reasoning methods in LLMs and enhance reasoning robustness through parallel exploration of multiple lines of thought.

Method: The authors provide a formal definition of parallel reasoning, create a taxonomy of techniques (non-interactive, interactive, and efficiency-focused decoding), analyze application scenarios, and identify core challenges.

Result: A comprehensive survey that organizes the field of parallel reasoning, distinguishing it from related concepts like Chain-of-Thought, and provides a roadmap for beginners and researchers.

Conclusion: Parallel reasoning represents a significant trend for improving LLM robustness, and the survey highlights challenges and future directions to advance this emerging paradigm.

Abstract: With the increasing capabilities of Large Language Models (LLMs), parallel reasoning has emerged as a new inference paradigm that enhances reasoning robustness by concurrently exploring multiple lines of thought before converging on a final answer. It has become a significant trend to explore parallel reasoning to overcome the fragility of standard sequential methods and improve practical performance. In this paper, we aim to survey and summarize the progress and challenges of parallel reasoning. We first present a formal definition of parallel reasoning and clarify its distinction from related concepts like Chain-of-Thought. Then, we organize and discuss advanced techniques based on a novel taxonomy, including non-interactive reasoning, interactive reasoning, and efficiency-focused decoding strategies. Additionally, we explore various application scenarios, such as solving complex problems and enhancing the reliability of LLM outputs.Finally, we highlight the core challenges of parallel reasoning and suggest potential directions for future research. We hope that our work can provide a useful roadmap for beginners and encourage more research on improving parallel reasoning methods. Related source can be avaliable in https://github.com/PPPP-kaqiu/Awesome-Parallel-Reasoning.

[31] Towards Inference-time Scaling for Continuous Space Reasoning

Minghan Wang, Thuy-Trang Vu, Ehsan Shareghi, Gholamreza Haffari

Main category: cs.CL

TL;DR: This paper investigates adapting inference-time scaling techniques from discrete text reasoning to continuous space reasoning using COCONUT LM, finding potential performance gains but highlighting unique challenges in discriminating correct vs incorrect reasoning paths.

Details

Motivation: To determine if established inference-time scaling techniques (multiple sample generation with PRM/ORM re-ranking) that work well for text-based reasoning can be successfully adapted to continuous space reasoning.

Method: Used COCONUT continuous space reasoning LM as backbone, generated diverse reasoning paths through dropout-based sampling, and conducted Pass@N analysis. Probed geometric properties and trajectory dynamics to understand discrimination challenges.

Result: Demonstrated feasibility of generating diverse reasoning paths and potential for significant performance gains similar to discrete space. However, discrete space recipes for PRM/ORM training provided only marginal improvements in continuous space due to inability to effectively discriminate correct vs incorrect reasoning.

Conclusion: Current limitations stem from absence of key inductive biases in continuous thought representations. Training frameworks for continuous reasoning LMs need to explicitly incorporate inductive biases for effective discrimination during inference-time scaling.

Abstract: Inference-time scaling through multiple sample generation in combination with Process- or Outcome-Reward Model (PRM or ORM) re-ranking has proven effective for text-based reasoning in large language models. This paper investigates whether such established techniques can be successfully adapted to reasoning in the continuous space, using COCONUT (Hao et al. 2024) continuous space reasoning LM as the backbone. We demonstrate the feasibility of generating diverse reasoning paths through dropout-based sampling. Our Pass@N analysis on the generated samples reveals the potential that could enable a significant gain in performance akin to observed gain in the discrete space. However, we highlight unique challenges faced for materializing this gain in the continuous thought space. In particular, working recipes for data generation and training PRM and ORM models in the discrete space unlocks only marginal improvements in the continuous space. Through probing various aspects including geometric properties and trajectory dynamics we identify the underlying reasons that prevent effective discrimination between correct and incorrect reasoning (essential for the functioning of PRM and ORM). Our findings reveal that current limitations stem from the absence of key inductive biases in continuous thought representations. We argue that the training frameworks for continuous reasoning LMs require not only to optimize for accuracy but also to explicitly incorporate inductive biases that could be utilized during inference-time for discrimination of correct and incorrect thoughts.\footnote{Our code and data will be publicly available.}

[32] From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing

Chengrui Xiang, Tengfei Ma, Xiangzheng Fu, Yiping Liu, Bosheng Song, Xiangxiang Zeng

Main category: cs.CL

TL;DR: LLaDR is a framework that uses large language models to enhance biomedical knowledge graphs for drug repurposing by incorporating treatment-relevant textual representations.

Details

Motivation: Existing drug repurposing methods overlook common-sense biomedical knowledge and mechanistic priors about drug-treatment incompatibilities in real-world labs.

Method: Extract semantically enriched treatment-related textual representations from LLMs and use them to fine-tune knowledge graph embedding models, injecting treatment-relevant knowledge into KGE.

Result: LLaDR achieves state-of-the-art performance across different scenarios and shows robustness in case studies on Alzheimer’s disease.

Conclusion: The framework significantly improves biomedical concept representation in knowledge graphs, enhancing semantic understanding of complex indications for drug repurposing.

Abstract: Drug repurposing plays a critical role in accelerating treatment discovery, especially for complex and rare diseases. Biomedical knowledge graphs (KGs), which encode rich clinical associations, have been widely adopted to support this task. However, existing methods largely overlook common-sense biomedical concept knowledge in real-world labs, such as mechanistic priors indicating that certain drugs are fundamentally incompatible with specific treatments. To address this gap, we propose LLaDR, a Large Language Model-assisted framework for Drug Repurposing, which improves the representation of biomedical concepts within KGs. Specifically, we extract semantically enriched treatment-related textual representations of biomedical entities from large language models (LLMs) and use them to fine-tune knowledge graph embedding (KGE) models. By injecting treatment-relevant knowledge into KGE, LLaDR largely improves the representation of biomedical concepts, enhancing semantic understanding of under-studied or complex indications. Experiments based on benchmarks demonstrate that LLaDR achieves state-of-the-art performance across different scenarios, with case studies on Alzheimer’s disease further confirming its robustness and effectiveness. Code is available at https://github.com/xiaomingaaa/LLaDR.

[33] DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

Zeyu Yang, Satoshi Nakamura

Main category: cs.CL

TL;DR: A segmentation framework using LLMs with Direct Preference Optimization (DPO) for simultaneous speech translation, outperforming SHAS in accuracy, translation quality, and latency across multiple language pairs.

Details

Motivation: Existing segmentation models like SHAS lack human preference alignment crucial for natural real-time interpretation, being constrained by supervised learning objectives.

Method: Proposed segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO), evaluated on ACL 60/60 corpus using SeamlessM4T v2 as translation backbone across three language pairs.

Result: DPO-tuned LLM achieves higher segmentation accuracy than SHAS, with consistent improvements in translation quality (BLEU, COMET) and latency (Average Lagging).

Conclusion: Preference-tuned LLMs have potential to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.

Abstract: Simultaneous speech translation requires accurate segmentation to balance translation quality and latency. Recent studies such as SHAS have introduced pretrained segmentation models, achieving stronger performance than heuristic rules. However, segmentation models such as SHAS, though pretrained and more robust than heuristic methods, are still constrained by supervised learning objectives and do not incorporate human preference alignment, which is crucial for natural real-time interpretation. In this work, we propose a segmentation framework based on large language models (LLMs) trained with Direct Preference Optimization (DPO). By leveraging preference alignment, our method enables LLMs to predict natural segmentation points that better meet the demands of real-time translation. We evaluate the system on the ACL 60/60 corpus across three language pairs (English-Japanese, Chinese, German), using SeamlessM4T v2 as the translation backbone. Experimental results show that our DPO-tuned LLM achieves higher segmentation accuracy than SHAS and yields consistent improvements in translation quality (BLEU, COMET) as well as latency (Average Lagging). Furthermore, our system benefits from IWSLT baselines for direct comparison. These findings highlight the potential of preference-tuned LLMs to surpass existing pretrained segmentation models and advance adaptive, human-aligned simultaneous interpretation.

[34] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

Ali Mekky, Omar El Herraoui, Preslav Nakov, Yuxia Wang

Main category: cs.CL

TL;DR: HALF is a harm-aware LLM fairness framework that evaluates bias in realistic applications weighted by harm severity across three tiers of domains.

Details

Motivation: Existing LLM fairness evaluations lack grounding in real-world scenarios and don't account for differences in harm severity across different application domains.

Method: HALF organizes nine application domains into three severity tiers (Severe, Moderate, Mild) using a five-stage pipeline to assess model bias in realistic applications weighted by harm severity.

Result: Evaluation across eight LLMs shows: (1) inconsistent fairness across domains, (2) model size/performance don’t guarantee fairness, (3) reasoning models perform better in medical decision support but worse in education.

Conclusion: HALF exposes a clear gap between previous benchmarking success and deployment readiness, highlighting the need for harm-aware fairness evaluation.

Abstract: Large language models (LLMs) are increasingly deployed across high-impact domains, from clinical decision support and legal analysis to hiring and education, making fairness and bias evaluation before deployment critical. However, existing evaluations lack grounding in real-world scenarios and do not account for differences in harm severity, e.g., a biased decision in surgery should not be weighed the same as a stylistic bias in text summarization. To address this gap, we introduce HALF (Harm-Aware LLM Fairness), a deployment-aligned framework that assesses model bias in realistic applications and weighs the outcomes by harm severity. HALF organizes nine application domains into three tiers (Severe, Moderate, Mild) using a five-stage pipeline. Our evaluation results across eight LLMs show that (1) LLMs are not consistently fair across domains, (2) model size or performance do not guarantee fairness, and (3) reasoning models perform better in medical decision support but worse in education. We conclude that HALF exposes a clear gap between previous benchmarking success and deployment readiness.

[35] Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

Bianca Raimondi, Daniela Dalbagno, Maurizio Gabbrielli

Main category: cs.CL

TL;DR: The Knobe effect moral bias emerges in finetuned LLMs and can be localized to specific layers, where patching activations from pretrained models eliminates the bias without retraining.

Details

Motivation: To understand how human-like biases manifest in LLMs and investigate whether the Knobe effect (moral bias in intentionality judgements) emerges during finetuning.

Method: Conducted Layer-Patching analysis across 3 open-weights LLMs to trace where the bias manifests and test if patching activations from pretrained models into critical layers can eliminate the effect.

Result: The bias is learned during finetuning and localized in specific layers. Patching activations from pretrained models into just a few critical layers successfully eliminates the Knobe effect.

Conclusion: Social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions without requiring model retraining.

Abstract: Large language models (LLMs) have been shown to internalize human-like biases during finetuning, yet the mechanisms by which these biases manifest remain unclear. In this work, we investigated whether the well-known Knobe effect, a moral bias in intentionality judgements, emerges in finetuned LLMs and whether it can be traced back to specific components of the model. We conducted a Layer-Patching analysis across 3 open-weights LLMs and demonstrated that the bias is not only learned during finetuning but also localized in a specific set of layers. Surprisingly, we found that patching activations from the corresponding pretrained model into just a few critical layers is sufficient to eliminate the effect. Our findings offer new evidence that social biases in LLMs can be interpreted, localized, and mitigated through targeted interventions, without the need for model retraining.

[36] DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

Jiakai Li, Rongzheng Wang, Yizhuo Ma, Shuang Liang, Guangchun Luo, Ke Qin

Main category: cs.CL

TL;DR: DSAS is a plug-and-play solution that addresses LLMs’ limitations in multi-document QA by using two modules: CGW to solve the ’lost-in-the-middle’ problem and RAS to improve long-range dependency modeling, achieving 4.2% average F1-score improvement without architectural changes.

Details

Motivation: LLMs struggle with multi-document QA due to two key limitations: difficulty modeling long-range dependencies and the 'lost-in-the-middle' issue where they fail to process information in the middle of long inputs. Current solutions are either limited or require costly fine-tuning.

Method: Proposes Dual-Stage Adaptive Sharpening (DSAS) with two modules: (1) Contextual Gate Weighting (CGW) assesses paragraph relevance through layer-wise attention tracking and position-aware weighting to address ’lost-in-the-middle’; (2) Reciprocal Attention Suppression (RAS) suppresses information exchange between key and irrelevant texts to enhance focus on critical paragraphs.

Result: Extensive experiments on four benchmarks show DSAS improves performance across mainstream LLMs (Llama, Qwen, Mistral, Deepseek), with average 4.2% F1-score improvement on Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct in Multi-doc QA tasks. Ablation studies confirm both modules are essential.

Conclusion: DSAS provides an effective plug-and-play solution for multi-document QA that addresses key LLM limitations without requiring architectural modifications or extra training parameters, demonstrating robustness and scalability across different models.

Abstract: While large language models (LLMs) show considerable promise across various fields, they have notable limitations in handling multi-document question answering (Multi-doc QA) tasks. The first challenge is long-range dependency modeling, where LLMs struggle to focus on key information in long texts, which weakens important semantic connections. Second, most LLMs suffer from the ‘’lost-in-the-middle’’ issue, where they have difficulty processing information in the middle of long inputs. Current solutions either truncate global dependencies or demand costly finetuning, ultimately lacking a universal and simple solution for these challenges. To resolve these limitations, we propose Dual-Stage Adaptive Sharpening (DSAS) containing two modules. (i) The Contextual Gate Weighting (CGW) module alleviates ‘’lost-in-the-middle’’ by assessing paragraph relevance through layer-wise attention tracking and position-aware weighting. (ii) The Reciprocal Attention Suppression (RAS) module enhances focus on critical paragraphs by suppressing information exchange between key and irrelevant texts, thus mitigating the limitations in long-range dependency modeling. Notably, DSAS functions as a plug-and-play solution requiring no architectural modifications or extra training parameters. Extensive experiments on four benchmarks demonstrate DSAS’s efficacy across mainstream LLMs (Llama, Qwen, Mistral, and Deepseek), with an average F1-score improvement of 4.2% in Multi-doc QA tasks on Llama-3.1-8B-Instruct and Qwen2.5-14B-Instruct. Ablation studies confirm the essential contributions of both the CGW and RAS modules. In addition, detailed discussions in the Appendix further validate the robustness and scalability of DSAS.

[37] Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

Blazej Manczak, Eric Lin, Francisco Eiras, James O’ Neill, Vaikkunth Mugunthan

Main category: cs.CL

TL;DR: The paper introduces MedQA-Followup, a framework for evaluating multi-turn robustness in medical LLMs, revealing severe vulnerabilities in multi-turn interactions where accuracy can drop dramatically, particularly with indirect context-based interventions.

Details

Motivation: Existing evaluation frameworks for medical LLMs focus on single-turn question answering under idealized conditions, overlooking the complexities of real medical consultations involving conflicting input, misleading context, and authority influence.

Method: Developed MedQA-Followup framework that distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), with an indirect-direct axis separating contextual framing from explicit suggestions. Used controlled interventions on the MedQA dataset to evaluate five state-of-the-art LLMs.

Result: Models perform reasonably well under shallow perturbations but exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect context-based interventions are often more harmful than direct suggestions, causing larger accuracy drops across models.

Conclusion: Multi-turn robustness is a critical but underexplored dimension for safe and reliable deployment of medical LLMs, highlighting significant vulnerabilities that need to be addressed before clinical deployment.

Abstract: Large language models (LLMs) are rapidly transitioning into medical clinical use, yet their reliability under realistic, multi-turn interactions remains poorly understood. Existing evaluation frameworks typically assess single-turn question answering under idealized conditions, overlooking the complexities of medical consultations where conflicting input, misleading context, and authority influence are common. We introduce MedQA-Followup, a framework for systematically evaluating multi-turn robustness in medical question answering. Our approach distinguishes between shallow robustness (resisting misleading initial context) and deep robustness (maintaining accuracy when answers are challenged across turns), while also introducing an indirect-direct axis that separates contextual framing (indirect) from explicit suggestion (direct). Using controlled interventions on the MedQA dataset, we evaluate five state-of-the-art LLMs and find that while models perform reasonably well under shallow perturbations, they exhibit severe vulnerabilities in multi-turn settings, with accuracy dropping from 91.2% to as low as 13.5% for Claude Sonnet 4. Counterintuitively, indirect, context-based interventions are often more harmful than direct suggestions, yielding larger accuracy drops across models and exposing a significant vulnerability for clinical deployment. Further compounding analyses reveal model differences, with some showing additional performance drops under repeated interventions while others partially recovering or even improving. These findings highlight multi-turn robustness as a critical but underexplored dimension for safe and reliable deployment of medical LLMs.

[38] Chinese ModernBERT with Whole-Word Masking

Zeyu Zhao, Ningtao Wang, Xing Fu, Yu Cheng

Main category: cs.CL

TL;DR: Chinese ModernBERT is a new encoder-only Transformer designed specifically for Chinese language processing, featuring optimized vocabulary, training techniques, and architecture to achieve competitive performance while maintaining efficiency.

Details

Motivation: Existing encoder improvements haven't fully transferred to Chinese due to differences in tokenization and morphology from English, creating a need for specialized Chinese encoders.

Method: Uses hardware-aware 32k BPE vocabulary, whole-word masking with dynamic curriculum, two-stage pre-training with extended context (1,024 to 8,192 tokens), and damped-cosine learning rate schedule.

Result: Competitive with strong Chinese encoders on CLUE, achieves high long-sequence throughput while maintaining short-sequence speed, and surpasses Qwen-0.6B-embedding on SimCLUE with additional contrastive data.

Conclusion: Chinese ModernBERT provides an effective scaling path for Chinese language processing with specialized design choices that address Chinese-specific challenges.

Abstract: Encoder-only Transformers have advanced along three axes – architecture, data, and systems – yielding Pareto gains in accuracy, speed, and memory efficiency. Yet these improvements have not fully transferred to Chinese, where tokenization and morphology differ markedly from English. We introduce Chinese ModernBERT, a from-scratch Chinese encoder that couples: (i) a hardware-aware 32k BPE vocabulary tailored to frequent Chinese affixes/compounds, lowering the embedding budget; (ii) whole-word masking (WWM) with a dynamic masking curriculum (30% -> 15%) to align task difficulty with training progress; (iii) a two-stage pre-training pipeline that extends the native context from 1,024 to 8,192 tokens using RoPE and alternating local/global attention; and (iv) a damped-cosine learning-rate schedule for stable long-horizon optimization. We pre-train on ~1.2T Chinese tokens from CCI3-HQ, CCI4 (Chinese), and Cosmopedia-Chinese. On CLUE, Chinese ModernBERT is competitive with strong Chinese encoders under a unified fine-tuning protocol. Under bf16 it achieves high long-sequence throughput while maintaining strong short-sequence speed, reflecting benefits from budget allocation and attention design. To probe retrieval-oriented quality, we add a small amount of open contrastive data: fine-tuning on SimCLUE (~3M pairs) improves further when adding T2Ranking (~2M), reaching 0.505 (Pearson) / 0.537 (Spearman) on the SimCLUE test set. Under this open-data setting, Chinese ModernBERT surpasses Qwen-0.6B-embedding on SimCLUE, suggesting a clear scaling path for STS with additional curated pairs. We will release tokenizer and weights to facilitate reproducible research.

[39] A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Cameron Morin, Matti Marttinen Larsson

Main category: cs.CL

TL;DR: An unsupervised pipeline using LLMs for automated grammatical annotation in large corpora, achieving 98%+ accuracy on 143,933 sentences from COHA in under 60 hours.

Details

Motivation: Manual annotation is a bottleneck in corpus linguistics as corpora expand rapidly, creating need for scalable automated solutions.

Method: Four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing using GPT-5 through OpenAI API, and post-hoc validation.

Result: Successfully annotated 143,933 sentences from COHA with 98%+ accuracy on sophisticated annotation procedures in under 60 hours.

Conclusion: LLMs can perform data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus research, though requiring attention to costs, licensing, and ethical considerations.

Abstract: As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable, unsupervised pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline’s accessibility and effectiveness through a diachronic case study of variation in the English consider construction. Using GPT-5 through the OpenAI API, we annotate 143,933 sentences from the Corpus of Historical American English (COHA) in under 60 hours, achieving 98%+ accuracy on two sophisticated annotation procedures. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, opening new possibilities for corpus-based research, though implementation requires attention to costs, licensing, and other ethical considerations.

[40] Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation

Greta Damo, Elena Cabrio, Serena Villata

Main category: cs.CL

TL;DR: A novel framework for counter-speech generation that treats it as knowledge-wise text generation using RAG pipelines, outperforming standard LLM baselines.

Details

Motivation: Existing counter-speech generation approaches have limitations in reliability, coherence, and scalability - either relying on unreliable LLMs or non-scalable NGO experts.

Method: Integrates RAG pipelines with a knowledge base of 32,792 texts from UN Digital Library, EUR-Lex, and EU Agency for Fundamental Rights to generate trustworthy counter-speech for 8 target groups.

Result: Outperforms standard LLM baselines and competitive approaches on both automated metrics (JudgeLM) and human evaluation using MultiTarget-CONAN dataset.

Conclusion: The framework enables trustworthy and sound counter-speech generation for hate speech and beyond, providing a scalable solution with verified knowledge sources.

Abstract: Counter-speech generation is at the core of many expert activities, such as fact-checking and hate speech, to counter harmful content. Yet, existing work treats counter-speech generation as pure text generation task, mainly based on Large Language Models or NGO experts. These approaches show severe drawbacks due to the limited reliability and coherence in the generated countering text, and in scalability, respectively. To close this gap, we introduce a novel framework to model counter-speech generation as knowledge-wise text generation process. Our framework integrates advanced Retrieval-Augmented Generation (RAG) pipelines to ensure the generation of trustworthy counter-speech for 8 main target groups identified in the hate speech literature, including women, people of colour, persons with disabilities, migrants, Muslims, Jews, LGBT persons, and other. We built a knowledge base over the United Nations Digital Library, EUR-Lex and the EU Agency for Fundamental Rights, comprising a total of 32,792 texts. We use the MultiTarget-CONAN dataset to empirically assess the quality of the generated counter-speech, both through standard metrics (i.e., JudgeLM) and a human evaluation. Results show that our framework outperforms standard LLM baselines and competitive approach, on both assessments. The resulting framework and the knowledge base pave the way for studying trustworthy and sound counter-speech generation, in hate speech and beyond.

[41] Fine-grained Analysis of Brain-LLM Alignment through Input Attribution

Michela Proietti, Roberto Capobianco, Mariya Toneva

Main category: cs.CL

TL;DR: A fine-grained input attribution method identifies distinct word subsets for brain-LLM alignment vs next-word prediction, revealing different feature reliance patterns.

Details

Motivation: To understand the computational principles of language processing by examining alignment between LLMs and human brain activity, specifically addressing the relationship between brain alignment and next-word prediction.

Method: Introduced a fine-grained input attribution method to identify specific words most important for brain-LLM alignment, and applied it to study the relationship between brain alignment and next-word prediction.

Result: Brain alignment and next-word prediction rely on largely distinct word subsets: NWP shows recency and primacy biases with syntactic focus, while BA prioritizes semantic and discourse-level information with targeted recency effect.

Conclusion: The work advances understanding of how LLMs relate to human language processing and highlights differences in feature reliance between brain alignment and next-word prediction, with the attribution method being broadly applicable to explore cognitive relevance in diverse language tasks.

Abstract: Understanding the alignment between large language models (LLMs) and human brain activity can reveal computational principles underlying language processing. We introduce a fine-grained input attribution method to identify the specific words most important for brain-LLM alignment, and leverage it to study a contentious research question about brain-LLM alignment: the relationship between brain alignment (BA) and next-word prediction (NWP). Our findings reveal that BA and NWP rely on largely distinct word subsets: NWP exhibits recency and primacy biases with a focus on syntax, while BA prioritizes semantic and discourse-level information with a more targeted recency effect. This work advances our understanding of how LLMs relate to human language processing and highlights differences in feature reliance between BA and NWP. Beyond this study, our attribution method can be broadly applied to explore the cognitive relevance of model predictions in diverse language processing tasks.

[42] MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

Yushu Zhao, Yubin Qin, Yang Wang, Xiaolong Yang, Huiming Han, Shaojun Wei, Yang Hu, Shouyi Yin

Main category: cs.CL

TL;DR: MoBiLE is a plug-and-play offloading-based MoE inference framework that uses mixture of big-little experts to accelerate inference while maintaining model quality.

Details

Motivation: Existing MoE offloading approaches are constrained by CPU-GPU interconnect bandwidth limitations and prefetching methods suffer from training overhead and reduced effectiveness on modern fine-grained MoE models.

Method: MoBiLE reduces experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens, with a dedicated fallback and prefetching mechanism for switching between little and big experts.

Result: MoBiLE achieves 1.60x to 1.72x speedup compared to baseline on consumer GPU system with negligible accuracy degradation, evaluated on four modern MoE architectures and challenging generative tasks.

Conclusion: MoBiLE effectively addresses the CPU-GPU bandwidth bottleneck in MoE inference through its mixture of big-little experts approach, providing significant speedup without compromising model quality.

Abstract: Mixture-of-Experts (MoE) models have recently demonstrated exceptional performance across a diverse range of applications. The principle of sparse activation in MoE models facilitates an offloading strategy, wherein active experts are maintained in GPU HBM, while inactive experts are stored in CPU DRAM. The efficacy of this approach, however, is fundamentally constrained by the limited bandwidth of the CPU-GPU interconnect. To mitigate this bottleneck, existing approaches have employed prefetching to accelerate MoE inference. These methods attempt to predict and prefetch the required experts using specially trained modules. Nevertheless, such techniques are often encumbered by significant training overhead and have shown diminished effectiveness on recent MoE models with fine-grained expert segmentation. In this paper, we propose MoBiLE, a plug-and-play offloading-based MoE inference framework with \textit{mixture of big-little experts}. It reduces the number of experts for unimportant tokens to half for acceleration while maintaining full experts for important tokens to guarantee model quality. Further, a dedicated fallback and prefetching mechanism is designed for switching between little and big experts to improve memory efficiency. We evaluate MoBiLE on four typical modern MoE architectures and challenging generative tasks. Our results show that MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.

[43] LLM-REVal: Can We Trust LLM Reviewers Yet?

Rui Li, Jia-Chen Gu, Po-Nien Kung, Heming Xia, Junfeng liu, Xiangwen Kong, Zhifang Sui, Nanyun Peng

Main category: cs.CL

TL;DR: LLM integration in academic workflows creates risks: LLM reviewers systematically favor LLM-authored papers and underrate human-authored critical papers due to linguistic bias and aversion to critical statements, raising fairness concerns.

Details

Motivation: To examine the risks of deep LLM integration in both peer-review and research processes, particularly how LLM reviewers may influence scholarly fairness through systematic biases.

Method: Simulation with research agent generating/revising papers and review agent assessing submissions, followed by human annotations to compare LLM-based reviews with human judgments.

Result: LLM reviewers inflate scores for LLM-authored papers and underrate human-authored papers with critical statements, revealing linguistic feature bias and aversion to critical statements.

Conclusion: LLM deployment in peer review poses equity risks for human authors, but LLM-guided revisions can improve paper quality, suggesting potential benefits for early-stage researchers and low-quality papers.

Abstract: The rapid advancement of large language models (LLMs) has inspired researchers to integrate them extensively into the academic workflow, potentially reshaping how research is practiced and reviewed. While previous studies highlight the potential of LLMs in supporting research and peer review, their dual roles in the academic workflow and the complex interplay between research and review bring new risks that remain largely underexplored. In this study, we focus on how the deep integration of LLMs into both peer-review and research processes may influence scholarly fairness, examining the potential risks of using LLMs as reviewers by simulation. This simulation incorporates a research agent, which generates papers and revises, alongside a review agent, which assesses the submissions. Based on the simulation results, we conduct human annotations and identify pronounced misalignment between LLM-based reviews and human judgments: (1) LLM reviewers systematically inflate scores for LLM-authored papers, assigning them markedly higher scores than human-authored ones; (2) LLM reviewers persistently underrate human-authored papers with critical statements (e.g., risk, fairness), even after multiple revisions. Our analysis reveals that these stem from two primary biases in LLM reviewers: a linguistic feature bias favoring LLM-generated writing styles, and an aversion toward critical statements. These results highlight the risks and equity concerns posed to human authors and academic research if LLMs are deployed in the peer review cycle without adequate caution. On the other hand, revisions guided by LLM reviews yield quality gains in both LLM-based and human evaluations, illustrating the potential of the LLMs-as-reviewers for early-stage researchers and enhancing low-quality papers.

[44] Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Hailay Kidu Teklehaymanot, Wolfgang Nejdl

Main category: cs.CL

TL;DR: Large-scale study reveals systematic tokenization disparities across 200+ languages, showing Latin-script languages have higher efficiency while non-Latin and morphologically complex languages face 3-5x higher computational costs.

Details

Motivation: To address tokenization disparities that create barriers to equitable AI access across linguistically diverse populations by systematically quantifying computational inequities in LLMs.

Method: Standardized cross-linguistic evaluation using consistent preprocessing/normalization protocols and tiktoken library across 200+ languages, with metrics like Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC) benchmarked against English.

Result: Substantial systematic disparities: Latin-script languages show higher tokenization efficiency, while non-Latin and morphologically complex languages have 3-5x higher RTC ratios, leading to increased computational costs and reduced context utilization.

Conclusion: Current AI systems have structural inequities disadvantaging speakers of low-resource and non-Latin languages. Future work should prioritize linguistically informed tokenization strategies and adaptive vocabulary methods for more inclusive multilingual AI.

Abstract: Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.

[45] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

Xiangjun Zai, Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Wenjie Zhang

Main category: cs.CL

TL;DR: PRoH is a dynamic framework for planning and reasoning over knowledge hypergraphs that addresses limitations in static retrieval planning, non-adaptive execution, and superficial use of KH structure in existing methods.

Details

Motivation: Existing knowledge hypergraph-based RAG methods suffer from static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, limiting their effectiveness in multi-hop question answering.

Method: PRoH incorporates three innovations: (1) context-aware planning module for structurally grounded reasoning plans, (2) structured question decomposition as dynamically evolving DAG for adaptive multi-trajectory exploration, and (3) Entity-Weighted Overlap-guided reasoning path retrieval algorithm for semantically coherent hyperedge traversals.

Result: PRoH achieves state-of-the-art performance, surpassing prior SOTA model HyperGraphRAG by average 19.73% in F1 and 8.41% in Generation Evaluation score, while maintaining strong robustness in long-range multi-hop reasoning tasks.

Conclusion: The proposed PRoH framework effectively overcomes limitations of existing KH-based RAG methods through dynamic planning and structured reasoning, demonstrating superior performance in multi-hop question answering across multiple domains.

Abstract: Knowledge Hypergraphs (KHs) have recently emerged as a knowledge representation for retrieval-augmented generation (RAG), offering a paradigm to model multi-entity relations into a structured form. However, existing KH-based RAG methods suffer from three major limitations: static retrieval planning, non-adaptive retrieval execution, and superficial use of KH structure and semantics, which constrain their ability to perform effective multi-hop question answering. To overcome these limitations, we propose PRoH, a dynamic Planning and Reasoning over Knowledge Hypergraphs framework. PRoH incorporates three core innovations: (i) a context-aware planning module that sketches the local KH neighborhood to guide structurally grounded reasoning plan generation; (ii) a structured question decomposition process that organizes subquestions as a dynamically evolving Directed Acyclic Graph (DAG) to enable adaptive, multi-trajectory exploration; and (iii) an Entity-Weighted Overlap (EWO)-guided reasoning path retrieval algorithm that prioritizes semantically coherent hyperedge traversals. Experiments across multiple domains demonstrate that PRoH achieves state-of-the-art performance, surpassing the prior SOTA model HyperGraphRAG by an average of 19.73% in F1 and 8.41% in Generation Evaluation (G-E) score, while maintaining strong robustness in long-range multi-hop reasoning tasks.

[46] Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

Linfeng Gao, Baolong Bi, Zheng Yuan, Le Wang, Zerui Chen, Zhimin Wei, Shenghua Liu, Qinggang Zhang, Jinsong Su

Main category: cs.CL

TL;DR: CLEAR is a framework that improves contextual faithfulness in Retrieval-Augmented Generation by localizing knowledge conflicts through hidden-state probing and conflict-aware fine-tuning.

Details

Motivation: Existing RAG systems suffer from unfaithfulness where model responses contradict retrieved evidence, and current approaches treat LLMs as black boxes without understanding how they internally integrate retrieved evidence with parametric memory during knowledge conflicts.

Method: Propose CLEAR framework that: (i) decomposes context into sentence-level knowledge, (ii) uses hidden-state probing to localize conflicting knowledge, and (iii) employs conflict-aware fine-tuning to guide accurate evidence integration.

Result: Extensive experiments across three benchmarks show CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions.

Conclusion: CLEAR effectively addresses the unfaithfulness problem in RAG systems by understanding and leveraging the internal knowledge integration mechanisms of LLMs through fine-grained conflict localization and targeted fine-tuning.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm to enhance the factuality of Large Language Models (LLMs). However, existing RAG systems often suffer from an unfaithfulness issue, where the model’s response contradicts evidence from the retrieved context. Existing approaches to improving contextual faithfulness largely rely on external interventions, such as prompt engineering, decoding constraints, or reward-based fine-tuning. These works treat the LLM as a black box and overlook a crucial question: how does the LLM internally integrate retrieved evidence with its parametric memory, particularly under knowledge conflicts? To address this gap, we conduct a probing-based analysis of hidden-state representations in LLMs and observe three findings: knowledge integration occurs hierarchically, conflicts manifest as latent signals at the sentence level, and irrelevant context is often amplified when aligned with parametric knowledge. Building on these findings, we propose CLEAR (Conflict-Localized and Enhanced Attention for RAG), a framework that (i) decomposes context into fine-grained sentence-level knowledge, (ii) employs hidden-state probing to localize conflicting knowledge, and (iii) introduces conflict-aware fine-tuning to guide the model to accurately integrate retrieved evidence. Extensive experiments across three benchmarks demonstrate that CLEAR substantially improves both accuracy and contextual faithfulness, consistently outperforming strong baselines under diverse conflict conditions. The related resources are available at https://github.com/LinfengGao/CLEAR.

Nikoleta Pantelidou, Evelina Leivada, Paolo Morosi

Main category: cs.CL

TL;DR: LLMs show human-like accuracy in morphological generalization with novel words, but performance is driven more by data availability than linguistic complexity.

Details

Motivation: To investigate whether LLMs' morphological generalization abilities approximate human competence and whether performance is shaped by linguistic complexity or training data quantity.

Method: Used a multilingual Wug Test adaptation to test six models across four languages (Catalan, English, Greek, Spanish) and compared with human speakers.

Result: Models generalized morphological processes to unseen words with human-like accuracy, but accuracy patterns aligned more with community size and data availability than structural complexity.

Conclusion: Model behavior is mainly driven by linguistic resource richness rather than grammatical complexity sensitivity, representing only superficial resemblance to human linguistic competence.

Abstract: The linguistic abilities of Large Language Models are a matter of ongoing debate. This study contributes to this discussion by investigating model performance in a morphological generalization task that involves novel words. Using a multilingual adaptation of the Wug Test, six models were tested across four partially unrelated languages (Catalan, English, Greek, and Spanish) and compared with human speakers. The aim is to determine whether model accuracy approximates human competence and whether it is shaped primarily by linguistic complexity or by the quantity of available training data. Consistent with previous research, the results show that the models are able to generalize morphological processes to unseen words with human-like accuracy. However, accuracy patterns align more closely with community size and data availability than with structural complexity, refining earlier claims in the literature. In particular, languages with larger speaker communities and stronger digital representation, such as Spanish and English, revealed higher accuracy than less-resourced ones like Catalan and Greek. Overall, our findings suggest that model behavior is mainly driven by the richness of linguistic resources rather than by sensitivity to grammatical complexity, reflecting a form of performance that resembles human linguistic competence only superficially.

[48] SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

Biao Zhang, Lixin Chen, Tong Liu, Bo Zheng

Main category: cs.CL

TL;DR: SMEC is a training framework that compresses high-dimensional LLM embeddings through sequential matryoshka representation learning, adaptive dimension selection, and cross-batch memory to reduce computational complexity while maintaining performance.

Details

Motivation: High-dimensional LLM embeddings increase computational complexity and storage requirements, hindering practical deployment of language models.

Method: Proposes Sequential Matryoshka Embedding Compression (SMEC) with three components: Sequential Matryoshka Representation Learning (SMRL) to reduce gradient variance, Adaptive Dimension Selection (ADS) to minimize information loss during dimension pruning, and Selectable Cross-batch Memory (S-XBM) to enhance unsupervised learning between different dimensional embeddings.

Result: SMEC achieves significant dimensionality reduction while maintaining performance. On BEIR dataset, it improves compressed LLM2Vec embeddings (256 dimensions) by 1.1 points over Matryoshka-Adaptor and 2.7 points over Search-Adaptor models.

Conclusion: The SMEC framework effectively addresses the computational challenges of high-dimensional LLM embeddings through a comprehensive approach that combines sequential training, adaptive dimension selection, and cross-batch learning, enabling practical deployment of compressed embeddings without performance degradation.

Abstract: Large language models (LLMs) generate high-dimensional embeddings that capture rich semantic and syntactic information. However, high-dimensional embeddings exacerbate computational complexity and storage requirements, thereby hindering practical deployment. To address these challenges, we propose a novel training framework named Sequential Matryoshka Embedding Compression (SMEC). This framework introduces the Sequential Matryoshka Representation Learning(SMRL) method to mitigate gradient variance during training, the Adaptive Dimension Selection (ADS) module to reduce information degradation during dimension pruning, and the Selectable Cross-batch Memory (S-XBM) module to enhance unsupervised learning between high- and low-dimensional embeddings. Experiments on image, text, and multimodal datasets demonstrate that SMEC achieves significant dimensionality reduction while maintaining performance. For instance, on the BEIR dataset, our approach improves the performance of compressed LLM2Vec embeddings (256 dimensions) by 1.1 points and 2.7 points compared to the Matryoshka-Adaptor and Search-Adaptor models, respectively.

[49] When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

Lang Gao, Xuhui Li, Chenxi Wang, Mingzhe Li, Wei Liu, Zirui Song, Jinghui Zhang, Rui Yan, Preslav Nakov, Xiuying Chen

Main category: cs.CL

TL;DR: This paper introduces the first benchmark for personalized machine-generated text detection, identifies performance gaps in existing detectors due to feature-inversion trap, and proposes a method to predict detector performance changes in personalized settings.

Details

Motivation: Large language models can now generate text that imitates personal style, creating risks of identity impersonation, but no prior work has examined personalized machine-generated text detection.

Method: Built a benchmark dataset from literary and blog texts with LLM-generated imitations, identified feature-inversion trap where discriminative features become misleading in personalized text, and proposed a method to identify latent directions of inverted features and construct probe datasets.

Result: Experimental results show large performance gaps across detectors in personalized settings, with some state-of-the-art models suffering significant drops. The proposed method can accurately predict both direction and magnitude of performance changes with 85% correlation.

Conclusion: This work highlights the limitations of current detectors in personalized settings and provides a reliable way to predict performance changes, encouraging further research on personalized text detection.

Abstract: Large language models (LLMs) have grown more powerful in language generation, producing fluent text and even imitating personal style. Yet, this ability also heightens the risk of identity impersonation. To the best of our knowledge, no prior work has examined personalized machine-generated text (MGT) detection. In this paper, we introduce \dataset, the first benchmark for evaluating detector robustness in personalized settings, built from literary and blog texts paired with their LLM-generated imitations. Our experimental results demonstrate large performance gaps across detectors in personalized settings: some state-of-the-art models suffer significant drops. We attribute this limitation to the \textit{feature-inversion trap}, where features that are discriminative in general domains become inverted and misleading when applied to personalized text. Based on this finding, we propose \method, a simple and reliable way to predict detector performance changes in personalized settings. \method identifies latent directions corresponding to inverted features and constructs probe datasets that differ primarily along these features to evaluate detector dependence. Our experiments show that \method can accurately predict both the direction and the magnitude of post-transfer changes, showing 85% correlation with the actual performance gaps. We hope that this work will encourage further research on personalized text detection.

[50] BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

Tomas Ruiz, Siyao Peng, Barbara Plank, Carsten Schwemmer

Main category: cs.CL

TL;DR: Test-time scaling techniques were applied to LeWiDi-2025 tasks for evaluating annotation disagreements, showing that benchmark methods (Model Averaging and Majority Voting) improve performance but Best-of-N sampling does not transfer effectively from mathematics domains.

Details

Motivation: To extend test-time scaling techniques beyond domains with verifiably correct answers (like mathematics and coding) to more subjective tasks involving annotation disagreements in LeWiDi-2025.

Method: Three test-time scaling methods were tested: Model Averaging, Majority Voting, and Best-of-N sampling on LeWiDi-2025 tasks for evaluating annotation disagreements.

Result: The two benchmark methods (Model Averaging and Majority Voting) consistently improved LLM performance on LeWiDi tasks, but the Best-of-N method did not show similar improvements.

Conclusion: Best-of-N sampling method does not currently transfer effectively from mathematics domains to LeWiDi tasks, suggesting domain-specific limitations for certain test-time scaling techniques.

Abstract: Test-time scaling is a family of techniques to improve LLM outputs at inference time by performing extra computation. To the best of our knowledge, test-time scaling has been limited to domains with verifiably correct answers, like mathematics and coding. We transfer test-time scaling to the LeWiDi-2025 tasks to evaluate annotation disagreements. We experiment with three test-time scaling methods: two benchmark algorithms (Model Averaging and Majority Voting), and a Best-of-N sampling method. The two benchmark methods improve LLM performance consistently on the LeWiDi tasks, but the Best-of-N method does not. Our experiments suggest that the Best-of-N method does not currently transfer from mathematics to LeWiDi tasks, and we analyze potential reasons for this gap.

[51] VISaGE: Understanding Visual Generics and Exceptions

Stella Frank, Emily Allaway

Main category: cs.CL

TL;DR: VLMs face tension between pragmatic priors (assuming congruent inputs) and semantic priors (general category knowledge) when evaluating atypical instances, with pragmatic priors dominating.

Details

Motivation: To understand how Vision Language Models trade off between pragmatic priors (congruent input assumption) and semantic priors (general category knowledge) when dealing with atypical instances.

Method: Created VISaGE dataset with typical and exceptional images, conducted carefully balanced experiments to analyze VLM behavior with incongruent inputs.

Result: Conceptual understanding degrades when pragmatic prior assumption of congruency is violated with incongruent images, and this effect is stronger than semantic prior influence.

Conclusion: Pragmatic priors dominate over semantic priors in VLMs when querying individual instances with incongruent inputs, highlighting limitations in handling atypical cases.

Abstract: While Vision Language Models (VLMs) learn conceptual representations, in the form of generalized knowledge, during training, they are typically used to analyze individual instances. When evaluation instances are atypical, this paradigm results in tension between two priors in the model. The first is a pragmatic prior that the textual and visual input are both relevant, arising from VLM finetuning on congruent inputs; the second is a semantic prior that the conceptual representation is generally true for instances of the category. In order to understand how VLMs trade off these priors, we introduce a new evaluation dataset, VISaGE, consisting of both typical and exceptional images. In carefully balanced experiments, we show that conceptual understanding degrades when the assumption of congruency underlying the pragmatic prior is violated with incongruent images. This effect is stronger than the effect of the semantic prior when querying about individual instances.

[52] Teaching Language Models to Faithfully Express their Uncertainty

Bryan Eikema, Evgenia Ilia, José G. C. de Souza, Chrysoula Zerva, Wilker Aziz

Main category: cs.CL

TL;DR: FUT is a fine-tuning method that teaches LLMs to express uncertainty faithfully by aligning verbal hedges with answer consistency, reducing the faithfulness gap without changing answer distributions.

Details

Motivation: LLMs often miscommunicate uncertainty - they give divergent answers to repeated queries but present responses as confident, creating a faithfulness gap in their knowledge state communication.

Method: Fine-tuning approach that augments model samples with uncertainty hedges (like ‘possibly’, ’likely’) aligned with sample consistency, requiring only the model and prompts without additional supervision.

Result: FUT substantially reduces the faithfulness gap while preserving QA accuracy and introducing minimal semantic distribution shift across multiple models and datasets.

Conclusion: FUT is a simple and effective method to teach LLMs to communicate uncertainty faithfully, showing robustness across decoding strategies, hedge choices, and other uncertainty expression forms.

Abstract: Large language models (LLMs) often miscommunicate their uncertainty: repeated queries can produce divergent answers, yet generated responses are typically unhedged or hedged in ways that do not reflect this variability. This conveys unfaithful information about the uncertain state of the LLMs’ knowledge, creating a faithfulness gap that affects even strong LLMs. We introduce Faithful Uncertainty Tuning (FUT): a fine-tuning approach that teaches instruction-tuned LLMs to express uncertainty faithfully without altering their underlying answer distribution. We construct training data by augmenting model samples with uncertainty hedges (i.e. verbal cues such as ‘possibly’ or ’likely’) aligned with sample consistency, requiring no supervision beyond the model and a set of prompts. We evaluate FUT on open-domain question answering (QA) across multiple models and datasets. Our results show that FUT substantially reduces the faithfulness gap, while preserving QA accuracy and introducing minimal semantic distribution shift. Further analyses demonstrate robustness across decoding strategies, choice of hedgers, and other forms of uncertainty expression (i.e. numerical). These findings establish FUT as a simple and effective way to teach LLMs to communicate uncertainty faithfully.

[53] StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis

Siyuan Li, Aodu Wulianghai, Xi Lin, Guangyan Li, Xiang Chen, Jun Wu, Jianhua Li

Main category: cs.CL

TL;DR: StyleDecipher is a robust framework for detecting machine-generated text by analyzing stylistic differences between human and LLM outputs using combined feature extractors.

Details

Motivation: Existing detection methods struggle with limited generalization, vulnerability to paraphrasing, and lack of explainability in real-world scenarios with stylistic diversity or hybrid human-AI authorship.

Method: Jointly models discrete stylistic indicators and continuous stylistic representations from semantic embeddings to capture style-level divergences in a unified representation space, without requiring model internals or labeled segments.

Result: Achieves state-of-the-art in-domain accuracy across five domains (news, code, essays, reviews, academic abstracts) and surpasses existing baselines by up to 36.30% in cross-domain evaluations, while maintaining robustness against adversarial perturbations.

Conclusion: Stylistic signals provide explainable evidence for distinguishing machine-generated text, enabling accurate, explainable, and domain-agnostic detection.

Abstract: With the increasing integration of large language models (LLMs) into open-domain writing, detecting machine-generated text has become a critical task for ensuring content authenticity and trust. Existing approaches rely on statistical discrepancies or model-specific heuristics to distinguish between LLM-generated and human-written text. However, these methods struggle in real-world scenarios due to limited generalization, vulnerability to paraphrasing, and lack of explainability, particularly when facing stylistic diversity or hybrid human-AI authorship. In this work, we propose StyleDecipher, a robust and explainable detection framework that revisits LLM-generated text detection using combined feature extractors to quantify stylistic differences. By jointly modeling discrete stylistic indicators and continuous stylistic representations derived from semantic embeddings, StyleDecipher captures distinctive style-level divergences between human and LLM outputs within a unified representation space. This framework enables accurate, explainable, and domain-agnostic detection without requiring access to model internals or labeled segments. Extensive experiments across five diverse domains, including news, code, essays, reviews, and academic abstracts, demonstrate that StyleDecipher consistently achieves state-of-the-art in-domain accuracy. Moreover, in cross-domain evaluations, it surpasses existing baselines by up to 36.30%, while maintaining robustness against adversarial perturbations and mixed human-AI content. Further qualitative and quantitative analysis confirms that stylistic signals provide explainable evidence for distinguishing machine-generated text. Our source code can be accessed at https://github.com/SiyuanLi00/StyleDecipher.

[54] ACADATA: Parallel Dataset of Academic Data for Machine Translation

Iñaki Lacunza, Javier Garcia Gilabert, Francesca De Luca Fornaciari, Javier Aula-Blasco, Aitor Gonzalez-Agirre, Maite Melero, Marta Villegas

Main category: cs.CL

TL;DR: ACADATA is a high-quality parallel dataset for academic translation with 1.5M training pairs and 6K evaluation samples, showing that fine-tuning LLMs on it improves academic translation quality by 6.1-12.4 d-BLEU points and outperforms proprietary models.

Details

Motivation: To address the need for high-quality academic translation resources and improve translation quality in academic domains using large language models.

Method: Created ACADATA dataset with ACAD-TRAIN (1.5M paragraph pairs) and ACAD-BENCH (6K translations), then fine-tuned 7B and 2B LLMs on ACAD-TRAIN and benchmarked against various translation systems.

Result: Fine-tuning on ACAD-TRAIN improved academic translation by +6.1 d-BLEU (7B model) and +12.4 d-BLEU (2B model), improved long-context translation by up to 24.9%, and the top model surpassed proprietary and open-weight models in academic translation.

Conclusion: ACADATA provides valuable resources for advancing academic domain and long-context translation research, demonstrating that fine-tuning on academic data significantly improves translation quality.

Abstract: We present ACADATA, a high-quality parallel dataset for academic translation, that consists of two subsets: ACAD-TRAIN, which contains approximately 1.5 million author-generated paragraph pairs across 96 language directions and ACAD-BENCH, a curated evaluation set of almost 6,000 translations covering 12 directions. To validate its utility, we fine-tune two Large Language Models (LLMs) on ACAD-TRAIN and benchmark them on ACAD-BENCH against specialized machine-translation systems, general-purpose, open-weight LLMs, and several large-scale proprietary models. Experimental results demonstrate that fine-tuning on ACAD-TRAIN leads to improvements in academic translation quality by +6.1 and +12.4 d-BLEU points on average for 7B and 2B models respectively, while also improving long-context translation in a general domain by up to 24.9% when translating out of English. The fine-tuned top-performing model surpasses the best propietary and open-weight models on academic translation domain. By releasing ACAD-TRAIN, ACAD-BENCH and the fine-tuned models, we provide the community with a valuable resource to advance research in academic domain and long-context translation.

[55] COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions

Nzubechukwu C. Ohalete, Kevin B. Gittner, Lauren M. Matheny

Main category: cs.CL

TL;DR: COSTAR-A is an enhanced prompt engineering framework that adds ‘Answer’ component to the original COSTAR method, showing improved performance with smaller LLMs like Llama 3.1-8B.

Details

Motivation: LLMs are highly sensitive to prompt design, and existing prompting techniques like COSTAR perform inconsistently with smaller, locally optimized models, especially for directive tasks.

Method: Enhanced the COSTAR framework (Context, Objective, Style, Tone, Audience, Response) by adding ‘Answer’ component, and conducted controlled prompt-output assessments with smaller models (≤8B parameters).

Result: COSTAR-A improves output structure and decisiveness for localized LLMs, with Llama 3.1-8B showing performance improvements, though effectiveness varies across models and use cases.

Conclusion: COSTAR-A demonstrates adaptability and scalability as a prompting framework, particularly valuable for computationally efficient AI deployments on resource-constrained hardware.

Abstract: Large Language Models (LLMs) are highly sensitive to prompt design, and making optimized prompting techniques is crucial for generating consistent, high-quality outputs. In this study, we introduce COSTAR-A, a novel prompt engineering framework that enhances the existing COSTAR method, which stands for Context, Objective, Style, Tone, Audience, and Response, by adding the ‘Answer’ component at the end. We demonstrate that while the original COSTAR framework improves prompt clarity and aligns outputs for larger LLMs, its performance is less consistent with smaller, locally optimized models, particularly in tasks that require more directive or constrained outputs. Through a series of controlled prompt-output assessments with smaller (at most 8 billion parameters), fine-tuned models, we found that COSTAR-A can enhance the output structure and decisiveness of localized LLMs for certain tasks, although its effectiveness varies across models and use cases. Notably, the Llama 3.1-8B model exhibited performance improvements when prompted with COSTAR-A compared to COSTAR alone. These findings emphasize the adaptability and scalability of COSTAR-A as a prompting framework, particularly in computationally efficient AI deployments on resource-constrained hardware.

[56] Reasoning Pattern Matters: Learning to Reason without Human Rationales

Chaoxu Pang, Yixuan Cao, Ping Luo

Main category: cs.CL

TL;DR: The paper shows that for patterned reasoning tasks, LLMs can generate effective rationales without human annotation by focusing on reasoning patterns rather than large-scale human annotations.

Details

Motivation: To reduce the high cost of human-annotated rationales for SFT+RLVR training while maintaining reasoning performance, by leveraging the observation that reasoning patterns are more important than rationale quantity/quality.

Method: Proposed PARO framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without human annotations, using limited human supervision over patterns.

Result: PARO-generated rationales achieved comparable SFT+RLVR performance to human rationales that are 10 times larger, demonstrating effective cost reduction.

Conclusion: Large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns for patterned reasoning tasks.

Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities under the widely adopted SFT+RLVR paradigm, which first performs Supervised Fine-Tuning (SFT) on human-annotated reasoning trajectories (rationales) to establish initial reasoning behaviors, then applies Reinforcement Learning with Verifiable Rewards (RLVR) to optimize the model using verifiable signals without golden rationales. However, annotating high-quality rationales for the SFT stage remains prohibitively expensive. This paper investigates when and how rationale annotation costs can be substantially reduced without compromising reasoning performance. We identify a broad class of problems, termed patterned reasoning tasks, where reasoning follows a fixed, procedural strategy consistent across instances. Although instances vary in content such as domain knowledge, factual information, or numeric values, the solution derives from applying a shared reasoning pattern. We argue that the success of SFT+RLVR on such tasks primarily stems from its ability to enable models to internalize these reasoning patterns. Using numerical semantic matching as a representative task, we provide both causal and behavioral evidence showing that reasoning patterns rather than the quantity or quality of rationales are the key determinant of performance. Building on these insights, we propose Pattern-Aware LLMs as Rationale AnnOtators (PARO), a simple yet effective framework that enables LLMs to generate rationales aligned with task-specific reasoning patterns without requiring human rationale annotations. Experiments show that PARO-generated rationales achieve comparable SFT+RLVR performance to human rationales that are 10 times larger. These results suggest that large-scale human rationale annotations can be replaced with LLM-based automatic annotations requiring only limited human supervision over reasoning patterns.

[57] Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

Sunny Yu, Ahmad Jabbar, Robert Hawkins, Dan Jurafsky, Myra Cheng

Main category: cs.CL

TL;DR: The paper introduces Generation Space Size (GSS) as a unified framework to address LLM failures in both creative (overly homogeneous) and factual (hallucinated diverse) tasks, and presents GSSBench for evaluation.

Details

Motivation: Current LLMs are miscalibrated - they produce overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. These failure modes need a unified approach.

Method: Propose Generation Space Size (GSS) concept and create GSSBench, a task suite with prompt pairs having ground-truth GSS relationships. Evaluate various metrics including EigenScore for hallucination detection.

Result: Hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty metrics. GSS provides interpretable insights into model’s internal task representations.

Conclusion: GSS has three key applications: detecting prompt ambiguity and predicting clarification questions, interpreting overthinking/underthinking in reasoning models, and steering models to expand generation space for better quality and diversity.

Abstract: Different open-ended generation tasks require different degrees of output diversity. However, current LLMs are often miscalibrated. They collapse to overly homogeneous outputs for creative tasks and hallucinate diverse but incorrect responses for factual tasks. We argue that these two failure modes are unified by, and can both be addressed by, the notion of effective generation space size (GSS) – the set of semantically distinct outputs a model considers for a prompt. We present GSSBench, a task suite of prompt pairs with ground-truth GSS relationships to assess different metrics and understand where models diverge from desired behavior. We find that hallucination detection metrics, particularly EigenScore, consistently outperform standard diversity and uncertainty quantification metrics, while using only model internals, providing interpretable insights into a model’s internal task representations. We demonstrate three applications of GSS: (1) detecting prompt ambiguity and predicting clarification questions for better grounding, (2) interpreting overthinking and underthinking in reasoning models, and (3) steering models to expand their generation space to yield high-quality and diverse outputs.

[58] Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages

Nadine El-Naggar, Tatsuki Kuribayashi, Ted Briscoe

Main category: cs.CL

TL;DR: This paper investigates whether language models have inductive biases favoring typologically frequent grammatical properties, using Generalized Categorial Grammar to create more natural artificial languages that include complex constructions like unbounded dependencies.

Details

Motivation: To extend previous work on LM inductive biases by using more naturalistic artificial languages that better capture real language features, and to test generalization ability on longer unseen sentences.

Method: Adopted Generalized Categorial Grammar (GCG) to create artificial languages covering attested constructions like unbounded dependencies and mildly context-sensitive structures, then evaluated LM generalization on longer test sentences.

Result: Typologically plausible word orders tend to be easier for language models to productively generalize to longer, unseen sentences.

Conclusion: The study provides clearer evidence that language models indeed have inductive biases favoring typologically frequent grammatical patterns, with more naturalistic artificial languages and better evaluation methodology.

Abstract: Whether language models (LMs) have inductive biases that favor typologically frequent grammatical properties over rare, implausible ones has been investigated, typically using artificial languages (ALs) (White and Cotterell, 2021; Kuribayashi et al., 2024). In this paper, we extend these works from two perspectives. First, we extend their context-free AL formalization by adopting Generalized Categorial Grammar (GCG) (Wood, 2014), which allows ALs to cover attested but previously overlooked constructions, such as unbounded dependency and mildly context-sensitive structures. Second, our evaluation focuses more on the generalization ability of LMs to process unseen longer test sentences. Thus, our ALs better capture features of natural languages and our experimental paradigm leads to clearer conclusions – typologically plausible word orders tend to be easier for LMs to productively generalize.

[59] Hey, wait a minute: on at-issue sensitivity in Language Models

Sanghee J. Kim, Kanishka Misra

Main category: cs.CL

TL;DR: Introduces DGRC method to evaluate dialogue naturalness in language models using ‘at-issueness’ concept, finding LMs prefer continuing at-issue content with enhanced effects in instruct-tuned models.

Details

Motivation: Evaluating dialogue naturalness in language models is challenging due to varying notions of naturalness and limited scalable metrics.

Method: Divide, Generate, Recombine, and Compare (DGRC): divides dialogue prompts, generates continuations for subparts, recombines dialogues, and compares likelihoods of recombined sequences.

Result: LMs prefer to continue dialogue on at-issue content (enhanced in instruct-tuned models) and reduce at-issue preference when relevant cues are present.

Conclusion: DGRC enables systematic testing of discourse-sensitive behavior and reveals patterns reflecting successful dialogue dynamics in language models.

Abstract: Evaluating the naturalness of dialogue in language models (LMs) is not trivial: notions of ’naturalness’ vary, and scalable quantitative metrics remain limited. This study leverages the linguistic notion of ‘at-issueness’ to assess dialogue naturalness and introduces a new method: Divide, Generate, Recombine, and Compare (DGRC). DGRC (i) divides a dialogue as a prompt, (ii) generates continuations for subparts using LMs, (iii) recombines the dialogue and continuations, and (iv) compares the likelihoods of the recombined sequences. This approach mitigates bias in linguistic analyses of LMs and enables systematic testing of discourse-sensitive behavior. Applying DGRC, we find that LMs prefer to continue dialogue on at-issue content, with this effect enhanced in instruct-tuned models. They also reduce their at-issue preference when relevant cues (e.g., “Hey, wait a minute”) are present. Although instruct-tuning does not further amplify this modulation, the pattern reflects a hallmark of successful dialogue dynamics.

[60] Language Models Model Language

Łukasz Borchmann

Main category: cs.CL

TL;DR: The paper argues for shifting from speculative linguistic critiques of LLMs based on de Saussure and Chomsky to an empiricist approach using Witold Mańczak’s framework that defines language as the totality of all usage and prioritizes frequency as the governing principle.

Details

Motivation: Current linguistic commentary on LLMs is often speculative and unproductive, focusing on theoretical concepts like 'deep structure' and 'grounding' that don't align well with how LLMs actually work.

Method: Adopts Witold Mańczak’s empiricist linguistic framework which defines language as all that is said and written, with frequency of use as the primary governing principle, to analyze and evaluate LLMs.

Result: Provides a constructive alternative perspective that challenges prior critiques of LLMs and offers practical guidance for designing, evaluating, and interpreting language models.

Conclusion: The empiricist approach based on Mańczak’s framework offers a more productive way to understand and work with LLMs by focusing on actual language usage patterns rather than theoretical constructs.

Abstract: Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for “deep structure” or “grounding” to achieve an idealized linguistic “competence.” We argue for a radical shift in perspective towards the empiricist principles of Witold Ma'nczak, a prominent general and historical linguist. He defines language not as a “system of signs” or a “computational system of the brain” but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language’s primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.

[61] Dr.LLM: Dynamic Layer Routing in LLMs

Ahmed Heakl, Martin Gubri, Salman Khan, Sangdoo Yun, Seong Joon Oh

Main category: cs.CL

TL;DR: Dr.LLM is a retrofittable framework that adds lightweight routers to pretrained LLMs, enabling dynamic layer skipping/execution/repeating to improve efficiency while maintaining or improving accuracy.

Details

Motivation: Current LLMs waste computation on simple queries and lack flexibility for complex reasoning tasks that need deeper processing. Adaptive-depth methods exist but often degrade accuracy or require costly modifications.

Method: Uses lightweight per-layer routers trained with MCTS supervision to derive optimal layer configurations. Features windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers for robustness.

Result: Improves accuracy by up to +3.4% on ARC and DART while saving 5 layers per example. Generalizes well to out-of-domain tasks with only 0.85% accuracy drop and outperforms prior routing methods by up to +7.7%.

Conclusion: Dr.LLM demonstrates that explicitly supervised routers can retrofit frozen LLMs for budget-aware, accuracy-driven inference without modifying base model weights.

Abstract: Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can improve efficiency, but prior approaches rely on costly inference-time search, architectural changes, or large-scale retraining, and in practice often degrade accuracy despite efficiency gains. We introduce Dr.LLM, Dynamic routing of Layers for LLMs, a retrofittable framework that equips pretrained models with lightweight per-layer routers deciding to skip, execute, or repeat a block. Routers are trained with explicit supervision: using Monte Carlo Tree Search (MCTS), we derive high-quality layer configurations that preserve or improve accuracy under a compute budget. Our design, windowed pooling for stable routing, focal loss with class balancing, and bottleneck MLP routers, ensures robustness under class imbalance and long sequences. On ARC (logic) and DART (math), Dr.LLM improves accuracy by up to +3.4%p while saving 5 layers per example on average. Routers generalize to out-of-domain tasks (MMLU, GSM8k, AIME, TruthfulQA, SQuADv2, GPQA, PIQA, AGIEval) with only 0.85% accuracy drop while retaining efficiency, and outperform prior routing methods by up to +7.7%p. Overall, Dr.LLM shows that explicitly supervised routers retrofit frozen LLMs for budget-aware, accuracy-driven inference without altering base weights.

[62] Cost Analysis of Human-corrected Transcription for Predominately Oral Languages

Yacouba Diarra, Nouhoum Souleymane Coulibaly, Michael Leventhal

Main category: cs.CL

TL;DR: It takes 30-36 hours of human labor to accurately transcribe one hour of speech data for low-resource oral languages like Bambara, based on a field study with native transcribers.

Details

Motivation: Understanding the actual human labor cost involved in creating speech datasets for low-resource languages, particularly predominantly oral languages with low literacy rates.

Method: One-month field study with ten native transcribers correcting ASR-generated transcriptions of 53 hours of Bambara voice data, comparing laboratory vs field conditions.

Result: Average transcription time is 30 hours per hour of speech in lab conditions and 36 hours in field conditions for Bambara.

Conclusion: Provides baseline cost estimates and practical insights for creating NLP resources for similar low-resource predominantly oral languages.

Abstract: Creating speech datasets for low-resource languages is a critical yet poorly understood challenge, particularly regarding the actual cost in human labor. This paper investigates the time and complexity required to produce high-quality annotated speech data for a subset of low-resource languages, low literacy Predominately Oral Languages, focusing on Bambara, a Manding language of Mali. Through a one-month field study involving ten transcribers with native proficiency, we analyze the correction of ASR-generated transcriptions of 53 hours of Bambara voice data. We report that it takes, on average, 30 hours of human labor to accurately transcribe one hour of speech data under laboratory conditions and 36 hours under field conditions. The study provides a baseline and practical insights for a large class of languages with comparable profiles undertaking the creation of NLP resources.

[63] MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

Hui Li, Xuekang Yang

Main category: cs.CL

TL;DR: MLRIP is a novel pre-training framework that integrates structured military knowledge through hierarchical knowledge integration and dual-phase entity substitution, achieving state-of-the-art performance in military NLP tasks.

Details

Motivation: Existing approaches fail to fully leverage intrinsic tactical associations and factual information within input sequences while introducing uncontrolled noise from unverified external sources in military intelligence analysis.

Method: Introduces hierarchical knowledge integration pipeline combined with dual-phase entity substitution mechanism, specifically modeling operational linkages between military entities (command, support, engagement structures).

Result: Outperforms existing BERT-based models by substantial margins, establishing new state-of-the-art performance in military entity recognition, typing, and operational linkage extraction tasks with superior operational efficiency.

Conclusion: MLRIP effectively addresses limitations of existing knowledge integration methods and demonstrates significant benefits for domain-specific military NLP applications.

Abstract: Incorporating structured knowledge into pre-trained language models has demonstrated signiffcant bene-ffts for domain-speciffc natural language processing tasks, particularly in specialized ffelds like military intelligence analysis. Existing approaches typically integrate external knowledge through masking tech-niques or fusion mechanisms, but often fail to fully leverage the intrinsic tactical associations and factual information within input sequences, while introducing uncontrolled noise from unveriffed exter-nal sources. To address these limitations, we present MLRIP (Military Language Representation with Integrated Prior), a novel pre-training framework that introduces a hierarchical knowledge integration pipeline combined with a dual-phase entity substitu-tion mechanism. Our approach speciffcally models operational linkages between military entities, capturing critical dependencies such as command, support, and engagement structures. Comprehensive evaluations on military-speciffc NLP tasks show that MLRIP outperforms existing BERT-based models by substantial margins, establishing new state-of-the-art performance in military entity recognition, typing, and operational linkage extraction tasks while demonstrating superior operational efffciency in resource-constrained environments.

[64] GRDD: A Dataset for Greek Dialectal NLP

Stergios Chatzikyriakidis, Chatrine Qwaider, Ilias Kolokousis, Christina Koula, Dimitris Papadakis, Efthymia Sakellariou

Main category: cs.CL

TL;DR: A dataset of four Modern Greek dialects (Cretan, Pontic, Northern Greek, Cypriot) is created and used for dialect identification using ML and DL methods, achieving high performance despite dataset imbalance.

Details

Motivation: To address the lack of large-scale dialectal resources for Modern Greek dialects and enable computational study of these linguistic varieties.

Method: Created an imbalanced dataset of raw text from four Modern Greek dialects, then applied traditional ML algorithms and simple deep learning architectures for dialect identification.

Result: Very good performance on dialect identification task, suggesting distinct characteristics between dialects that allow even simple models to perform well. Error analysis revealed some errors were due to insufficient dataset cleaning.

Conclusion: The study successfully demonstrates the feasibility of dialect identification for Modern Greek using computational methods, while highlighting the importance of dataset quality for optimal performance.

Abstract: In this paper, we present a dataset for the computational study of a number of Modern Greek dialects. It consists of raw text data from four dialects of Modern Greek, Cretan, Pontic, Northern Greek and Cypriot Greek. The dataset is of considerable size, albeit imbalanced, and presents the first attempt to create large scale dialectal resources of this type for Modern Greek dialects. We then use the dataset to perform dialect idefntification. We experiment with traditional ML algorithms, as well as simple DL architectures. The results show very good performance on the task, potentially revealing that the dialects in question have distinct enough characteristics allowing even simple ML models to perform well on the task. Error analysis is performed for the top performing algorithms showing that in a number of cases the errors are due to insufficient dataset cleaning.

[65] When “Competency” in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

Divij Handa, Zehua Zhang, Amir Saeidi, Shrinidhi Kumbhar, Md Nayem Uddin, Aswin RRV, Chitta Baral

Main category: cs.CL

TL;DR: LLMs become more vulnerable to jailbreaking as they improve at reasoning and decoding custom ciphers. ACE and LACE attacks exploit this by encoding malicious queries with novel ciphers, achieving up to 72% success rates.

Details

Motivation: To study the paradoxical vulnerability where advanced LLMs with better reasoning capabilities become more susceptible to novel jailbreaking attacks through custom cipher decoding.

Method: Introduce ACE (Attacks using Custom Encryptions) that encodes malicious queries with novel ciphers, and LACE (Layered Attacks using Custom Encryptions) using multi-layer ciphers. Develop CipherBench benchmark to evaluate LLM cipher decoding accuracy.

Result: Experiments show LLMs better at decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b increasing from 60% with ACE to 72% with LACE.

Conclusion: There’s a critical trade-off: as LLMs improve at deciphering complex user ciphers (which can’t be preemptively included in safety training), they become increasingly exploitable to jailbreaking attacks.

Abstract: Recent advancements in Large Language Model (LLM) safety have primarily focused on mitigating attacks crafted in natural language or common ciphers (e.g. Base64), which are likely integrated into newer models’ safety training. However, we reveal a paradoxical vulnerability: as LLMs advance in reasoning, they inadvertently become more susceptible to novel jailbreaking attacks. Enhanced reasoning enables LLMs to interpret complex instructions and decode complex user-defined ciphers, creating an exploitable security gap. To study this vulnerability, we introduce Attacks using Custom Encryptions (ACE), a jailbreaking technique that encodes malicious queries with novel ciphers. Extending ACE, we introduce Layered Attacks using Custom Encryptions (LACE), which applies multi-layer ciphers to amplify attack complexity. Furthermore, we develop CipherBench, a benchmark designed to evaluate LLMs’ accuracy in decoding encrypted benign text. Our experiments reveal a critical trade-off: LLMs that are more capable of decoding ciphers are more vulnerable to LACE, with success rates on gpt-oss-20b escalating from 60% under ACE to 72% with LACE. These findings highlight a critical insight: as LLMs become more adept at deciphering complex user ciphers–many of which cannot be preemptively included in safety training–they become increasingly exploitable.

[66] Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

Congying Liu, Gaosheng Wang, Peipei Liu, Xingyuan Wei, Hongsong Zhu

Main category: cs.CL

TL;DR: MsFNER is a hybrid multi-stage decoding framework for few-shot NER that splits the task into entity-span detection and entity classification, using entity-aware contrastive learning and meta-learning to improve performance.

Details

Motivation: Previous few-shot NER methods using token-level or span-level metric learning suffer from computational burden and large numbers of negative sample spans, requiring a more efficient approach.

Method: Two-stage approach: entity-span detection followed by entity classification. Uses meta-learning on source domain, contrastive learning for entity representations, and combines model predictions with KNN during inference.

Result: Experiments on FewNERD dataset demonstrate the advance of MsFNER over previous methods.

Conclusion: The proposed MsFNER framework effectively addresses computational challenges in few-shot NER through its hybrid multi-stage decoding approach with entity-aware contrastive learning.

Abstract: Few-shot named entity recognition can identify new types of named entities based on a few labeled examples. Previous methods employing token-level or span-level metric learning suffer from the computational burden and a large number of negative sample spans. In this paper, we propose the Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning (MsFNER), which splits the general NER into two stages: entity-span detection and entity classification. There are 3 processes for introducing MsFNER: training, finetuning, and inference. In the training process, we train and get the best entity-span detection model and the entity classification model separately on the source domain using meta-learning, where we create a contrastive learning module to enhance entity representations for entity classification. During finetuning, we finetune the both models on the support dataset of target domain. In the inference process, for the unlabeled data, we first detect the entity-spans, then the entity-spans are jointly determined by the entity classification model and the KNN. We conduct experiments on the open FewNERD dataset and the results demonstrate the advance of MsFNER.

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song

Main category: cs.CL

TL;DR: Textual unlearning in Vision-Language Models effectively reduces attack success rates for both text-based and vision-text-based attacks without requiring multi-modal training data.

Details

Motivation: Existing safety training techniques like SFT and RLHF are bypassed when new modalities are integrated into LLMs, and collecting multi-modal safety training datasets is challenging.

Method: Leveraging the structural design of multi-modal models where all inputs fuse into language space, the authors explore textual unlearning for cross-modality safety alignment without multi-modal data.

Result: Textual unlearning reduces Attack Success Rate to less than 8% (as low as 2%) for both text and vision-text attacks while preserving utility. Multi-modal unlearning offers no benefits but increases computational costs up to 6x.

Conclusion: Textual domain unlearning is sufficient and effective for cross-modality safety alignment in VLMs, eliminating the need for costly multi-modal training data collection.

Abstract: Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability – textual unlearning in VLMs significantly reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential benefits but incurs significantly increased computational demands, possibly up to 6 times higher.

[68] The Open Source Advantage in Large Language Models (LLMs)

Jiya Manchanda, Laura Boettcher, Matheus Westphalen, Jasser Jasser

Main category: cs.CL

TL;DR: This position paper argues that open-source approaches are the most robust path for advancing LLM research and ethical deployment, despite closed-source models delivering state-of-the-art performance.

Details

Motivation: The field faces a critical dilemma between closed-source models (high performance but restricted reproducibility/accessibility) and open-source frameworks (democratized access and collaboration). Hybrid approaches attempt to combine benefits but the authors believe open-source is superior.

Method: The paper presents a position argument analyzing the trade-offs between closed-source, open-source, and hybrid approaches to LLM development, focusing on reproducibility, accessibility, oversight, and ethical considerations.

Result: The analysis concludes that while closed-source models achieve state-of-the-art performance and hybrid approaches address some challenges, open-source frameworks provide the most robust foundation for advancing LLM research and ensuring ethical deployment.

Conclusion: Open-source remains the optimal path forward for LLM development, as it democratizes access, fosters collaboration, supports diverse applications, and enables transparency and external oversight essential for ethical deployment.

Abstract: Large language models (LLMs) have rapidly advanced natural language processing, driving significant breakthroughs in tasks such as text generation, machine translation, and domain-specific reasoning. The field now faces a critical dilemma in its approach: closed-source models like GPT-4 deliver state-of-the-art performance but restrict reproducibility, accessibility, and external oversight, while open-source frameworks like LLaMA and Mixtral democratize access, foster collaboration, and support diverse applications, achieving competitive results through techniques like instruction tuning and LoRA. Hybrid approaches address challenges like bias mitigation and resource accessibility by combining the scalability of closed-source systems with the transparency and inclusivity of open-source framework. However, in this position paper, we argue that open-source remains the most robust path for advancing LLM research and ethical deployment.

[69] AFRIDOC-MT: Document-level MT Corpus for African Languages

Jesujoba O. Alabi, Israel Abebe Azime, Miaoran Zhang, Cristina España-Bonet, Rachel Bawden, Dawei Zhu, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Idris Akinade, Iffat Maab, Davis David, Shamsuddeen Hassan Muhammad, Neo Putini, David O. Ademuyiwa, Andrew Caines, Dietrich Klakow

Main category: cs.CL

TL;DR: AFRIDOC-MT is a document-level multi-parallel translation dataset for English and five African languages, with benchmark experiments showing NLLB-200 and GPT-4o as top performers, but revealing generalization challenges and quality issues in LLMs for African languages.

Details

Motivation: To address the scarcity of document-level translation resources for African languages and evaluate translation performance across different model types for these under-resourced languages.

Method: Created a dataset of 334 health and 271 IT news documents translated from English to five African languages, then evaluated NMT models and LLMs at sentence and pseudo-document levels with realignment for document-level evaluation.

Result: NLLB-200 performed best among standard NMT models, GPT-4o outperformed general-purpose LLMs, fine-tuning improved performance but sentence-trained models struggled with longer documents, and LLMs showed quality issues like under-generation and repetition for African languages.

Conclusion: Document-level translation for African languages remains challenging, with models showing generalization issues and quality problems, highlighting the need for improved resources and model capabilities for these languages.

Abstract: This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, Hausa, Swahili, Yor`ub'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.

[70] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

Xiang Liu, Zhenheng Tang, Peijie Dong, Zeyu Li, Yue Liu, Bo Li, Xuming Hu, Xiaowen Chu

Main category: cs.CL

TL;DR: ChunkKV introduces a semantic-aware KV cache compression method that treats semantic chunks as basic units instead of individual tokens, preserving contextual integrity and improving both memory efficiency and model performance.

Details

Motivation: Current KV cache compression methods focus on individual token importance but overlook semantic relationships, leading to fragmented context and performance degradation in long-text LLM inference.

Method: ChunkKV treats semantic chunks as compression units and uses a layer-wise index reuse technique that exploits cross-layer similarity of preserved indices to reduce computational overhead.

Result: ChunkKV improves throughput by 26.5% and outperforms state-of-the-art methods by up to 8.7% in precision while maintaining the same compression ratio across multiple benchmarks including LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV.

Conclusion: Semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing an effective solution to the memory bottleneck problem.

Abstract: Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce ChunkKV, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in ChunkKV, reducing computational overhead and improving throughput by 26.5%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. The code is available at \href{https://github.com/NVIDIA/kvpress}{link}.

[71] From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models

Yurui Dong, Luozhijie Jin, Yao Yang, Bingjie Lu, Jiaxi Yang, Zhi Liu

Main category: cs.CL

TL;DR: Proposes Emotion Vectors (EVs) - latent representations from activation shifts between neutral and emotional responses - to enable controllable emotion generation in LLMs without training or architectural changes.

Details

Motivation: LLMs struggle with consistent, controllable emotional expression, limiting authentic human-AI interaction despite strong reasoning capabilities.

Method: Inject Emotion Vectors derived from internal activation shifts into hidden states during inference, enabling fine-grained emotional tone modulation without additional training.

Result: Achieves consistent emotional alignment, stable topic adherence, and controllable affect intensity across multiple LLM families, outperforming prompt-based and fine-tuning baselines.

Conclusion: EV steering efficiently bridges rational reasoning and affective understanding in LLMs, offering promising direction for emotionally resonant AI systems.

Abstract: Purpose: Emotion is a fundamental component of human communication, shaping understanding, trust, and engagement across domains such as education, healthcare, and mental health. While large language models (LLMs) exhibit strong reasoning and knowledge generation capabilities, they still struggle to express emotions in a consistent, controllable, and contextually appropriate manner. This limitation restricts their potential for authentic human-AI interaction. Methods: We propose a controllable emotion generation framework based on Emotion Vectors (EVs) - latent representations derived from internal activation shifts between neutral and emotion-conditioned responses. By injecting these vectors into the hidden states of pretrained LLMs during inference, our method enables fine-grained, continuous modulation of emotional tone without any additional training or architectural modification. We further provide theoretical analysis proving that EV steering enhances emotional expressivity while maintaining semantic fidelity and linguistic fluency. Results: Extensive experiments across multiple LLM families show that the proposed approach achieves consistent emotional alignment, stable topic adherence, and controllable affect intensity. Compared with existing prompt-based and fine-tuning-based baselines, our method demonstrates superior flexibility and generalizability. Conclusion: Emotion Vector (EV) steering provides an efficient and interpretable means of bridging rational reasoning and affective understanding in large language models, offering a promising direction for building emotionally resonant AI systems capable of more natural human-machine interaction.

[72] A Survey of Multilingual Reasoning in Language Models

Akash Ghosh, Debayan Datta, Sriparna Saha, Chirag Agarwal

Main category: cs.CL

TL;DR: This survey provides the first comprehensive review of multilingual reasoning in language models, covering challenges, methods, benchmarks, and future research directions for handling logical reasoning across diverse languages.

Details

Motivation: While reasoning and multilingual capabilities in LMs have advanced significantly, their integration into unified multilingual reasoning remains underdeveloped, with challenges including misalignment, biases, and low-resource language difficulties.

Method: The survey systematically reviews existing methods for multilingual reasoning in LMs, analyzes standard data resources and evaluation benchmarks, and examines state-of-the-art approaches and their performance.

Result: The paper provides a comprehensive overview of the current state of multilingual reasoning research, identifying key challenges and performance metrics across different approaches.

Conclusion: Future research should focus on enhancing LMs’ ability to handle diverse languages and complex reasoning tasks, with ongoing developments tracked through the project’s GitHub repository.

Abstract: While reasoning and multilingual capabilities in language models (LMs) have achieved remarkable progress in recent years, their integration into a unified paradigm - multilingual reasoning - is at a nascent stage. Multilingual reasoning requires language models to handle logical reasoning across languages while addressing misalignment, biases, and challenges in low-resource settings. This survey provides the first in-depth review of multilingual reasoning in LMs. In this survey, we provide a systematic overview of existing methods that leverage LMs for multilingual reasoning, specifically outlining the challenges, motivations, and foundational aspects of applying language models to reason across diverse languages. We provide an overview of the standard data resources used for training multilingual reasoning in LMs and the evaluation benchmarks employed to assess their multilingual capabilities. Next, we analyze various state-of-the-art methods and their performance on these benchmarks. Finally, we explore future research opportunities to improve multilingual reasoning in LMs, focusing on enhancing their ability to handle diverse languages and complex reasoning tasks. Rapid growth of evolving developments in this field can be actively tracked on our project page: https://github.com/AkashGhosh/Survey-of-Multilingual-Reasoning-in-Language-Models

[73] Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

Alireza S. Ziabari, Nona Ghazizadeh, Zhivar Sourati, Farzan Karimi-Malekabadi, Payam Piray, Morteza Dehghani

Main category: cs.CL

TL;DR: LLMs can be aligned to human-like System 1 (intuitive) and System 2 (analytical) reasoning styles, revealing a trade-off: System 2 excels in arithmetic/symbolic reasoning while System 1 performs better in commonsense tasks. Combining both approaches based on entropy outperforms single-style models.

Details

Motivation: Human cognition flexibly adapts between intuitive and analytical reasoning, while LLMs rely on uniform step-by-step processing. This raises questions about whether this inflexibility makes LLMs brittle and whether different reasoning styles might be optimal for different tasks.

Method: Curated dataset with System 1 and System 2 answers, aligned LLMs to both reasoning styles, interpolated between extremes, and combined models based on generation entropy without additional training.

Result: Accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic/symbolic reasoning, System 1-aligned models better in commonsense reasoning. Interpolation shows monotonic accuracy changes. Combined model outperforms across nearly all benchmarks.

Conclusion: Step-by-step reasoning is not always optimal; adapting reasoning strategies based on task demands is crucial. Combining System 1 and System 2 approaches creates more robust and dynamic models.

Abstract: Large Language Models (LLMs) exhibit impressive reasoning abilities, yet their reliance on structured step-by-step processing reveals a critical limitation. In contrast, human cognition fluidly adapts between intuitive, heuristic (System 1) and analytical, deliberative (System 2) reasoning depending on the context. This difference between human cognitive flexibility and LLMs’ reliance on a single reasoning style raises a critical question: while human fast heuristic reasoning evolved for its efficiency and adaptability, is a uniform reasoning approach truly optimal for LLMs, or does its inflexibility make them brittle and unreliable when faced with tasks demanding more agile, intuitive responses? To answer these questions, we explicitly align LLMs to these reasoning styles by curating a dataset with valid System 1 and System 2 answers, and evaluate their performance across reasoning benchmarks. Our results reveal an accuracy-efficiency trade-off: System 2-aligned models excel in arithmetic and symbolic reasoning, while System 1-aligned models perform better in commonsense reasoning tasks. To analyze the reasoning spectrum, we interpolated between the two extremes by varying the proportion of alignment data, which resulted in a monotonic change in accuracy. A mechanistic analysis of model responses shows that System 1 models employ more definitive outputs, whereas System 2 models demonstrate greater uncertainty. Building on these findings, we further combine System 1- and System 2-aligned models based on the entropy of their generations, without additional training, and obtain a dynamic model that outperforms across nearly all benchmarks. This work challenges the assumption that step-by-step reasoning is always optimal and highlights the need for adapting reasoning strategies based on task demands.

[74] Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

Angana Borah, Rada Mihalcea, Verónica Pérez-Rosas

Main category: cs.CL

TL;DR: This study examines how demographic factors affect misinformation susceptibility in both humans and LLMs, showing LLMs replicate human-like demographic patterns in persuasion dynamics and echo chamber behavior.

Details

Motivation: To understand the bidirectional persuasion dynamics between LLMs and humans regarding misinformation, particularly how demographic vulnerabilities in humans are reflected in LLMs and how LLMs' persuasive capabilities can influence human susceptibility.

Method: Used human-stance datasets to analyze human-to-LLM influence, generated LLM-based persuasive arguments to assess LLM-to-human influence, and employed a multi-agent LLM framework to study misinformation spread among demographic-oriented agents.

Result: Demographic factors significantly influence misinformation susceptibility in LLMs, mirroring human demographic patterns. Multi-agent LLMs exhibit echo chamber behavior similar to human demographic groups.

Conclusion: The research reveals important parallels between human and LLM susceptibility to misinformation across demographic lines, providing insights for developing future interventions to address misinformation challenges in AI systems.

Abstract: Existing challenges in misinformation exposure and susceptibility vary across demographic groups, as some populations are more vulnerable to misinformation than others. Large language models (LLMs) introduce new dimensions to these challenges through their ability to generate persuasive content at scale and reinforcing existing biases. This study investigates the bidirectional persuasion dynamics between LLMs and humans when exposed to misinformative content. We analyze human-to-LLM influence using human-stance datasets and assess LLM-to-human influence by generating LLM-based persuasive arguments. Additionally, we use a multi-agent LLM framework to analyze the spread of misinformation under persuasion among demographic-oriented LLM agents. Our findings show that demographic factors influence susceptibility to misinformation in LLMs, closely reflecting the demographic-based patterns seen in human susceptibility. We also find that, similar to human demographic groups, multi-agent LLMs exhibit echo chamber behavior. This research explores the interplay between humans and LLMs, highlighting demographic differences in the context of misinformation and offering insights for future interventions.

[75] EmoDebt: Bayesian-Optimized Emotional Intelligence for Strategic Agent-to-Agent Debt Recovery

Yunbo Long, Yuhan Liu, Liming Xu, Alexandra Brintrup

Main category: cs.CL

TL;DR: EmoDebt is an LLM agent with a Bayesian-optimized emotional intelligence engine that adapts to adversarial emotional tactics in debt collection negotiations, significantly outperforming non-adaptive baselines.

Details

Motivation: LLM agents in emotion-sensitive domains like debt collection are vulnerable to exploitation by adversaries who simulate negative emotions to derail negotiations.

Method: Developed a dataset of debt recovery scenarios and a multi-agent simulation framework. Created EmoDebt with a Bayesian-optimized emotional intelligence engine that treats emotional expression as sequential decision-making, using online learning to tune emotional transition policies.

Result: EmoDebt achieves significant strategic robustness, substantially outperforming non-adaptive and emotion-agnostic baselines across key metrics including success rate and operational efficiency.

Conclusion: This work establishes a foundation for deploying strategically robust LLM agents in adversarial, emotion-sensitive debt interactions through both a critical benchmark and an adaptive agent.

Abstract: The emergence of autonomous Large Language Model (LLM) agents has created a new ecosystem of strategic, agent-to-agent interactions. However, a critical challenge remains unaddressed: in high-stakes, emotion-sensitive domains like debt collection, LLM agents pre-trained on human dialogue are vulnerable to exploitation by adversarial counterparts who simulate negative emotions to derail negotiations. To fill this gap, we first contribute a novel dataset of simulated debt recovery scenarios and a multi-agent simulation framework. Within this framework, we introduce EmoDebt, an LLM agent architected for robust performance. Its core innovation is a Bayesian-optimized emotional intelligence engine that reframes a model’s ability to express emotion in negotiation as a sequential decision-making problem. Through online learning, this engine continuously tunes EmoDebt’s emotional transition policies, discovering optimal counter-strategies against specific debtor tactics. Extensive experiments on our proposed benchmark demonstrate that EmoDebt achieves significant strategic robustness, substantially outperforming non-adaptive and emotion-agnostic baselines across key performance metrics, including success rate and operational efficiency. By introducing both a critical benchmark and a robustly adaptive agent, this work establishes a new foundation for deploying strategically robust LLM agents in adversarial, emotion-sensitive debt interactions.

[76] Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

Nikita Tatarinov, Siddhant Sukhani, Agam Shah, Sudheer Chava

Main category: cs.CL

TL;DR: This paper systematically reviews 374 NLP papers (2017-2024) with focus on 221 finance-related papers, identifying key opportunities for expanding forecasting tasks, enriching evaluation metrics, leveraging multilingual datasets, and balancing model efficiency with interpretability.

Details

Motivation: To systematically examine the growing trend of finance-related papers in top-tier NLP venues and identify opportunities for future research directions in this emerging field.

Method: Comprehensive review of 374 NLP research papers across 38 conferences and workshops (2017-2024), with focused analysis of 221 finance-related papers, evaluated across 11 quantitative and qualitative dimensions.

Result: Identified four key opportunities: expanding forecasting tasks, enriching evaluation with financial metrics, leveraging multilingual and crisis-period datasets, and balancing PLMs with efficient/interpretable alternatives. Provided actionable directions with dataset and tool recommendations.

Conclusion: The study provides systematic analysis of NLP in finance research trends and offers concrete recommendations for both academia and industry to advance the field through expanded scope, better evaluation, diverse datasets, and balanced model approaches.

Abstract: Recent advances in language modeling have led to a growing number of papers related to finance in top-tier Natural Language Processing (NLP) venues. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, and our study identifies the following opportunities for NLP researchers: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with financial metrics; (iii) leveraging multilingual and crisis-period datasets; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions supported by dataset and tool recommendations, with implications for both the academia and industry communities.

[77] AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery

Amirhossein Abaskohi, Amrutha Varshini Ramesh, Shailesh Nanisetty, Chirag Goel, David Vazquez, Christopher Pal, Spandana Gella, Giuseppe Carenini, Issam H. Laradji

Main category: cs.CL

TL;DR: AgentAda is the first LLM-powered analytics agent that automatically learns and applies specialized analytics skills from a library to extract insights, outperforming existing methods through a three-step process: question generation, skill matching, and code generation.

Details

Motivation: Existing analytics methods require users to manually select which data analytics methods to apply, limiting their ability to handle complex tasks that LLMs cannot perform out of the box.

Method: Three-step strategy: (I) question generator for user-relevant queries, (II) hybrid RAG-based skill matcher to select best analytics skill from library, (III) code generator producing executable code based on skill documentation.

Result: Human evaluation showed 48.78% of evaluators preferred AgentAda’s analyses vs 27.67% for unskilled agent. Also introduced KaggleBench benchmark and LLM-as-a-judge approach aligned with human evaluation.

Conclusion: AgentAda successfully automates specialized analytics skill selection and application, providing more insightful analyses than existing tools, with scalable evaluation methods.

Abstract: We introduce AgentAda, the first LLM-powered analytics agent that can learn and use new analytics skills to extract more specialized insights. Unlike existing methods that require users to manually decide which data analytics method to apply, AgentAda automatically identifies the skill needed from a library of analytical skills to perform the analysis. This also allows AgentAda to use skills that existing LLMs cannot perform out of the box. The library covers a range of methods, including clustering, predictive modeling, and NLP techniques like BERT, which allow AgentAda to handle complex analytics tasks based on what the user needs. AgentAda’s dataset-to-insight extraction strategy consists of three key steps: (I) a question generator to generate queries relevant to the user’s goal and persona, (II) a hybrid Retrieval-Augmented Generation (RAG)-based skill matcher to choose the best data analytics skill from the skill library, and (III) a code generator that produces executable code based on the retrieved skill’s documentation to extract key patterns. We also introduce KaggleBench, a benchmark of curated notebooks across diverse domains, to evaluate AgentAda’s performance. We conducted a human evaluation demonstrating that AgentAda provides more insightful analytics than existing tools, with 48.78% of evaluators preferring its analyses, compared to 27.67% for the unskilled agent. We also propose a novel LLM-as-a-judge approach that we show is aligned with human evaluation as a way to automate insight quality evaluation at larger scale.

[78] Exploring Compositional Generalization (in COGS/ReCOGS_pos) by Transformers using Restricted Access Sequence Processing (RASP)

William Bruns

Main category: cs.CL

TL;DR: This paper demonstrates that Transformer models can achieve near perfect compositional generalization on COGS and ReCOGS benchmarks using flat pattern-matching rules in RASP, challenging previous claims that hierarchical/tree-structured solutions are necessary.

Details

Motivation: To challenge the claim that Transformer models cannot perform structural generalization on COGS benchmark (which reported 0% accuracy) and show that flat pattern-matching approaches can achieve near perfect scores without requiring hierarchical solutions.

Method: Used RASP (Restricted Access Sequence Processing) programming language to implement Transformer Encoder-Decoder models with word-level tokens, embedding layer for POS tagging, 19 attention-head compatible flat pattern-matching rules, grammar coverage analysis, and masking techniques for prepositional phrases and sentential complements.

Result: Achieved near perfect scores on structural generalization splits for both COGS (exact match) and ReCOGS_pos (semantic exact match), including handling pp recursion and cp recursion through decoder loops.

Conclusion: The COGS and ReCOGS tasks do not require hierarchical or tree-structured solutions as previously claimed; flat pattern-matching approaches in Transformers can systematically and compositionally solve these tasks.

Abstract: Humans understand new combinations of words encountered if they are combinations of words recognized from different contexts, an ability called Compositional Generalization. The COGS benchmark (Kim and Linzen, 2020) arXiv:2010.05465 reports 0% accuracy for Transformer models on some structural generalizations. We use (Weiss et al., 2021) arXiv:2106.06981’s Restricted Access Sequence Processing (RASP), a Transformer-equivalent programming language, to demonstrate that a Transformer Encoder-Decoder can perform COGS and the semantically equivalent ReCOGS_pos (Wu et al., 2024) arXiv:2303.13716 systematically and compositionally: Our RASP models attain near perfect scores on structural generalization splits on COGS (exact match) and ReCOGS_pos (semantic exact match). Our RASP models show the (Re)COGS tasks do not require a hierarchical or tree-structured solution (contrary to (Kim and Linzen, 2020) arXiv:2010.05465, (Yao and Koller, 2022) arXiv:2210.13050, (Murty et al., 2022) arXiv:2211.01288, (Liu et al., 2021) arXiv:2107.06516): we use word-level tokens with an “embedding” layer that tags with possible part of speech, applying just once per encoder pass 19 attention-head compatible flat pattern-matching rules (easily identified with specific training examples), shown using grammar coverage (Zeller et al., 2023) to cover the non-recursive aspects of the input grammar, plus masking out prepositional phrases (“pp noun”) and/or sentential complements (cp) when recognizing grammar patterns and extracting nouns related to the main verb in the sentence, and output the next logical form (LF) token (repeating until the LF is complete). The models do not apply recursive, tree-structured rules like “np_det pp np -> np_pp -> np”, but score near perfect semantic and string exact match on both COGS and ReCOGS pp recursion, cp recursion using the decoder loop.

[79] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Zichun Yu, Chenyan Xiong

Main category: cs.CL

TL;DR: RePro is a web recycling method that trains a small language model with reinforcement learning to generate high-quality rephrasings of pretraining data, improving data efficiency and model performance.

Details

Motivation: High-quality pretraining data is becoming scarce for frontier LLMs, creating a need for efficient data recycling methods to extend the utility of existing data.

Method: Trains a 4B parameter LM with reinforcement learning using one quality reward and three faithfulness rewards to generate effective rephrasings while preserving core semantics and structure of original data.

Result: Achieves 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks, outperforms state-of-the-art methods, and improves organic data efficiency by 2-3x.

Conclusion: RePro provides an efficient and controllable approach to effectively harness existing pretraining data, offering a sustainable path forward for LLM development.

Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at https://github.com/cxcscmu/RePro.

[80] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

Linxuan Wang, Shuiyuan Yu

Main category: cs.CL

TL;DR: Study explores relationship between dependency distance and hierarchical distance in Japanese, finding predicate valency drives trade-off between linear and hierarchical complexity.

Details

Motivation: To understand the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese language structure.

Method: Compared probability distributions of DD and HD with/without fixed sentence length, analyzed changes in MDD and MHD as sentence length increases, and calculated correlation coefficients using Balanced Corpus of Contemporary Written Japanese.

Result: Predicate valency is the underlying factor driving trade-off between MDD and MHD. Native speakers regulate linear and hierarchical complexity through predicate valency. Valency affects probability distributions of DD and HD differently, with greater effect on HD distribution.

Conclusion: Predicate valency regulates the trade-off between dependency distance and hierarchical distance in Japanese, with mean MDD being lower than mean MHD due to valency’s differential effects on their distributions.

Abstract: To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.

[81] MaxPoolBERT: Enhancing BERT Classification via Layer- and Token-Wise Aggregation

Maike Behrendt, Stefan Sylvius Wagner, Stefan Harmeling

Main category: cs.CL

TL;DR: MaxPoolBERT enhances BERT’s classification by refining [CLS] token representations through layer-wise max-pooling and multi-head attention, improving performance on low-resource tasks without major model changes.

Details

Motivation: The [CLS] token in BERT is commonly used for classification but prior work shows other tokens and layers contain valuable contextual information that could enhance representations.

Method: Three lightweight extensions: (i) max-pooling [CLS] token across multiple layers, (ii) adding multi-head attention for [CLS] to attend over final layer, (iii) combining max-pooling across full sequence with MHA.

Result: MaxPoolBERT consistently outperforms standard BERT base model on low-resource tasks of GLUE benchmark, improving classification accuracy without requiring new pre-training.

Conclusion: Simple modifications to aggregate information across layers and tokens can significantly enhance BERT’s classification performance, especially for low-resource scenarios.

Abstract: The [CLS] token in BERT is commonly used as a fixed-length representation for classification tasks, yet prior work has shown that both other tokens and intermediate layers encode valuable contextual information. In this work, we study lightweight extensions to BERT that refine the [CLS] representation by aggregating information across layers and tokens. Specifically, we explore three modifications: (i) max-pooling the [CLS] token across multiple layers, (ii) enabling the [CLS] token to attend over the entire final layer using an additional multi-head attention (MHA) layer, and (iii) combining max-pooling across the full sequence with MHA. Our approach, called MaxPoolBERT, enhances BERT’s classification accuracy (especially on low-resource tasks) without requiring new pre-training or significantly increasing model size. Experiments on the GLUE benchmark show that MaxPoolBERT consistently achieves a better performance than the standard BERT base model on low resource tasks of the GLUE benchmark.

[82] Steering Large Language Models for Machine Translation Personalization

Daniel Scalena, Gabriele Sarti, Arianna Bisazza, Elisabetta Fersini, Malvina Nissim

Main category: cs.CL

TL;DR: This paper explores methods for personalizing machine translations to match specific human translator styles using few examples, focusing on literary translation and demonstrating that contrastive steering with sparse autoencoder latents provides efficient and effective style conditioning.

Details

Motivation: Large language models struggle with implicitly learning stylistic requirements from examples, particularly in literary translation where personal style matters but few examples are available.

Method: Evaluated various prompting strategies and inference-time interventions, with focus on contrastive steering using sparse autoencoder (SAE) latents to identify personalization properties.

Result: Contrastive SAE steering achieved robust style conditioning and translation quality with higher computational efficiency than prompting approaches, while affecting similar model activation layers as prompting.

Conclusion: Contrastive SAE steering is an effective method for personalizing translations with few examples, operating through similar mechanisms as prompting but with better computational efficiency.

Abstract: Large language models have simplified the production of personalized translations reflecting predefined stylistic constraints. However, these systems still struggle when stylistic requirements are implicitly represented by a set of examples, such as texts produced by a specific human translator. In this work, we explore various strategies for personalizing automatically generated translations when few examples are available, with a focus on the challenging domain of literary translation. We begin by determining the feasibility of the task and how style information is encoded within model representations. Then, we evaluate various prompting strategies and inference-time interventions for steering model generations towards a personalized style, with a particular focus on contrastive steering with sparse autoencoder (SAE) latents to identify salient personalization properties. We demonstrate that contrastive SAE steering yields robust style conditioning and translation quality, resulting in higher inference-time computational efficiency than prompting approaches. We further examine the impact of steering on model activations, finding that layers encoding personalization properties are impacted similarly by prompting and SAE steering, suggesting a similar mechanism at play.

Xuanming Zhang, Yuxuan Chen, Samuel Yeh, Sharon Li

Main category: cs.CL

TL;DR: MetaMind is a multi-agent framework that improves LLMs’ social reasoning by decomposing it into three stages: ToM hypothesis generation, moral refinement, and response generation with intent validation.

Details

Motivation: LLMs struggle with ambiguity and contextual nuance in human communication, particularly in inferring unspoken intentions, emotions, and beliefs (Theory of Mind).

Method: A multi-agent framework with three collaborative stages: Theory-of-Mind Agent generates mental state hypotheses, Moral Agent refines using cultural norms, and Response Agent generates contextually appropriate responses while validating intent alignment.

Result: Achieves state-of-the-art performance with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning, enabling LLMs to match human-level performance on key ToM tasks for the first time.

Conclusion: Advances AI systems toward human-like social intelligence with applications in empathetic dialogue and culturally sensitive interactions, demonstrating the necessity of all framework components.

Abstract: Human social interactions depend on the ability to infer others’ unspoken intentions, emotions, and beliefs-a cognitive skill grounded in the psychological concept of Theory of Mind (ToM). While large language models (LLMs) excel in semantic understanding tasks, they struggle with the ambiguity and contextual nuance inherent in human communication. To bridge this gap, we introduce MetaMind, a multi-agent framework inspired by psychological theories of metacognition, designed to emulate human-like social reasoning. MetaMind decomposes social understanding into three collaborative stages: (1) a Theory-of-Mind Agent generates hypotheses about user mental states (e.g., intent, emotion), (2) a Moral Agent refines these hypotheses using cultural norms and ethical constraints, and (3) a Response Agent generates contextually appropriate responses while validating alignment with inferred intent. Our framework achieves state-of-the-art performance across three challenging benchmarks, with 35.7% improvement in real-world social scenarios and 6.2% gain in ToM reasoning. Notably, it enables LLMs to match human-level performance on key ToM tasks for the first time. Ablation studies confirm the necessity of all components, which showcase the framework’s ability to balance contextual plausibility, social appropriateness, and user adaptation. This work advances AI systems toward human-like social intelligence, with applications in empathetic dialogue and culturally sensitive interactions. Code is available at https://github.com/XMZhangAI/MetaMind.

[84] DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue

Yichun Feng, Jiawei Wang, Lu Zhou, Zhen Lei, Yixue Li

Main category: cs.CL

TL;DR: A reinforcement learning-based multi-agent framework for medical consultations that enables LLMs to dynamically optimize questioning strategies through multi-turn interactions, outperforming existing models in diagnostic performance.

Details

Motivation: To address limitations of single-round consultation systems and traditional multi-turn dialogue models in clinical settings, which lack flexibility and intelligent information extraction capabilities.

Method: Proposed a reinforcement learning-based multi-agent collaborative framework where a doctor agent continuously optimizes questioning strategy through multi-turn interactions with a patient agent, using comprehensive rewards from a Consultation Evaluator.

Result: The approach outperforms existing models in both multi-turn reasoning capability and final diagnostic performance, and created MTMedDialog - the first English multi-turn medical consultation dataset for simulating patient interactions.

Conclusion: The framework shows immense practical value by reducing misdiagnosis risks, freeing clinicians for complex cases, and pioneering optimization of medical resource allocation to alleviate workforce shortages.

Abstract: Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges. Single-round consultation systems require patients to describe all symptoms upfront, leading to vague diagnosis with unclear complaints. Traditional multi-turn dialogue models, constrained by static supervised learning, lack flexibility and fail to intelligently extract key clinical information. To address these limitations, we propose \Ours{}, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty. The doctor agent continuously optimizes its questioning strategy within the RL framework through multi-turn interactions with the patient agent, dynamically adjusting its information-gathering path based on comprehensive rewards from the Consultation Evaluator. This RL fine-tuning mechanism enables LLMs to autonomously develop interaction strategies aligned with clinical reasoning logic, rather than superficially imitating patterns in existing dialogue data. Notably, we constructed MTMedDialog, the first English multi-turn medical consultation dataset capable of simulating patient interactions. Experiments demonstrate that \Ours{} outperforms existing models in both multi-turn reasoning capability and final diagnostic performance. This approach shows immense practical value by reducing misdiagnosis risks in time-pressured settings, freeing clinicians for complex cases, and pioneering a strategy to optimize medical resource allocation and alleviate workforce shortages. Code and data are available at https://github.com/JarvisUSTC/DoctorAgent-RL

[85] Leveraging Importance Sampling to Detach Alignment Modules from Large Language Models

Yi Liu, Dianqing Liu, Mingye Zhu, Junbo Guo, Yongdong Zhang, Zhendong Mao

Main category: cs.CL

TL;DR: Proposes Residual Alignment Model (RAM) that frames LLM alignment as importance sampling, enabling flexible alignment without retraining base models.

Details

Motivation: Traditional alignment methods require retraining large pretrained models, making it difficult to quickly adapt LLMs for diverse applications.

Method: Formalizes alignment as importance sampling with unaligned model as proposal distribution and alignment module as importance weight estimator. Uses sequence-level training and iterative token-level decoding to address latency issues.

Result: Experimental evaluations on open-source LLMs across instruction following, domain adaptation, and preference optimization tasks show consistent outperformance over baseline models.

Conclusion: RAM enables flexible and scalable alignment of LLMs without requiring retraining of base models, addressing key limitations of traditional alignment approaches.

Abstract: The widespread adoption of large language models (LLMs) across industries has increased the demand for high-quality and customizable outputs. However, traditional alignment methods often require retraining large pretrained models, making it difficult to quickly adapt and optimize LLMs for diverse applications. To address this limitation, we propose a novel \textit{Residual Alignment Model} (\textit{RAM}) that formalizes the alignment process as a type of importance sampling. In this framework, the unaligned upstream model serves as the proposal distribution, while the alignment process is framed as secondary sampling based on an autoregressive alignment module that acts as an estimator of the importance weights. This design enables a natural detachment of the alignment module from the target aligned model, improving flexibility and scalability. Based on this model, we derive an efficient sequence-level training strategy for the alignment module, which operates independently of the proposal module. Additionally, we develop a resampling algorithm with iterative token-level decoding to address the common first-token latency issue in comparable methods. Experimental evaluations on two leading open-source LLMs across diverse tasks, including instruction following, domain adaptation, and preference optimization, demonstrate that our approach consistently outperforms baseline models.

[86] The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models

Siqi Fan, Bowen Qin, Peng Han, Shuo Shang, Yequan Wang, Aixin Sun

Main category: cs.CL

TL;DR: Thinking models often overthink simple problems, wasting computation. COTHINK pipeline uses an instruct model to draft outlines and thinking model to expand them, reducing token usage by 21.1% while maintaining accuracy.

Details

Motivation: Existing thinking models trained with RL and backward-checking CoT suffer from overthinking - producing excessively long outputs on simple problems, wasting computation. Current evaluations based on token efficiency are incomplete as they neglect problem difficulty and intermediate computation costs.

Method: Formalize reasoning efficiency as relative measure between thinking and instruct models, treating instruct models as minimal-effort baseline. Propose COTHINK - a two-stage pipeline where instruct model drafts brief outline and thinking model expands it.

Result: Systematic study across four thinking models reveals: (i) instruct models achieve higher efficiency overall, (ii) problem difficulty affects efficiency - thinking models waste computation on easy problems but provide value on harder ones. COTHINK cuts token usage by 21.1% on GSM8K, MATH500, and AIME24 while keeping accuracy.

Conclusion: COTHINK pipeline effectively addresses overthinking problem in thinking models, significantly reducing computational costs while maintaining performance, and remains competitive with strong efficiency baselines.

Abstract: Recent thinking models trained with reinforcement learning and backward-checking CoT often suffer from overthinking: they produce excessively long outputs even on simple problems, wasting computation. Existing evaluations, based on token efficiency, give an incomplete view as they neglect problem difficulty and intermediate computation costs. We formalize reasoning efficiency as a relative measure between thinking and instruct models, treating instruct models as the minimal-effort baseline. A systematic study across four thinking models and multiple benchmarks reveals two consistent patterns: (i) instruct models achieve higher efficiency overall, and (ii) problem difficulty affects efficiency, with thinking models wasting computation on easy problems but providing value on harder ones. Building on this insight, we propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it. On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.

[87] Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection

Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun

Main category: cs.CL

TL;DR: ORION is a framework that refines teacher Chains-of-Thought (CoTs) through error-aware reflection to create more tailored supervision signals for distilling reasoning ability to Small Language Models (SLMs).

Details

Motivation: Existing methods use long-form CoTs as supervision for SLMs but these teachers are unaware of student model capacity, limiting effective utilization of reasoning traces.

Method: ORION refines teacher CoTs through an Error-Aware Reflection process that incorporates the student model’s own reasoning errors to construct more tailored supervision signals.

Result: Experiments on multiple mathematical reasoning benchmarks show ORION consistently improves performance by more than 2% over all baselines, with CoTs exhibiting higher coherence and logical consistency.

Conclusion: ORION effectively enhances reasoning ability distillation to SLMs by creating more tailored teacher CoTs through error-aware reflection, serving as more effective supervision signals for SFT.

Abstract: Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model’s capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose errOr-aware self-ReflectION (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.git.

[88] DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: DefenderBench is an open-source toolkit for evaluating LLM agents in cybersecurity tasks, showing Claude-3.7-sonnet performs best with 81.65 score.

Details

Motivation: LLM agents have impressive language capabilities but their potential in cybersecurity remains underexplored, requiring practical evaluation tools.

Method: Developed DefenderBench toolkit with environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment using standardized agentic framework.

Result: Claude-3.7-sonnet performed best (81.65), followed by Claude-3.7-sonnet-think (78.40), while best open-weight model Llama 3.3 70B scored 71.81.

Conclusion: DefenderBench provides affordable, accessible evaluation framework for cybersecurity LLM agents, enabling fair comparisons and promoting reproducibility.

Abstract: Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

[89] Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment

Priyanka Dey, Yugal Khanter, Aayush Bothra, Jieyu Zhao, Emilio Ferrara

Main category: cs.CL

TL;DR: CulturalPersonas is a new benchmark for evaluating LLM personality expression in culturally appropriate contexts across six countries, showing improved alignment with human personality distributions.

Details

Motivation: As LLMs become central to interactive applications, there's a need for culturally appropriate personality expression, but current evaluations overlook the interplay between culture and personality.

Method: Created CulturalPersonas benchmark with 3,000 scenario-based questions across six countries, using both multiple-choice and open-ended response formats to evaluate three LLMs in culturally grounded contexts.

Result: CulturalPersonas improves alignment with country-specific human personality distributions (over 20% reduction in Wasserstein distance) and elicits more expressive, culturally coherent outputs compared to existing benchmarks.

Conclusion: CulturalPersonas bridges personality expression and cultural nuance, paving the way for more socially intelligent and globally adaptive LLMs by offering new directions for aligning LLMs to global behavioral norms.

Abstract: As LLMs become central to interactive applications, ranging from tutoring to mental health, the ability to express personality in culturally appropriate ways is increasingly important. While recent works have explored personality evaluation of LLMs, they largely overlook the interplay between culture and personality. To address this, we introduce CulturalPersonas, the first large-scale benchmark with human validation for evaluating LLMs’ personality expression in culturally grounded, behaviorally rich contexts. Our dataset spans 3,000 scenario-based questions across six diverse countries, designed to elicit personality through everyday scenarios rooted in local values. We evaluate three LLMs, using both multiple-choice and open-ended response formats. Our results show that CulturalPersonas improves alignment with country-specific human personality distributions (over a 20% reduction in Wasserstein distance across models and countries) and elicits more expressive, culturally coherent outputs compared to existing benchmarks. CulturalPersonas surfaces meaningful modulated trait outputs in response to culturally grounded prompts, offering new directions for aligning LLMs to global norms of behavior. By bridging personality expression and cultural nuance, we envision that CulturalPersonas will pave the way for more socially intelligent and globally adaptive LLMs.

[90] KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model

Xinping Zhao, Xinshuo Hu, Zifei Shan, Shouzheng Huang, Yao Zhou, Xin Zhang, Zetian Sun, Zhenyu Liu, Dongfang Li, Xinyuan Wei, Youcheng Pan, Yang Xiang, Meishan Zhang, Haofen Wang, Jun Yu, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: KaLM-Embedding-V2 is a series of compact 0.5B parameter embedding models that achieve state-of-the-art performance through superior training techniques and high-quality data curation, outperforming models of similar size and rivaling much larger models.

Details

Motivation: Current LLM-based text embedding models focus mainly on data scaling or synthesis, with limited exploration of training techniques and data quality, which constrains their performance.

Method: Uses 0.5B compact architecture with mean-pooling and bidirectional representation learning. Implements progressive multi-stage training: pre-training on weakly supervised data, fine-tuning with high-quality supervised data, and contrastive distillation with soft signals. Employs focal-style reweighting and online hard-negative mixing. Curates over 20 pre-training and 100 fine-tuning categories with task-specific instructions and hard-negative mining.

Result: Achieves state-of-the-art performance on Massive Text Embedding Benchmark, outperforming comparable-sized models and rivaling models 3-26x larger.

Conclusion: Sets a new standard for versatile and compact embedding models under 1B parameters by systematically addressing training techniques and data quality.

Abstract: Recent advancements in Large Language Models (LLMs)-based text embedding models primarily focus on data scaling or synthesis, yet limited exploration of training techniques and data quality, thereby constraining performance. In this work, we propose KaLM-Embedding-V2, a series of versatile and compact embedding models, systematically incentivizing advanced embedding capability in LLMs by superior training techniques and high-quality data. For model architecture, we implement the models on a 0.5B compact size with simple mean-pooling to produce fixed-length embeddings and remove the causal attention mask to enable fully bidirectional representation learning. For training techniques, we propose a progressive multi-stage training pipeline: pre-training on weakly supervised large-scale datasets, fine-tuning with supervised high-quality datasets, and contrastive distillation with fine-grained soft signals, integrated with focal-style reweighting and online hard-negative mixing to emphasize difficult samples and enrich hard negatives, respectively. For training data, we curate over 20 categories for pre-training and 100 categories for fine-tuning and contrastive distillation, to improve both performance and generalization, leveraging task-specific instructions, hard-negative mining, and example-based multi-class labeling to ensure high quality. Combining these techniques, our KaLM-Embedding-V2 series achieves state-of-the-art performance on the Massive Text Embedding Benchmark, outperforming models of comparable size and rivaling models 3-26x larger, setting a new standard for versatile and compact embedding models under 1B parameters.

[91] VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents

Sam Yu-Te Lee, Chenyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma

Main category: cs.CL

TL;DR: VIDEE is a system that enables entry-level data analysts to perform advanced text analytics using intelligent agents through a human-agent collaboration workflow with decomposition, execution, and evaluation stages.

Details

Motivation: Traditional text analytics requires specialized NLP knowledge, creating barriers for entry-level analysts. Recent LLM advances enable more accessible text analysis.

Method: VIDEE implements a human-agent collaboration workflow with three stages: Decomposition (using human-in-the-loop Monte-Carlo Tree Search), Execution (generating executable text analytics pipelines), and Evaluation (LLM-based evaluation with visualizations).

Result: Two quantitative experiments evaluated VIDEE’s effectiveness and analyzed agent errors. A user study with participants from no experience to experts demonstrated system usability and revealed distinct user behavior patterns.

Conclusion: The findings identify design implications for human-agent collaboration, validate VIDEE’s utility for non-experts, and inform future improvements to intelligent text analytics systems.

Abstract: Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience – from none to expert – demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.

[92] Lost at the Beginning of Reasoning

Baohao Liao, Xinyi Chen, Sara Rajaee, Yuhui Xu, Christian Herold, Anders Søgaard, Maarten de Rijke, Christof Monz

Main category: cs.CL

TL;DR: The first reasoning step in chain-of-thought reasoning has disproportionate influence on final predictions, and a reward model-based sampling strategy can reduce inference costs by 70% without accuracy loss.

Details

Motivation: Despite advancements in LLM reasoning capabilities, self-correction abilities during long chain-of-thought reasoning remain underexplored, and models often engage in redundant reasoning (overthinking).

Method: Proposed an efficient sampling strategy that uses a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones.

Result: Achieved up to 70% reduction in inference cost without sacrificing accuracy, with the phenomenon consistently observed across various state-of-the-art reasoning models.

Conclusion: The first reasoning step plays a central role in generating high-quality reasoning trajectories, enabling significantly efficient sampling in chain-of-thought reasoning.

Abstract: Recent advancements in large language models (LLMs) have significantly advanced complex reasoning capabilities, particularly through extended chain-of-thought (CoT) reasoning that incorporates mechanisms such as backtracking, self-reflection, and self-correction. Despite these developments, the self-correction abilities of LLMs during long CoT reasoning remain underexplored. And recent findings on overthinking suggest that such models often engage in unnecessarily redundant reasoning. In this work, we empirically show that the first reasoning step exerts a disproportionately large influence on the final prediction. I.e., errors introduced at this stage can substantially degrade subsequent reasoning quality. This phenomenon is consistently observed across various state-of-the-art open- and closed-source reasoning models. Leveraging this insight, we propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps while discarding suboptimal ones, achieving up to a 70% reduction in inference cost without sacrificing any accuracy. Our work highlights the central role of the first reasoning step in generating a high-quality reasoning trajectory, and thus enabling significantly efficient sampling.

[93] SAFER: Probing Safety in Reward Models with Sparse Autoencoder

Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang

Main category: cs.CL

TL;DR: SAFER is a framework that uses Sparse Autoencoders to interpret and improve reward models in RLHF by identifying human-interpretable features in reward model activations, enabling targeted data manipulation to enhance or degrade safety alignment.

Details

Motivation: Reward models in RLHF are opaque despite being crucial for aligning LLMs with human values, making it difficult to understand and improve safety-relevant decision-making.

Method: Uses Sparse Autoencoders (SAEs) to uncover interpretable features in reward model activations, quantifies feature salience through activation differences between chosen/rejected responses, and designs targeted data poisoning/denoising strategies.

Result: SAFER can precisely degrade or enhance safety alignment with minimal data modification while maintaining general chat performance, demonstrating effective interpretation and refinement of reward models.

Conclusion: The approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks, providing tools for better understanding and improving safety mechanisms.

Abstract: Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}

[94] Knowledge Fusion via Bidirectional Information Aggregation

Songlin Zhai, Guilin Qi, Yue Wang, Yuan Meng

Main category: cs.CL

TL;DR: KGA is a novel framework that dynamically integrates external knowledge graphs into LLMs at inference-time without parameter modification, using two synergistic attention pathways inspired by neuroscience.

Details

Motivation: To bridge the gap between dynamic knowledge in knowledge graphs and static LLMs, addressing issues of outdated knowledge in time-sensitive web applications while avoiding parameter-invasive fine-tuning that causes catastrophic forgetting.

Method: Introduces two synergistic pathways: bottom-up knowledge fusion (input-driven KG fusion) and top-down attention guidance (goal-directed verification process), rewiring self-attention without parameter modification.

Result: Extensive experiments on four benchmarks verify KGA’s strong fusion performance and efficiency, supporting real-time knowledge fusion.

Conclusion: KGA provides an effective framework for dynamic KG-LLM integration that maintains LLM capabilities while enabling real-time knowledge updates without parameter changes.

Abstract: Knowledge graphs (KGs) are the cornerstone of the semantic web, offering up-to-date representations of real-world entities and relations. Yet large language models (LLMs) remain largely static after pre-training, causing their internal knowledge to become outdated and limiting their utility in time-sensitive web applications. To bridge this gap between dynamic knowledge and static models, a prevalent approach is to enhance LLMs with KGs. However, prevailing methods typically rely on parameter-invasive fine-tuning, which risks catastrophic forgetting and often degrades LLMs’ general capabilities. Moreover, their static integration frameworks cannot keep pace with the continuous evolution of real-world KGs, hindering their deployment in dynamic web environments. To bridge this gap, we introduce KGA (\textit{\underline{K}nowledge \underline{G}raph-guided \underline{A}ttention}), a novel framework that dynamically integrates external KGs into LLMs exclusively at inference-time without any parameter modification. Inspired by research on neuroscience, we rewire the self-attention module by innovatively introducing two synergistic pathways: a \textit{bottom-up knowledge fusion} pathway and a \textit{top-down attention guidance} pathway. The \textit{bottom-up pathway} dynamically integrates external knowledge into input representations via input-driven KG fusion, which is akin to the \textit{stimulus-driven attention process} in the human brain. Complementarily, the \textit{top-down pathway} aims to assess the contextual relevance of each triple through a \textit{goal-directed verification process}, thereby suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. By synergistically combining these two pathways, our method supports real-time knowledge fusion. Extensive experiments on four benchmarks verify KGA’s strong fusion performance and efficiency.

[95] A Comprehensive Taxonomy of Negation for NLP and Neural Retrievers

Roxana Petcu, Samarth Bhargav, Maarten de Rijke, Evangelos Kanoulas

Main category: cs.CL

TL;DR: The paper studies how neural models handle negation in information retrieval, introduces a negation taxonomy, creates benchmark datasets, and proposes a logic-based classification to analyze model performance on negation.

Details

Motivation: Neural models underperform on queries containing negation despite learning contextual embeddings, creating a need to better understand and improve their handling of negation in reasoning tasks.

Method: Developed a taxonomy of negation from philosophical, linguistic, and logical definitions; generated two benchmark datasets; proposed a logic-based classification mechanism to analyze retrieval model performance.

Result: The taxonomy produces balanced data distribution over negation types, leading to faster convergence on the NevIR dataset. The classification schema reveals coverage of negation types in existing datasets.

Conclusion: The proposed framework provides better training setups and insights into factors affecting generalization of fine-tuned models on negation, improving neural model performance on negation-containing queries.

Abstract: Understanding and solving complex reasoning tasks is vital for addressing the information needs of a user. Although dense neural models learn contextualised embeddings, they still underperform on queries containing negation. To understand this phenomenon, we study negation in both traditional neural information retrieval and LLM-based models. We (1) introduce a taxonomy of negation that derives from philosophical, linguistic, and logical definitions; (2) generate two benchmark datasets that can be used to evaluate the performance of neural information retrieval models and to fine-tune models for a more robust performance on negation; and (3) propose a logic-based classification mechanism that can be used to analyze the performance of retrieval models on existing datasets. Our taxonomy produces a balanced data distribution over negation types, providing a better training setup that leads to faster convergence on the NevIR dataset. Moreover, we propose a classification schema that reveals the coverage of negation types in existing datasets, offering insights into the factors that might affect the generalization of fine-tuned models on negation.

[96] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: CAMERA introduces micro-expert as a finer-grained compression unit for Mixture-of-Experts (MoE) LLMs, addressing computational and storage overheads through training-free redundancy identification and structured pruning/quantization methods.

Details

Motivation: MoE models suffer from substantial computational and storage overheads, with performance gains not scaling proportionally with parameter growth. Existing parameter reduction methods face challenges in both performance and computational efficiency.

Method: Proposes CAMERA framework that views MoE layers as mixtures of micro-experts, identifies redundancy via training-free analysis, and introduces CAMERA-P for structured micro-expert pruning and CAMERA-Q for mixed-precision quantization of micro-experts.

Result: CAMERA-P outperforms strong baselines under 20%-60% pruning ratios across nine downstream tasks. CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing matrix- and channel-level methods. Enables complete micro-expert analysis of Qwen2-57B-A14B in <5 minutes on single A100 GPU.

Conclusion: The micro-expert perspective provides an effective framework for MoE model compression, achieving significant efficiency improvements while maintaining performance through structured pruning and quantization techniques.

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

[97] EMSEdit: Efficient Multi-Step Meta-Learning-based Model Editing

Xiaopeng Li, Shasha Li, Xi Wang, Shezheng Song, Bin Ji, Shangwen Wang, Jun Ma, Xiaodong Liu, Mina Liu, Jie Yu

Main category: cs.CL

TL;DR: EMSEdit is an efficient model editing method that uses multi-step backpropagation and norm-based regularization to improve editing performance in low-data regimes while reducing training costs.

Details

Motivation: Existing meta-learning-based model editing (MLME) methods struggle with low-data scenarios and incur high training costs due to KL divergence usage.

Method: EMSEdit leverages multi-step backpropagation to capture gradient-activation mapping patterns, performs multi-step edits per sample to enhance performance with limited data, and uses norm-based regularization to preserve unedited knowledge.

Result: Experiments on two datasets and three LLMs show EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing, and demonstrates robustness in multi-hop reasoning tasks.

Conclusion: EMSEdit provides an effective and efficient solution for model editing that works well in low-data regimes and can be integrated into existing approaches for additional performance gains.

Abstract: Large Language Models (LLMs) power numerous AI applications, yet updating their knowledge remains costly. Model editing provides a lightweight alternative through targeted parameter modifications, with meta-learning-based model editing (MLME) demonstrating strong effectiveness and efficiency. However, we find that MLME struggles in low-data regimes and incurs high training costs due to the use of KL divergence. To address these issues, we propose $\textbf{E}$fficient $\textbf{M}$ulti-$\textbf{S}$tep $\textbf{Edit (EMSEdit)}$, which leverages multi-step backpropagation (MSBP) to effectively capture gradient-activation mapping patterns within editing samples, performs multi-step edits per sample to enhance editing performance under limited data, and introduces norm-based regularization to preserve unedited knowledge while improving training efficiency. Experiments on two datasets and three LLMs show that EMSEdit consistently outperforms state-of-the-art methods in both sequential and batch editing. Moreover, MSBP can be seamlessly integrated into existing approaches to yield additional performance gains. Further experiments on a multi-hop reasoning editing task demonstrate EMSEdit’s robustness in handling complex edits, while ablation studies validate the contribution of each design component. Our code is available at https://github.com/xpq-tech/emsedit.

[98] The Cultural Gene of Large Language Models: A Study on the Impact of Cross-Corpus Training on Model Values and Biases

Emanuel Z. Fenech-Borg, Tilen P. Meznaric-Kos, Milica D. Lekovic-Bojovic, Arni J. Hentze-Djurhuus

Main category: cs.CL

TL;DR: LLMs inherit cultural values from training data, with GPT-4 showing Western individualistic/low-power-distance tendencies and ERNIE Bot showing Eastern collectivistic/high-power-distance tendencies, highlighting the need for culturally aware AI deployment.

Details

Motivation: To investigate the cultural biases embedded in LLMs and understand how they inherit value orientations from their training corpora, addressing concerns about algorithmic cultural hegemony.

Method: Created Cultural Probe Dataset (CPD) with 200 prompts targeting Individualism-Collectivism and Power Distance dimensions, compared GPT-4 and ERNIE Bot using standardized zero-shot prompts with human annotation and Cultural Alignment Index against Hofstede’s national scores.

Result: Significant cultural divergence: GPT-4 showed individualistic/low-power-distance tendencies (IDV=1.21, PDI=-1.05), ERNIE Bot showed collectivistic/high-power-distance tendencies (IDV=-0.89, PDI=0.76). GPT-4 aligned more with USA values, ERNIE Bot with China values.

Conclusion: LLMs function as statistical mirrors of their cultural training corpora, necessitating culturally aware evaluation and deployment to prevent algorithmic cultural hegemony.

Abstract: Large language models (LLMs) are deployed globally, yet their underlying cultural and ethical assumptions remain underexplored. We propose the notion of a “cultural gene” – a systematic value orientation that LLMs inherit from their training corpora – and introduce a Cultural Probe Dataset (CPD) of 200 prompts targeting two classic cross-cultural dimensions: Individualism-Collectivism (IDV) and Power Distance (PDI). Using standardized zero-shot prompts, we compare a Western-centric model (GPT-4) and an Eastern-centric model (ERNIE Bot). Human annotation shows significant and consistent divergence across both dimensions. GPT-4 exhibits individualistic and low-power-distance tendencies (IDV score approx 1.21; PDI score approx -1.05), while ERNIE Bot shows collectivistic and higher-power-distance tendencies (IDV approx -0.89; PDI approx 0.76); differences are statistically significant (p < 0.001). We further compute a Cultural Alignment Index (CAI) against Hofstede’s national scores and find GPT-4 aligns more closely with the USA (e.g., IDV CAI approx 0.91; PDI CAI approx 0.88) whereas ERNIE Bot aligns more closely with China (IDV CAI approx 0.85; PDI CAI approx 0.81). Qualitative analyses of dilemma resolution and authority-related judgments illustrate how these orientations surface in reasoning. Our results support the view that LLMs function as statistical mirrors of their cultural corpora and motivate culturally aware evaluation and deployment to avoid algorithmic cultural hegemony.

[99] Diffusion Language Models Know the Answer Before Decoding

Pengxiang Li, Yefan Zhou, Dilxat Muhtar, Lu Yin, Shilin Yan, Li Shen, Yi Liang, Soroush Vosoughi, Shiwei Liu

Main category: cs.CL

TL;DR: Prophet is a training-free decoding method for diffusion language models that enables early commit decoding by leveraging early answer convergence, reducing decoding steps by up to 3.4x while maintaining quality.

Details

Motivation: Diffusion language models are slower than autoregressive models due to bidirectional attention costs and many refinement steps needed for high-quality outputs.

Method: Prophet dynamically decides whether to continue refinement or decode all remaining tokens at once using the confidence gap between top-2 prediction candidates, requiring no additional training.

Result: On GSM8K and MMLU, up to 97% and 99% of instances can be decoded correctly using only half the refinement steps. Prophet reduces decoding steps by up to 3.4x while preserving generation quality.

Conclusion: Early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, recasting DLM decoding as a problem of when to stop sampling.

Abstract: Diffusion language models (DLMs) have recently emerged as an alternative to autoregressive approaches, offering parallel sequence generation and flexible token orders. However, their inference remains slower than that of autoregressive models, primarily due to the cost of bidirectional attention and the large number of refinement steps required for high quality outputs. In this work, we highlight and leverage an overlooked property of DLMs early answer convergence: in many cases, the correct answer can be internally identified by half steps before the final decoding step, both under semi-autoregressive and random remasking schedules. For example, on GSM8K and MMLU, up to 97% and 99% of instances, respectively, can be decoded correctly using only half of the refinement steps. Building on this observation, we introduce Prophet, a training-free fast decoding paradigm that enables early commit decoding. Specifically, Prophet dynamically decides whether to continue refinement or to go “all-in” (i.e., decode all remaining tokens in one step), using the confidence gap between the top-2 prediction candidates as the criterion. It integrates seamlessly into existing DLM implementations, incurs negligible overhead, and requires no additional training. Empirical evaluations of LLaDA-8B and Dream-7B across multiple tasks show that Prophet reduces the number of decoding steps by up to 3.4x while preserving high generation quality. These results recast DLM decoding as a problem of when to stop sampling, and demonstrate that early decode convergence provides a simple yet powerful mechanism for accelerating DLM inference, complementary to existing speedup techniques. Our code is publicly available at https://github.com/pixeli99/Prophet.

[100] DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL

Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong

Main category: cs.CL

TL;DR: DeepDive advances deep search agents by automatically synthesizing complex questions from knowledge graphs and using multi-turn reinforcement learning to improve LLMs’ long-horizon reasoning with browsing tools.

Details

Motivation: Open LLMs perform poorly as deep search agents due to limited long-horizon reasoning capacity with browsing tools and lack of sufficiently difficult supervised data.

Method: 1) Automatically synthesize complex questions from open knowledge graphs; 2) Apply end-to-end multi-turn RL with redundancy penalty to discourage repeated queries; 3) Enable test-time scaling of tool calls and parallel sampling.

Result: DeepDive-32B achieves new open-source competitive results on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. Multi-turn RL training significantly improves deep search ability across multiple benchmarks.

Conclusion: DeepDive effectively addresses the challenges of limited reasoning capacity and insufficient training data for deep search agents, demonstrating improved performance through automated data synthesis and multi-turn RL training.

Abstract: Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. To encourage diversity and reduce redundancy, we design a redundancy penalty that discourages repeated similar queries. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.

[101] FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Bowen Qin, Chen Yue, Fang Yin, Hui Wang, JG Yao, Jiakang Liu, Jing-Shu Zheng, Miguel Hu Chen, Richeng Xuan, Shibei Meng, Shiqi Zhou, Teng Dai, Tong-Shuai Ren, Wei Cui, Xi Yang, Xialin Du, Xiaojing Xu, Xue Sun, Xuejing Li, Yaming Liu, Yesheng Liu, Ying Liu, Yonghua Lin, Yu Zhao, Yunduo Zhang, Yuwen Luo, Zheqi He, Zhiyuan He, Zhongyuan Wang

Main category: cs.CL

TL;DR: The paper presents ROME, a contamination-free evaluation benchmark for large reasoning models (LRMs) that tests reasoning from visual clues, with preliminary findings from moderate-scale evaluation.

Details

Motivation: To provide a contamination-free evaluation framework for current large reasoning models, addressing the need for reliable assessment of reasoning capabilities from visual information.

Method: Developed ROME benchmark specifically designed for vision language models, conducted moderate-scale evaluation with contamination-free approach, and released evaluation data publicly.

Result: Preliminary findings from the evaluation of current large reasoning models using the ROME benchmark, with benchmark and evaluation data made available online.

Conclusion: The ROME benchmark provides a valuable tool for assessing reasoning capabilities in vision language models, with ongoing updates and resources available through the project website.

Abstract: We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

[102] LLMs4All: A Systematic Review of Large Language Models Across Academic Disciplines

Yanfang Ye, Zheyuan Zhang, Tianyi Ma, Zehong Wang, Yiyang Li, Shifu Hou, Weixiang Sun, Kaiwen Shi, Yijun Ma, Wei Song, Ahmed Abbasi, Ying Cheng, Jane Cleland-Huang, Steven Corcelli, Robert Goulding, Ming Hu, Ting Hua, John Lalor, Fang Liu, Tengfei Luo, Ed Maginn, Nuno Moniz, Jason Rohr, Brett Savoie, Daniel Slate, Matthew Webber, Olaf Wiest, Johnny Zhang, Nitesh V. Chawla

Main category: cs.CL

TL;DR: This paper provides an overview of state-of-the-art Large Language Models (LLMs) and their integration across diverse academic disciplines, exploring their impacts, limitations, and future directions in the generative AI era.

Details

Motivation: The impressive performance of LLMs like ChatGPT on various language tasks has demonstrated their potential for broader real-world applications across multiple domains, inspiring this comprehensive review of their interdisciplinary integration.

Method: The paper offers a systematic overview and review of LLMs’ integration into three main academic categories: (1) arts, letters, and law; (2) economics and business; and (3) science and engineering, exploring their applications and impacts in each field.

Result: The review examines how LLMs are shaping research and practice across diverse disciplines, providing key observations and insights about their current applications and transformative potential.

Conclusion: The comprehensive analysis of LLMs’ interdisciplinary integration helps researchers and practitioners understand how to leverage these models to advance their work in various real-world applications, while also highlighting key limitations, challenges, and future directions for generative AI.

Abstract: Cutting-edge Artificial Intelligence (AI) techniques keep reshaping our view of the world. For example, Large Language Models (LLMs) based applications such as ChatGPT have shown the capability of generating human-like conversation on extensive topics. Due to the impressive performance on a variety of language-related tasks (e.g., open-domain question answering, translation, and document summarization), one can envision the far-reaching impacts that can be brought by the LLMs with broader real-world applications (e.g., customer service, education and accessibility, and scientific discovery). Inspired by their success, this paper will offer an overview of state-of-the-art LLMs and their integration into a wide range of academic disciplines, including: (1) arts, letters, and law (e.g., history, philosophy, political science, arts and architecture, law), (2) economics and business (e.g., finance, economics, accounting, marketing), and (3) science and engineering (e.g., mathematics, physics and mechanical engineering, chemistry and chemical engineering, life sciences and bioengineering, earth sciences and civil engineering, computer science and electrical engineering). Integrating humanity and technology, in this paper, we will explore how LLMs are shaping research and practice in these fields, while also discussing key limitations, open challenges, and future directions in the era of generative AI. The review of how LLMs are engaged across disciplines-along with key observations and insights-can help researchers and practitioners interested in exploiting LLMs to advance their works in diverse real-world applications.

[103] Responsible AI Technical Report

KT, :, Yunjin Park, Jungwon Yoon, Junhyung Moon, Myunggyo Oh, Wonhyuk Lee, Sujin Kim Youngchol Kim, Eunmi Kim, Hyoungjun Park, Eunyoung Shin, Wonyoung Lee, Somin Lee, Minwook Ju, Minsung Noh, Dongyoung Jeong, Jeongyeop Kim, Wanjin Park, Soonmin Bae

Main category: cs.CL

TL;DR: KT developed a Responsible AI assessment methodology and risk mitigation technologies, including a proprietary Guardrail system called SafetyGuard, to ensure AI safety and regulatory compliance.

Details

Motivation: To ensure safety and reliability of AI services by addressing regulatory compliance needs and managing potential risks throughout AI development and operation lifecycle.

Method: Developed a unique RAI assessment methodology based on AI risk taxonomy, analyzed Basic Act on AI implementation and global governance trends, and created practical tools including SafetyGuard for real-time harmful response blocking.

Result: Created a systematic approach to identify and manage AI risks, developed reliable assessment methodology for model safety verification, and released proprietary Guardrail technology for real-time risk mitigation.

Conclusion: The research outcomes provide valuable insights for organizations developing Responsible AI and support safety enhancement in the domestic AI development ecosystem.

Abstract: KT developed a Responsible AI (RAI) assessment methodology and risk mitigation technologies to ensure the safety and reliability of AI services. By analyzing the Basic Act on AI implementation and global AI governance trends, we established a unique approach for regulatory compliance and systematically identify and manage all potential risk factors from AI development to operation. We present a reliable assessment methodology that systematically verifies model safety and robustness based on KT’s AI risk taxonomy tailored to the domestic environment. We also provide practical tools for managing and mitigating identified AI risks. With the release of this report, we also release proprietary Guardrail : SafetyGuard that blocks harmful responses from AI models in real-time, supporting the enhancement of safety in the domestic AI development ecosystem. We also believe these research outcomes provide valuable insights for organizations seeking to develop Responsible AI.

[104] LUMINA: Detecting Hallucinations in RAG System with Context-Knowledge Signals

Samuel Yeh, Sharon Li, Tanwi Mallick

Main category: cs.CL

TL;DR: LUMINA is a framework that detects hallucinations in RAG systems by measuring context-knowledge signals through distributional distance for external context and token evolution across transformer layers for internal knowledge, achieving superior performance without extensive tuning.

Details

Motivation: RAG-based LLMs still hallucinate despite having correct context, due to imbalance between external context and internal knowledge usage. Existing detection methods require extensive hyperparameter tuning, limiting generalizability.

Method: Quantifies external context utilization via distributional distance, measures internal knowledge utilization by tracking how predicted tokens evolve across transformer layers, and introduces statistical validation framework.

Result: Achieves consistently high AUROC and AUPRC scores, outperforming prior methods by up to +13% AUROC on HalluRAG benchmark. Remains robust under relaxed assumptions about retrieval quality and model matching.

Conclusion: LUMINA offers an effective and practical solution for hallucination detection in RAG systems, combining strong performance with robustness and reduced tuning requirements.

Abstract: Retrieval-Augmented Generation (RAG) aims to mitigate hallucinations in large language models (LLMs) by grounding responses in retrieved documents. Yet, RAG-based LLMs still hallucinate even when provided with correct and sufficient context. A growing line of work suggests that this stems from an imbalance between how models use external context and their internal knowledge, and several approaches have attempted to quantify these signals for hallucination detection. However, existing methods require extensive hyperparameter tuning, limiting their generalizability. We propose LUMINA, a novel framework that detects hallucinations in RAG systems through context-knowledge signals: external context utilization is quantified via distributional distance, while internal knowledge utilization is measured by tracking how predicted tokens evolve across transformer layers. We further introduce a framework for statistically validating these measurements. Experiments on common RAG hallucination benchmarks and four open-source LLMs show that LUMINA achieves consistently high AUROC and AUPRC scores, outperforming prior utilization-based methods by up to +13% AUROC on HalluRAG. Moreover, LUMINA remains robust under relaxed assumptions about retrieval quality and model matching, offering both effectiveness and practicality.

[105] Large language models management of medications: three performance analyses

Kelli Henry, Steven Xu, Kaitlin Blotske, Moriah Cargile, Erin F. Barreto, Brian Murray, Susan Smith, Seth R. Bauer, Xingmeng Zhao, Adeleine Tilley, Yanjun Gao, Tianming Liu, Sunghwan Sohn, Andrea Sikora

Main category: cs.CL

TL;DR: GPT-4o performs poorly on medication management tasks including drug-formulation matching (49% accuracy), drug-drug interaction identification (54.7% accuracy), and medication order preparation (65.8% error-free orders).

Details

Motivation: To evaluate LLM consistency in recommending appropriate medication regimens, as medication management requires complex synthesis of drug information and order instructions for safe use.

Method: Tested GPT-4o on three medication tasks: drug-formulation matching, drug-drug interaction identification, and medication order preparation. Evaluated using clinician assessment and standard LLM metrics (TF-IDF, Levenshtein similarity, ROUGE scores).

Result: Poor performance across all tasks: 49% accuracy for drug-formulation matching with omissions and hallucinations; 54.7% accuracy for DDI identification; 65.8% of medication orders were error-free.

Conclusion: LLMs need domain-specific training with clinician-annotated datasets and comprehensive evaluation frameworks for medication management tasks due to consistently poor performance on basic medication tasks.

Abstract: Purpose: Large language models (LLMs) have proven performance for certain diagnostic tasks, however limited studies have evaluated their consistency in recommending appropriate medication regimens for a given diagnosis. Medication management is a complex task that requires synthesis of drug formulation and complete order instructions for safe use. Here, the performance of GPT 4o, an LLM available with ChatGPT, was tested for three medication management tasks. Methods: GPT-4o performance was tested using three medication tasks: identifying available formulations for a given generic drug name, identifying drug-drug interactions (DDI) for a given medication regimen, and preparing a medication order for a given generic drug name. For each experiment, the models raw text response was captured exactly as returned and evaluated using clinician evaluation in addition to standard LLM metrics, including Term Frequency-Inverse Document Frequency (TF IDF) vectors, normalized Levenshtein similarity, and Recall-Oriented Understudy for Gisting Evaluation (ROUGE 1/ROUGE L F1) between each response and its reference string. Results: For the first task of drug-formulation matching, GPT-4o had 49% accuracy for generic medications being matched to all available formulations, with an average of 1.23 omissions per medication and 1.14 hallucinations per medication. For the second task of drug-drug interaction identification, the accuracy was 54.7% for identifying the DDI pair. For the third task, GPT-4o generated order sentences containing no medication or abbreviation errors in 65.8% of cases. Conclusions: Model performance for basic medication tasks was consistently poor. This evaluation highlights the need for domain-specific training through clinician-annotated datasets and a comprehensive evaluation framework for benchmarking performance.

Xuanming Zhang, Yuxuan Chen, Samuel Yeh, Sharon Li

Main category: cs.CL

TL;DR: CooT is a decoding-time framework that adds explicit cognitive self-monitoring to LLMs using a Generator-Perceiver architecture with structured principles hierarchy, enabling dynamic safety interventions during inference without model retraining.

Details

Motivation: Current LLM alignment strategies embed safety into model weights, making controls implicit, static, and difficult to modify, creating a need for more flexible and explicit safety mechanisms.

Method: CooT couples a standard text Generator with a cognitive Perceiver that monitors generation using a precedence-based hierarchy of principles. When violations are detected, it rolls back generation and regenerates with injected guidance combining universal social priors and context-specific warnings.

Result: Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

Conclusion: CooT transforms alignment from a fixed property into an explicit, dynamic, and auditable process during inference, allowing flexible policy updates without model retraining.

Abstract: Large language models (LLMs) excel at complex reasoning but can still exhibit harmful behaviors. Current alignment strategies typically embed safety into model weights, making these controls implicit, static, and difficult to modify. This paper introduces Cognition-of-Thought (CooT), a novel decoding-time framework that equips LLMs with an explicit cognitive self-monitoring loop. CooT couples a standard text Generator with a cognitive Perceiver that continuously monitors the unfolding sequence. The Perceiver uses a structured, precedence-based hierarchy of principles (e.g., safety over obedience) to detect potential misalignments as they arise. When violations are flagged, CooT intervenes by rolling back the generation to the point of error and regenerating under injected guidance that combines universal social priors with context-specific warnings. CooT thus transforms alignment from a fixed property into an explicit, dynamic, and auditable process active during inference, allowing for flexible policy updates without retraining the model. Extensive experiments across multiple benchmarks and model families confirm that CooT consistently improves safety and social reasoning performance.

[107] DiaCDM: Cognitive Diagnosis in Teacher-Student Dialogues using the Initiation-Response-Evaluation Framework

Rui Jia, Yuang Wei, Ruijia Li, Yuan-Hao Jiang, Xinyu Xie, Yaomin Shen, Min Zhang, Bo Jiang

Main category: cs.CL

TL;DR: DiaCDM is a novel cognitive diagnosis model for teacher-student dialogues that adapts the IRE framework and uses graph-based encoding to overcome challenges of dynamic, unstructured dialogue data.

Details

Motivation: Traditional cognitive diagnosis models are designed for structured test data and cannot handle dynamic, unstructured teacher-student dialogues effectively.

Method: Adapted the initiation-response-evaluation (IRE) framework from educational theory and developed a unique graph-based encoding method that integrates teacher questions with knowledge components.

Result: Experiments on three real-world dialogue datasets show significant improvements in diagnostic accuracy and enhanced interpretability of results.

Conclusion: DiaCDM provides teachers with a powerful tool for assessing students’ cognitive states in dialogue settings, marking the first exploration of cognitive diagnosis in this context.

Abstract: While cognitive diagnosis (CD) effectively assesses students’ knowledge mastery from structured test data, applying it to real-world teacher-student dialogues presents two fundamental challenges. Traditional CD models lack a suitable framework for handling dynamic, unstructured dialogues, and it’s difficult to accurately extract diagnostic semantics from lengthy dialogues. To overcome these hurdles, we propose DiaCDM, an innovative model. We’ve adapted the initiation-response-evaluation (IRE) framework from educational theory to design a diagnostic framework tailored for dialogue. We also developed a unique graph-based encoding method that integrates teacher questions with relevant knowledge components to capture key information more precisely. To our knowledge, this is the first exploration of cognitive diagnosis in a dialogue setting. Experiments on three real-world dialogue datasets confirm that DiaCDM not only significantly improves diagnostic accuracy but also enhances the results’ interpretability, providing teachers with a powerful tool for assessing students’ cognitive states. The code is available at https://github.com/Mind-Lab-ECNU/DiaCDM/tree/main.

[108] Understanding the Mixture-of-Experts with Nadaraya-Watson Kernel

Chuanyang Zheng, Jiankai Sun, Yihang Gao, Enze Xie, Yuehao Wang, Peihao Wang, Ting Xu, Matthew Chang, Liliang Ren, Jingyao Li, Jing Xiong, Kashif Rasul, Mac Schwager, Anderson Schneider, Zhangyang Wang, Yuriy Nevmyvaka

Main category: cs.CL

TL;DR: The paper proposes KERN, a new FFN-style router function for Mixture-of-Experts models that replaces the traditional Softmax router, showing it generalizes both Sigmoid- and Softmax-based routers with zero additional cost.

Details

Motivation: The authors challenge the unchallenged assumption that Softmax is necessary for MoE routers, noting it's been treated as standard practice without principled justification. They observe that both FFN and MoE share mathematical formulations with Nadaraya-Watson regression.

Method: Proposed KERN (Kernel Inspired Router with Normalization), an FFN-style router function that uses ReLU activation and L2-normalization, based on insights from Nadaraya-Watson regression and FFN implementation practices.

Result: Comprehensive experiments in MoE and LLM settings validate the effectiveness of the proposed router function, showing it works as a viable alternative to traditional Softmax routers.

Conclusion: The KERN router provides a principled alternative to Softmax that generalizes existing router approaches and can be implemented with zero additional computational cost.

Abstract: Mixture-of-Experts (MoE) has become a cornerstone in recent state-of-the-art large language models (LLMs). Traditionally, MoE relies on $\mathrm{Softmax}$ as the router score function to aggregate expert output, a designed choice that has persisted from the earliest MoE models to modern LLMs, and is now widely regarded as standard practice. However, the necessity of using $\mathrm{Softmax}$ to project router weights into a probability simplex remains an unchallenged assumption rather than a principled design choice. In this work, we first revisit the classical Nadaraya-Watson regression and observe that MoE shares the same mathematical formulation as Nadaraya-Watson regression. Furthermore, we show that both feed-forward neural network (FFN) and MoE can be interpreted as a special case of Nadaraya-Watson regression, where the kernel function corresponds to the input neurons of the output layer. Motivated by these insights, we propose the \textbf{zero-additional-cost} Kernel Inspired Router with Normalization (KERN), an FFN-style router function, as an alternative to $\mathrm{Softmax}$. We demonstrate that this router generalizes both $\mathrm{Sigmoid}$- and $\mathrm{Softmax}$-based routers. \textbf{Based on empirical observations and established practices in FFN implementation, we recommend the use of $\mathrm{ReLU}$ activation and $\ell_2$-normalization in $\mathrm{KERN}$ router function.} Comprehensive experiments in MoE and LLM validate the effectiveness of the proposed FFN-style router function \methodNorm.

[109] Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning

Arash Marioriyad, Shaygan Adim, Nima Alighardashi, Mahdieh Soleymani Banghshah, Mohammad Hossein Rohban

Main category: cs.CL

TL;DR: LLM reasoning using chain-of-thought prompting is systematically influenced by hints in ways that compromise faithfulness, with correct hints improving accuracy but incorrect ones reducing it, and hint acknowledgement varying by complexity and presentation style.

Details

Motivation: To determine whether chain-of-thought rationales are faithful computations or post-hoc narratives shaped by answer shortcuts embedded in prompts.

Method: Systematic study across 4 datasets (AIME, GSM-Hard, MATH-500, UniADILR) using GPT-4o and Gemini-2-Flash, with controlled hint manipulations varying in correctness, presentation style, and complexity.

Result: Correct hints substantially improve accuracy, especially on harder tasks; incorrect hints reduce accuracy; hint acknowledgement is uneven - complex hints are referenced while raw hints are adopted silently; presentation style affects acknowledgement patterns.

Conclusion: LLM reasoning is systematically shaped by shortcuts that obscure faithfulness, with RLHF-related effects influencing how models acknowledge or hide their reliance on hints.

Abstract: Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.

[110] The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge

Arash Marioriyad, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah

Main category: cs.CL

TL;DR: LLM judges show systematic biases favoring recent responses and certain sources (Expert > Human > LLM > Unknown), but rarely acknowledge these cues in their justifications, making them unfaithful evaluators.

Details

Motivation: To assess whether LLM judges base their evaluations solely on response quality and explicitly acknowledge the factors influencing their decisions, given their increasing use in evaluating system outputs.

Method: Used ELI5 (long-form QA) and LitBench (creative writing) datasets with pairwise comparisons. Constructed 100 judgment tasks per dataset, assigning superficial cues (provenance: Human/Expert/LLM/Unknown; recency: Old/New) while keeping prompts fixed. Employed GPT-4o and Gemini-2.5-Flash as evaluators.

Result: Both models showed strong recency bias (favoring new responses) and clear provenance hierarchy (Expert > Human > LLM > Unknown). Biases were more pronounced in GPT-4o and in subjective domains like LitBench. Cue acknowledgment was rare - justifications focused on content qualities rather than the injected cues.

Conclusion: Current LLM-as-a-judge systems are shortcut-prone and unfaithful, relying on superficial cues without acknowledgment, which undermines their reliability for evaluation in research and deployment.

Abstract: Large language models (LLMs) are increasingly deployed as automatic judges to evaluate system outputs in tasks such as summarization, dialogue, and creative writing. A faithful judge should base its verdicts solely on response quality and explicitly acknowledge the factors shaping its decision. We show that current LLM judges fail on both counts by relying on shortcuts introduced in the prompt. Our study uses two evaluation datasets: ELI5, a benchmark for long-form question answering, and LitBench, a recent benchmark for creative writing. Both datasets provide pairwise comparisons, where the evaluator must choose which of two responses is better. From each dataset we construct 100 pairwise judgment tasks and employ two widely used models, GPT-4o and Gemini-2.5-Flash, as evaluators in the role of LLM-as-a-judge. For each pair, we assign superficial cues to the responses, provenance cues indicating source identity (Human, Expert, LLM, or Unknown) and recency cues indicating temporal origin (Old, 1950 vs. New, 2025), while keeping the rest of the prompt fixed. Results reveal consistent verdict shifts: both models exhibit a strong recency bias, systematically favoring new responses over old, as well as a clear provenance hierarchy (Expert > Human > LLM > Unknown). These biases are especially pronounced in GPT-4o and in the more subjective and open-ended LitBench domain. Crucially, cue acknowledgment is rare: justifications almost never reference the injected cues, instead rationalizing decisions in terms of content qualities. These findings demonstrate that current LLM-as-a-judge systems are shortcut-prone and unfaithful, undermining their reliability as evaluators in both research and deployment.

[111] Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis

Leitian Tao, Xuefeng Du, Sharon Li

Main category: cs.CL

TL;DR: LENS is a framework that synthesizes preference data directly in LLM’s latent embedding space using VAE, bypassing expensive text generation and annotation while preserving preference ordering.

Details

Motivation: Reward modeling for LLM alignment is bottlenecked by high cost of preference data collection, and existing text-based synthesis methods are computationally expensive.

Method: Uses VAE to learn structured latent representation of response embeddings, performs controlled perturbations in latent space, and decodes back to embedding space to generate synthetic preference pairs.

Result: Outperforms text-based augmentation on benchmarks, achieves superior results while being 18x faster in generation and using 16,000x smaller model.

Conclusion: Provides scalable and effective alternative for enhancing reward modeling through efficient latent-space data augmentation.

Abstract: Reward modeling, crucial for aligning large language models (LLMs) with human preferences, is often bottlenecked by the high cost of preference data. Existing textual data synthesis methods are computationally expensive. We propose a novel framework LENS for synthesizing preference data directly in the LLM’s latent embedding space. Our method employs a Variational Autoencoder (VAE) to learn a structured latent representation of response embeddings. By performing controlled perturbations in this latent space and decoding back to the embedding space, we efficiently generate diverse, semantically consistent synthetic preference pairs, bypassing costly text generation and annotation. We provide theoretical guarantees that our synthesized pairs approximately preserve original preference ordering and improve reward model generalization. Empirically, our latent-space synthesis significantly outperforms text-based augmentation on standard benchmarks, achieving superior results while being 18x faster in generation and using a 16,000x smaller model. Our work offers a scalable and effective alternative for enhancing reward modeling through efficient data augmentation. Code is publicly available at https://github.com/deeplearning-wisc/lens

[112] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang

Main category: cs.CL

TL;DR: Graph2Eval is a knowledge graph-based framework that automatically generates multimodal tasks for comprehensive evaluation of LLM-driven agents’ reasoning, collaboration, and interactive capabilities.

Details

Motivation: Existing evaluation methods based on static datasets are inadequate for assessing agents in dynamic environments, and current LLM-based synthetic data methods cannot handle agent tasks requiring tool use and interactive capabilities.

Method: Uses knowledge graphs from multi-source data as task space, translates semantic relations into multimodal tasks via subgraph sampling, task templates, and meta-paths, with multi-stage filtering for quality assurance.

Result: Created Graph2Eval-Bench with 1,319 tasks spanning document comprehension and web interaction, showing effective differentiation of agent performance and revealing gaps in reasoning, collaboration, and web interaction.

Conclusion: Graph2Eval provides a comprehensive framework for agent evaluation that efficiently generates tasks to assess true capabilities in dynamic environments, offering new perspective for agent evaluation.

Abstract: As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents’ reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.

[113] KnowledgeSmith: Uncovering Knowledge Updating in LLMs with Model Editing and Unlearning

Yinyi Luo, Zhexian Zhou, Hao Chen, Kai Qiu, Marios Savvides, Sharon Li, Jindong Wang

Main category: cs.CL

TL;DR: KnowledgeSmith is a unified framework that systematically analyzes how large language models (LLMs) update knowledge through editing and unlearning, revealing nuanced insights about knowledge propagation, plasticity scaling, and consistency-robustness trade-offs.

Details

Motivation: To understand the knowledge updating mechanism of LLMs, which remains largely unexplored due to insufficient, isolated, and small-scale evaluation. The paper aims to investigate whether LLMs update knowledge similarly to humans and how editing vs unlearning differ as training data increases.

Method: Proposes KnowledgeSmith framework that: 1) Casts editing and unlearning as instances of one constrained optimization problem, 2) Uses an automatic dataset generator providing structured interventions across multiple graph levels and data scales, 3) Enables controlled studies of how different modification strategies propagate through model knowledge.

Result: Extensive experiments reveal nuanced insights: LLMs do not exhibit similar updating as humans for different knowledge levels, and there exists a consistency-capacity trade-off. The framework provides controlled analysis of knowledge propagation, plasticity scaling, consistency, and robustness.

Conclusion: The findings offer suggestions for designing more reliable and scalable knowledge updating strategies for LLMs, highlighting the importance of understanding knowledge propagation mechanisms and trade-offs in model behavior.

Abstract: Knowledge editing and machine unlearning are two popular approaches for large language models (LLMs) to stay up-to-date. However, the knowledge updating mechanism of LLMs remains largely unexplored due to insufficient, isolated, and small-scale evaluation. For instance, are LLMs similar to humans in modifying certain knowledge? What differs editing and unlearning as training data increases? This paper proposes KnowledgeSmith, a unified framework to systematically understand the updating mechanism of LLMs. We first cast editing and unlearning as instances of one constrained optimization problem. Then, we propose an automatic dataset generator that provides structured interventions across multiple graph levels and data scales, enabling controlled studies of how different modification strategies propagate through model knowledge. Extensive experiments demonstrate nuanced insights over knowledge propagation, plasticity scaling, consistency, and robustness. For instance, our results show that LLMs do not exhibit similar updating as humans for different levels of knowledge, and there exists consistency-capacity trade-off. We hope our findings can offer suggestions to the design of more reliable and scalable strategies. Code: https://github.com/AIFrontierLab/KnowledgeSmith.git

[114] Model-Based Ranking of Source Languages for Zero-Shot Cross-Lingual Transfer

Abteen Ebrahimi, Adam Wiemerslage, Katharina von der Wense

Main category: cs.CL

TL;DR: NN-Rank is a novel algorithm that uses multilingual model representations and unlabeled target-language data to rank source languages for cross-lingual transfer, outperforming existing methods on POS tagging and NER tasks.

Details

Motivation: Existing approaches for source language ranking rely on lexical and linguistic features, which may not capture the complex relationships between languages that multilingual models can learn. There's a need for more effective ranking methods that can leverage modern multilingual representations.

Method: NN-Rank leverages hidden representations from pretrained multilingual models and unlabeled target-language data to rank source languages. It was tested on 51 source languages using two multilingual models for POS tagging (56 targets) and NER (72 targets).

Result: NN-Rank significantly outperforms state-of-the-art baselines, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. It remains competitive using only the Bible corpus and works well with as few as 25 unlabeled examples (achieving 92.8% of full-data NDCG).

Conclusion: NN-Rank provides an effective approach for source language ranking that leverages multilingual representations and requires minimal target-language data, making it practical for low-resource scenarios.

Abstract: We present NN-Rank, an algorithm for ranking source languages for cross-lingual transfer, which leverages hidden representations from multilingual models and unlabeled target-language data. We experiment with two pretrained multilingual models and two tasks: part-of-speech tagging (POS) and named entity recognition (NER). We consider 51 source languages and evaluate on 56 and 72 target languages for POS and NER, respectively. When using in-domain data, NN-Rank beats state-of-the-art baselines that leverage lexical and linguistic features, with average improvements of up to 35.56 NDCG for POS and 18.14 NDCG for NER. As prior approaches can fall back to language-level features if target language data is not available, we show that NN-Rank remains competitive using only the Bible, an out-of-domain corpus available for a large number of languages. Ablations on the amount of unlabeled target data show that, for subsets consisting of as few as 25 examples, NN-Rank produces high-quality rankings which achieve 92.8% of the NDCG achieved using all available target data for ranking.

[115] Triplet-Structured Knowledge Integration for Multi-Turn Medical Reasoning

Zhaohan Meng, Zaiqiao Meng, Siwei Liu, Iadh Ounis

Main category: cs.CL

TL;DR: TriMediQ improves LLM reasoning in medical dialogues by extracting clinical triplets from patient responses, building a knowledge graph, and using a projection module for multi-hop reasoning without fine-tuning LLMs.

Details

Motivation: LLMs perform poorly in multi-turn clinical dialogues where patient information is scattered across turns, requiring better reasoning capabilities.

Method: Extracts clinical triplets from patient responses using constrained prompting, builds patient-specific knowledge graphs, and uses a trainable projection module (graph encoder + projector) for multi-hop reasoning while keeping LLM parameters frozen.

Result: Achieves up to 10.4% accuracy improvement over five baselines on iMedQA dataset, demonstrating superior performance in interactive medical QA.

Conclusion: Structuring patient information as triplets effectively enhances LLM reasoning capability in multi-turn medical question answering.

Abstract: Large Language Models (LLMs) have shown strong performance on static medical Question Answering (QA) tasks, yet their reasoning often deteriorates in multi-turn clinical dialogues where patient information is scattered across turns. This paper introduces TriMediQ, a triplet-structured approach that enhances the reasoning reliability of LLMs through explicit knowledge integration. TriMediQ first employs a frozen triplet extraction LLM to convert patient responses into clinically grounded triplets, ensuring factual precision via constrained prompting. These triplets are incorporated into a patient-specific Knowledge Graph (KG), from which a trainable projection module consisting of a graph encoder and a projector captures relational dependencies while keeping all LLM parameters frozen. During inference, the projection module guides multi-hop reasoning over the KG, enabling coherent clinical dialogue understanding. Experiments on two interactive medical QA benchmarks show that TriMediQ achieves up to 10.4% improvement in accuracy over five existing baselines on the iMedQA dataset. These results demonstrate that structuring patient information as triplets can effectively improve the reasoning capability of LLMs in multi-turn medical QA.

[116] Simulating and Understanding Deceptive Behaviors in Long-Horizon Interactions

Yang Xu, Xuanming Zhang, Samuel Yeh, Jwala Dhamala, Ousmane Dia, Rahul Gupta, Sharon Li

Main category: cs.CL

TL;DR: First simulation framework to evaluate LLM deception in multi-turn interactions, revealing deception increases with pressure and erodes trust across 11 models.

Details

Motivation: Current LLM deception evaluations are limited to single-turn prompts, failing to capture long-horizon interactions where deceptive strategies typically unfold in real-world contexts.

Method: Multi-agent simulation with performer agent completing tasks, supervisor agent evaluating progress and maintaining trust states, and independent deception auditor reviewing full trajectories to identify deception patterns.

Result: Deception is model-dependent, increases with event pressure, consistently erodes supervisor trust, and reveals distinct strategies of concealment, equivocation, and falsification across 11 frontier LLMs.

Conclusion: Deception is an emergent risk in long-horizon LLM interactions, establishing foundation for evaluating future models in trust-sensitive real-world contexts.

Abstract: Deception is a pervasive feature of human communication and an emerging concern in large language models (LLMs). While recent studies document instances of LLM deception under pressure, most evaluations remain confined to single-turn prompts and fail to capture the long-horizon interactions in which deceptive strategies typically unfold. We introduce the first simulation framework for probing and evaluating deception in LLMs under extended sequences of interdependent tasks and dynamic contextual pressures. Our framework instantiates a multi-agent system: a performer agent tasked with completing tasks and a supervisor agent that evaluates progress, provides feedback, and maintains evolving states of trust. An independent deception auditor then reviews full trajectories to identify when and how deception occurs. We conduct extensive experiments across 11 frontier models, spanning both closed- and open-source systems, and find that deception is model-dependent, increases with event pressure, and consistently erodes supervisor trust. Qualitative analyses further reveal distinct strategies of concealment, equivocation, and falsification. Our findings establish deception as an emergent risk in long-horizon interactions and provide a foundation for evaluating future LLMs in real-world, trust-sensitive contexts.

[117] Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Kaiyu Huang, Yufeng Chen, Jinan Xu, Jie Zhou

Main category: cs.CL

TL;DR: M-Thinker addresses language inconsistency and poor reasoning in non-English languages for Large Reasoning Models by using GRPO algorithm with Language Consistency and Cross-lingual Thinking Alignment rewards.

Details

Motivation: Current LRMs struggle with maintaining input-output language consistency and perform poorly with wrong reasoning paths in non-English languages, degrading user experience and hindering global deployment.

Method: Propose M-Thinker trained with GRPO algorithm featuring Language Consistency reward for input-thought-answer consistency and Cross-lingual Thinking Alignment reward to transfer English reasoning capability to non-English languages.

Result: M-Thinker-1.5B/7B models achieve nearly 100% language consistency, superior performance on MMATH and PolyMath benchmarks, and excellent generalization on out-of-domain languages.

Conclusion: M-Thinker effectively solves language inconsistency and reasoning capability transfer issues in non-English LRMs through innovative reward mechanisms in RL training.

Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the “think-then-answer” paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model’s non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

[118] CrisiText: A dataset of warning messages for LLM training in emergency communication

Giacomo Gonella, Gian Maria Campedelli, Stefano Menini, Marco Guerini

Main category: cs.CL

TL;DR: CrisiText is the first large-scale dataset for generating warning messages across 13 crisis scenarios, containing over 400,000 messages. The paper compares various NLG approaches for crisis warning generation.

Details

Motivation: Current NLP applications in crisis situations are limited to classification tasks, overlooking the significant potential of timely warning message generation using NLG architectures to assist civilians during emergencies.

Method: Created CrisiText dataset from existing crisis descriptions by generating event chains paired with warning messages following expert guidelines. Compared supervised fine-tuning with preference alignment, zero-shot, few-shot approaches, out-of-distribution performance, and automatic post-editing.

Result: Developed a comprehensive dataset of 400,000+ warning messages spanning 18,000 crisis situations across 13 different crisis types, with each message accompanied by three suboptimal warning types for comparative study.

Conclusion: The CrisiText dataset enables systematic study of NLG approaches for crisis warning generation, addressing a significant gap in emergency response AI systems and providing a foundation for developing more effective automated warning systems.

Abstract: Effectively identifying threats and mitigating their potential damage during crisis situations, such as natural disasters or violent attacks, is paramount for safeguarding endangered individuals. To tackle these challenges, AI has been used in assisting humans in emergency situations. Still, the use of NLP techniques remains limited and mostly focuses on classification tasks. The significant potential of timely warning message generation using NLG architectures, however, has been largely overlooked. In this paper we present CrisiText, the first large-scale dataset for the generation of warning messages across 13 different types of crisis scenarios. The dataset contains more than 400,000 warning messages (spanning almost 18,000 crisis situations) aimed at assisting civilians during and after such events. To generate the dataset, we started from existing crisis descriptions and created chains of events related to the scenarios. Each event was then paired with a warning message. The generations follow experts’ written guidelines to ensure correct terminology and factuality of their suggestions. Additionally, each message is accompanied by three suboptimal warning types to allow for the study of different NLG approaches. To this end, we conducted a series of experiments comparing supervised fine-tuning setups with preference alignment, zero-shot, and few-shot approaches. We further assessed model performance in out-of-distribution scenarios and evaluated the effectiveness of an automatic post-editor.

[119] DSPO: Stable and Efficient Policy Optimization for Agentic Search and Reasoning

Chenyang Gu, Yewen Pu, Bruce Yang, Xiaofan Li, Huan Gao

Main category: cs.CL

TL;DR: DSPO is a new RL algorithm that trains LLMs to actively search external knowledge through sequence-level optimization and dynamic sample filtering, achieving significant performance improvements on QA benchmarks without supervised data.

Details

Motivation: Current approaches for enabling LLMs to search external knowledge either rely on prompting or suffer from performance limitations with RL in complex interactive tasks, failing to unlock their full agentic potential.

Method: Dynamic-filter Sequence-level Policy Optimization (DSPO) - an RL algorithm using sequence-level optimization and dynamic sample filtering to train models to interleave multi-turn search and reasoning without supervised demonstration data.

Result: The 7B model improves over comparable previous work by 34.1% across multiple QA benchmarks, outperforms a 14B model from previous work on HotpotQA by nearly 9% relative, and maintains exceptional training stability.

Conclusion: DSPO enables robust agent training for LLMs to actively search external knowledge, achieving state-of-the-art performance on complex QA tasks through pure RL training without supervised data.

Abstract: Enhancing LLMs with the ability to actively search external knowledge is crucial for complex and real-world tasks. Current approaches either rely on prompting to elicit the model’s innate agent capabilities, or suffer from performance ceilings and collapse when applying RL to complex interactive tasks, leaving their true agentic potential untapped. To address this, we introduce \textbf{D}ynamic-filter \textbf{S}equence-level \textbf{P}olicy \textbf{O}ptimization (DSPO), an improved RL algorithm designed for robust agent training through sequence-level optimization and dynamic sample filtering. We train our model purely through RL to interleave multi-turn search and reasoning, obviating the need for supervised demonstration data. Across multiple QA benchmarks, our 7B model improves over a comparable previous work by \textbf{34.1%}, and even outperforms the 14B model from previous work in complex multihop QA such as HotpotQA by nearly \textbf{9% relative}, maintaining exceptional training stability.

[120] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

Omid Reza Heidari, Siobhan Reid, Yassine Yaakoubi

Main category: cs.CL

TL;DR: AGENTIQL is an agent-inspired multi-expert framework for text-to-SQL that uses reasoning and coding agents for question decomposition and sub-query generation, with adaptive routing between modular and baseline approaches.

Details

Motivation: Monolithic LLM architectures struggle with complex reasoning and schema diversity in text-to-SQL tasks, requiring a more robust and interpretable approach.

Method: Multi-expert framework with reasoning agent for question decomposition, coding agent for sub-query generation, refinement for column selection, and adaptive router for efficiency-accuracy balance. Supports parallel execution.

Result: Achieves 86.07% EX on Spider benchmark using 14B models, narrowing gap to GPT-4 SOTA (89.65% EX) while using smaller open-source LLMs.

Conclusion: AGENTIQL provides robust, scalable, and interpretable semantic parsing with enhanced transparency through intermediate reasoning steps.

Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.

[121] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Tao Yu, Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen

Main category: cs.CL

TL;DR: BrowserAgent is an interactive web agent that uses human-inspired browser actions like scrolling, clicking, and typing to solve complex web tasks, achieving competitive performance with less training data than existing methods.

Details

Motivation: Current web agents like Search-R1 and WebDancer rely on converting web environments to static text, which contrasts with human browsing behaviors involving diverse browser interactions.

Method: Two-stage training (SFT and RFT) with predefined browser actions via Playwright, plus an explicit memory mechanism to store key conclusions across steps for long-horizon tasks.

Result: BrowserAgent-7B achieves ~20% improvement over Search-R1 on multi-hop QA tasks (HotpotQA, 2Wiki, Bamboogle) and competitive results across Open-QA tasks with significantly less training data.

Conclusion: BrowserAgent serves as a more advanced framework for interactive and scalable web agents by mimicking human browsing behaviors.

Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model’s generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model’s reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.

[122] Are Large Reasoning Models Interruptible?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Main category: cs.CL

TL;DR: The paper challenges the “frozen world” assumption in Large Reasoning Models evaluation by testing robustness under dynamic scenarios like interruptions and changing context, revealing significant performance drops and novel failure modes.

Details

Motivation: Traditional evaluations assume static, unchanging contexts during model reasoning, which breaks down in real-world scenarios like programming where contexts change over time and models take hours to respond.

Method: Evaluate LRM robustness under two dynamic scenarios: interruptions (testing partial outputs with limited budget) and dynamic context (testing adaptation to in-flight changes), using mathematics and programming benchmarks requiring long-form reasoning.

Result: Static evaluations overestimate robustness - state-of-the-art LRMs show up to 60% performance drop when interrupted or exposed to changing context, especially when updates occur late in reasoning. Novel failure modes identified include reasoning leakage, panic, and self-doubt.

Conclusion: The frozen world assumption is inadequate for evaluating modern reasoning tasks; dynamic evaluation reveals critical robustness issues in LRMs that static evaluations miss, highlighting the need for more realistic testing frameworks.

Abstract: Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information.

cs.CV

[123] Enhancing the Quality of 3D Lunar Maps Using JAXA’s Kaguya Imagery

Yumi Iwashita, Haakon Moe, Yang Cheng, Adnan Ansar, Georgios Georgakis, Adrian Stoica, Kazuto Nakashima, Ryo Kurazume, Jim Torresen

Main category: cs.CV

TL;DR: Method to improve 3D lunar maps from Kaguya TC images by reducing compression-induced noise in disparity maps, enhancing elevation accuracy for safer lunar missions.

Details

Motivation: High-quality 3D lunar maps are critical for long-distance missions like NASA's Endurance rover, but Kaguya TC images suffer from altitude inaccuracies due to stereo matching errors and JPEG compression artifacts.

Method: Analyze compression behavior of Kaguya TC imagery, identify systematic disparity noise patterns in darker regions, and develop approach to reduce residual noise in disparity images from compressed images.

Result: Experimental results show the proposed approach effectively reduces elevation noise, improving terrain data quality.

Conclusion: The method enhances safety and reliability of terrain data for future lunar missions by mitigating compression-induced noise in 3D lunar maps.

Abstract: As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical-particularly for long-distance missions such as NASA’s Endurance mission concept, in which a rover aims to traverse 2,000 km across the South Pole-Aitken basin. Kaguya TC (Terrain Camera) images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose an approach to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed approach effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.

[124] Data or Language Supervision: What Makes CLIP Better than DINO?

Yiming Liu, Yuhui Zhang, Dhruba Ghosh, Ludwig Schmidt, Serena Yeung-Levy

Main category: cs.CV

TL;DR: CLIP’s advantage over DINO in vision-language models stems from language supervision enabling better semantic understanding, not just larger training data, with CLIP excelling at text-intensive tasks while DINO performs slightly better on vision-centric ones.

Details

Motivation: To determine whether CLIP's superior performance as vision encoders in vision-language models comes from language supervision or larger training data, and to understand the fundamental differences between CLIP and DINO representations.

Method: Pre-trained CLIP and DINO under controlled settings with same architecture, dataset, and training configuration to achieve similar ImageNet accuracy, then analyzed embeddings and evaluated on 20 VQA benchmarks.

Result: CLIP captures high-level semantics (object categories, text) while DINO responds more to low-level features (colors, styles). CLIP excels at text-intensive VQA tasks, DINO slightly outperforms on vision-centric ones. Language supervision variants provided limited gains.

Conclusion: Language supervision in CLIP enables better semantic understanding crucial for vision-language tasks, providing scientific insights for vision encoder design in VLMs.

Abstract: CLIP outperforms self-supervised models like DINO as vision encoders for vision-language models (VLMs), but it remains unclear whether this advantage stems from CLIP’s language supervision or its much larger training data. To disentangle these factors, we pre-train CLIP and DINO under controlled settings – using the same architecture, dataset, and training configuration – achieving similar ImageNet accuracy. Embedding analysis shows that CLIP captures high-level semantics (e.g., object categories, text), while DINO is more responsive to low-level features like colors and styles. When integrated into VLMs and evaluated on 20 VQA benchmarks, CLIP excels at text-intensive tasks, while DINO slightly outperforms on vision-centric ones. Variants of language supervision (e.g., sigmoid loss, pre-trained language encoders) yield limited gains. Our findings provide scientific insights into vision encoder design and its impact on VLM performance.

[125] MammoDINO: Anatomically Aware Self-Supervision for Mammographic Images

Sicheng Zhou, Lei Wu, Cao Xiao, Parminder Bhatia, Taha Kass-Hout

Main category: cs.CV

TL;DR: MammoDINO is a self-supervised learning framework for mammography that achieves state-of-the-art performance on breast cancer screening tasks using 1.4 million mammographic images with specialized data augmentation and cross-slice contrastive learning.

Details

Motivation: Self-supervised learning has been transformative in general vision domains but remains underutilized in medical imaging due to limited data and domain-specific biases, particularly in mammography where annotation is costly and time-consuming.

Method: Uses breast tissue aware data augmentation sampler for image-level and patch-level supervision, and cross-slice contrastive learning objective that leverages 3D DBT structure into 2D pretraining. Pretrained on 1.4 million mammographic images.

Result: Achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. Provides a scalable, annotation-free foundation for multipurpose CAD tools.

Conclusion: MammoDINO offers an effective SSL solution for mammography that can reduce radiologists’ workload and improve diagnostic efficiency in breast cancer screening through scalable, annotation-free pretraining.

Abstract: Self-supervised learning (SSL) has transformed vision encoder training in general domains but remains underutilized in medical imaging due to limited data and domain specific biases. We present MammoDINO, a novel SSL framework for mammography, pretrained on 1.4 million mammographic images. To capture clinically meaningful features, we introduce a breast tissue aware data augmentation sampler for both image-level and patch-level supervision and a cross-slice contrastive learning objective that leverages 3D digital breast tomosynthesis (DBT) structure into 2D pretraining. MammoDINO achieves state-of-the-art performance on multiple breast cancer screening tasks and generalizes well across five benchmark datasets. It offers a scalable, annotation-free foundation for multipurpose computer-aided diagnosis (CAD) tools for mammogram, helping reduce radiologists’ workload and improve diagnostic efficiency in breast cancer screening.

[126] Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis

Blessing Agyei Kyem, Neema Jakisa Owor, Andrews Danyo, Joshua Kofi Asamoah, Eugene Denteh, Tanner Muturi, Anthony Dontoh, Yaw Adu-Gyamfi, Armstrong Aboah

Main category: cs.CV

TL;DR: A dual-model framework using VideoLLaMA and Qwen2.5-VL with separate training for captioning and VQA tasks achieves state-of-the-art performance in traffic safety analysis by minimizing task interference.

Details

Motivation: Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention.

Method: Dual-model framework that strategically utilizes VideoLLaMA and Qwen2.5-VL through task-specific optimization, separating training for captioning and visual question answering (VQA) tasks to minimize interference.

Result: VideoLLaMA excels in temporal reasoning (CIDEr: 1.1001), Qwen2.5-VL excels in visual understanding (VQA accuracy: 60.80%). Achieved S2 score of 45.7572 in 2025 AI City Challenge Track 2 (10th place). Separate training outperforms joint training by 8.6% in VQA accuracy.

Conclusion: Separating training for captioning and VQA tasks allows each model to specialize more effectively, minimizing task interference and achieving superior performance in traffic safety analysis.

Abstract: Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. In this work, we present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization to address this issue. The core insight behind our approach is that separating training for captioning and visual question answering (VQA) tasks minimizes task interference and allows each model to specialize more effectively. Experimental results demonstrate that VideoLLaMA is particularly effective in temporal reasoning, achieving a CIDEr score of 1.1001, while Qwen2.5-VL excels in visual understanding with a VQA accuracy of 60.80%. Through extensive experiments on the WTS dataset, our method achieves an S2 score of 45.7572 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard. Ablation studies validate that our separate training strategy outperforms joint training by 8.6% in VQA accuracy while maintaining captioning quality.

[127] PanoTPS-Net: Panoramic Room Layout Estimation via Thin Plate Spline Transformation

Hatem Ibrahem, Ahmed Salem, Qinmin Vivian Hu, Guanghui Wang

Main category: cs.CV

TL;DR: PanoTPS-Net is a novel model for estimating 3D room layouts from single panorama images using CNN and Thin Plate Spline transformation in a two-stage architecture.

Details

Motivation: Accurate 3D room layout estimation is crucial for applications in robotics, augmented reality, and interior design from single panorama images.

Method: Two-stage architecture: 1) CNN extracts high-level features and learns TPS transformation parameters, 2) TPS transformation warps reference layout to predicted layout using learned parameters.

Result: Achieved 3DIoU values of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets respectively, demonstrating high accuracy and generalization to both cuboid and non-cuboid layouts.

Conclusion: The model effectively combines TPS transformation with panorama images, showing robustness in handling various room layout types and outperforming state-of-the-art methods.

Abstract: Accurately estimating the 3D layout of rooms is a crucial task in computer vision, with potential applications in robotics, augmented reality, and interior design. This paper proposes a novel model, PanoTPS-Net, to estimate room layout from a single panorama image. Leveraging a Convolutional Neural Network (CNN) and incorporating a Thin Plate Spline (TPS) spatial transformation, the architecture of PanoTPS-Net is divided into two stages: First, a convolutional neural network extracts the high-level features from the input images, allowing the network to learn the spatial parameters of the TPS transformation. Second, the TPS spatial transformation layer is generated to warp a reference layout to the required layout based on the predicted parameters. This unique combination empowers the model to properly predict room layouts while also generalizing effectively to both cuboid and non-cuboid layouts. Extensive experiments on publicly available datasets and comparisons with state-of-the-art methods demonstrate the effectiveness of the proposed method. The results underscore the model’s accuracy in room layout estimation and emphasize the compatibility between the TPS transformation and panorama images. The robustness of the model in handling both cuboid and non-cuboid room layout estimation is evident with a 3DIoU value of 85.49, 86.16, 81.76, and 91.98 on PanoContext, Stanford-2D3D, Matterport3DLayout, and ZInD datasets, respectively. The source code is available at: https://github.com/HatemHosam/PanoTPS_Net.

[128] Ivan-ISTD: Rethinking Cross-domain Heteroscedastic Noise Perturbations in Infrared Small Target Detection

Yuehui Li, Yahao Lu, Haoyuan Wu, Sen Zhang, Liang Lin, Yukai Shi

Main category: cs.CV

TL;DR: Proposes Ivan-ISTD, a doubly wavelet-guided invariance learning framework for infrared small target detection that addresses cross-domain shift and heteroscedastic noise through wavelet-guided cross-domain synthesis and real-domain noise invariance learning.

Details

Motivation: To solve the dual challenges of cross-domain shift and heteroscedastic noise perturbations in infrared small target detection for drone-based multi-modality sensing applications.

Method: Two-stage approach: 1) Wavelet-guided Cross-domain Synthesis generates training samples aligned with target domain using multi-frequency wavelet filtering; 2) Real-domain Noise Invariance Learning extracts real noise characteristics to build dynamic noise library and learns noise invariance through self-supervised loss.

Result: Outperforms existing state-of-the-art methods in quantitative metrics and demonstrates excellent robustness in cross-domain scenarios. Validated on Dynamic-ISTD Benchmark and other real-world datasets.

Conclusion: Ivan-ISTD effectively addresses cross-domain shift and noise perturbations in infrared small target detection, showing superior performance and robustness compared to existing methods.

Abstract: In the multimedia domain, Infrared Small Target Detection (ISTD) plays a important role in drone-based multi-modality sensing. To address the dual challenges of cross-domain shift and heteroscedastic noise perturbations in ISTD, we propose a doubly wavelet-guided Invariance learning framework(Ivan-ISTD). In the first stage, we generate training samples aligned with the target domain using Wavelet-guided Cross-domain Synthesis. This wavelet-guided alignment machine accurately separates the target background through multi-frequency wavelet filtering. In the second stage, we introduce Real-domain Noise Invariance Learning, which extracts real noise characteristics from the target domain to build a dynamic noise library. The model learns noise invariance through self-supervised loss, thereby overcoming the limitations of distribution bias in traditional artificial noise modeling. Finally, we create the Dynamic-ISTD Benchmark, a cross-domain dynamic degradation dataset that simulates the distribution shifts encountered in real-world applications. Additionally, we validate the versatility of our method using other real-world datasets. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods in terms of many quantitative metrics. In particular, Ivan-ISTD demonstrates excellent robustness in cross-domain scenarios. The code for this work can be found at: https://github.com/nanjin1/Ivan-ISTD.

[129] Prompt-Guided Spatial Understanding with RGB-D Transformers for Fine-Grained Object Relation Reasoning

Tanner Muturi, Blessing Agyei Kyem, Joshua Kofi Asamoah, Neema Jakisa Owor, Richard Dyzinela, Andrews Danyo, Yaw Adu-Gyamfi, Armstrong Aboah

Main category: cs.CV

TL;DR: A spatial reasoning framework for warehouse environments that embeds bounding box coordinates in prompts to improve spatial understanding, achieving 4th place on the AI City Challenge leaderboard.

Details

Motivation: Existing vision-language systems struggle with spatial reasoning in cluttered 3D environments like warehouses due to reliance on local appearance and lack of explicit spatial grounding.

Method: Embed mask dimensions as bounding box coordinates in input prompts, fine-tune across four question categories with task-specific supervision, and append normalized answers to GPT responses for evaluation consistency.

Result: Achieved a final score of 73.0606, ranking 4th overall on the public leaderboard of the AI City Challenge Track 3 2025.

Conclusion: Structured prompt enrichment and targeted optimization effectively advance spatial reasoning capabilities for real-world industrial environments.

Abstract: Spatial reasoning in large-scale 3D environments such as warehouses remains a significant challenge for vision-language systems due to scene clutter, occlusions, and the need for precise spatial understanding. Existing models often struggle with generalization in such settings, as they rely heavily on local appearance and lack explicit spatial grounding. In this work, we introduce a dedicated spatial reasoning framework for the Physical AI Spatial Intelligence Warehouse dataset introduced in the Track 3 2025 AI City Challenge. Our approach enhances spatial comprehension by embedding mask dimensions in the form of bounding box coordinates directly into the input prompts, enabling the model to reason over object geometry and layout. We fine-tune the framework across four question categories namely: Distance Estimation, Object Counting, Multi-choice Grounding, and Spatial Relation Inference using task-specific supervision. To further improve consistency with the evaluation system, normalized answers are appended to the GPT response within the training set. Our comprehensive pipeline achieves a final score of 73.0606, placing 4th overall on the public leaderboard. These results demonstrate the effectiveness of structured prompt enrichment and targeted optimization in advancing spatial reasoning for real-world industrial environments.

[130] AngularFuse: A Closer Look at Angle-based Perception for Spatial-Sensitive Multi-Modality Image Fusion

Xiaopeng Liu, Yupei Lin, Sen Zhang, Xiao Wang, Yukai Shi, Liang Lin

Main category: cs.CV

TL;DR: AngularFuse is a novel visible-infrared image fusion method that uses angle-based perception framework with cross-modal complementary masks, fine-grained reference image synthesis, and angle-aware loss to simultaneously constrain gradient magnitude and direction for better fusion quality.

Details

Motivation: Existing unsupervised fusion methods rely on manually designed loss functions with limitations: constructed reference images lack details and have uneven brightness, and gradient losses only focus on magnitude while ignoring direction.

Method: Proposes AngularFuse with three key components: 1) cross-modal complementary mask module to learn complementary information, 2) fine-grained reference image synthesis using Laplacian edge enhancement and adaptive histogram equalization, 3) angle-aware loss that constrains both gradient magnitude and direction.

Result: Comprehensive experiments on MSRS, RoadScene, and M3FD datasets show AngularFuse outperforms existing methods with clear margin. Visual comparisons confirm sharper and more detailed results in challenging scenes.

Conclusion: AngularFuse demonstrates superior fusion capability by preserving both texture intensity and correct edge orientation, addressing limitations of existing methods through its angle-based perception framework.

Abstract: Visible-infrared image fusion is crucial in key applications such as autonomous driving and nighttime surveillance. Its main goal is to integrate multimodal information to produce enhanced images that are better suited for downstream tasks. Although deep learning based fusion methods have made significant progress, mainstream unsupervised approaches still face serious challenges in practical applications. Existing methods mostly rely on manually designed loss functions to guide the fusion process. However, these loss functions have obvious limitations. On one hand, the reference images constructed by existing methods often lack details and have uneven brightness. On the other hand, the widely used gradient losses focus only on gradient magnitude. To address these challenges, this paper proposes an angle-based perception framework for spatial-sensitive image fusion (AngularFuse). At first, we design a cross-modal complementary mask module to force the network to learn complementary information between modalities. Then, a fine-grained reference image synthesis strategy is introduced. By combining Laplacian edge enhancement with adaptive histogram equalization, reference images with richer details and more balanced brightness are generated. Last but not least, we introduce an angle-aware loss, which for the first time constrains both gradient magnitude and direction simultaneously in the gradient domain. AngularFuse ensures that the fused images preserve both texture intensity and correct edge orientation. Comprehensive experiments on the MSRS, RoadScene, and M3FD public datasets show that AngularFuse outperforms existing mainstream methods with clear margin. Visual comparisons further confirm that our method produces sharper and more detailed results in challenging scenes, demonstrating superior fusion capability.

[131] Evaluating the Explainability of Vision Transformers in Medical Imaging

Leili Barekatain, Ben Glocker

Main category: cs.CV

TL;DR: This paper evaluates Vision Transformer explainability methods for medical imaging, finding DINO with Grad-CAM provides the most faithful and localized explanations across blood cell and breast ultrasound classification tasks.

Details

Motivation: Interpretability is crucial for clinical trust in medical imaging AI. Vision Transformers achieve state-of-the-art performance but their complex attention mechanisms pose explainability challenges that need evaluation.

Method: Evaluated ViT, DeiT, DINO, and Swin Transformer architectures with Gradient Attention Rollout and Grad-CAM on peripheral blood cell classification and breast ultrasound classification tasks using quantitative and qualitative analysis.

Result: DINO combined with Grad-CAM offered the most faithful and localized explanations across datasets. Grad-CAM produced class-discriminative heatmaps while Gradient Attention Rollout yielded scattered activations. Even in misclassifications, DINO with Grad-CAM highlighted clinically relevant misleading features.

Conclusion: The research supports reliable integration of Vision Transformers into medical diagnostics by improving model transparency through effective explainability methods, particularly DINO with Grad-CAM.

Abstract: Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.

[132] Probabilistic Temporal Masked Attention for Cross-view Online Action Detection

Liping Xie, Yang Tan, Shicheng Jing, Huimin Lu, Kanjian Zhang

Main category: cs.CV

TL;DR: Proposes PTMA model for Online Action Detection using probabilistic modeling and temporal masked attention to handle viewpoint variations and improve cross-view generalization.

Details

Motivation: Mainstream OAD models are sensitive to varying video viewpoints, limiting their generalization to unseen sources. Need for robust view-invariant feature extraction.

Method: Probabilistic Temporal Masked Attention (PTMA) with GRU-based temporal masked attention cell, probabilistic modeling for latent compressed representations, and multi-view integration for view-invariant features.

Result: Achieves state-of-the-art performance on DAHLIA, IKEA ASM, and Breakfast datasets under cross-subject, cross-view, and cross-subject-view evaluation protocols.

Conclusion: PTMA effectively addresses viewpoint sensitivity in OAD through probabilistic modeling and temporal attention, enabling robust cross-view generalization and superior performance.

Abstract: As a critical task in video sequence classification within computer vision, Online Action Detection (OAD) has garnered significant attention. The sensitivity of mainstream OAD models to varying video viewpoints often hampers their generalization when confronted with unseen sources. To address this limitation, we propose a novel Probabilistic Temporal Masked Attention (PTMA) model, which leverages probabilistic modeling to derive latent compressed representations of video frames in a cross-view setting. The PTMA model incorporates a GRU-based temporal masked attention (TMA) cell, which leverages these representations to effectively query the input video sequence, thereby enhancing information interaction and facilitating autoregressive frame-level video analysis. Additionally, multi-view information can be integrated into the probabilistic modeling to facilitate the extraction of view-invariant features. Experiments conducted under three evaluation protocols: cross-subject (cs), cross-view (cv), and cross-subject-view (csv) show that PTMA achieves state-of-the-art performance on the DAHLIA, IKEA ASM, and Breakfast datasets.

[133] APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection

Xinxin Huang, Han Sun, Junmin Cai, Ningzhong Liu, Huiyu Zhou

Main category: cs.CV

TL;DR: APGNet is an Adaptive Prior-Guided Network for detecting camouflaged objects in underwater environments, addressing challenges of image degradation and marine organism camouflage through illumination-invariant data augmentation and hierarchical prior fusion.

Details

Motivation: Existing methods struggle with underwater image degradation (low contrast, color distortion) and fail to adapt terrestrial camouflaged object detection approaches to underwater environments due to lack of consideration for underwater optical characteristics.

Method: APGNet integrates Siamese architecture with adaptive prior-guided mechanism, using MSRCR for illumination-invariant data augmentation, Extended Receptive Field module with Multi-Scale Progressive Decoder for multi-scale context, and hierarchical fusion of position/boundary priors using spatial attention and deformable convolution.

Result: APGNet outperforms 15 state-of-the-art methods on two public MAS datasets under widely used evaluation metrics.

Conclusion: The proposed APGNet effectively addresses underwater camouflaged object detection challenges through adaptive prior guidance and illumination-invariant processing, demonstrating superior performance over existing methods.

Abstract: Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics.

[134] VIDMP3: Video Editing by Representing Motion with Pose and Position Priors

Sandeep Mishra, Oindrila Saha, Alan C. Bovik

Main category: cs.CV

TL;DR: VidMP3 is a novel video editing method that preserves original motion while allowing structural and semantic flexibility in swapped objects, addressing temporal inconsistency and identity drift issues in existing methods.

Details

Motivation: Motion-preserved video editing is crucial for creators needing flexibility in structure and semantics of swapped objects, but existing methods struggle with temporal inconsistency, subject identity drift, and require human intervention.

Method: Leverages pose and position priors to learn a generalized motion representation from source videos, enabling generation of new videos that maintain original motion while allowing structural and semantic flexibility.

Result: Both qualitative and quantitative evaluations demonstrate superiority over existing methods in maintaining motion consistency while enabling flexible structural and semantic editing.

Conclusion: VidMP3 successfully addresses key challenges in motion-preserved video editing and outperforms current approaches, with code to be made publicly available.

Abstract: Motion-preserved video editing is crucial for creators, particularly in scenarios that demand flexibility in both the structure and semantics of swapped objects. Despite its potential, this area remains underexplored. Existing diffusion-based editing methods excel in structure-preserving tasks, using dense guidance signals to ensure content integrity. While some recent methods attempt to address structure-variable editing, they often suffer from issues such as temporal inconsistency, subject identity drift, and the need for human intervention. To address these challenges, we introduce VidMP3, a novel approach that leverages pose and position priors to learn a generalized motion representation from source videos. Our method enables the generation of new videos that maintain the original motion while allowing for structural and semantic flexibility. Both qualitative and quantitative evaluations demonstrate the superiority of our approach over existing methods. The code will be made publicly available at https://github.com/sandeep-sm/VidMP3.

[135] A Review on Domain Adaption and Generative Adversarial Networks(GANs)

Aashish Dhawan, Divyanshu Mudgal

Main category: cs.CV

TL;DR: The paper discusses Domain Adaptation methods to address the challenge of limited labeled data in computer vision by using models trained on one dataset to predict on different domains.

Details

Motivation: To overcome the scarcity of labeled data in computer vision, especially in image classification, due to high labeling costs and difficulty in obtaining labeled data.

Method: Domain Adaptation techniques that enable models trained on one dataset to predict on different domains of the same type (e.g., paintings to real images).

Result: Methods that can produce comparable results to benchmark performance despite data scarcity.

Conclusion: Domain Adaptation provides reliable solutions for handling limited labeled data scenarios in computer vision applications.

Abstract: The major challenge in today’s computer vision scenario is the availability of good quality labeled data. In a field of study like image classification, where data is of utmost importance, we need to find more reliable methods which can overcome the scarcity of data to produce results comparable to previous benchmark results. In most cases, obtaining labeled data is very difficult because of the high cost of human labor and in some cases impossible. The purpose of this paper is to discuss Domain Adaptation and various methods to implement it. The main idea is to use a model trained on a particular dataset to predict on data from a different domain of the same kind, for example - a model trained on paintings of airplanes predicting on real images of airplanes

[136] Playmate2: Training-Free Multi-Character Audio-Driven Animation via Diffusion Transformer with Reward Feedback

Xingpei Ma, Shenneng Huang, Jiaran Cai, Yuansheng Guan, Shen Zheng, Hanfeng Zhao, Qiang Zhang, Shunsi Zhang

Main category: cs.CV

TL;DR: A diffusion transformer-based framework for generating lifelike talking videos of arbitrary length with improved lip-sync accuracy, temporal coherence, and multi-character animation capabilities.

Details

Motivation: Existing audio-driven human video generation methods face challenges in lip-sync accuracy, temporal coherence for long videos, and multi-character animation, requiring improved solutions.

Method: Uses DiT-based framework with LoRA training strategy and position shift inference for long videos, combines partial parameter updates with reward feedback for better lip sync and body motion, and introduces Mask-CFG for training-free multi-character animation.

Result: Outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation efficiently.

Conclusion: The proposed method provides a simple, efficient, and cost-effective solution for high-quality audio-driven video generation with improved capabilities for long videos and multi-character scenarios.

Abstract: Recent advances in diffusion models have significantly improved audio-driven human video generation, surpassing traditional methods in both quality and controllability. However, existing approaches still face challenges in lip-sync accuracy, temporal coherence for long video generation, and multi-character animation. In this work, we propose a diffusion transformer (DiT)-based framework for generating lifelike talking videos of arbitrary length, and introduce a training-free method for multi-character audio-driven animation. First, we employ a LoRA-based training strategy combined with a position shift inference approach, which enables efficient long video generation while preserving the capabilities of the foundation model. Moreover, we combine partial parameter updates with reward feedback to enhance both lip synchronization and natural body motion. Finally, we propose a training-free approach, Mask Classifier-Free Guidance (Mask-CFG), for multi-character animation, which requires no specialized datasets or model modifications and supports audio-driven animation for three or more characters. Experimental results demonstrate that our method outperforms existing state-of-the-art approaches, achieving high-quality, temporally coherent, and multi-character audio-driven video generation in a simple, efficient, and cost-effective manner.

[137] IL3D: A Large-Scale Indoor Layout Dataset for LLM-Driven 3D Scene Generation

Wenxu Zhou, Kaixuan Nie, Hang Du, Dong Yin, Wei Huang, Siqiang Guo, Xiaobo Zhang, Pengbo Hu

Main category: cs.CV

TL;DR: IL3D is a large-scale dataset for LLM-driven 3D scene generation with 27,816 indoor layouts, 29,215 3D objects, and natural language annotations, enabling improved generalization through supervised fine-tuning.

Details

Motivation: Address the pressing demand for diverse, high-quality training data in indoor layout design to support multimodal learning for vision-language tasks.

Method: Created IL3D dataset with instance-level natural language annotations, established rigorous benchmarks for LLM-driven scene generation, and performed supervised fine-tuning (SFT) of LLMs on the dataset.

Result: SFT of LLMs on IL3D significantly improves generalization and surpasses performance compared to SFT on other datasets. The dataset provides flexible multimodal data export capabilities.

Conclusion: IL3D advances research in 3D scene generation and embodied intelligence by providing high-fidelity scene data to support environment perception tasks of embodied agents.

Abstract: In this study, we present IL3D, a large-scale dataset meticulously designed for large language model (LLM)-driven 3D scene generation, addressing the pressing demand for diverse, high-quality training data in indoor layout design. Comprising 27,816 indoor layouts across 18 prevalent room types and a library of 29,215 high-fidelity 3D object assets, IL3D is enriched with instance-level natural language annotations to support robust multimodal learning for vision-language tasks. We establish rigorous benchmarks to evaluate LLM-driven scene generation. Experimental results show that supervised fine-tuning (SFT) of LLMs on IL3D significantly improves generalization and surpasses the performance of SFT on other datasets. IL3D offers flexible multimodal data export capabilities, including point clouds, 3D bounding boxes, multiview images, depth maps, normal maps, and semantic masks, enabling seamless adaptation to various visual tasks. As a versatile and robust resource, IL3D significantly advances research in 3D scene generation and embodied intelligence, by providing high-fidelity scene data to support environment perception tasks of embodied agents.

[138] An Adaptive Edge-Guided Dual-Network Framework for Fast QR Code Motion Deblurring

Jianping Li, Dongyang Guo, Wenjie Li, Wei Zhao

Main category: cs.CV

TL;DR: Proposed Edge-Guided Restormer (EG-Restormer) and Adaptive Dual-network (ADNet) for QR code deblurring, using explicit edge priors to improve decoding rates and dynamically selecting networks based on blur severity.

Details

Motivation: Existing deep learning methods for QR code deblurring don't exploit the structured patterns and sharp edge priors of QR codes, while current approaches prioritize perceptual quality over successful decoding.

Method: Developed Edge-Guided Attention Block (EGAB) to embed explicit edge priors into Transformers, created EG-Restormer for severe blur and LENet for mild blur, then integrated them into ADNet for adaptive network selection.

Result: EG-Restormer and ADNet achieve state-of-the-art performance with competitive speed, significantly boosting decoding rates for severely blurred QR codes.

Conclusion: The proposed edge-guided approach and adaptive dual-network architecture effectively address QR code deblurring with explicit structural priors, making it suitable for mobile devices.

Abstract: Unlike general image deblurring that prioritizes perceptual quality, QR code deblurring focuses on ensuring successful decoding. QR codes are characterized by highly structured patterns with sharp edges, a robust prior for restoration. Yet existing deep learning methods rarely exploit these priors explicitly. To address this gap, we propose the Edge-Guided Attention Block (EGAB), which embeds explicit edge priors into a Transformer architecture. Based on EGAB, we develop Edge-Guided Restormer (EG-Restormer), an effective network that significantly boosts the decoding rate of severely blurred QR codes. For mildly blurred inputs, we design the Lightweight and Efficient Network (LENet) for fast deblurring. We further integrate these two networks into an Adaptive Dual-network (ADNet), which dynamically selects the suitable network based on input blur severity, making it ideal for resource-constrained mobile devices. Extensive experiments show that our EG-Restormer and ADNet achieve state-of-the-art performance with a competitive speed. Project page: https://github.com/leejianping/ADNet

[139] G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

Junfeng Ni, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, Siyuan Huang

Main category: cs.CV

TL;DR: This paper proposes a method to improve 3D scene reconstruction using generative diffusion models by leveraging accurate geometry guidance from planar structures to address limitations in reconstruction quality and multi-view consistency.

Details

Motivation: Existing methods using generative priors from diffusion models struggle with poor reconstruction quality in both observed and unobserved regions due to lack of geometric supervision, and suffer from multi-view inconsistencies leading to shape-appearance ambiguities.

Method: The method leverages planar structures to derive accurate metric-scale depth maps for reliable supervision. It incorporates geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models.

Result: Extensive experiments on Replica, ScanNet++, and DeepBlending show the method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. It also supports single-view inputs and unposed videos with strong generalizability across indoor and outdoor scenarios.

Conclusion: Accurate geometry is identified as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. The proposed geometry-guided approach enables high-quality and consistent scene completion with practical real-world applicability.

Abstract: Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.

[140] DRL: Discriminative Representation Learning with Parallel Adapters for Class Incremental Learning

Jiawei Zhan, Jun Liu, Jinlong Peng, Xiaochen Chen, Bin-Bin Gao, Yong Liu, Chengjie Wang

Main category: cs.CV

TL;DR: Proposes Discriminative Representation Learning (DRL) framework with Incremental Parallel Adapter (IPA) network and Decoupled Anchor Supervision (DAS) to address challenges in non-rehearsal Class-Incremental Learning, achieving state-of-the-art performance with high efficiency.

Details

Motivation: Address three key challenges in non-rehearsal Class-Incremental Learning: large model complexity, non-smooth representation shift during incremental learning, and inconsistency between stage-wise optimization and global inference.

Method: Uses Incremental Parallel Adapter (IPA) network built on pre-trained models with lightweight adapters for efficient incremental learning, and Decoupled Anchor Supervision (DAS) that decouples positive/negative sample constraints using virtual anchors for discriminative representation learning.

Result: Extensive experiments on six benchmarks show DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high training and inference efficiency.

Conclusion: The proposed DRL framework effectively addresses key challenges in non-rehearsal CIL through efficient adapter-based learning and decoupled supervision, achieving superior performance and smooth representation transitions across incremental stages.

Abstract: With the excellent representation capabilities of Pre-Trained Models (PTMs), remarkable progress has been made in non-rehearsal Class-Incremental Learning (CIL) research. However, it remains an extremely challenging task due to three conundrums: increasingly large model complexity, non-smooth representation shift during incremental learning and inconsistency between stage-wise sub-problem optimization and global inference. In this work, we propose the Discriminative Representation Learning (DRL) framework to specifically address these challenges. To conduct incremental learning effectively and yet efficiently, the DRL’s network, called Incremental Parallel Adapter (IPA) network, is built upon a PTM and increasingly augments the model by learning a lightweight adapter with a small amount of parameter learning overhead in each incremental stage. The adapter is responsible for adapting the model to new classes, it can inherit and propagate the representation capability from the current model through parallel connection between them by a transfer gate. As a result, this design guarantees a smooth representation shift between different incremental stages. Furthermore, to alleviate inconsistency and enable comparable feature representations across incremental stages, we design the Decoupled Anchor Supervision (DAS). It decouples constraints of positive and negative samples by respectively comparing them with the virtual anchor. This decoupling promotes discriminative representation learning and aligns the feature spaces learned at different stages, thereby narrowing the gap between stage-wise local optimization over a subset of data and global inference across all classes. Extensive experiments on six benchmarks reveal that our DRL consistently outperforms other state-of-the-art methods throughout the entire CIL period while maintaining high efficiency in both training and inference phases.

[141] Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration

Wenjie Li, Xiangyi Wang, Heng Guo, Guangwei Gao, Zhanyu Ma

Main category: cs.CV

TL;DR: SSDiff is a self-supervised selective-guided diffusion method for old-photo face restoration that uses pseudo-reference faces and staged supervision to address localized artifacts and color issues.

Details

Motivation: Existing diffusion-guided methods struggle with localized artifacts and face color in old-photo restoration due to compounded degradations like breakage, fading, and severe blur.

Method: Uses pseudo-reference faces from pre-trained diffusion model under weak guidance, with staged supervision: structural guidance throughout denoising and color refinement in later steps. Incorporates face parsing maps and scratch masks for selective restoration.

Result: Outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability on the constructed VintageFace benchmark of 300 real old photos.

Conclusion: SSDiff effectively addresses old-photo face restoration challenges through selective guidance and staged supervision, achieving superior restoration quality while maintaining identity consistency.

Abstract: Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: https://github.com/PRIS-CV/SSDiff.

[142] ImageSentinel: Protecting Visual Datasets from Unauthorized Retrieval-Augmented Image Generation

Ziyuan Luo, Yangyi Zhao, Ka Chun Cheung, Simon See, Renjie Wan

Main category: cs.CV

TL;DR: ImageSentinel is a framework that protects visual datasets from unauthorized use in Retrieval-Augmented Image Generation (RAIG) systems by generating sentinel images with embedded character sequences for verification.

Details

Motivation: Address concerns about unauthorized use of private image datasets in RAIG systems, as traditional watermarking fails due to complex feature extraction processes that don't preserve watermarks.

Method: Leverages vision-language models to synthesize sentinel images that maintain visual consistency with original datasets, embedding randomly generated character sequences as retrieval keys for protection verification.

Result: Experimental results show ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications.

Conclusion: ImageSentinel provides an effective solution for protecting visual datasets in RAIG systems, overcoming limitations of traditional watermarking approaches.

Abstract: The widespread adoption of Retrieval-Augmented Image Generation (RAIG) has raised significant concerns about the unauthorized use of private image datasets. While these systems have shown remarkable capabilities in enhancing generation quality through reference images, protecting visual datasets from unauthorized use in such systems remains a challenging problem. Traditional digital watermarking approaches face limitations in RAIG systems, as the complex feature extraction and recombination processes fail to preserve watermark signals during generation. To address these challenges, we propose ImageSentinel, a novel framework for protecting visual datasets in RAIG. Our framework synthesizes sentinel images that maintain visual consistency with the original dataset. These sentinels enable protection verification through randomly generated character sequences that serve as retrieval keys. To ensure seamless integration, we leverage vision-language models to generate the sentinel images. Experimental results demonstrate that ImageSentinel effectively detects unauthorized dataset usage while preserving generation quality for authorized applications. Code is available at https://github.com/luo-ziyuan/ImageSentinel.

[143] DarkIR: Robust Low-Light Image Restoration

Daniel Feijoo, Juan C. Benito, Alvaro Garcia, Marcos V. Conde

Main category: cs.CV

TL;DR: DarkIR is an efficient CNN-based neural network for multi-task low-light image restoration that addresses noise, low light, and blurring issues simultaneously, achieving state-of-the-art results with reduced computational costs.

Details

Motivation: Night and dark condition photography suffers from noise, low light, and blurring issues, but current approaches typically solve deblurring and low-light enhancement separately rather than addressing them jointly.

Method: Proposes new attention mechanisms to enhance the receptive field of efficient CNNs instead of using Transformer-based models, reducing computational costs in parameters and MAC operations.

Result: Achieves state-of-the-art results on LOLBlur, LOLv2 and Real-LOLBlur datasets, demonstrating generalization capability on real-world night and dark images.

Conclusion: DarkIR provides an efficient and robust solution for multi-task low-light image restoration that outperforms previous methods while being computationally efficient.

Abstract: Photography during night or in dark conditions typically suffers from noise, low light and blurring issues due to the dim environment and the common use of long exposure. Although Deblurring and Low-light Image Enhancement (LLIE) are related under these conditions, most approaches in image restoration solve these tasks separately. In this paper, we present an efficient and robust neural network for multi-task low-light image restoration. Instead of following the current tendency of Transformer-based models, we propose new attention mechanisms to enhance the receptive field of efficient CNNs. Our method reduces the computational costs in terms of parameters and MAC operations compared to previous methods. Our model, DarkIR, achieves new state-of-the-art results on the popular LOLBlur, LOLv2 and Real-LOLBlur datasets, being able to generalize on real-world night and dark images. Code and models at https://github.com/cidautai/DarkIR

[144] Hardware-aware Coding Function Design for Compressive Single-Photon 3D Cameras

David Parra, Felipe Gutierrez-Barragan, Trevor Seets, Andreas Velten

Main category: cs.CV

TL;DR: A constrained optimization method for designing practical coding functions in compressive single-photon 3D imaging that addresses hardware limitations like bandwidth and peak power constraints.

Details

Motivation: Single-photon cameras face performance limitations due to hardware constraints including system bandwidth, laser power, sensor data rates, and in-sensor resources. Traditional compressive histograms underperform under real-world illumination hardware constraints.

Method: A constrained optimization approach using gradient descent to jointly optimize illumination and coding matrices that adhere to hardware constraints, adapting to arbitrary parameterized impulse responses.

Result: The optimized coding functions consistently outperform traditional designs under both bandwidth and peak power constraints, with particularly strong advantages in peak-power-constrained systems. The method also works effectively with real-world non-ideal impulse responses.

Conclusion: The constrained optimization approach enables practical coding function design for compressive single-photon 3D imaging that overcomes hardware limitations and adapts to real-world system characteristics.

Abstract: Single-photon cameras are becoming increasingly popular in time-of-flight 3D imaging because they can time-tag individual photons with extreme resolution. However, their performance is susceptible to hardware limitations, such as system bandwidth, maximum laser power, sensor data rates, and in-sensor memory and compute resources. Compressive histograms were recently introduced as a solution to the challenge of data rates through an online in-sensor compression of photon timestamp data. Although compressive histograms work within limited in-sensor memory and computational resources, they underperform when subjected to real-world illumination hardware constraints. To address this, we present a constrained optimization approach for designing practical coding functions for compressive single-photon 3D imaging. Using gradient descent, we jointly optimize an illumination and coding matrix (i.e., the coding functions) that adheres to hardware constraints. We show through extensive simulations that our coding functions consistently outperform traditional coding designs under both bandwidth and peak power constraints. This advantage is particularly pronounced in systems constrained by peak power. Finally, we show that our approach adapts to arbitrary parameterized impulse responses by evaluating it on a real-world system with a non-ideal impulse response function.

[145] MetaCaptioner: Towards Generalist Visual Captioning with Open-source Suites

Zhenxin Lei, Zhangwei Gao, Changyao Tian, Erfei Cui, Guanzhou Chen, Danni Yang, Yuchen Duan, Zhaokai Wang, Wenhao Li, Weiyun Wang, Xiangyu Zhao, Jiayi Ji, Yu Qiao, Wenhai Wang, Gen Luo

Main category: cs.CV

TL;DR: CapFlow is a multi-agent workflow that achieves GPT-4-level caption quality using open-source models with 89.5% cost reduction, enabling scalable data synthesis and training MetaCaptioner.

Details

Motivation: To bridge the performance gap between open-source and commercial visual captioning models, enabling cost-effective applications like data synthesis.

Method: Proposes CapFlow, a multi-agent collaboration workflow that leverages open-source models to generate high-quality captions, then uses this as data synthesizer to train MetaCaptioner via fine-tuning.

Result: CapFlow achieves caption quality comparable to GPT-4 with 89.5% cost reduction. MetaCaptioner reaches top-tier multimodal performance in open-source community and matches commercial models.

Conclusion: CapFlow and MetaCaptioner provide a strong, cost-effective visual captioning solution that can benefit future multimodal research.

Abstract: Generalist visual captioning goes beyond a simple appearance description task, but requires integrating a series of visual cues into a caption and handling various visual domains. In this task, current open-source models present a large performance gap with commercial ones, which limits various applications such as data synthesis. To bridge the gap, this paper proposes CapFlow, a novel multi-agent collaboration workflow. CapFlow demonstrates for the first time that, by capitalizing on open-source models, it is possible to achieve caption quality on par with GPT-4.1 in various domains with an 89.5% reduction in costs. By leveraging CapFlow as the data synthesizer, we produce high-quality visual captions from image and video domains at scale, and obtain a generalist visual captioner via fine-tuning, namely MetaCaptioner. Through extensive experiments, we show that MetaCaptioner not only achieves comparable captioning capabilities with commercial models but also reaches top-tier multimodal performance in the open-source community. We hope CapFlow and MetaCaptioner can benefit future multimodal research by providing a strong and cost-effective visual captioning solution.

[146] FedHUG: Federated Heterogeneous Unsupervised Generalization for Remote Physiological Measurements

Xiao Yang, Jiyao Wang

Main category: cs.CV

TL;DR: FedHUG is a federated learning framework for remote physiological measurement that addresses privacy concerns and unlabeled client data through unsupervised domain generalization.

Details

Motivation: Remote physiological measurement requires collecting privacy-sensitive user data, and existing contactless methods rely on labeled client data, creating challenges for updating deployed models with unlabeled user data.

Method: FedHUG framework includes: (1) Minimal Bias Aggregation module that dynamically adjusts aggregation weights based on bias evaluation for heterogeneous non-IID features, and (2) Global Distribution-aware Learning Controller that parameterizes label distribution and manipulates client training strategies to mitigate server-client distribution skew and long-tail issues.

Result: The framework shows superior performance compared to state-of-the-art techniques in estimation using either RGB video or mmWave radar.

Conclusion: FedHUG effectively addresses privacy and unlabeled data challenges in remote physiological measurement through federated unsupervised domain generalization.

Abstract: Remote physiological measurement gained wide attention, while it requires collecting users’ privacy-sensitive information, and existing contactless measurements still rely on labeled client data. This presents challenges when we want to further update real-world deployed models with numerous user data lacking labels. To resolve these challenges, we instantiate a new protocol called Federated Unsupervised Domain Generalization (FUDG) in this work. Subsequently, the \textbf{Fed}erated \textbf{H}eterogeneous \textbf{U}nsupervised \textbf{G}eneralization (\textbf{FedHUG}) framework is proposed and consists of: (1) Minimal Bias Aggregation module dynamically adjusts aggregation weights based on prior-driven bias evaluation to cope with heterogeneous non-IID features from multiple domains. (2) The Global Distribution-aware Learning Controller parameterizes the label distribution and dynamically manipulates client-specific training strategies, thereby mitigating the server-client label distribution skew and long-tail issue. The proposal shows superior performance across state-of-the-art techniques in estimation with either RGB video or mmWave radar. The code will be released.

[147] Class-aware Domain Knowledge Fusion and Fission for Continual Test-Time Adaptation

Jiahuan Zhou, Chao Zhu, Zhenyu Cui, Zichen Liu, Xu Zou, Gang Hua

Main category: cs.CV

TL;DR: KFF method for Continual Test-Time Adaptation that uses class-aware domain knowledge fusion and fission to dynamically accumulate discriminative historical knowledge while avoiding negative knowledge interference.

Details

Motivation: Existing CTTA methods suffer from catastrophic forgetting of historical knowledge and insufficient learning of new knowledge due to irregular domain switching, leading to performance degradation.

Method: Proposes KFF with two modules: 1) Knowledge Fission (KFI) to separate new domain knowledge from class-aware domain prompt pool, and 2) Knowledge Fusion (KFU) to merge new knowledge with existing pool using greedy dynamic merging strategy.

Result: Extensive experiments on ImageNet-C dataset verify the method’s effectiveness against other approaches.

Conclusion: KFF successfully addresses the challenges of catastrophic forgetting and knowledge interference in CTTA through adaptive knowledge fusion and fission mechanisms.

Abstract: Continual Test-Time Adaptation (CTTA) aims to quickly fine-tune the model during the test phase so that it can adapt to multiple unknown downstream domain distributions without pre-acquiring downstream domain data. To this end, existing advanced CTTA methods mainly reduce the catastrophic forgetting of historical knowledge caused by irregular switching of downstream domain data by restoring the initial model or reusing historical models. However, these methods are usually accompanied by serious insufficient learning of new knowledge and interference from potentially harmful historical knowledge, resulting in severe performance degradation. To this end, we propose a class-aware domain Knowledge Fusion and Fission method for continual test-time adaptation, called KFF, which adaptively expands and merges class-aware domain knowledge in old and new domains according to the test-time data from different domains, where discriminative historical knowledge can be dynamically accumulated. Specifically, considering the huge domain gap within streaming data, a domain Knowledge FIssion (KFI) module is designed to adaptively separate new domain knowledge from a paired class-aware domain prompt pool, alleviating the impact of negative knowledge brought by old domains that are distinct from the current domain. Besides, to avoid the cumulative computation and storage overheads from continuously fissioning new knowledge, a domain Knowledge FUsion (KFU) module is further designed to merge the fissioned new knowledge into the existing knowledge pool with minimal cost, where a greedy knowledge dynamic merging strategy is designed to improve the compatibility of new and old knowledge while keeping the computational efficiency. Extensive experiments on the ImageNet-C dataset verify the effectiveness of our proposed method against other methods.

[148] DPL: Spatial-Conditioned Diffusion Prototype Enhancement for One-Shot Medical Segmentation

Ziyuan Gao, Philippe Morel

Main category: cs.CV

TL;DR: DPL introduces diffusion-based prototype learning for one-shot medical image segmentation, modeling prototypes as probability distributions to generate diverse variants from limited data.

Details

Motivation: Traditional prototype methods fail to capture intra-class diversity due to deterministic averaging, limiting robustness against anatomical variability in medical images.

Method: Uses diffusion processes to enhance prototypes: (1) diffusion-based prototype enhancement, (2) spatial-aware conditioning using geometric properties, (3) conservative fusion strategy preserving fidelity while maximizing diversity.

Result: Achieves state-of-the-art performance on abdominal MRI and CT datasets for one-shot medical image segmentation.

Conclusion: DPL effectively addresses prototype representation challenges in one-shot segmentation by modeling prototypes as distributions and using diffusion processes for robust feature space exploration.

Abstract: One-shot medical image segmentation faces fundamental challenges in prototype representation due to limited annotated data and significant anatomical variability across patients. Traditional prototype-based methods rely on deterministic averaging of support features, creating brittle representations that fail to capture intra-class diversity essential for robust generalization. This work introduces Diffusion Prototype Learning (DPL), a novel framework that reformulates prototype construction through diffusion-based feature space exploration. DPL models one-shot prototypes as learnable probability distributions, enabling controlled generation of diverse yet semantically coherent prototype variants from minimal labeled data. The framework operates through three core innovations: (1) a diffusion-based prototype enhancement module that transforms single support prototypes into diverse variant sets via forward-reverse diffusion processes, (2) a spatial-aware conditioning mechanism that leverages geometric properties derived from prototype feature statistics, and (3) a conservative fusion strategy that preserves prototype fidelity while maximizing representational diversity. DPL ensures training-inference consistency by using the same diffusion enhancement and fusion pipeline in both phases. This process generates enhanced prototypes that serve as the final representations for similarity calculations, while the diffusion process itself acts as a regularizer. Extensive experiments on abdominal MRI and CT datasets demonstrate significant improvements respectively, establishing new state-of-the-art performance in one-shot medical image segmentation.

[149] State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding

Jiahuan Zhou, Kai Zhu, Zhenyu Cui, Zichen Liu, Xu Zou, Gang Hua

Main category: cs.CV

TL;DR: SSP is a State Space Prompting method that improves video understanding by combining intra-frame and inter-frame prompts to better capture spatiotemporal information in videos, outperforming SOTA methods by 2.76% on average with fewer fine-tuning parameters.

Details

Motivation: Existing pre-trained state space models for video classification fail to effectively capture spatial and temporal contextual information when using prompt learning, limiting their ability to propagate spatial information within frames and temporal information between frames.

Method: Proposed State Space Prompting (SSP) with two modules: Intra-Frame Gathering (IFG) to aggregate spatial key information within each frame, and Inter-Frame Spreading (IFS) to spread discriminative spatiotemporal information across frames. The method adaptively balances and compresses key spatiotemporal information.

Result: Extensive experiments on four video benchmark datasets show SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing fine-tuning parameter overhead.

Conclusion: SSP effectively propagates discriminative information in videos through complementary intra-frame and inter-frame prompting, achieving superior performance with efficient parameter usage.

Abstract: Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.

[150] UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

Yusen Xie, Zhenmin Huang, Jianhao Jiao, Dimitrios Kanoulas, Jun Ma

Main category: cs.CV

TL;DR: UniGS is a unified framework for high-fidelity multimodal 3D reconstruction using 3D Gaussian Splatting, featuring differentiable rendering of RGB, depth, normals, and semantics with improved geometric accuracy and efficiency.

Details

Motivation: To achieve high-fidelity multimodal 3D reconstruction with geometric consistency across different modalities (RGB, depth, normals, semantics) while maintaining computational efficiency.

Method: Uses 3D Gaussian Splatting with redesigned rasterization for differentiable depth rendering via ray-ellipsoid intersection, analytic gradient formulation for surface normal rendering, and learnable attributes for differentiable pruning of Gaussians.

Result: State-of-the-art reconstruction accuracy across all modalities demonstrated through quantitative and qualitative experiments.

Conclusion: The proposed geometry-aware paradigm effectively enables high-fidelity multimodal 3D reconstruction with improved geometric consistency and computational efficiency.

Abstract: In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.

[151] BEEP3D: Box-Supervised End-to-End Pseudo-Mask Generation for 3D Instance Segmentation

Youngju Yoo, Seho Kim, Changick Kim

Main category: cs.CV

TL;DR: BEEP3D is an end-to-end box-supervised 3D instance segmentation method that uses a student-teacher framework with pseudo-mask generation, avoiding the need for expensive point-level annotations.

Details

Motivation: To reduce the high annotation costs of fully supervised 3D instance segmentation by using box-level annotations instead of dense point-level labels, while overcoming the ambiguity challenges in overlapping regions.

Method: Uses a student-teacher framework where the teacher generates pseudo-masks and is updated via Exponential Moving Average. Includes instance center-based query refinement and two novel losses: query consistency loss and masked feature consistency loss.

Result: Achieves competitive or superior performance compared to state-of-the-art weakly supervised methods on ScanNetV2 and S3DIS datasets while remaining computationally efficient.

Conclusion: BEEP3D provides an effective end-to-end solution for box-supervised 3D instance segmentation that eliminates the need for multi-stage training pipelines and expensive point-level annotations.

Abstract: 3D instance segmentation is crucial for understanding complex 3D environments, yet fully supervised methods require dense point-level annotations, resulting in substantial annotation costs and labor overhead. To mitigate this, box-level annotations have been explored as a weaker but more scalable form of supervision. However, box annotations inherently introduce ambiguity in overlapping regions, making accurate point-to-instance assignment challenging. Recent methods address this ambiguity by generating pseudo-masks through training a dedicated pseudo-labeler in an additional training stage. However, such two-stage pipelines often increase overall training time and complexity, hinder end-to-end optimization. To overcome these challenges, we propose BEEP3D-Box-supervised End-to-End Pseudo-mask generation for 3D instance segmentation. BEEP3D adopts a student-teacher framework, where the teacher model serves as a pseudo-labeler and is updated by the student model via an Exponential Moving Average. To better guide the teacher model to generate precise pseudo-masks, we introduce an instance center-based query refinement that enhances position query localization and leverages features near instance centers. Additionally, we design two novel losses-query consistency loss and masked feature consistency loss-to align semantic and geometric signals between predictions and pseudo-masks. Extensive experiments on ScanNetV2 and S3DIS datasets demonstrate that BEEP3D achieves competitive or superior performance compared to state-of-the-art weakly supervised methods while remaining computationally efficient.

[152] CompoDistill: Attention Distillation for Compositional Reasoning in Multimodal LLMs

Jiwan Kim, Kibum Kim, Sangwoo Seo, Chanyoung Park

Main category: cs.CV

TL;DR: CompoDistill is a knowledge distillation framework that aligns visual attention between teacher and student MLLMs to improve visual perception in compositional reasoning tasks.

Details

Motivation: Existing knowledge distillation methods struggle to effectively transfer rich visual perception abilities from teacher to student MLLMs, with visual attention misalignment identified as the main cause.

Method: Proposes CompoDistill framework that explicitly aligns the student’s visual attention with the teacher’s to enhance visual perception abilities through systematic attention alignment.

Result: Significantly improves performance on compositional reasoning tasks requiring visual perception while maintaining strong performance on visual question answering tasks, and demonstrates effectiveness with advanced backbones.

Conclusion: CompoDistill effectively addresses visual attention misalignment in MLLM knowledge distillation, enhancing student models’ visual perception capabilities and showing strong generalizability.

Abstract: Recently, efficient Multimodal Large Language Models (MLLMs) have gained significant attention as a solution to their high computational complexity, making them more practical for real-world applications. In this regard, the knowledge distillation (KD) approach has emerged as a promising alternative, which transfers the rich visual and linguistic knowledge from a larger model (teacher) to a smaller model (student). However, we observe that existing KD methods struggle to effectively distill the teacher MLLM’s rich visual perception abilities to the student, a challenge that has been largely overlooked in previous studies. Through a systematic analysis, we identify visual attention misalignment between student and teacher as the main cause of this issue. Based on this insight, we propose CompoDistill, a novel KD framework that explicitly aligns the student’s visual attention with that of the teacher to enhance the student’s visual perception abilities. Our extensive experiments show that CompoDistill significantly improves performance on compositional reasoning tasks that require visual perception abilities while maintaining strong performance on visual question answering tasks, as done in existing studies. Furthermore, CompoDistill demonstrates effectiveness with a more advanced backbone, highlighting its generalizability.

[153] Hierarchical Reasoning with Vision-Language Models for Incident Reports from Dashcam Videos

Shingo Yokoi, Kento Sasaki, Yu Yamaguchi

Main category: cs.CV

TL;DR: A hierarchical reasoning framework using vision-language models for generating incident reports from dashcam videos, achieving 2nd place in the 2COOOL challenge with best CIDEr-D score.

Details

Motivation: Address the gap in autonomous driving models' performance in out-of-distribution scenarios and improve hazard understanding beyond closed taxonomies through interpretable incident reporting.

Method: Hierarchical reasoning framework integrating frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models, enhanced by model ensembling and Blind A/B Scoring selection.

Result: Ranked 2nd among 29 teams on the official 2COOOL open leaderboard and achieved the best CIDEr-D score, producing accurate and coherent incident narratives.

Conclusion: Hierarchical reasoning with VLMs is a promising direction for accident analysis and broader understanding of safety-critical traffic events.

Abstract: Recent advances in end-to-end (E2E) autonomous driving have been enabled by training on diverse large-scale driving datasets, yet autonomous driving models still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark targets this gap by encouraging hazard understanding beyond closed taxonomies, and the 2COOOL challenge extends it to generating human-interpretable incident reports. We present a hierarchical reasoning framework for incident report generation from dashcam videos that integrates frame-level captioning, incident frame detection, and fine-grained reasoning within vision-language models (VLMs). We further improve factual accuracy and readability through model ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL open leaderboard, our method ranks 2nd among 29 teams and achieves the best CIDEr-D score, producing accurate and coherent incident narratives. These results indicate that hierarchical reasoning with VLMs is a promising direction for accident analysis and for broader understanding of safety-critical traffic events. The implementation and code are available at https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.

[154] The Impact of Synthetic Data on Object Detection Model Performance: A Comparative Analysis with Real-World Data

Muammer Bay, Timo von Marcard, Dren Fazlija

Main category: cs.CV

TL;DR: This paper investigates using synthetic data from NVIDIA Omniverse Replicator for fine-tuning object detection models in warehouse logistics, finding that balanced integration of synthetic and real data leads to robust performance.

Details

Motivation: Many AI applications in logistics and manufacturing lack expertise and resources, forcing reliance on general-purpose models that require costly real-world data for fine-tuning. Synthetic data offers a cost-effective alternative.

Method: Examined pallet detection in warehouse settings using NVIDIA Omniverse Replicator to generate synthetic data. Compared models trained on real data only versus various synthetic dataset generation strategies.

Result: The study found that synthetic data can effectively enhance object detection model performance when properly integrated with real-world data.

Conclusion: A balanced integration of synthetic and real data can lead to robust and efficient object detection models for warehouse logistics applications.

Abstract: Recent advances in generative AI, particularly in computer vision (CV), offer new opportunities to optimize workflows across industries, including logistics and manufacturing. However, many AI applications are limited by a lack of expertise and resources, which forces a reliance on general-purpose models. Success with these models often requires domain-specific data for fine-tuning, which can be costly and inefficient. Thus, using synthetic data for fine-tuning is a popular, cost-effective alternative to gathering real-world data. This work investigates the impact of synthetic data on the performance of object detection models, compared to models trained on real-world data only, specifically within the domain of warehouse logistics. To this end, we examined the impact of synthetic data generated using the NVIDIA Omniverse Replicator tool on the effectiveness of object detection models in real-world scenarios. It comprises experiments focused on pallet detection in a warehouse setting, utilizing both real and various synthetic dataset generation strategies. Our findings provide valuable insights into the practical applications of synthetic image data in computer vision, suggesting that a balanced integration of synthetic and real data can lead to robust and efficient object detection models.

[155] DIANet: A Phase-Aware Dual-Stream Network for Micro-Expression Recognition via Dynamic Images

Vu Tram Anh Khuong, Luu Tu Nguyen, Thi Bich Phuong Man, Thanh Ha Le, Thi Duyen Ngo

Main category: cs.CV

TL;DR: DIANet is a dual-stream CNN framework for micro-expression recognition that uses phase-aware dynamic images to capture onset-to-apex and apex-to-offset phases separately, with cross-attention fusion for improved performance.

Details

Motivation: Micro-expression recognition is challenging due to subtle, transient facial cues and limited data. Conventional dynamic image methods overlook distinct temporal phase characteristics within micro-expressions.

Method: Proposes DIANet with dual streams: one for onset-to-apex phase and another for apex-to-offset phase, each processed by dedicated CNNs and integrated via cross-attention fusion module.

Result: Outperforms conventional single-phase DI-based approaches on three benchmark datasets (CASME-II, SAMM, and MMEW), demonstrating superior recognition accuracy.

Conclusion: Explicitly modeling temporal phase information is crucial for micro-expression recognition, and the dual-stream phase-aware approach represents a promising direction for advancing MER.

Abstract: Micro-expressions are brief, involuntary facial movements that typically last less than half a second and often reveal genuine emotions. Accurately recognizing these subtle expressions is critical for applications in psychology, security, and behavioral analysis. However, micro-expression recognition (MER) remains a challenging task due to the subtle and transient nature of facial cues and the limited availability of annotated data. While dynamic image (DI) representations have been introduced to summarize temporal motion into a single frame, conventional DI-based methods often overlook the distinct characteristics of different temporal phases within a micro-expression. To address this issue, this paper proposes a novel dual-stream framework, DIANet, which leverages phase-aware dynamic images - one encoding the onset-to-apex phase and the other capturing the apex-to-offset phase. Each stream is processed by a dedicated convolutional neural network, and a cross-attention fusion module is employed to adaptively integrate features from both streams based on their contextual relevance. Extensive experiments conducted on three benchmark MER datasets (CASME-II, SAMM, and MMEW) demonstrate that the proposed method consistently outperforms conventional single-phase DI-based approaches. The results highlight the importance of modeling temporal phase information explicitly and suggest a promising direction for advancing MER.

[156] HoneyBee: Data Recipes for Vision-Language Reasoners

Hritik Bansal, Devandra Singh Sachan, Kai-Wei Chang, Aditya Grover, Gargi Ghosh, Wen-tau Yih, Ramakanth Pasunuru

Main category: cs.CV

TL;DR: This paper studies data curation strategies for vision-language reasoning datasets and introduces HoneyBee, a large-scale chain-of-thought dataset that significantly improves VLM performance.

Details

Motivation: The principles for constructing effective vision-language reasoning training datasets are poorly understood, despite VLMs' strong reasoning capabilities.

Method: Analyzed context source strategies, implemented targeted data interventions (auxiliary signals, text-only reasoning), scaled data dimensions, and created HoneyBee dataset with 2.5M examples.

Result: HoneyBee-trained VLMs outperform SOTA models across sizes (3B model beats SOTA by 7.8% on MathVerse), with test-time scaling reducing decoding cost by 73% without accuracy loss.

Conclusion: The work presents improved strategies for VL reasoning dataset curation, showing that careful data construction significantly enhances reasoning capabilities.

Abstract: Recent advances in vision-language models (VLMs) have made them highly effective at reasoning tasks. However, the principles underlying the construction of performant VL reasoning training datasets remain poorly understood. In this work, we introduce several data curation approaches and study their impacts on VL reasoning capabilities by carefully controlling training and evaluation setups. We analyze the effects of context (image and question pair) sources, implement targeted data interventions, and explore scaling up images, questions, and chain-of-thought (CoT) solutions. Our findings reveal that (a) context source strategies significantly affect VLM performance, (b) interventions such as auxiliary signals from image captions and the inclusion of text-only reasoning yield substantial gains, and (c) scaling all data dimensions (e.g., unique questions per image and unique CoTs per image-question pair) consistently improves reasoning capability. Motivated by these insights, we introduce HoneyBee, a large-scale, high-quality CoT reasoning dataset with 2.5M examples consisting 350K image-question pairs. VLMs trained with HoneyBee outperform state-of-the-art models across model sizes. For instance, a HoneyBee-trained VLM with 3B parameters outperforms the SOTA model and the base model by 7.8% and 24.8%, respectively, on MathVerse. Furthermore, we propose a test-time scaling strategy that reduces decoding cost by 73% without sacrificing accuracy. Overall, this work presents improved strategies for VL reasoning dataset curation research.

[157] BIGFix: Bidirectional Image Generation with Token Fixing

Victor Besnier, David Hurych, Andrei Bursuc, Eduardo Valle

Main category: cs.CV

TL;DR: A method for self-correcting image and video generation that combines parallel token prediction with iterative refinement to improve efficiency while maintaining quality.

Details

Motivation: Improving inference efficiency in generative models is crucial for commercial viability, but parallel token prediction can cause structural inconsistencies due to token incompatibilities and lacks backtracking mechanisms.

Method: Proposes iterative token refinement with a novel training scheme that injects random tokens in context to improve robustness and enable token fixing during sampling.

Result: Achieves substantial improvements in generation quality while preserving efficiency benefits, demonstrated on ImageNet-256, CIFAR-10 for images and UCF-101, NuScenes for videos.

Conclusion: The method successfully combines parallel token prediction efficiency with enhanced generation quality through iterative self-correction, working effectively across both image and video modalities.

Abstract: Recent advances in image and video generation have raised significant interest from both academia and industry. A key challenge in this field is improving inference efficiency, as model size and the number of inference steps directly impact the commercial viability of generative models while also posing fundamental scientific challenges. A promising direction involves combining auto-regressive sequential token modeling with multi-token prediction per step, reducing inference time by up to an order of magnitude. However, predicting multiple tokens in parallel can introduce structural inconsistencies due to token incompatibilities, as capturing complex joint dependencies during training remains challenging. Traditionally, once tokens are sampled, there is no mechanism to backtrack and refine erroneous predictions. We propose a method for self-correcting image generation by iteratively refining sampled tokens. We achieve this with a novel training scheme that injects random tokens in the context, improving robustness and enabling token fixing during sampling. Our method preserves the efficiency benefits of parallel token prediction while significantly enhancing generation quality. We evaluate our approach on image generation using the ImageNet-256 and CIFAR-10 datasets, as well as on video generation with UCF-101 and NuScenes, demonstrating substantial improvements across both modalities.

[158] Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding

Ye Chen, Liming Tan, Yupeng Zhu, Yuanbin Wang, Bingbing Ni

Main category: cs.CV

TL;DR: Proposes spatio-temporally consistent proxy nodes for robust video representation that overcomes limitations of pixel-level tracking by handling occlusion, large motion, and tracking errors through hierarchical multi-scale structure and dynamic updates.

Details

Motivation: Current video representations rely on unstable pixel-level matching and tracking, which are vulnerable to tracking errors, occlusions, and large motions that can collapse object representations.

Method: Uses hierarchical proxy nodes to represent dynamic objects/scenes with multi-scale structure, dynamic representation update mechanism leveraging spatio-temporal priors, and decoupled encoding of shape and texture representations.

Result: Achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks including video in-painting and keyframe-based temporally consistent video editing.

Conclusion: The proxy node representation provides robust video modeling that handles tracking errors, occlusions, and motion while enabling controllable appearance editing and efficient video processing.

Abstract: Current video representations heavily rely on unstable and over-grained priors for motion and appearance modelling, \emph{i.e.}, pixel-level matching and tracking. A tracking error of just a few pixels would lead to the collapse of the visual object representation, not to mention occlusions and large motion frequently occurring in videos. To overcome the above mentioned vulnerability, this work proposes spatio-temporally consistent proxy nodes to represent dynamically changing objects/scenes in the video. On the one hand, the hierarchical proxy nodes have the ability to stably express the multi-scale structure of visual objects, so they are not affected by accumulated tracking error, long-term motion, occlusion, and viewpoint variation. On the other hand, the dynamic representation update mechanism of the proxy nodes adequately leverages spatio-temporal priors of the video to mitigate the impact of inaccurate trackers, thereby effectively handling drastic changes in scenes and objects. Additionally, the decoupled encoding manner of the shape and texture representations across different visual objects in the video facilitates controllable and fine-grained appearance editing capability. Extensive experiments demonstrate that the proposed representation achieves high video reconstruction accuracy with fewer parameters and supports complex video processing tasks, including video in-painting and keyframe-based temporally consistent video editing.

[159] Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

Yuto Yokoi, Kazuhiro Hotta

Main category: cs.CV

TL;DR: Proposes two novel loss functions - Multiplicative Loss and Confidence-Adaptive Multiplicative Loss - for medical image segmentation that combine Cross Entropy and Dice losses multiplicatively rather than additively, achieving better performance with limited data.

Details

Motivation: Address data scarcity in medical imaging due to privacy, ethics, and costly annotations. Existing additive combinations of Cross Entropy and Dice Loss are sensitive to hyperparameters and perform suboptimally with limited data.

Method: Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively to dynamically modulate gradients based on prediction confidence. Confidence-Adaptive version adds exponential scaling inspired by Focal Loss to emphasize difficult samples using predicted probabilities and Dice coefficients.

Result: Experiments on cellular and medical segmentation benchmarks show the framework consistently outperforms tuned additive and existing loss functions, providing robust segmentation under data limitations.

Conclusion: The proposed multiplicative loss functions offer a simple, effective, and hyperparameter-free mechanism for robust semantic segmentation in medical and cellular images, particularly valuable under challenging data scarcity conditions.

Abstract: We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.

[160] Local Background Features Matter in Out-of-Distribution Detection

Jinlun Ye, Zhuohao Sun, Yiqiao Qiu, Qiu Li, Zhijun Tan, Ruixuan Wang

Main category: cs.CV

TL;DR: Proposes using local background features from in-distribution images as fake OOD features during training to improve OOD detection by reducing model overconfidence on OOD data.

Details

Motivation: Neural networks often produce overconfident predictions on OOD data, and existing methods using auxiliary OOD datasets or generated fake OOD images are limited by high data collection and training costs.

Method: Extracts background features from ID images as simulated OOD representations during training based on local convolution invariance, and optimizes networks to reduce L2-norm of these background features.

Result: Extensive experiments on multiple standard OOD detection benchmarks confirm effectiveness and wide combinatorial compatibility with existing post-hoc methods, achieving new state-of-the-art performance.

Conclusion: The proposed method effectively alleviates overconfidence issue on OOD data using local background features as fake OOD features, providing a cost-effective solution for OOD detection.

Abstract: Out-of-distribution (OOD) detection is crucial when deploying deep neural networks in the real world to ensure the reliability and safety of their applications. One main challenge in OOD detection is that neural network models often produce overconfident predictions on OOD data. While some methods using auxiliary OOD datasets or generating fake OOD images have shown promising OOD detection performance, they are limited by the high costs of data collection and training. In this study, we propose a novel and effective OOD detection method that utilizes local background features as fake OOD features for model training. Inspired by the observation that OOD images generally share similar background regions with ID images, the background features are extracted from ID images as simulated OOD visual representations during training based on the local invariance of convolution. Through being optimized to reduce the $L_2$-norm of these background features, the neural networks are able to alleviate the overconfidence issue on OOD data. Extensive experiments on multiple standard OOD detection benchmarks confirm the effectiveness of our method and its wide combinatorial compatibility with existing post-hoc methods, with new state-of-the-art performance achieved from our method.

[161] SpineBench: Benchmarking Multimodal LLMs for Spinal Pathology Analysis

Chenghanyu Zhang, Zekun Li, Peipei Li, Xing Cui, Shuhan Xia, Weixiang Yan, Yiqiao Zhang, Qianyu Zhuang

Main category: cs.CV

TL;DR: SpineBench is a comprehensive Visual Question Answering benchmark for evaluating Multimodal Large Language Models in the spinal domain, featuring 64,878 QA pairs from 40,263 spine images covering 11 diseases.

Details

Motivation: Existing medical benchmarks focus on general tasks and inadequately assess performance in specialized areas like spine medicine that rely heavily on visual input, creating a gap in evaluating MLLMs for nuanced clinical applications.

Method: Created by integrating and standardizing open-source spinal disease datasets, with challenging hard negative options selected based on visual similarity to simulate real-world diagnostic scenarios. Includes two clinical tasks: spinal disease diagnosis and lesion localization in multiple-choice format.

Result: Evaluation of 12 leading MLLMs revealed poor performance in spinal tasks, demonstrating significant limitations of current models in the spine domain despite their general medical capabilities.

Conclusion: SpineBench highlights the need for specialized evaluation benchmarks in medical subdomains and provides guidance for future improvements in spinal medicine applications of MLLMs.

Abstract: With the increasing integration of Multimodal Large Language Models (MLLMs) into the medical field, comprehensive evaluation of their performance in various medical domains becomes critical. However, existing benchmarks primarily assess general medical tasks, inadequately capturing performance in nuanced areas like the spine, which relies heavily on visual input. To address this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA) benchmark designed for fine-grained analysis and evaluation of MLLMs in the spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images, covering 11 spinal diseases through two critical clinical tasks: spinal disease diagnosis and spinal lesion localization, both in multiple-choice format. SpineBench is built by integrating and standardizing image-label pairs from open-source spinal disease datasets, and samples challenging hard negative options for each VQA pair based on visual similarity (similar but not the same disease), simulating real-world challenging scenarios. We evaluate 12 leading MLLMs on SpineBench. The results reveal that these models exhibit poor performance in spinal tasks, highlighting limitations of current MLLM in the spine domain and guiding future improvements in spinal medicine applications. SpineBench is publicly available at https://zhangchenghanyu.github.io/SpineBench.github.io/.

[162] PAGS: Priority-Adaptive Gaussian Splatting for Dynamic Driving Scenes

Ying A, Wenzhang Sun, Chang Zeng, Chunfeng Wang, Hao Li, Jianxun Cui

Main category: cs.CV

TL;DR: PAGS introduces a semantic-aware 3D reconstruction framework that prioritizes safety-critical objects over static backgrounds, achieving high-quality reconstruction with significantly improved computational efficiency.

Details

Motivation: Current dynamic 3D urban scene reconstruction methods face a trade-off between fidelity and computational cost due to semantically agnostic designs that treat all scene elements equally, without prioritizing safety-critical objects.

Method: PAGS uses two core techniques: (1) Semantically-Guided Pruning and Regularization with hybrid importance metrics to simplify non-critical elements while preserving critical object details, and (2) Priority-Driven Rendering pipeline with priority-based depth pre-pass to cull occluded primitives and accelerate shading.

Result: Extensive experiments on Waymo and KITTI datasets show PAGS achieves exceptional reconstruction quality on safety-critical objects while reducing training time and achieving rendering speeds over 350 FPS.

Conclusion: PAGS successfully addresses the fidelity-efficiency trade-off in dynamic 3D urban scene reconstruction by incorporating semantic priorities, making it suitable for autonomous driving applications.

Abstract: Reconstructing dynamic 3D urban scenes is crucial for autonomous driving, yet current methods face a stark trade-off between fidelity and computational cost. This inefficiency stems from their semantically agnostic design, which allocates resources uniformly, treating static backgrounds and safety-critical objects with equal importance. To address this, we introduce Priority-Adaptive Gaussian Splatting (PAGS), a framework that injects task-aware semantic priorities directly into the 3D reconstruction and rendering pipeline. PAGS introduces two core contributions: (1) Semantically-Guided Pruning and Regularization strategy, which employs a hybrid importance metric to aggressively simplify non-critical scene elements while preserving fine-grained details on objects vital for navigation. (2) Priority-Driven Rendering pipeline, which employs a priority-based depth pre-pass to aggressively cull occluded primitives and accelerate the final shading computations. Extensive experiments on the Waymo and KITTI datasets demonstrate that PAGS achieves exceptional reconstruction quality, particularly on safety-critical objects, while significantly reducing training time and boosting rendering speeds to over 350 FPS.

[163] Dual Learning with Dynamic Knowledge Distillation and Soft Alignment for Partially Relevant Video Retrieval

Jianfeng Dong, Lei Huang, Daizong Liu, Xianke Chen, Xun Yang, Changting Lin, Xun Wang, Meng Wang

Main category: cs.CV

TL;DR: Proposes DL-DKD++, a dual learning framework with dynamic knowledge distillation for partially relevant video retrieval (PRVR), where a teacher model transfers knowledge to a student network with inheritance and exploration branches to handle untrimmed videos with partial relevance to text queries.

Details

Motivation: Real-world videos are typically untrimmed with long durations and complex background content, unlike the pre-trimmed short videos assumed in most text-to-video retrieval works. This creates a practical challenge of retrieving partially relevant untrimmed videos.

Method: Uses a large teacher model to supervise a compact dual-branch student network. The student has an inheritance branch for transferable knowledge and an exploration branch for task-specific learning. Incorporates dynamic soft-target construction to replace rigid hard-target supervision.

Result: Achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR.

Conclusion: The proposed DL-DKD++ framework effectively addresses the practical challenge of partially relevant video retrieval by distilling knowledge from large pre-trained models and learning task-specific information through a dual-branch architecture with dynamic supervision.

Abstract: Almost all previous text-to-video retrieval works ideally assume that videos are pre-trimmed with short durations containing solely text-related content. However, in practice, videos are typically untrimmed in long durations with much more complicated background content. Therefore, in this paper, we focus on the more practical yet challenging task of Partially Relevant Video Retrieval (PRVR), which aims to retrieve partially relevant untrimmed videos with the given query. To tackle this task, we propose a novel framework that distills generalization knowledge from a powerful large-scale vision-language pre-trained model and transfers it to a lightweight, task-specific PRVR network. Specifically, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD++), where a large teacher model provides supervision to a compact dual-branch student network. The student model comprises two branches: an inheritance branch that absorbs transferable knowledge from the teacher, and an exploration branch that learns task-specific information from the PRVR dataset to address domain gaps. To further enhance learning, we incorporate a dynamic soft-target construction mechanism. By replacing rigid hard-target supervision with adaptive soft targets that evolve during training, our method enables the model to better capture the fine-grained, partial relevance between videos and queries. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on TVR, ActivityNet, and Charades-STA datasets for PRVR. The code is available at https://github.com/HuiGuanLab/DL-DKD.

[164] Vision Language Models Map Logos to Text via Semantic Entanglement in the Visual Projector

Sifan Li, Hongkai Chen, Yujun Cai, Qingwen Ye, Liyang Chen, Junsong Yuan, Yiwei Wang

Main category: cs.CV

TL;DR: VLMs suffer from logo hallucinations, generating brand names from logos without visible text. Systematic testing reveals this occurs across logo types, persists under distortions, and is tied to specific projector dimensions in the model architecture.

Details

Motivation: To investigate logo hallucination in VLMs where models generate brand names from logos containing no visible words, revealing a critical vulnerability in multimodal reasoning systems.

Method: Used curated logo datasets (pure symbols, hybrids, text-bearing logos, Hard-60 subset) with structured perturbations. Conducted embedding-level analysis with LLaVA to identify projector dimensions responsible for hallucinations.

Result: VLMs consistently hallucinate brand names from non-text logos, especially circular ones. Hallucinations persist under distortions, with occlusion being most revealing. Targeted ablation of specific projector dimensions reduces errors while maintaining OCR accuracy.

Conclusion: VLMs rely on symbolic priors rather than genuine visual perception for logos. Projector subspaces play a key role in hallucinations, suggesting projector disentanglement and OCR-guided decoding as promising mitigation strategies for more trustworthy multimodal systems.

Abstract: Vision Language Models (VLMs) have achieved impressive progress in multimodal reasoning; yet, they remain vulnerable to hallucinations, where outputs are not grounded in visual evidence. In this paper, we investigate a previously overlooked setting: logo hallucination, where models generate brand names or textual content despite logos containing no visible words. Using curated splits of pure symbols, hybrids, and text-bearing logos, as well as the challenging Hard-60 subset, we systematically measure hallucination across leading VLMs. We further probe robustness through nine structured perturbations and show that hallucinations persist even under strong distortions, with occlusion exposing the sharpest weaknesses. Embedding-level analysis with open-weight LLaVA demonstrates that hallucination is tied to a small subset of projector dimensions, and targeted ablation substantially reduces errors while preserving OCR accuracy. Together, these findings reveal that VLMs often rely on symbolic priors rather than genuine glyph perception, particularly for iconic circular logos, and that projector subspaces play a decisive role in this failure mode. Our work contributes both a novel diagnostic lens and actionable mitigation insights, highlighting projector disentanglement and OCR-guided decoding as promising directions for building more trustworthy multimodal systems.

[165] Hybrid Gaussian Splatting for Novel Urban View Synthesis

Mohamed Omran, Farhad Zanjani, Davide Abati, Jens Petersen, Amirhossein Habibian

Main category: cs.CV

TL;DR: A hybrid 3D reconstruction and diffusion enhancement approach for novel view synthesis in street scenes, achieving second place in the RealADSim-NVS challenge.

Details

Motivation: To address the challenge of generating realistic novel views of urban environments from different traversals (e.g., different lanes or directions) using car-centric training frames.

Method: Two-stage approach: 1) Fit 3D scene reconstruction using Gaussian splatting and render novel views from target cameras, 2) Enhance frames with a single-step diffusion model. Includes specific initialization of Gaussian primitives and fine-tuning of the enhancer model.

Result: Achieved aggregated score of 0.432 on the public leaderboard, securing second place overall. Performance evaluated using PSNR, SSIM, and LPIPS metrics.

Conclusion: The hybrid approach combining 3D reconstruction with diffusion-based enhancement effectively addresses novel view synthesis in street scenes, demonstrating competitive performance in the challenge.

Abstract: This paper describes the Qualcomm AI Research solution to the RealADSim-NVS challenge, hosted at the RealADSim Workshop at ICCV 2025. The challenge concerns novel view synthesis in street scenes, and participants are required to generate, starting from car-centric frames captured during some training traversals, renders of the same urban environment as viewed from a different traversal (e.g. different street lane or car direction). Our solution is inspired by hybrid methods in scene generation and generative simulators merging gaussian splatting and diffusion models, and it is composed of two stages: First, we fit a 3D reconstruction of the scene and render novel views as seen from the target cameras. Then, we enhance the resulting frames with a dedicated single-step diffusion model. We discuss specific choices made in the initialization of gaussian primitives as well as the finetuning of the enhancer model and its training data curation. We report the performance of our model design and we ablate its components in terms of novel view quality as measured by PSNR, SSIM and LPIPS. On the public leaderboard reporting test results, our proposal reaches an aggregated score of 0.432, achieving the second place overall.

[166] CurriFlow: Curriculum-Guided Depth Fusion with Optical Flow-Based Temporal Alignment for 3D Semantic Scene Completion

Jinzhou Lin, Jie Zhou, Wenhao Xu, Rongtao Xu, Changwei Wang, Shunpeng Chen, Kexue Fu, Yihua Shao, Li Guo, Shibiao Xu

Main category: cs.CV

TL;DR: CurriFlow is a semantic scene completion framework that uses optical flow-based temporal alignment and curriculum learning with depth fusion to improve 3D geometry and semantics prediction from monocular images.

Details

Motivation: Existing SSC methods lack explicit motion reasoning and struggle with occlusions and noisy depth supervision, limiting camera-based perception in autonomous driving.

Method: Integrates optical flow-based temporal alignment with curriculum-guided depth fusion, using multi-level feature fusion and semantic priors from SAM for category-agnostic supervision.

Result: Achieves state-of-the-art performance on SemanticKITTI benchmark with mean IoU of 16.9.

Conclusion: CurriFlow’s motion-guided and curriculum-aware design effectively addresses challenges in camera-based 3D semantic scene completion.

Abstract: Semantic Scene Completion (SSC) aims to infer complete 3D geometry and semantics from monocular images, serving as a crucial capability for camera-based perception in autonomous driving. However, existing SSC methods relying on temporal stacking or depth projection often lack explicit motion reasoning and struggle with occlusions and noisy depth supervision. We propose CurriFlow, a novel semantic occupancy prediction framework that integrates optical flow-based temporal alignment with curriculum-guided depth fusion. CurriFlow employs a multi-level fusion strategy to align segmentation, visual, and depth features across frames using pre-trained optical flow, thereby improving temporal consistency and dynamic object understanding. To enhance geometric robustness, a curriculum learning mechanism progressively transitions from sparse yet accurate LiDAR depth to dense but noisy stereo depth during training, ensuring stable optimization and seamless adaptation to real-world deployment. Furthermore, semantic priors from the Segment Anything Model (SAM) provide category-agnostic supervision, strengthening voxel-level semantic learning and spatial consistency. Experiments on the SemanticKITTI benchmark demonstrate that CurriFlow achieves state-of-the-art performance with a mean IoU of 16.9, validating the effectiveness of our motion-guided and curriculum-aware design for camera-based 3D semantic scene completion.

[167] Deep Attention-guided Adaptive Subsampling

Sharath M Shankaranarayana, Soumava Kumar Roy, Prasad Sudhakar, Chandan Aladahalli

Main category: cs.CV

TL;DR: Proposes a learnable subsampling framework with attention-guided sampling that adapts to inputs during inference, reducing computational complexity while maintaining performance in 3D medical imaging and video classification tasks.

Details

Motivation: Deep neural networks often achieve performance gains at the cost of increased computational complexity, with inherent redundancies in 3D volumes and videos where not all slices/frames are necessary.

Method: An attention-guided sampling module that adapts to different inputs even during inference, overcoming the non-differentiability of subsampling operations and addressing limitations of static sampling mechanisms.

Result: Demonstrated effectiveness on 3D medical imaging datasets from MedMNIST3D and two ultrasound video datasets, including a challenging real-world clinical dataset, showing performance gains with reduced complexity.

Conclusion: The proposed input-adaptive subsampling framework successfully reduces computational complexity in deep neural networks while maintaining or improving performance, making it suitable for real-world applications with dynamic input adaptation.

Abstract: Although deep neural networks have provided impressive gains in performance, these improvements often come at the cost of increased computational complexity and expense. In many cases, such as 3D volume or video classification tasks, not all slices or frames are necessary due to inherent redundancies. To address this issue, we propose a novel learnable subsampling framework that can be integrated into any neural network architecture. Subsampling, being a nondifferentiable operation, poses significant challenges for direct adaptation into deep learning models. While some works, have proposed solutions using the Gumbel-max trick to overcome the problem of non-differentiability, they fall short in a crucial aspect: they are only task-adaptive and not inputadaptive. Once the sampling mechanism is learned, it remains static and does not adjust to different inputs, making it unsuitable for real-world applications. To this end, we propose an attention-guided sampling module that adapts to inputs even during inference. This dynamic adaptation results in performance gains and reduces complexity in deep neural network models. We demonstrate the effectiveness of our method on 3D medical imaging datasets from MedMNIST3D as well as two ultrasound video datasets for classification tasks, one of them being a challenging in-house dataset collected under real-world clinical conditions.

[168] Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling

Tim J. Schoonbeek, Shao-Hsuan Hung, Dan Lehman, Hans Onvlee, Jacek Kustra, Peter H. N. de With, Fons van der Sommen

Main category: cs.CV

TL;DR: STORM-PSR is a dual-stream framework for procedure step recognition that combines spatial and temporal features to handle partial occlusion, reducing step completion prediction delays by 11.2-26.1% compared to prior methods.

Details

Motivation: Existing PSR models rely solely on detecting assembly object states in individual frames, neglecting temporal features. This limits robustness and accuracy, especially when objects are partially occluded.

Method: Proposed STORM-PSR dual-stream framework: assembly state detection stream for unobstructed views, and spatio-temporal stream with spatial encoder (pre-trained using weakly supervised approach) and transformer-based temporal encoder to capture spatial-temporal relationships under occlusion.

Result: Evaluated on MECCANO and IndustReal datasets, reduced average delay between actual and predicted assembly step completions by 11.2% and 26.1% respectively compared to prior methods. Spatio-temporal stream enables step recognition without requiring unobstructed object views.

Conclusion: STORM-PSR effectively addresses occlusion challenges in procedure step recognition by leveraging both spatial and temporal features, with the spatio-temporal stream being key to improved performance under partial occlusion conditions.

Abstract: Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at https://timschoonbeek.github.io/stormpsr .

[169] Scene Coordinate Reconstruction Priors

Wenjing Bian, Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

Main category: cs.CV

TL;DR: The paper presents a probabilistic approach to training scene coordinate regression models that incorporates reconstruction priors to prevent degeneration when training images have insufficient multi-view constraints.

Details

Motivation: Scene coordinate regression models can degenerate when trained on images with insufficient multi-view constraints, leading to poor performance in downstream tasks.

Method: Proposes probabilistic reinterpretation of SCR training with various priors including depth distribution priors and learned priors using 3D point cloud diffusion models trained on indoor scans.

Result: The priors help learn better scene representations, producing more coherent point clouds, higher registration rates, better camera poses, and improved performance on novel view synthesis and relocalization tasks.

Conclusion: Incorporating reconstruction priors during SCR training effectively prevents model degeneration and improves performance across multiple indoor datasets.

Abstract: Scene coordinate regression (SCR) models have proven to be powerful implicit scene representations for 3D vision, enabling visual relocalization and structure-from-motion. SCR models are trained specifically for one scene. If training images imply insufficient multi-view constraints SCR models degenerate. We present a probabilistic reinterpretation of training SCR models, which allows us to infuse high-level reconstruction priors. We investigate multiple such priors, ranging from simple priors over the distribution of reconstructed depth values to learned priors over plausible scene coordinate configurations. For the latter, we train a 3D point cloud diffusion model on a large corpus of indoor scans. Our priors push predicted 3D scene points towards plausible geometry at each training step to increase their likelihood. On three indoor datasets our priors help learning better scene representations, resulting in more coherent scene point clouds, higher registration rates and better camera poses, with a positive effect on down-stream tasks such as novel view synthesis and camera relocalization.

[170] Towards General Urban Monitoring with Vision-Language Models: A Review, Evaluation, and a Research Agenda

André Torneiro, Diogo Monteiro, Paulo Novais, Pedro Rangel Henriques, Nuno F. Rodrigues

Main category: cs.CV

TL;DR: This systematic review examines Vision-Language Models (VLMs) for urban infrastructure monitoring, analyzing 32 studies from 2021-2025 to understand their applications, architectures, datasets, and performance in zero-shot settings.

Details

Motivation: Traditional urban monitoring using IoT sensors and manual inspections is costly, difficult to scale, and misaligned with citizens' visual perception. VLMs offer potential to enable machines to 'see' like citizens and assess infrastructure conditions.

Method: Following PRISMA methodology, the authors systematically reviewed 32 peer-reviewed studies to address four research questions about VLM applications in urban monitoring, focusing on tasks, architectures, datasets, and evaluation methods.

Result: The review identified effective urban monitoring tasks addressed by VLMs, commonly used architectures and frameworks, available datasets, and reported performance levels across different applications.

Conclusion: VLMs show promise as a technology for urban infrastructure monitoring, particularly in zero-shot applications, offering potential solutions to overcome limitations of traditional monitoring approaches.

Abstract: Urban monitoring of public infrastructure (such as waste bins, road signs, vegetation, sidewalks, and construction sites) poses significant challenges due to the diversity of objects, environments, and contextual conditions involved. Current state-of-the-art approaches typically rely on a combination of IoT sensors and manual inspections, which are costly, difficult to scale, and often misaligned with citizens’ perception formed through direct visual observation. This raises a critical question: Can machines now “see” like citizens and infer informed opinions about the condition of urban infrastructure? Vision-Language Models (VLMs), which integrate visual understanding with natural language reasoning, have recently demonstrated impressive capabilities in processing complex visual information, turning them into a promising technology to address this challenge. This systematic review investigates the role of VLMs in urban monitoring, with particular emphasis on zero-shot applications. Following the PRISMA methodology, we analyzed 32 peer-reviewed studies published between 2021 and 2025 to address four core research questions: (1) What urban monitoring tasks have been effectively addressed using VLMs? (2) Which VLM architectures and frameworks are most commonly used and demonstrate superior performance? (3) What datasets and resources support this emerging field? (4) How are VLM-based applications evaluated, and what performance levels have been reported?

[171] Low-Field Magnetic Resonance Image Quality Enhancement using a Conditional Flow Matching Model

Huu Tien Nguyen, Ahmed Karam Eldaly

Main category: cs.CV

TL;DR: A novel framework using conditional flow matching (CFM) for image quality transfer in low-field MRI, achieving high-field-like reconstruction with fewer parameters and robust generalization.

Details

Motivation: Low-field MRI is affordable and portable but suffers from low signal-to-noise ratio and reduced diagnostic quality. There's a need to bridge the quality gap without expensive infrastructure.

Method: Uses conditional flow matching (CFM) to learn a continuous flow between noise distribution and target data through direct regression of an optimal velocity field, rather than iterative sampling or adversarial objectives.

Result: CFM achieves state-of-the-art performance, generalizes robustly to both in-distribution and out-of-distribution data, and uses significantly fewer parameters than competing deep learning methods.

Conclusion: CFM shows strong potential as a powerful and scalable tool for MRI reconstruction, especially in resource-limited clinical environments.

Abstract: This paper introduces a novel framework for image quality transfer based on conditional flow matching (CFM). Unlike conventional generative models that rely on iterative sampling or adversarial objectives, CFM learns a continuous flow between a noise distribution and target data distributions through the direct regression of an optimal velocity field. We evaluate this approach in the context of low-field magnetic resonance imaging (LF-MRI), a rapidly emerging modality that offers affordable and portable scanning but suffers from inherently low signal-to-noise ratio and reduced diagnostic quality. Our framework is designed to reconstruct high-field-like MR images from their corresponding low-field inputs, thereby bridging the quality gap without requiring expensive infrastructure. Experiments demonstrate that CFM not only achieves state-of-the-art performance, but also generalizes robustly to both in-distribution and out-of-distribution data. Importantly, it does so while utilizing significantly fewer parameters than competing deep learning methods. These results underline the potential of CFM as a powerful and scalable tool for MRI reconstruction, particularly in resource-limited clinical environments.

[172] VideoLucy: Deep Memory Backtracking for Long Video Understanding

Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao

Main category: cs.CV

TL;DR: VideoLucy is a deep memory backtracking framework for long video understanding that overcomes limitations of current LLM-based systems by using hierarchical memory structure and iterative backtracking to capture temporal context and preserve critical details.

Details

Motivation: Current agent-based systems using LLMs for long video understanding struggle with capturing temporal context of consecutive frames and risk discarding crucial information due to sparse frame sampling.

Method: VideoLucy employs a hierarchical memory structure with progressive granularity and an agent-based iterative backtracking mechanism to systematically mine video-wide, question-relevant deep memories.

Result: VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance surpassing proprietary models like GPT-4o.

Conclusion: The proposed VideoLucy framework effectively addresses temporal understanding challenges in long videos while preserving critical details, demonstrating superior performance over existing methods.

Abstract: Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model’s ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io

[173] A Review of Longitudinal Radiology Report Generation: Dataset Composition, Methods, and Performance Evaluation

Shaoyang Zhou, Yingshu Li, Yunyi Liu, Lingqiao Liu, Lei Wang, Luping Zhou

Main category: cs.CV

TL;DR: This survey provides the first comprehensive review of longitudinal radiology report generation (LRRG), addressing the gap in existing literature that primarily focuses on single-image chest X-ray report generation.

Details

Motivation: Chest X-ray imaging creates substantial workloads for radiologists, and conventional single-image approaches fail to capture longitudinal context necessary for clinically faithful comparison statements. Existing surveys offer limited guidance for longitudinal settings.

Method: The survey examines dataset construction strategies, report generation architectures with longitudinally tailored designs, and evaluation protocols including both longitudinal-specific measures and widely used benchmarks.

Result: The analysis highlights the critical role of longitudinal information and architectural design choices in improving model performance, and summarizes performance of LRRG methods alongside ablation studies.

Conclusion: The survey identifies five major limitations of current research and outlines promising directions for future development to advance this emerging field.

Abstract: Chest Xray imaging is a widely used diagnostic tool in modern medicine, and its high utilization creates substantial workloads for radiologists. To alleviate this burden, vision language models are increasingly applied to automate Chest Xray radiology report generation (CXRRRG), aiming for clinically accurate descriptions while reducing manual effort. Conventional approaches, however, typically rely on single images, failing to capture the longitudinal context necessary for producing clinically faithful comparison statements. Recently, growing attention has been directed toward incorporating longitudinal data into CXR RRG, enabling models to leverage historical studies in ways that mirror radiologists diagnostic workflows. Nevertheless, existing surveys primarily address single image CXRRRG and offer limited guidance for longitudinal settings, leaving researchers without a systematic framework for model design. To address this gap, this survey provides the first comprehensive review of longitudinal radiology report generation (LRRG). Specifically, we examine dataset construction strategies, report generation architectures alongside longitudinally tailored designs, and evaluation protocols encompassing both longitudinal specific measures and widely used benchmarks. We further summarize LRRG methods performance, alongside analyses of different ablation studies, which collectively highlight the critical role of longitudinal information and architectural design choices in improving model performance. Finally, we summarize five major limitations of current research and outline promising directions for future development, aiming to lay a foundation for advancing this emerging field.

[174] MS-GAGA: Metric-Selective Guided Adversarial Generation Attack

Dion J. X. Ho, Gabriel Lee Jun Rong, Niharika Shrivastava, Harshavardhan Abichandani, Pai Chet Ng, Xiaoxiao Miao

Main category: cs.CV

TL;DR: MS-GAGA is a two-stage framework for creating transferable and imperceptible adversarial examples against deepfake detectors in black-box settings, achieving up to 27% higher attack success rates than state-of-the-art methods.

Details

Motivation: To develop effective adversarial attacks against deepfake detectors that work across different models without requiring knowledge of the target model's architecture (black-box setting), while maintaining visual imperceptibility.

Method: Two-stage framework: Stage 1 uses dual-stream attack (MNTD-PGD for enhanced gradient calculations with small perturbations, SG-PGD for focusing on salient regions) to generate candidates. Stage 2 employs metric-aware selection evaluating both attack success and structural similarity (SSIM).

Result: Achieves up to 27% higher misclassification rates on unseen deepfake detectors compared to state-of-the-art attacks, demonstrating improved transferability across models.

Conclusion: MS-GAGA effectively balances transferability and imperceptibility in adversarial attacks against deepfake detectors through its dual-stream generation and metric-aware selection approach.

Abstract: We present MS-GAGA (Metric-Selective Guided Adversarial Generation Attack), a two-stage framework for crafting transferable and visually imperceptible adversarial examples against deepfake detectors in black-box settings. In Stage 1, a dual-stream attack module generates adversarial candidates: MNTD-PGD applies enhanced gradient calculations optimized for small perturbation budgets, while SG-PGD focuses perturbations on visually salient regions. This complementary design expands the adversarial search space and improves transferability across unseen models. In Stage 2, a metric-aware selection module evaluates candidates based on both their success against black-box models and their structural similarity (SSIM) to the original image. By jointly optimizing transferability and imperceptibility, MS-GAGA achieves up to 27% higher misclassification rates on unseen detectors compared to state-of-the-art attacks.

[175] A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation

Shurong Chai, Rahul Kumar JAIN, Rui Xu, Shaocong Mo, Ruibo Hou, Shiyu Teng, Jiaqing Liu, Lanfen Lin, Yen-Wei Chen

Main category: cs.CV

TL;DR: Proposes early fusion framework combining text and visual features before augmentation to preserve spatial consistency in text-guided medical image segmentation, achieving SOTA results.

Details

Motivation: Common data augmentations disrupt spatial alignment between images and text in multimodal segmentation, weakening performance. Need to preserve spatial consistency while benefiting from augmentation.

Method: Early fusion framework that combines text and visual features before augmentation. Uses lightweight generator to project text embeddings into visual space, bridging semantic gaps.

Result: Achieved state-of-the-art results on three medical imaging tasks and four segmentation frameworks. Visualization shows accurate region localization in generated pseudo-images.

Conclusion: Early fusion approach effectively preserves spatial consistency between text and images during augmentation, improving performance in text-guided medical image segmentation tasks.

Abstract: Deep learning relies heavily on data augmentation to mitigate limited data, especially in medical imaging. Recent multimodal learning integrates text and images for segmentation, known as referring or text-guided image segmentation. However, common augmentations like rotation and flipping disrupt spatial alignment between image and text, weakening performance. To address this, we propose an early fusion framework that combines text and visual features before augmentation, preserving spatial consistency. We also design a lightweight generator that projects text embeddings into visual space, bridging semantic gaps. Visualization of generated pseudo-images shows accurate region localization. Our method is evaluated on three medical imaging tasks and four segmentation frameworks, achieving state-of-the-art results. Code is publicly available on GitHub: https://github.com/11yxk/MedSeg_EarlyFusion.

[176] BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring

An Zhao, Piaopiao Yu, Zhe Zhu, Mingqiang Wei

Main category: cs.CV

TL;DR: Bi-Stage 3D Gaussian Splatting (BSGS) addresses 3D scene reconstruction from motion-blurred images using a two-stage approach with camera pose refinement and global rigid transformation, enhanced by subframe gradient aggregation and space-time optimization.

Details

Motivation: Existing 3DGS-based deblurring methods struggle with motion-blurred images due to extreme dependence on camera pose accuracy and inability to control erroneous Gaussian primitive densification caused by motion blur.

Method: Two-stage framework: 1) Camera Pose Refinement for rough optimization, 2) Global Rigid Transformation with fixed poses to correct blur distortions. Uses subframe gradient aggregation and space-time bi-stage optimization to prevent noisy Gaussian generation.

Result: Comprehensive experiments show BSGS effectively reconstructs 3D scenes from motion-blurred images and outperforms state-of-the-art methods.

Conclusion: The proposed Bi-Stage 3D Gaussian Splatting framework successfully addresses motion blur challenges in 3D scene reconstruction through its two-stage optimization approach and gradient management strategies.

Abstract: 3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction.However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur.To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images.BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions.To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages.Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.

[177] Voronoi-Assisted Diffusion for Computing Unsigned Distance Fields from Unoriented Points

Jiayi Kong, Chen Zong, Junkai Deng, Xuhui Chen, Fei Hou, Shiqing Xin, Junhui Hou, Chen Qian, Ying He

Main category: cs.CV

TL;DR: VAD is a lightweight, network-free method that computes Unsigned Distance Fields from unoriented point clouds using Voronoi-based normal alignment and diffusion.

Details

Motivation: Existing neural approaches for learning UDFs suffer from numerical instability, high computational cost, and limited controllability.

Method: Assigns bi-directional normals to input points using Voronoi-based geometric criteria, then diffuses aligned normals to form approximate UDF gradient field, and integrates to recover final UDF.

Result: VAD robustly handles watertight and open surfaces, complex non-manifold and non-orientable geometries, while remaining computationally efficient and stable.

Conclusion: The proposed Voronoi-Assisted Diffusion method provides an effective alternative to neural approaches for UDF computation from point clouds.

Abstract: Unsigned Distance Fields (UDFs) provide a flexible representation for 3D shapes with arbitrary topology, including open and closed surfaces, orientable and non-orientable geometries, and non-manifold structures. While recent neural approaches have shown promise in learning UDFs, they often suffer from numerical instability, high computational cost, and limited controllability. We present a lightweight, network-free method, Voronoi-Assisted Diffusion (VAD), for computing UDFs directly from unoriented point clouds. Our approach begins by assigning bi-directional normals to input points, guided by two Voronoi-based geometric criteria encoded in an energy function for optimal alignment. The aligned normals are then diffused to form an approximate UDF gradient field, which is subsequently integrated to recover the final UDF. Experiments demonstrate that VAD robustly handles watertight and open surfaces, as well as complex non-manifold and non-orientable geometries, while remaining computationally efficient and stable.

[178] Unconditional Human Motion and Shape Generation via Balanced Score-Based Diffusion

David Björkstrand, Tiesheng Wang, Lars Bretzner, Josephine Sullivan

Main category: cs.CV

TL;DR: A diffusion model for human motion generation achieves state-of-the-art results using only careful feature normalization and analytically derived loss weightings, avoiding complex auxiliary losses and slow post-processing.

Details

Motivation: Existing methods rely on over-parameterized features and auxiliary losses, which should not be necessary for diffusion models to match human motion distribution.

Method: Score-based diffusion model with feature-space normalization and analytically derived weightings for L2 score-matching loss, generating both motion and shape directly.

Result: Achieves state-of-the-art performance in unconditional human motion generation without auxiliary losses or slow post-processing.

Conclusion: Careful feature normalization and theoretical loss weightings are sufficient for diffusion models to match human motion distribution effectively.

Abstract: Recent work has explored a range of model families for human motion generation, including Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and diffusion-based models. Despite their differences, many methods rely on over-parameterized input features and auxiliary losses to improve empirical results. These strategies should not be strictly necessary for diffusion models to match the human motion distribution. We show that on par with state-of-the-art results in unconditional human motion generation are achievable with a score-based diffusion model using only careful feature-space normalization and analytically derived weightings for the standard L2 score-matching loss, while generating both motion and shape directly, thereby avoiding slow post hoc shape recovery from joints. We build the method step by step, with a clear theoretical motivation for each component, and provide targeted ablations demonstrating the effectiveness of each proposed addition in isolation.

[179] CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong

Main category: cs.CV

TL;DR: CoIRL-AD is a competitive dual-policy framework that combines imitation learning and reinforcement learning for autonomous driving, achieving 18% lower collision rates and better generalization than baselines.

Details

Motivation: Imitation learning alone suffers from poor generalization, while reinforcement learning faces sample inefficiency and unstable convergence. Combining both approaches can overcome these limitations.

Method: Proposes CoIRL-AD, a competitive dual-policy framework where IL and RL agents interact during training through a competition-based mechanism that enables knowledge exchange while preventing gradient conflicts.

Result: Experiments on nuScenes dataset show 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios.

Conclusion: The proposed CoIRL-AD framework effectively combines IL and RL through competitive interaction, achieving superior performance in autonomous driving tasks compared to conventional approaches.

Abstract: End-to-end autonomous driving models trained solely with imitation learning (IL) often suffer from poor generalization. In contrast, reinforcement learning (RL) promotes exploration through reward maximization but faces challenges such as sample inefficiency and unstable convergence. A natural solution is to combine IL and RL. Moving beyond the conventional two-stage paradigm (IL pretraining followed by RL fine-tuning), we propose CoIRL-AD, a competitive dual-policy framework that enables IL and RL agents to interact during training. CoIRL-AD introduces a competition-based mechanism that facilitates knowledge exchange while preventing gradient conflicts. Experiments on the nuScenes dataset show an 18% reduction in collision rate compared to baselines, along with stronger generalization and improved performance on long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

[180] MMOT: The First Challenging Benchmark for Drone-based Multispectral Multi-Object Tracking

Tianhao Li, Tingfa Xu, Ying Wang, Haolin Qin, Xu Lin, Jianan Li

Main category: cs.CV

TL;DR: MMOT is the first challenging benchmark for drone-based multispectral multi-object tracking, featuring 125 video sequences with 488.8K annotations across 8 categories, addressing limitations of RGB tracking in aerial views through spectral features and oriented annotations.

Details

Motivation: RGB-based tracking algorithms degrade in aerial views due to reliance on spatial appearance cues like color and texture. Multispectral imagery provides crucial spectral reflectance cues that enhance object discriminability under degraded spatial conditions, but lack of dedicated datasets has hindered progress.

Method: Proposed a multispectral and orientation-aware MOT scheme with: (1) lightweight Spectral 3D-Stem for spectral feature integration while maintaining RGB pretraining compatibility, (2) orientation-aware Kalman filter for precise state estimation, and (3) end-to-end orientation-adaptive transformer.

Result: Extensive experiments show multispectral input significantly improves tracking performance over RGB baselines, particularly for small and densely packed objects. The MMOT benchmark enables comprehensive evaluation across diverse challenging conditions.

Conclusion: The work advances drone-based multispectral multi-object tracking research by providing the first dedicated benchmark and demonstrating the effectiveness of spectral features for overcoming limitations of RGB tracking in aerial scenarios.

Abstract: Drone-based multi-object tracking is essential yet highly challenging due to small targets, severe occlusions, and cluttered backgrounds. Existing RGB-based tracking algorithms heavily depend on spatial appearance cues such as color and texture, which often degrade in aerial views, compromising reliability. Multispectral imagery, capturing pixel-level spectral reflectance, provides crucial cues that enhance object discriminability under degraded spatial conditions. However, the lack of dedicated multispectral UAV datasets has hindered progress in this domain. To bridge this gap, we introduce MMOT, the first challenging benchmark for drone-based multispectral multi-object tracking. It features three key characteristics: (i) Large Scale - 125 video sequences with over 488.8K annotations across eight categories; (ii) Comprehensive Challenges - covering diverse conditions such as extreme small targets, high-density scenarios, severe occlusions, and complex motion; and (iii) Precise Oriented Annotations - enabling accurate localization and reduced ambiguity under aerial perspectives. To better extract spectral features and leverage oriented annotations, we further present a multispectral and orientation-aware MOT scheme adapting existing methods, featuring: (i) a lightweight Spectral 3D-Stem integrating spectral features while preserving compatibility with RGB pretraining; (ii) an orientation-aware Kalman filter for precise state estimation; and (iii) an end-to-end orientation-adaptive transformer. Extensive experiments across representative trackers consistently show that multispectral input markedly improves tracking performance over RGB baselines, particularly for small and densely packed objects. We believe our work will advance drone-based multispectral multi-object tracking research. Our MMOT, code, and benchmarks are publicly available at https://github.com/Annzstbl/MMOT.

[181] Learning Human Motion with Temporally Conditional Mamba

Quang Nguyen, Tri Le, Baoru Huang, Minh Nhat Vu, Ngan Le, Thieu Vo, Anh Nguyen

Main category: cs.CV

TL;DR: Temporally Conditional Mamba (TCM) is a new mamba-based model that integrates conditional information into recurrent dynamics to generate human motion with better temporal alignment to input signals, outperforming existing cross-attention methods.

Details

Motivation: Existing methods using cross-attention mechanisms struggle to maintain step-by-step temporal alignment between conditioning inputs and generated human motion, primarily capturing only global interactions.

Method: The approach integrates conditional information directly into the recurrent dynamics of the Mamba block, enabling better temporal alignment throughout the motion generation process.

Result: Extensive experiments show significant improvements in temporal alignment, motion realism, and condition consistency compared to state-of-the-art approaches across various human motion tasks.

Conclusion: The Temporally Conditional Mamba model effectively addresses the temporal alignment limitation of previous methods and demonstrates superior performance in generating human motion that consistently reflects temporal patterns of conditioning inputs.

Abstract: Learning human motion based on a time-dependent input signal presents a challenging yet impactful task with various applications. The goal of this task is to generate or estimate human movement that consistently reflects the temporal patterns of conditioning inputs. Existing methods typically rely on cross-attention mechanisms to fuse the condition with motion. However, this approach primarily captures global interactions and struggles to maintain step-by-step temporal alignment. To address this limitation, we introduce Temporally Conditional Mamba, a new mamba-based model for human motion generation. Our approach integrates conditional information into the recurrent dynamics of the Mamba block, enabling better temporally aligned motion. To validate the effectiveness of our method, we evaluate it on a variety of human motion tasks. Extensive experiments demonstrate that our model significantly improves temporal alignment, motion realism, and condition consistency over state-of-the-art approaches. Our project page is available at https://zquang2202.github.io/TCM.

[182] Unlocking Zero-Shot Plant Segmentation with Pl@ntNet Intelligence

Simon Ravé, Jean-Christophe Lombardo, Pejman Rasti, Alexis Joly, David Rousseau

Main category: cs.CV

TL;DR: Zero-shot segmentation for agricultural imagery using Plantnet with DinoV2 backbone and SAM, achieving improved performance without new dataset annotation.

Details

Motivation: To address the annotation bottleneck in agricultural image segmentation and enable effective segmentation in diverse field conditions without collecting new datasets.

Method: Combines Plantnet’s specialized plant representations with DinoV2 backbone to identify plant regions and generate coarse masks, then refines them using Segment Anything Model (SAM).

Result: Consistent performance gains using Plantnet-fine-tuned DinoV2 over base DinoV2 model across four datasets with varying complexity, as measured by Jaccard Index (IoU).

Conclusion: Combining foundation models with specialized plant-centric models effectively alleviates annotation requirements and enables robust segmentation in diverse agricultural scenarios.

Abstract: We present a zero-shot segmentation approach for agricultural imagery that leverages Plantnet, a large-scale plant classification model, in conjunction with its DinoV2 backbone and the Segment Anything Model (SAM). Rather than collecting and annotating new datasets, our method exploits Plantnet’s specialized plant representations to identify plant regions and produce coarse segmentation masks. These masks are then refined by SAM to yield detailed segmentations. We evaluate on four publicly available datasets of various complexity in terms of contrast including some where the limited size of the training data and complex field conditions often hinder purely supervised methods. Our results show consistent performance gains when using Plantnet-fine-tuned DinoV2 over the base DinoV2 model, as measured by the Jaccard Index (IoU). These findings highlight the potential of combining foundation models with specialized plant-centric models to alleviate the annotation bottleneck and enable effective segmentation in diverse agricultural scenarios.

[183] LayerSync: Self-aligning Intermediate Layers

Yasaman Haghighi, Bastien van Delft, Mariam Hassan, Alexandre Alahi

Main category: cs.CV

TL;DR: LayerSync is a domain-agnostic method that improves diffusion model training efficiency and generation quality by using the model’s own semantically rich intermediate representations to guide weaker ones, eliminating the need for external supervision.

Details

Motivation: Prior research showed that external guidance on diffusion model intermediate representations accelerates training, but this requires additional supervision. The authors aim to create a self-sufficient approach that leverages the model's internal representations without external dependencies.

Method: LayerSync regularizes diffusion models by using their own intermediate representations. It identifies semantically rich representations within the model and uses them as intrinsic guidance for weaker representations, creating a plug-and-play regularizer term with no training overhead.

Result: LayerSync consistently improves generation quality and training efficiency across multiple domains. It achieved 8.75x training speedup for flow-based transformer on ImageNet and 23.6% improvement in generation quality. The method works effectively for image, audio, video, and motion generation.

Conclusion: LayerSync provides an effective, domain-agnostic solution for improving diffusion models by leveraging internal representation hierarchies, eliminating the need for external supervision while maintaining training efficiency and improving generation quality across multiple modalities.

Abstract: We propose LayerSync, a domain-agnostic approach for improving the generation quality and the training efficiency of diffusion models. Prior studies have highlighted the connection between the quality of generation and the representations learned by diffusion models, showing that external guidance on model intermediate representations accelerates training. We reconceptualize this paradigm by regularizing diffusion models with their own intermediate representations. Building on the observation that representation quality varies across diffusion model layers, we show that the most semantically rich representations can act as an intrinsic guidance for weaker ones, reducing the need for external supervision. Our approach, LayerSync, is a self-sufficient, plug-and-play regularizer term with no overhead on diffusion model training and generalizes beyond the visual domain to other modalities. LayerSync requires no pretrained models nor additional data. We extensively evaluate the method on image generation and demonstrate its applicability to other domains such as audio, video, and motion generation. We show that it consistently improves the generation quality and the training efficiency. For example, we speed up the training of flow-based transformer by over 8.75x on ImageNet dataset and improved the generation quality by 23.6%. The code is available at https://github.com/vita-epfl/LayerSync.

[184] Advancing End-to-End Pixel Space Generative Modeling via Self-supervised Pre-training

Jiachen Lei, Keli Liu, Julius Berner, Haiming Yu, Hongkai Zheng, Jiahong Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: A novel two-stage training framework that closes the performance gap between pixel-space and latent-space generative models for diffusion and consistency models, achieving state-of-the-art results on ImageNet.

Details

Motivation: Pixel-space generative models are more difficult to train and underperform compared to latent-space models, creating a persistent performance and efficiency gap that needs to be addressed.

Method: Two-stage training: (1) Pre-train encoders to capture semantics from clean images while aligning them along deterministic sampling trajectories, (2) Integrate encoder with randomly initialized decoder and fine-tune end-to-end for both diffusion and consistency models.

Result: Diffusion model achieves FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 NFE, surpassing prior pixel-space methods. Consistency model achieves FID of 8.82 on ImageNet-256 in single step, significantly outperforming latent-space counterpart.

Conclusion: The framework successfully closes the performance gap for pixel-space models, enabling high-quality generation without relying on pre-trained VAEs or diffusion models, marking the first successful training of consistency models directly on high-resolution images.

Abstract: Pixel-space generative models are often more difficult to train and generally underperform compared to their latent-space counterparts, leaving a persistent performance and efficiency gap. In this paper, we introduce a novel two-stage training framework that closes this gap for pixel-space diffusion and consistency models. In the first stage, we pre-train encoders to capture meaningful semantics from clean images while aligning them with points along the same deterministic sampling trajectory, which evolves points from the prior to the data distribution. In the second stage, we integrate the encoder with a randomly initialized decoder and fine-tune the complete model end-to-end for both diffusion and consistency models. Our training framework demonstrates strong empirical performance on ImageNet dataset. Specifically, our diffusion model reaches an FID of 2.04 on ImageNet-256 and 2.35 on ImageNet-512 with 75 number of function evaluations (NFE), surpassing prior pixel-space methods by a large margin in both generation quality and efficiency while rivaling leading VAE-based models at comparable training cost. Furthermore, on ImageNet-256, our consistency model achieves an impressive FID of 8.82 in a single sampling step, significantly surpassing its latent-space counterpart. To the best of our knowledge, this marks the first successful training of a consistency model directly on high-resolution images without relying on pre-trained VAEs or diffusion models.

[185] Reasoning in the Dark: Interleaved Vision-Text Reasoning in Latent Space

Chao Chen, Zhixin Ma, Yongqi Li, Yupeng Hu, Yinwei Wei, Wenjie Li, Liqiang Nie

Main category: cs.CV

TL;DR: IVT-LR introduces multimodal latent reasoning that combines visual and textual information in latent space to improve efficiency and reduce annotation requirements in multimodal reasoning.

Details

Motivation: Current multimodal reasoning methods require explicit reasoning steps with labor-intensive vision-text annotations and suffer from significant inference latency.

Method: Proposes Interleaved Vision-Text Latent Reasoning (IVT-LR) that represents reasoning steps using latent text (hidden states) and latent vision (selected image embeddings) in latent space, with progressive multi-stage training.

Result: Achieves 5.45% average accuracy improvement on M3CoT and ScienceQA benchmarks while achieving over 5x speed increase compared to existing approaches.

Conclusion: IVT-LR provides an efficient multimodal reasoning framework that reduces annotation requirements and inference latency while improving performance.

Abstract: Multimodal reasoning aims to enhance the capabilities of MLLMs by incorporating intermediate reasoning steps before reaching the final answer. It has evolved from text-only reasoning to the integration of visual information, enabling the thought process to be conveyed through both images and text. Despite its effectiveness, current multimodal reasoning methods depend on explicit reasoning steps that require labor-intensive vision-text annotations and inherently introduce significant inference latency. To address these issues, we introduce multimodal latent reasoning with the advantages of multimodal representation, reduced annotation, and inference efficiency. To facilicate it, we propose Interleaved Vision-Text Latent Reasoning (IVT-LR), which injects both visual and textual information in the reasoning process within the latent space. Specifically, IVT-LR represents each reasoning step by combining two implicit parts: latent text (the hidden states from the previous step) and latent vision (a set of selected image embeddings). We further introduce a progressive multi-stage training strategy to enable MLLMs to perform the above multimodal latent reasoning steps. Experiments on M3CoT and ScienceQA demonstrate that our IVT-LR method achieves an average performance increase of 5.45% in accuracy, while simultaneously achieving a speed increase of over 5 times compared to existing approaches. Code available at https://github.com/FYYDCC/IVT-LR.

[186] WaterFlow: Explicit Physics-Prior Rectified Flow for Underwater Saliency Mask Generation

Runting Li, Shijie Lian, Hua Li, Yutong Li, Wenhui Wu, Sam Kwong

Main category: cs.CV

TL;DR: WaterFlow is a rectified flow-based framework for underwater salient object detection that incorporates underwater physical imaging priors and temporal modeling, achieving state-of-the-art performance on USOD10K dataset.

Details

Motivation: Existing USOD methods ignore underwater imaging physics or treat degradation as noise, failing to exploit valuable information in underwater images. There's a need to incorporate physical principles and leverage degradation phenomena constructively.

Method: Proposes WaterFlow framework using rectified flow-based approach that explicitly incorporates underwater physical imaging information as priors during network training and introduces temporal dimension modeling for enhanced salient object identification.

Result: Achieves 0.072 gain in S_m metric on USOD10K dataset, demonstrating significant performance improvement and effectiveness of the proposed method.

Conclusion: WaterFlow effectively addresses USOD challenges by incorporating physical imaging priors and temporal modeling, showing superior performance and establishing a new approach for leveraging underwater degradation information constructively.

Abstract: Underwater Salient Object Detection (USOD) faces significant challenges, including underwater image quality degradation and domain gaps. Existing methods tend to ignore the physical principles of underwater imaging or simply treat degradation phenomena in underwater images as interference factors that must be eliminated, failing to fully exploit the valuable information they contain. We propose WaterFlow, a rectified flow-based framework for underwater salient object detection that innovatively incorporates underwater physical imaging information as explicit priors directly into the network training process and introduces temporal dimension modeling, significantly enhancing the model’s capability for salient object identification. On the USOD10K dataset, WaterFlow achieves a 0.072 gain in S_m, demonstrating the effectiveness and superiority of our method. The code will be published after the acceptance.

[187] Zero-Shot CFC: Fast Real-World Image Denoising based on Cross-Frequency Consistency

Yanlin Jiang, Yuchen Liu, Mingren Liu

Main category: cs.CV

TL;DR: ZSCFC is an efficient zero-shot denoiser that uses cross-frequency consistency to remove noise from single images without noise distribution assumptions, outperforming existing methods in speed and performance.

Details

Motivation: Existing zero-shot denoisers have long training times and rely on unrealistic noise assumptions (independence and zero-mean), limiting their effectiveness in real-world scenarios with complex noise characteristics.

Method: Proposes ZSCFC method based on cross-frequency consistency - image textures show position similarity and content consistency across frequency bands while noise doesn’t. Uses cross-frequency consistency loss and ultralight network for single-image training and denoising.

Result: Experiments on various real-world datasets show ZSCFC outperforms state-of-the-art zero-shot methods in both computational efficiency and denoising performance.

Conclusion: ZSCFC provides an effective and efficient solution for real-world denoising that works with single noisy images without requiring noise distribution assumptions.

Abstract: Zero-shot denoisers address the dataset dependency of deep-learning-based denoisers, enabling the denoising of unseen single images. Nonetheless, existing zero-shot methods suffer from long training times and rely on the assumption of noise independence and a zero-mean property, limiting their effectiveness in real-world denoising scenarios where noise characteristics are more complicated. This paper proposes an efficient and effective method for real-world denoising, the Zero-Shot denoiser based on Cross-Frequency Consistency (ZSCFC), which enables training and denoising with a single noisy image and does not rely on assumptions about noise distribution. Specifically, image textures exhibit position similarity and content consistency across different frequency bands, while noise does not. Based on this property, we developed cross-frequency consistency loss and an ultralight network to realize image denoising. Experiments on various real-world image datasets demonstrate that our ZSCFC outperforms other state-of-the-art zero-shot methods in terms of computational efficiency and denoising performance.

[188] On the Use of Hierarchical Vision Foundation Models for Low-Cost Human Mesh Recovery and Pose Estimation

Shuhei Tarashima, Yushan Wang, Norio Tagawa

Main category: cs.CV

TL;DR: This paper develops efficient human mesh recovery (HMR) and human pose estimation (HPE) models by using early stages of hierarchical vision foundation models as encoders, achieving comparable performance to full models with better computational efficiency.

Details

Motivation: To create simple and efficient HMR/HPE models that overcome the computational burden of large non-hierarchical vision transformers used in state-of-the-art methods like HMR2.0 and ViTPose.

Method: Proposed using only the first 2-3 stages of hierarchical vision foundation models (Swin Transformer, GroupMixFormer, VMamba) as encoders, leveraging their high-resolution intermediate features to match non-hierarchical model performance.

Result: Comprehensive evaluation of 27 hierarchical-VFM-based models showed that truncated models achieve performance on par with full-stage models while offering better accuracy-efficiency trade-offs than existing lightweight alternatives.

Conclusion: Using early stages of hierarchical vision foundation models provides an effective approach for building efficient HMR and HPE models with competitive performance and improved computational efficiency.

Abstract: In this work, we aim to develop simple and efficient models for human mesh recovery (HMR) and its predecessor task, human pose estimation (HPE). State-of-the-art HMR methods, such as HMR2.0 and its successors, rely on large, non-hierarchical vision transformers as encoders, which are inherited from the corresponding HPE models like ViTPose. To establish baselines across varying computational budgets, we first construct three lightweight HMR2.0 variants by adapting the corresponding ViTPose models. In addition, we propose leveraging the early stages of hierarchical vision foundation models (VFMs), including Swin Transformer, GroupMixFormer, and VMamba, as encoders. This design is motivated by the observation that intermediate stages of hierarchical VFMs produce feature maps with resolutions comparable to or higher than those of non-hierarchical counterparts. We conduct a comprehensive evaluation of 27 hierarchical-VFM-based HMR and HPE models, demonstrating that using only the first two or three stages achieves performance on par with full-stage models. Moreover, we show that the resulting truncated models exhibit better trade-offs between accuracy and computational efficiency compared to existing lightweight alternatives.

[189] TerraCodec: Compressing Earth Observations

Julen Costa-Watanabe, Isabelle Wittmann, Benedikt Blumenstiel, Konrad Schindler

Main category: cs.CV

TL;DR: TerraCodec (TEC) is a family of learned compression algorithms for Earth observation data that outperforms classical codecs by 3-10x, with temporal models enabling zero-shot cloud inpainting.

Details

Motivation: Earth observation satellites generate massive multispectral image time series, creating storage and transmission challenges. Current learned compression is fragmented and misaligned with natural image compression advances, while existing codecs fail to capture temporal redundancy in largely static scenes.

Method: TEC includes image-based variants for multispectral inputs and TEC-TT (Temporal Transformer) that leverages temporal dependencies. Introduces Latent Repacking for training flexible-rate transformer models that handle varying rate-distortion settings.

Result: Trained on Sentinel-2 data, TerraCodec achieves 3-10x stronger compression than classical codecs at equivalent image quality. TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark.

Conclusion: Bespoke, learned compression algorithms are a promising direction for Earth observation, with code and model weights to be released under permissive license.

Abstract: Earth observation (EO) satellites produce massive streams of multispectral image time series, posing pressing challenges for storage and transmission. Yet, learned EO compression remains fragmented, lacking publicly available pretrained models and misaligned with advances in compression for natural imagery. Image codecs overlook temporal redundancy, while video codecs rely on motion priors that fail to capture the radiometric evolution of largely static scenes. We introduce TerraCodec (TEC), a family of learned codecs tailored to EO. TEC includes efficient image-based variants adapted to multispectral inputs, as well as a Temporal Transformer model (TEC-TT) that leverages dependencies across time. To overcome the fixed-rate setting of today’s neural codecs, we present Latent Repacking, a novel method for training flexible-rate transformer models that operate on varying rate-distortion settings. Trained on Sentinel-2 data, TerraCodec outperforms classical codecs, achieving 3-10x stronger compression at equivalent image quality. Beyond compression, TEC-TT enables zero-shot cloud inpainting, surpassing state-of-the-art methods on the AllClear benchmark. Our results establish bespoke, learned compression algorithms as a promising direction for Earth observation. Code and model weights will be released under a permissive license.

[190] MCOP: Multi-UAV Collaborative Occupancy Prediction

Zefu Lin, Wenbo Chen, Xiaojuan Jin, Yuran Yang, Lue Fan, Yixin Zhang, Yufeng Zhang, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: A novel multi-UAV collaborative occupancy prediction framework that preserves 3D spatial structures and semantics while reducing communication overhead, achieving state-of-the-art accuracy on extended virtual and real-world datasets.

Details

Motivation: Current BEV-based approaches for UAV swarm systems have limitations: bounding-box representations fail to capture complete semantic/geometric information, and performance degrades with undefined/occluded objects.

Method: Proposes a framework with Spatial-Aware Feature Encoder, Cross-Agent Feature Integration, Altitude-Aware Feature Reduction, and Dual-Mask Perceptual Guidance for adaptive feature selection and communication efficiency.

Result: Achieves state-of-the-art accuracy, significantly outperforms existing collaborative methods, and reduces communication overhead to only a fraction of previous approaches.

Conclusion: The proposed collaborative occupancy prediction framework effectively addresses limitations of current BEV-based methods and demonstrates superior performance with reduced communication costs.

Abstract: Unmanned Aerial Vehicle (UAV) swarm systems necessitate efficient collaborative perception mechanisms for diverse operational scenarios. Current Bird’s Eye View (BEV)-based approaches exhibit two main limitations: bounding-box representations fail to capture complete semantic and geometric information of the scene, and their performance significantly degrades when encountering undefined or occluded objects. To address these limitations, we propose a novel multi-UAV collaborative occupancy prediction framework. Our framework effectively preserves 3D spatial structures and semantics through integrating a Spatial-Aware Feature Encoder and Cross-Agent Feature Integration. To enhance efficiency, we further introduce Altitude-Aware Feature Reduction to compactly represent scene information, along with a Dual-Mask Perceptual Guidance mechanism to adaptively select features and reduce communication overhead. Due to the absence of suitable benchmark datasets, we extend three datasets for evaluation: two virtual datasets (Air-to-Pred-Occ and UAV3D-Occ) and one real-world dataset (GauUScene-Occ). Experiments results demonstrate that our method achieves state-of-the-art accuracy, significantly outperforming existing collaborative methods while reducing communication overhead to only a fraction of previous approaches.

[191] EReLiFM: Evidential Reliability-Aware Residual Flow Meta-Learning for Open-Set Domain Generalization under Noisy Labels

Kunyu Peng, Di Wen, Kailun Yang, Jia Fu, Yufan Chen, Ruiping Liu, Jiamin Wu, Junwei Zheng, M. Saquib Sarfraz, Luc Van Gool, Danda Pani Paudel, Rainer Stiefelhagen

Main category: cs.CV

TL;DR: EReLiFM addresses Open-Set Domain Generalization with Noisy Labels by combining evidential loss clustering for label reliability awareness and residual flow matching for uncertainty-aware domain transfer, achieving state-of-the-art performance.

Details

Motivation: Label noise corrupts source-domain knowledge in Open-Set Domain Generalization, making it difficult to recognize known classes and reject unseen ones. Existing methods struggle with domain gaps when clean labeled data is limited.

Method: Proposes Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM) with: 1) unsupervised two-stage evidential loss clustering for label reliability awareness, 2) residual flow matching mechanism modeling domain- and category-conditioned residuals for uncertainty-aware transfer, and 3) meta-learning optimization where clean set updates maximize loss decrease on noisy set using confident pseudo labels.

Result: Experimental results show EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance.

Conclusion: EReLiFM effectively addresses the challenges of Open-Set Domain Generalization with Noisy Labels through evidential clustering and residual flow matching, demonstrating superior performance over existing approaches.

Abstract: Open-Set Domain Generalization (OSDG) aims to enable deep learning models to recognize unseen categories in new domains, which is crucial for real-world applications. Label noise hinders open-set domain generalization by corrupting source-domain knowledge, making it harder to recognize known classes and reject unseen ones. While existing methods address OSDG under Noisy Labels (OSDG-NL) using hyperbolic prototype-guided meta-learning, they struggle to bridge domain gaps, especially with limited clean labeled data. In this paper, we propose Evidential Reliability-Aware Residual Flow Meta-Learning (EReLiFM). We first introduce an unsupervised two-stage evidential loss clustering method to promote label reliability awareness. Then, we propose a residual flow matching mechanism that models structured domain- and category-conditioned residuals, enabling diverse and uncertainty-aware transfer paths beyond interpolation-based augmentation. During this meta-learning process, the model is optimized such that the update direction on the clean set maximizes the loss decrease on the noisy set, using pseudo labels derived from the most confident predicted class for supervision. Experimental results show that EReLiFM outperforms existing methods on OSDG-NL, achieving state-of-the-art performance. The source code is available at https://github.com/KPeng9510/ERELIFM.

[192] Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis

Shelley Zixin Shu, Haozhe Luo, Alexander Poellinger, Mauricio Reyes

Main category: cs.CV

TL;DR: Proposes H-EGL framework combining self-supervised and human-guided constraints to improve attention alignment and generalization in medical imaging transformers, outperforming existing EGL methods.

Details

Motivation: Transformer models in medical imaging often learn spurious correlations leading to biases and limited generalization. Human-AI attention alignment can help but requires costly manual supervision.

Method: Hybrid Explanation-Guided Learning (H-EGL) framework with self-supervised component using class-distinctive attention without restrictive priors, combined with human-guided constraints.

Result: H-EGL outperforms two state-of-the-art EGL methods in chest X-ray classification using Vision Transformer, achieving superior classification accuracy, generalization capability, and better attention map alignment with human expertise.

Conclusion: The H-EGL framework effectively enhances attention alignment and generalization in medical imaging transformers without relying on costly manual supervision, demonstrating improved performance over existing methods.

Abstract: Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.

[193] Beyond Seeing: Evaluating Multimodal LLMs on Tool-Enabled Image Perception, Transformation, and Reasoning

Xingang Guo, Utkarsh Tyagi, Advait Gosai, Paula Vergara, Ernesto Gabriel Hernández Montoya, Chen Bo Calvin Zhang, Bin Hu, Yunzhong He, Bing Liu, Rakshith Sharma Srinivasa

Main category: cs.CV

TL;DR: IRIS is a new benchmark for evaluating MLLMs’ ability to actively manipulate and reason with images, moving beyond static visual perception to dynamic image transformation and tool integration.

Details

Motivation: Current MLLMs treat images as passive context rather than manipulable cognitive workspaces, and existing benchmarks follow a 'think about images' paradigm rather than 'think with images' approach needed for real-world applications.

Method: Developed IRIS benchmark with 1,204 challenging vision tasks (603 single-turn, 601 multi-turn) across five domains, with detailed rubrics for systematic evaluation of MLLMs’ ability to perceive, transform, and reason with images.

Result: Current MLLMs struggle significantly - even the strongest model (GPT-5-think) achieves only 18.68% pass rate. Models show divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement.

Conclusion: IRIS provides the first benchmark for ’think with images’ paradigm, revealing critical gaps in MLLMs’ visual intelligence and offering insights for advancing their ability to actively manipulate and reason with visual content.

Abstract: Multimodal Large Language Models (MLLMs) are increasingly applied in real-world scenarios where user-provided images are often imperfect, requiring active image manipulations such as cropping, editing, or enhancement to uncover salient visual cues. Beyond static visual perception, MLLMs must also think with images: dynamically transforming visual content and integrating it with other tools to solve complex tasks. However, this shift from treating vision as passive context to a manipulable cognitive workspace remains underexplored. Most existing benchmarks still follow a think about images paradigm, where images are regarded as static inputs. To address this gap, we introduce IRIS, an Interactive Reasoning with Images and Systems that evaluates MLLMs’ ability to perceive, transform, and reason across complex visual-textual tasks under the think with images paradigm. IRIS comprises 1,204 challenging, open-ended vision tasks (603 single-turn, 601 multi-turn) spanning across five diverse domains, each paired with detailed rubrics to enable systematic evaluation. Our evaluation shows that current MLLMs struggle with tasks requiring effective integration of vision and general-purpose tools. Even the strongest model (GPT-5-think) reaches only 18.68% pass rate. We further observe divergent tool-use behaviors, with OpenAI models benefiting from diverse image manipulations while Gemini-2.5-pro shows no improvement. By introducing the first benchmark centered on think with images, IRIS offers critical insights for advancing visual intelligence in MLLMs.

[194] Personalized Federated Fine-Tuning of Vision Foundation Models for Healthcare

Adam Tupper, Christian Gagné

Main category: cs.CV

TL;DR: Proposes a personalized federated fine-tuning method using orthogonal LoRA adapters to separate general and client-specific knowledge, enabling better use of distributed healthcare data while preserving privacy.

Details

Motivation: Foundation models need fine-tuning for healthcare tasks but face data scarcity due to privacy restrictions on sharing patient data across hospitals and clinics.

Method: Uses orthogonal LoRA adapters to disentangle general and client-specific knowledge in federated learning, allowing each client to leverage both their own data and others’ data.

Result: Preliminary results on real-world federated medical imaging tasks show the approach is competitive against current federated fine-tuning methods.

Conclusion: The proposed personalized federated fine-tuning method effectively addresses data scarcity in healthcare while maintaining privacy through federated learning.

Abstract: Foundation models open up new possibilities for the use of AI in healthcare. However, even when pre-trained on health data, they still need to be fine-tuned for specific downstream tasks. Furthermore, although foundation models reduce the amount of training data required to achieve good performance, obtaining sufficient data is still a challenge. This is due, in part, to restrictions on sharing and aggregating data from different sources to protect patients’ privacy. One possible solution to this is to fine-tune foundation models via federated learning across multiple participating clients (i.e., hospitals, clinics, etc.). In this work, we propose a new personalized federated fine-tuning method that learns orthogonal LoRA adapters to disentangle general and client-specific knowledge, enabling each client to fully exploit both their own data and the data of others. Our preliminary results on real-world federated medical imaging tasks demonstrate that our approach is competitive against current federated fine-tuning methods.

[195] SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, Xihui Liu

Main category: cs.CV

TL;DR: SRUM is a self-rewarding post-training framework that enables Unified Multimodal Models to use their understanding module to reward and improve their generation module, creating a feedback loop without additional human-labeled data.

Details

Motivation: There's a significant gap where models' strong visual understanding capabilities don't transfer to visual generation - they can understand images but can't generate faithful images from text prompts. This raises the question of whether models can achieve self-improvement by using understanding to reward generation.

Method: SRUM creates a feedback loop where the model’s understanding module acts as an internal evaluator, providing corrective signals to improve generation. It uses a global-local dual reward system: global reward ensures overall visual semantics and layout correctness, while local reward refines fine-grained object-level fidelity.

Result: SRUM significantly boosts performance on benchmarks: T2I-CompBench improved from 82.18 to 88.37, and T2I-ReasonBench improved from 43.82 to 46.75. The framework shows strong generalization capabilities.

Conclusion: SRUM establishes a new paradigm for enabling Unified Multimodal Models’ understanding modules to guide and enhance their own generation capabilities through self-rewarding, bridging the gap between understanding and generation without requiring additional human-labeled data.

Abstract: Recently, remarkable progress has been made in Unified Multimodal Models (UMMs), which integrate vision-language generation and understanding capabilities within a single framework. However, a significant gap exists where a model’s strong visual understanding often fails to transfer to its visual generation. A model might correctly understand an image based on user instructions, yet be unable to generate a faithful image from text prompts. This phenomenon directly raises a compelling question: Can a model achieve self-improvement by using its understanding module to reward its generation module? To bridge this gap and achieve self-improvement, we introduce SRUM, a self-rewarding post-training framework that can be directly applied to existing UMMs of various designs. SRUM creates a feedback loop where the model’s own understanding module acts as an internal ``evaluator’’, providing corrective signals to improve its generation module, without requiring additional human-labeled data. To ensure this feedback is comprehensive, we designed a global-local dual reward system. To tackle the inherent structural complexity of images, this system offers multi-scale guidance: a \textbf{global reward} ensures the correctness of the overall visual semantics and layout, while a \textbf{local reward} refines fine-grained, object-level fidelity. SRUM leads to powerful capabilities and shows strong generalization, boosting performance on T2I-CompBench from 82.18 to \textbf{88.37} and on T2I-ReasonBench from 43.82 to \textbf{46.75}. Overall, our work establishes a powerful new paradigm for enabling a UMMs’ understanding module to guide and enhance its own generation via self-rewarding.

[196] FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution

Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue

Main category: cs.CV

TL;DR: FlashVSR is a diffusion-based one-step streaming framework for real-time video super-resolution that achieves ~17 FPS for 768x1408 videos on A100 GPU through three innovations: train-friendly distillation, sparse attention, and tiny conditional decoder.

Details

Motivation: Current diffusion models for video super-resolution suffer from high latency, prohibitive computation, and poor generalization to ultra-high resolutions, making them impractical for real-world applications.

Method: Three complementary innovations: (1) train-friendly three-stage distillation pipeline for streaming super-resolution, (2) locality-constrained sparse attention to reduce computation and bridge train-test resolution gap, (3) tiny conditional decoder for accelerated reconstruction. Also created VSR-120K dataset with 120k videos and 180k images.

Result: FlashVSR achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models, scales reliably to ultra-high resolutions, and runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU.

Conclusion: FlashVSR makes diffusion-based video super-resolution practical by achieving efficiency, scalability, and real-time performance, with code, models, and dataset to be released for future research.

Abstract: Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.

[197] SPORTS: Simultaneous Panoptic Odometry, Rendering, Tracking and Segmentation for Urban Scenes Understanding

Zhiliu Yang, Jinyu Dai, Jianyuan Zhang, Zhu Yang

Main category: cs.CV

TL;DR: SPORTS is a unified framework that integrates Video Panoptic Segmentation, Visual Odometry, and Scene Rendering to address scene understanding challenges like segmentation deficiency and sensor sparsity through iterative cross-task enhancement.

Details

Motivation: Existing scene perception solutions suffer from segmentation deficiencies, dynamic object interference, sensor data sparsity, and view limitations, which hinder embodied-AI agents' performance.

Method: The framework integrates three tasks: VPS uses adaptive attention-based geometric fusion with pose, depth, and optical flow alignment; VO combines panoptic segmentation with optical flow for better camera pose estimation; SR transforms sparse point clouds into neural fields for high-fidelity rendering.

Result: Extensive experiments on three public datasets show superior performance on odometry, tracking, segmentation, and novel view synthesis tasks compared to state-of-the-art methods.

Conclusion: SPORTS demonstrates that tightly integrating VPS, VO, and SR in an iterative framework effectively addresses key scene understanding challenges and outperforms existing approaches across multiple tasks.

Abstract: The scene perception, understanding, and simulation are fundamental techniques for embodied-AI agents, while existing solutions are still prone to segmentation deficiency, dynamic objects’ interference, sensor data sparsity, and view-limitation problems. This paper proposes a novel framework, named SPORTS, for holistic scene understanding via tightly integrating Video Panoptic Segmentation (VPS), Visual Odometry (VO), and Scene Rendering (SR) tasks into an iterative and unified perspective. Firstly, VPS designs an adaptive attention-based geometric fusion mechanism to align cross-frame features via enrolling the pose, depth, and optical flow modality, which automatically adjust feature maps for different decoding stages. And a post-matching strategy is integrated to improve identities tracking. In VO, panoptic segmentation results from VPS are combined with the optical flow map to improve the confidence estimation of dynamic objects, which enhances the accuracy of the camera pose estimation and completeness of the depth map generation via the learning-based paradigm. Furthermore, the point-based rendering of SR is beneficial from VO, transforming sparse point clouds into neural fields to synthesize high-fidelity RGB views and twin panoptic views. Extensive experiments on three public datasets demonstrate that our attention-based feature fusion outperforms most existing state-of-the-art methods on the odometry, tracking, segmentation, and novel view synthesis tasks.

[198] VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

A. Alfarano, L. Venturoli, D. Negueruela del Castillo

Main category: cs.CV

TL;DR: VQArt-Bench is a new VQA benchmark for cultural heritage that reveals significant limitations in current MLLMs, especially in deep semantic understanding of visual art.

Details

Motivation: Existing VQA benchmarks fail to evaluate deep semantic understanding in complex domains like visual art analysis, leading models to exploit statistical shortcuts rather than engage in visual reasoning.

Method: Constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions structured along visual understanding dimensions.

Result: Evaluation of 14 state-of-the-art MLLMs revealed significant limitations, including weakness in simple counting tasks and performance gap between proprietary and open-source models.

Conclusion: Current MLLMs have substantial limitations in deep visual understanding, particularly in complex domains like cultural heritage, highlighting the need for more sophisticated benchmarks and model improvements.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model’s ability to interpret symbolic meaning, narratives, and complex visual relationships. Our evaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and open-source models.

[199] E-MoFlow: Learning Egomotion and Optical Flow from Event Data via Implicit Regularization

Wenpu Li, Bangyan Liao, Yi Zhou, Qi Xu, Pian Wan, Peidong Liu

Main category: cs.CV

TL;DR: E-MoFlow: An unsupervised framework that jointly optimizes optical flow and 6-DoF ego-motion using implicit spatial-temporal and geometric regularization, achieving state-of-the-art performance without supervision.

Details

Motivation: Traditional independent estimation of optical flow and ego-motion is ill-posed for neuromorphic vision due to lack of robust data association and ground truth supervision. Existing methods introduce bias through explicit regularization or converge to suboptimal solutions.

Method: Models camera ego-motion as continuous spline and optical flow as implicit neural representation, embedding spatial-temporal coherence through inductive biases. Incorporates structure-and-motion priors via differential geometric constraints without explicit depth estimation.

Result: Achieves state-of-the-art performance among unsupervised methods and competitive with supervised approaches. Demonstrates versatility in general 6-DoF motion scenarios.

Conclusion: E-MoFlow successfully unifies ego-motion and optical flow estimation through implicit regularization in an unsupervised paradigm, addressing the ill-posed nature of the problem for neuromorphic vision.

Abstract: The estimation of optical flow and 6-DoF ego-motion, two fundamental tasks in 3D vision, has typically been addressed independently. For neuromorphic vision (e.g., event cameras), however, the lack of robust data association makes solving the two problems separately an ill-posed challenge, especially in the absence of supervision via ground truth. Existing works mitigate this ill-posedness by either enforcing the smoothness of the flow field via an explicit variational regularizer or leveraging explicit structure-and-motion priors in the parametrization to improve event alignment. The former notably introduces bias in results and computational overhead, while the latter, which parametrizes the optical flow in terms of the scene depth and the camera motion, often converges to suboptimal local minima. To address these issues, we propose an unsupervised framework that jointly optimizes egomotion and optical flow via implicit spatial-temporal and geometric regularization. First, by modeling camera’s egomotion as a continuous spline and optical flow as an implicit neural representation, our method inherently embeds spatial-temporal coherence through inductive biases. Second, we incorporate structure-and-motion priors through differential geometric constraints, bypassing explicit depth estimation while maintaining rigorous geometric consistency. As a result, our framework (called E-MoFlow) unifies egomotion and optical flow estimation via implicit regularization under a fully unsupervised paradigm. Experiments demonstrate its versatility to general 6-DoF motion scenarios, achieving state-of-the-art performance among unsupervised methods and competitive even with supervised approaches.

[200] PET Head Motion Estimation Using Supervised Deep Learning with Attention

Zhuotong Cai, Tianyi Zeng, Jiazhen Zhang, Eléonore V. Lieffrig, Kathryn Fontaine, Chenyu You, Enette Mae Revilla, James S. Duncan, Jingmin Xin, Yihuan Lu, John A. Onofrey

Main category: cs.CV

TL;DR: DL-HMC++ is a deep learning approach for head motion correction in PET imaging that uses cross-attention to predict rigid head motion from 3D PET raw data, outperforming existing methods and achieving results comparable to gold-standard hardware motion tracking.

Details

Motivation: Head movement causes artifacts and quantification errors in brain PET imaging. Hardware-based motion tracking has limited clinical applicability, so a data-driven solution is needed for accurate neurological diagnosis.

Method: Deep learning approach with cross-attention trained in supervised manner using dynamic PET scans with gold-standard motion measurements from external hardware tracking. Predicts rigid head motion from one-second 3D PET raw data.

Result: Outperforms state-of-the-art data-driven methods, produces motion-free images indistinguishable from gold-standard HMT. Average difference ratios: 1.2±0.5% for HRRT and 0.5±0.2% for mCT scanners across four radiotracers.

Conclusion: DL-HMC++ enables effective PET head motion correction without hardware tracking, making motion correction accessible for clinical populations beyond research settings.

Abstract: Head movement poses a significant challenge in brain positron emission tomography (PET) imaging, resulting in image artifacts and tracer uptake quantification inaccuracies. Effective head motion estimation and correction are crucial for precise quantitative image analysis and accurate diagnosis of neurological disorders. Hardware-based motion tracking (HMT) has limited applicability in real-world clinical practice. To overcome this limitation, we propose a deep-learning head motion correction approach with cross-attention (DL-HMC++) to predict rigid head motion from one-second 3D PET raw data. DL-HMC++ is trained in a supervised manner by leveraging existing dynamic PET scans with gold-standard motion measurements from external HMT. We evaluate DL-HMC++ on two PET scanners (HRRT and mCT) and four radiotracers (18F-FDG, 18F-FPEB, 11C-UCB-J, and 11C-LSN3172176) to demonstrate the effectiveness and generalization of the approach in large cohort PET studies. Quantitative and qualitative results demonstrate that DL-HMC++ consistently outperforms state-of-the-art data-driven motion estimation methods, producing motion-free images with clear delineation of brain structures and reduced motion artifacts that are indistinguishable from gold-standard HMT. Brain region of interest standard uptake value analysis exhibits average difference ratios between DL-HMC++ and gold-standard HMT to be 1.2 plus-minus 0.5% for HRRT and 0.5 plus-minus 0.2% for mCT. DL-HMC++ demonstrates the potential for data-driven PET head motion correction to remove the burden of HMT, making motion correction accessible to clinical populations beyond research settings. The code is available at https://github.com/maxxxxxxcai/DL-HMC-TMI.

[201] AnyUp: Universal Feature Upsampling

Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, Jan Eric Lenssen

Main category: cs.CV

TL;DR: AnyUp is a feature-agnostic upsampling method that works with any vision feature at any resolution without encoder-specific training, outperforming existing methods.

Details

Motivation: Existing learning-based upsamplers need retraining for each feature extractor and don't generalize across different feature types at inference time.

Method: An inference-time feature-agnostic upsampling architecture that doesn’t require encoder-specific training.

Result: Sets new state-of-the-art for upsampled features, generalizes to different feature types, preserves feature semantics, and is efficient for downstream tasks.

Conclusion: AnyUp provides a versatile and high-quality upsampling solution that works across various vision features without retraining requirements.

Abstract: We introduce AnyUp, a method for feature upsampling that can be applied to any vision feature at any resolution, without encoder-specific training. Existing learning-based upsamplers for features like DINO or CLIP need to be re-trained for every feature extractor and thus do not generalize to different feature types at inference time. In this work, we propose an inference-time feature-agnostic upsampling architecture to alleviate this limitation and improve upsampling quality. In our experiments, AnyUp sets a new state of the art for upsampled features, generalizes to different feature types, and preserves feature semantics while being efficient and easy to apply to a wide range of downstream tasks.

[202] Efficient Perceptual Image Super Resolution: AIM 2025 Study and Benchmark

Bruno Longarela, Marcos V. Conde, Alvaro Garcia, Radu Timofte

Main category: cs.CV

TL;DR: This paper presents a benchmark study on Efficient Perceptual Super-Resolution (EPSR), aiming to achieve Real-ESRGAN-level perceptual quality with strict efficiency constraints of ≤5M parameters and ≤2000 GFLOPs.

Details

Motivation: While efficient PSNR-oriented super resolution has advanced significantly, perceptual quality-focused approaches remain inefficient. The authors aim to bridge this gap by developing efficient methods that match or exceed Real-ESRGAN's perceptual results.

Method: The study proposes solutions evaluated on a novel dataset of 500 4K test images degraded using multiple degradation types, without original high-quality counterparts to simulate realistic deployment conditions.

Result: The top-performing approach outperforms Real-ESRGAN across all benchmark datasets while meeting the strict efficiency constraints, demonstrating the potential of efficient methods in the perceptual domain.

Conclusion: This paper establishes modern baselines for efficient perceptual super resolution, showing that efficient methods can achieve superior perceptual quality compared to existing approaches like Real-ESRGAN.

Abstract: This paper presents a comprehensive study and benchmark on Efficient Perceptual Super-Resolution (EPSR). While significant progress has been made in efficient PSNR-oriented super resolution, approaches focusing on perceptual quality metrics remain relatively inefficient. Motivated by this gap, we aim to replicate or improve the perceptual results of Real-ESRGAN while meeting strict efficiency constraints: a maximum of 5M parameters and 2000 GFLOPs, calculated for an input size of 960x540 pixels. The proposed solutions were evaluated on a novel dataset consisting of 500 test images of 4K resolution, each degraded using multiple degradation types, without providing the original high-quality counterparts. This design aims to reflect realistic deployment conditions and serves as a diverse and challenging benchmark. The top-performing approach manages to outperform Real-ESRGAN across all benchmark datasets, demonstrating the potential of efficient methods in the perceptual domain. This paper establishes the modern baselines for efficient perceptual super resolution.

[203] Uncertainty Matters in Dynamic Gaussian Splatting for Monocular 4D Reconstruction

Fengzhi Guo, Chih-Chuan Hsu, Sihao Ding, Cheng Zhang

Main category: cs.CV

TL;DR: USplat4D introduces uncertainty-aware dynamic Gaussian Splatting that models per-Gaussian uncertainty to improve 4D reconstruction by propagating reliable motion cues and handling occlusion better.

Details

Motivation: Dynamic 3D scene reconstruction from monocular input is under-constrained with occlusion and extreme view ambiguities. Vanilla Gaussian Splatting optimizes all primitives uniformly, ignoring observation reliability, leading to motion drifts and degraded synthesis.

Method: Estimates time-varying per-Gaussian uncertainty and constructs a spatio-temporal graph for uncertainty-aware optimization, treating frequently observed Gaussians as reliable anchors and less visible ones as less reliable.

Result: Experiments on diverse datasets show consistent improvements over vanilla models, with more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

Conclusion: Explicitly modeling uncertainty in dynamic Gaussian Splatting frameworks significantly enhances 4D reconstruction quality by leveraging reliable motion cues and handling observation variability.

Abstract: Reconstructing dynamic 3D scenes from monocular input is fundamentally under-constrained, with ambiguities arising from occlusion and extreme novel views. While dynamic Gaussian Splatting offers an efficient representation, vanilla models optimize all Gaussian primitives uniformly, ignoring whether they are well or poorly observed. This limitation leads to motion drifts under occlusion and degraded synthesis when extrapolating to unseen views. We argue that uncertainty matters: Gaussians with recurring observations across views and time act as reliable anchors to guide motion, whereas those with limited visibility are treated as less reliable. To this end, we introduce USplat4D, a novel Uncertainty-aware dynamic Gaussian Splatting framework that propagates reliable motion cues to enhance 4D reconstruction. Our key insight is to estimate time-varying per-Gaussian uncertainty and leverages it to construct a spatio-temporal graph for uncertainty-aware optimization. Experiments on diverse real and synthetic datasets show that explicitly modeling uncertainty consistently improves dynamic Gaussian Splatting models, yielding more stable geometry under occlusion and high-quality synthesis at extreme viewpoints.

[204] What If : Understanding Motion Through Sparse Interactions

Stefan Andreas Baumann, Nick Stracke, Timy Phan, Björn Ommer

Main category: cs.CV

TL;DR: FPT is a novel framework that predicts multi-modal motion distributions from sparse interactions (pokes), providing interpretable representations of scene dynamics and uncertainties, outperforming specialized baselines on various tasks.

Details

Motivation: Traditional methods only enable dense sampling of single motion realizations, lacking interpretable representations of multi-modal scene dynamics and their dependency on physical interactions.

Method: Flow Poke Transformer (FPT) framework that directly predicts distributions of local motion conditioned on sparse interactions called ‘pokes’, providing interpretable representations of scene motion dependencies and uncertainties.

Result: FPT surpasses specialized baselines on dense face motion generation, achieves significant improvements on articulated object motion estimation when fine-tuned, and shows competitive performance on moving part segmentation from pokes.

Conclusion: FPT provides a flexible and versatile approach for predicting multi-modal motion distributions from sparse interactions, enabling interpretable representations of scene dynamics across various downstream tasks.

Abstract: Understanding the dynamics of a physical scene involves reasoning about the diverse ways it can potentially change, especially as a result of local interactions. We present the Flow Poke Transformer (FPT), a novel framework for directly predicting the distribution of local motion, conditioned on sparse interactions termed “pokes”. Unlike traditional methods that typically only enable dense sampling of a single realization of scene dynamics, FPT provides an interpretable directly accessible representation of multi-modal scene motion, its dependency on physical interactions and the inherent uncertainties of scene dynamics. We also evaluate our model on several downstream tasks to enable comparisons with prior methods and highlight the flexibility of our approach. On dense face motion generation, our generic pre-trained model surpasses specialized baselines. FPT can be fine-tuned in strongly out-of-distribution tasks such as synthetic datasets to enable significant improvements over in-domain methods in articulated object motion estimation. Additionally, predicting explicit motion distributions directly enables our method to achieve competitive performance on tasks like moving part segmentation from pokes which further demonstrates the versatility of our FPT. Code and models are publicly available at https://compvis.github.io/flow-poke-transformer.

[205] MVP4D: Multi-View Portrait Video Diffusion for Animatable 4D Avatars

Felix Taubner, Ruihang Zhang, Mathieu Tuli, Sherwin Bahmani, David B. Lindell

Main category: cs.CV

TL;DR: MVP4D is a video diffusion model that generates animatable multi-view 360-degree videos from a single reference image, enabling real-time 4D avatar creation with improved realism and consistency.

Details

Motivation: Current single-image avatar generation methods lack multi-view constraints and explicit 3D representation, causing quality degradation when viewpoints deviate from the reference image.

Method: Based on a pre-trained video diffusion model, MVP4D generates hundreds of frames simultaneously from viewpoints varying up to 360 degrees around a subject, then distills outputs into real-time renderable 4D avatars.

Result: The approach significantly improves realism, temporal consistency, and 3D consistency compared to previous single-image avatar generation methods.

Conclusion: MVP4D enables high-quality, animatable digital human avatar creation from casual single images while overcoming viewpoint limitations of previous approaches.

Abstract: Digital human avatars aim to simulate the dynamic appearance of humans in virtual environments, enabling immersive experiences across gaming, film, virtual reality, and more. However, the conventional process for creating and animating photorealistic human avatars is expensive and time-consuming, requiring large camera capture rigs and significant manual effort from professional 3D artists. With the advent of capable image and video generation models, recent methods enable automatic rendering of realistic animated avatars from a single casually captured reference image of a target subject. While these techniques significantly lower barriers to avatar creation and offer compelling realism, they lack constraints provided by multi-view information or an explicit 3D representation. So, image quality and realism degrade when rendered from viewpoints that deviate strongly from the reference image. Here, we build a video model that generates animatable multi-view videos of digital humans based on a single reference image and target expressions. Our model, MVP4D, is based on a state-of-the-art pre-trained video diffusion model and generates hundreds of frames simultaneously from viewpoints varying by up to 360 degrees around a target subject. We show how to distill the outputs of this model into a 4D avatar that can be rendered in real-time. Our approach significantly improves the realism, temporal consistency, and 3D consistency of generated avatars compared to previous methods.

[206] Efficient Real-World Deblurring using Single Images: AIM 2025 Challenge Report

Daniel Feijoo, Paula Garrido-Mellado, Marcos V. Conde, Jaesung Rim, Alvaro Garcia, Sunghyun Cho, Radu Timofte

Main category: cs.CV

TL;DR: This paper reviews the AIM 2025 Efficient Real-World Deblurring Challenge, which focused on developing efficient single-image deblurring methods with strict computational constraints (under 5M parameters and 200 GMACs).

Details

Motivation: To advance efficient real-world blur restoration by creating a challenge that pushes the development of practical deblurring solutions with strict efficiency requirements.

Method: The challenge used a new test set based on RSBlur dataset with blur-degraded image pairs captured via double-camera system. Participants developed solutions with computational constraints.

Result: 71 participants registered, 4 teams submitted valid solutions. The top-performing approach achieved 31.1298 dB PSNR, demonstrating the potential of efficient deblurring methods.

Conclusion: The challenge successfully advanced efficient real-world image deblurring research, providing a comprehensive reference for future work in this domain.

Abstract: This paper reviews the AIM 2025 Efficient Real-World Deblurring using Single Images Challenge, which aims to advance in efficient real-blur restoration. The challenge is based on a new test set based on the well known RSBlur dataset. Pairs of blur and degraded images in this dataset are captured using a double-camera system. Participant were tasked with developing solutions to effectively deblur these type of images while fulfilling strict efficiency constraints: fewer than 5 million model parameters and a computational budget under 200 GMACs. A total of 71 participants registered, with 4 teams finally submitting valid solutions. The top-performing approach achieved a PSNR of 31.1298 dB, showcasing the potential of efficient methods in this domain. This paper provides a comprehensive overview of the challenge, compares the proposed solutions, and serves as a valuable reference for researchers in efficient real-world image deblurring.

[207] UniFusion: Vision-Language Model as Unified Encoder in Image Generation

Kevin Li, Manuel Brack, Sudeep Katakol, Hareesh Ravi, Ajinkya Kale

Main category: cs.CV

TL;DR: UniFusion is a diffusion-based generative model that uses a frozen vision-language model as a unified multimodal encoder, featuring Layerwise Attention Pooling for better cross-modal reasoning and VERIFI for flexible inference.

Details

Motivation: Existing architectures use separate encoders for images and text, limiting cross-modal reasoning and knowledge transfer. Unified training approaches require massive computational resources and data, making them inaccessible.

Method: Uses frozen VLM as unified encoder with Layerwise Attention Pooling (LAP) to extract both high-level semantics and low-level details. Introduces VERIFI for conditioning diffusion transformer on VLM-generated text tokens during in-model prompt rewriting.

Result: LAP outperforms shallow fusion architectures on text-image alignment. VERIFI combines VLM reasoning with flexible inference. Finetuning on editing improves generation alignment and shows strong generalization to multiple image references.

Conclusion: UniFusion demonstrates effective cross-modality knowledge transfer and generalization capabilities, validating the unified encoder approach for multimodal generation and editing tasks.

Abstract: Although recent advances in visual generation have been remarkable, most existing architectures still depend on distinct encoders for images and text. This separation constrains diffusion models’ ability to perform cross-modal reasoning and knowledge transfer. Prior attempts to bridge this gap often use the last layer information from VLM, employ multiple visual encoders, or train large unified models jointly for text and image generation, which demands substantial computational resources and large-scale data, limiting its accessibility.We present UniFusion, a diffusion-based generative model conditioned on a frozen large vision-language model (VLM) that serves as a unified multimodal encoder. At the core of UniFusion is the Layerwise Attention Pooling (LAP) mechanism that extracts both high level semantics and low level details from text and visual tokens of a frozen VLM to condition a diffusion generative model. We demonstrate that LAP outperforms other shallow fusion architectures on text-image alignment for generation and faithful transfer of visual information from VLM to the diffusion model which is key for editing. We propose VLM-Enabled Rewriting Injection with Flexibile Inference (VERIFI), which conditions a diffusion transformer (DiT) only on the text tokens generated by the VLM during in-model prompt rewriting. VERIFI combines the alignment of the conditioning distribution with the VLM’s reasoning capabilities for increased capabilities and flexibility at inference. In addition, finetuning on editing task not only improves text-image alignment for generation, indicative of cross-modality knowledge transfer, but also exhibits tremendous generalization capabilities. Our model when trained on single image editing, zero-shot generalizes to multiple image references further motivating the unified encoder design of UniFusion.

[208] ViCO: A Training Strategy towards Semantic Aware Dynamic High-Resolution

Long Cui, Weiyun Wang, Jie Shao, Zichen Wen, Gen Luo, Linfeng Zhang, Yanting Zhang, Yu Qiao, Wenhai Wang

Main category: cs.CV

TL;DR: ViCO is a training algorithm that reduces vision tokens in MLLMs by using multiple MLP connectors with different compression ratios and a Visual Resolution Router to adapt token count based on image semantic complexity.

Details

Motivation: Existing MLLMs suffer from increased inference costs due to excessive vision tokens from image inputs, creating a need for more efficient token utilization.

Method: Uses multiple MLP connectors with different compression ratios to downsample vision tokens based on semantic complexity, and minimizes KL divergence between responses from different connectors during training. At inference, a Visual Resolution Router selects appropriate compression rates per image patch.

Result: Reduces number of vision tokens by up to 50% while maintaining model’s perception, reasoning, and OCR capabilities.

Conclusion: ViCO contributes to more efficient MLLMs by dynamically adapting visual token count based on semantic complexity rather than just resolution.

Abstract: Existing Multimodal Large Language Models (MLLMs) suffer from increased inference costs due to the additional vision tokens introduced by image inputs. In this work, we propose Visual Consistency Learning (ViCO), a novel training algorithm that enables the model to represent images of varying semantic complexities using different numbers of vision tokens. The key idea behind our method is to employ multiple MLP connectors, each with a different image compression ratio, to downsample the vision tokens based on the semantic complexity of the image. During training, we minimize the KL divergence between the responses conditioned on different MLP connectors. At inference time, we introduce an image router, termed Visual Resolution Router (ViR), that automatically selects the appropriate compression rate for each image patch. Compared with existing dynamic high-resolution strategies, which adjust the number of visual tokens based on image resolutions, our method dynamically adapts the number of visual tokens according to semantic complexity. Experimental results demonstrate that our method can reduce the number of vision tokens by up to 50% while maintaining the model’s perception, reasoning, and OCR capabilities. We hope this work will contribute to the development of more efficient MLLMs. The code and models will be released to facilitate future research.

[209] CuMPerLay: Learning Cubical Multiparameter Persistence Vectorizations

Caner Korkmaz, Brighton Nuwagira, Barış Coşkunuzer, Tolga Birdal

Main category: cs.CV

TL;DR: CuMPerLay is a differentiable vectorization layer that integrates Cubical Multiparameter Persistence into deep learning, enabling joint learning of bifiltration functions and robust topological features for image analysis.

Details

Motivation: Cubical Multiparameter Persistence (CMP) is powerful for topological image analysis but hindered by complexity of multifiltration structures and vectorization challenges, limiting its integration into deep learning pipelines.

Method: Introduces a novel algorithm that decomposes CMP into learnable single-parameter persistence components, where bifiltration functions are jointly learned, creating differentiable topological feature vectors compatible with modern architectures like Swin Transformers.

Result: Experiments on medical imaging and computer vision datasets show improved classification and segmentation performance, especially in limited-data scenarios, with theoretical guarantees for stability under generalized Wasserstein metrics.

Conclusion: CuMPerLay provides a promising approach for integrating global structural topological information into deep networks for structured image analysis, overcoming previous limitations of CMP integration.

Abstract: We present CuMPerLay, a novel differentiable vectorization layer that enables the integration of Cubical Multiparameter Persistence (CMP) into deep learning pipelines. While CMP presents a natural and powerful way to topologically work with images, its use is hindered by the complexity of multifiltration structures as well as the vectorization of CMP. In face of these challenges, we introduce a new algorithm for vectorizing MP homologies of cubical complexes. Our CuMPerLay decomposes the CMP into a combination of individual, learnable single-parameter persistence, where the bifiltration functions are jointly learned. Thanks to the differentiability, its robust topological feature vectors can be seamlessly used within state-of-the-art architectures such as Swin Transformers. We establish theoretical guarantees for the stability of our vectorization under generalized Wasserstein metrics. Our experiments on benchmark medical imaging and computer vision datasets show the benefit CuMPerLay on classification and segmentation performance, particularly in limited-data scenarios. Overall, CuMPerLay offers a promising direction for integrating global structural information into deep networks for structured image analysis.

[210] DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving

Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, Lu Hou, Lue Fan, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: DriveVLA-W0 introduces world modeling to address the supervision deficit in Vision-Language-Action models by predicting future images, creating dense self-supervised signals for better driving intelligence.

Details

Motivation: VLA models suffer from supervision deficit where large model capacity is supervised by sparse, low-dimensional actions, leaving representational power underutilized.

Method: Proposes DriveVLA-W0 training paradigm using world modeling to predict future images, with two implementations: autoregressive world model for discrete visual tokens and diffusion world model for continuous visual features, plus a lightweight action expert for real-time deployment.

Result: Extensive experiments on NAVSIM v1/v2 benchmark and large in-house dataset show DriveVLA-W0 significantly outperforms BEV and VLA baselines and amplifies data scaling laws with accelerating performance gains as dataset size increases.

Conclusion: World modeling effectively addresses VLA supervision deficit, enabling better utilization of model capacity and improved driving intelligence through dense self-supervised learning.

Abstract: Scaling Vision-Language-Action (VLA) models on large-scale data offers a promising path to achieving a more generalized driving intelligence. However, VLA models are limited by a ``supervision deficit’’: the vast model capacity is supervised by sparse, low-dimensional actions, leaving much of their representational power underutilized. To remedy this, we propose \textbf{DriveVLA-W0}, a training paradigm that employs world modeling to predict future images. This task generates a dense, self-supervised signal that compels the model to learn the underlying dynamics of the driving environment. We showcase the paradigm’s versatility by instantiating it for two dominant VLA archetypes: an autoregressive world model for VLAs that use discrete visual tokens, and a diffusion world model for those operating on continuous visual features. Building on the rich representations learned from world modeling, we introduce a lightweight action expert to address the inference latency for real-time deployment. Extensive experiments on the NAVSIM v1/v2 benchmark and a 680x larger in-house dataset demonstrate that DriveVLA-W0 significantly outperforms BEV and VLA baselines. Crucially, it amplifies the data scaling law, showing that performance gains accelerate as the training dataset size increases.

[211] Detect Anything via Next Point Prediction

Qing Jiang, Junan Huo, Xingyu Chen, Yuda Xiong, Zhaoyang Zeng, Yihao Chen, Tianhe Ren, Junzhi Yu, Lei Zhang

Main category: cs.CV

TL;DR: Rex-Omni is a 3B-scale MLLM that achieves state-of-the-art object detection performance comparable to regression-based models like DINO and Grounding DINO in zero-shot settings, using quantized coordinate tokens, multi-data engines, and two-stage training with RL post-training.

Details

Motivation: Traditional coordinate regression models dominate object detection, but recent MLLM approaches face challenges like low recall, duplicate predictions, and coordinate misalignment. This work aims to bridge the gap between MLLMs and regression-based models for better object perception.

Method: Three key designs: 1) Task formulation using special tokens for quantized coordinates (0-999), 2) Multiple data engines generating high-quality grounding, referring, and pointing data, 3) Two-stage training with supervised fine-tuning on 22M data followed by GRPO-based RL post-training with geometry-aware rewards.

Result: Achieves state-of-the-art performance on COCO and LVIS benchmarks, comparable to or exceeding regression-based models in zero-shot settings. Enables versatile capabilities including object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing.

Conclusion: Rex-Omni paves the way for more versatile and language-aware visual perception systems, demonstrating that MLLMs can achieve competitive object detection performance while maintaining language understanding capabilities.

Abstract: Object detection has long been dominated by traditional coordinate regression-based models, such as YOLO, DETR, and Grounding DINO. Although recent efforts have attempted to leverage MLLMs to tackle this task, they face challenges like low recall rate, duplicate predictions, coordinate misalignment, etc. In this work, we bridge this gap and propose Rex-Omni, a 3B-scale MLLM that achieves state-of-the-art object perception performance. On benchmarks like COCO and LVIS, Rex-Omni attains performance comparable to or exceeding regression-based models (e.g., DINO, Grounding DINO) in a zero-shot setting. This is enabled by three key designs: 1) Task Formulation: we use special tokens to represent quantized coordinates from 0 to 999, reducing the model’s learning difficulty and improving token efficiency for coordinate prediction; 2) Data Engines: we construct multiple data engines to generate high-quality grounding, referring, and pointing data, providing semantically rich supervision for training; \3) Training Pipelines: we employ a two-stage training process, combining supervised fine-tuning on 22 million data with GRPO-based reinforcement post-training. This RL post-training leverages geometry-aware rewards to effectively bridge the discrete-to-continuous coordinate prediction gap, improve box accuracy, and mitigate undesirable behaviors like duplicate predictions that stem from the teacher-guided nature of the initial SFT stage. Beyond conventional detection, Rex-Omni’s inherent language understanding enables versatile capabilities such as object referring, pointing, visual prompting, GUI grounding, spatial referring, OCR and key-pointing, all systematically evaluated on dedicated benchmarks. We believe that Rex-Omni paves the way for more versatile and language-aware visual perception systems.

[212] DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan

Main category: cs.CV

TL;DR: DeepMMSearch-R1 is a multimodal LLM that performs on-demand web searches using both image crops and text queries, with iterative query refinement through a two-stage training pipeline.

Details

Motivation: Existing MLLMs struggle with rigid pipelines, excessive searches, and poor query construction when handling real-world information-seeking queries that require external knowledge.

Method: Two-stage training: supervised fine-tuning followed by online reinforcement learning, using a novel multimodal VQA dataset (DeepMMSearchVQA) that teaches when/what/how to search.

Result: Extensive experiments show superiority across knowledge-intensive benchmarks, with effective image search via relevant crops and iterative text query adaptation.

Conclusion: The approach provides valuable insights for advancing multimodal web-search capabilities in MLLMs.

Abstract: Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

[213] Enhancing Representations through Heterogeneous Self-Supervised Learning

Zhong-Yu Li, Bo-Wen Yin, Yongxiang Liu, Li Liu, Ming-Ming Cheng

Main category: cs.CV

TL;DR: Heterogeneous Self-Supervised Learning (HSSL) enforces a base model to learn from an auxiliary head with different architecture, improving representation quality as architectural discrepancy increases.

Details

Motivation: To better exploit complementarity between heterogeneous architectures in self-supervised learning, as current methods don't fully utilize architectural differences.

Method: Propose HSSL framework with base model learning from heterogeneous auxiliary head, plus search strategy for optimal auxiliary head selection and methods to enlarge model discrepancy.

Result: HSSL achieves superior performance on various downstream tasks including image classification, semantic segmentation, instance segmentation, and object detection.

Conclusion: HSSL effectively leverages architectural heterogeneity in self-supervised learning and is compatible with various self-supervised methods.

Abstract: Incorporating heterogeneous representations from different architectures has facilitated various vision tasks, e.g., some hybrid networks combine transformers and convolutions. However, complementarity between such heterogeneous architectures has not been well exploited in self-supervised learning. Thus, we propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. In this process, HSSL endows the base model with new characteristics in a representation learning way without structural changes. To comprehensively understand the HSSL, we conduct experiments on various heterogeneous pairs containing a base model and an auxiliary head. We discover that the representation quality of the base model moves up as their architecture discrepancy grows. This observation motivates us to propose a search strategy that quickly determines the most suitable auxiliary head for a specific base model to learn and several simple but effective methods to enlarge the model discrepancy. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks, including image classification, semantic segmentation, instance segmentation, and object detection. The codes are available at https://github.com/NK-JittorCV/Self-Supervised/.

[214] Constructing a Real-World Benchmark for Early Wildfire Detection with the New PYRONEAR-2025 Dataset

Mateo Lostanlen, Nicolas Isla, Jose Guillen, Renzo Zanca, Felix Veith, Cristian Buc, Valentin Barriere

Main category: cs.CV

TL;DR: PYRONEAR-2025 is a new large-scale wildfire detection dataset with 150,000 manual annotations on 50,000 images/videos covering 640 wildfires from multiple countries, enabling training of both image-based and sequential smoke plume detection models.

Details

Motivation: Early wildfire detection is crucial for rapid response and minimizing wildfire damage, requiring diverse and comprehensive datasets for training effective detection models.

Method: Created PYRONEAR-2025 dataset from: (1) web-scraped wildfire videos from public camera networks, (2) in-house camera network videos, and (3) synthetic/real images. Evaluated using lightweight object detection models suitable for real-life deployment.

Result: Dataset is challenging with F1 score around 70% but more stable than existing datasets. Cross-dataset experiments show it improves overall performance when used with other public datasets. Sequential models trained on video data improve recall while maintaining precision for earlier detections.

Conclusion: PYRONEAR-2025 provides a comprehensive, diverse dataset that advances wildfire detection capabilities, particularly enabling sequential modeling for improved early detection while maintaining dataset stability and cross-dataset compatibility.

Abstract: Early wildfire detection (EWD) is of the utmost importance to enable rapid response efforts, and thus minimize the negative impacts of wildfire spreads. To this end, we present PYRONEAR-2025, a new dataset composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. The data is sourced from: (i) web-scraped videos of wildfires from public networks of cameras for wildfire detection in-the-wild, (ii) videos from our in-house network of cameras, and (iii) a small portion of synthetic and real images. This dataset includes around 150,000 manual annotations on 50,000 images, covering 640 wildfires, PYRONEAR-2025 surpasses existing datasets in size and diversity. It includes data from France, Spain, Chile and the United States. Finally, it is composed of both images and videos, allowing for the training and evaluation of smoke plume detection models, including sequential models. We ran cross-dataset experiments using a lightweight state-of-the-art object detection model, as the ones used in-real-life, and found out the proposed dataset is particularly challenging, with F1 score of around 70%, but more stable than existing datasets. Finally, its use in concordance with other public datasets helps to reach higher results overall. Last but not least, the video part of the dataset can be used to train a lightweight sequential model, improving global recall while maintaining precision for earlier detections. We make both our code and data available online.

[215] Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

Akash Ghosh, Arkadeep Acharya, Sriparna Saha, Vinija Jain, Aman Chadha

Main category: cs.CV

TL;DR: A comprehensive survey paper that classifies Vision-Language Models (VLMs) into three categories based on their multimodal processing capabilities and analyzes their architectures, training data, performance, and limitations.

Details

Motivation: To address the limitation of Large Language Models (LLMs) being primarily text-only by surveying the integration of visual capabilities with LLMs, resulting in Vision-Language Models (VLMs) that can handle more complex multimodal tasks.

Method: Classifies VLMs into three categories: vision-language understanding models, multimodal input to unimodal text output models, and fully multimodal input/output models. Analyzes each model’s architecture, training data, strengths, and limitations.

Result: Provides extensive analysis of VLM architectures, performance on benchmark datasets, and identifies key advancements in the field. The classification framework helps understand different VLM capabilities.

Conclusion: VLMs represent significant progress in multimodal AI, with potential for future research breakthroughs in this dynamic domain. The survey offers a nuanced understanding of the diverse VLM landscape.

Abstract: The advent of Large Language Models (LLMs) has significantly reshaped the trajectory of the AI revolution. Nevertheless, these LLMs exhibit a notable limitation, as they are primarily adept at processing textual information. To address this constraint, researchers have endeavored to integrate visual capabilities with LLMs, resulting in the emergence of Vision-Language Models (VLMs). These advanced models are instrumental in tackling more intricate tasks such as image captioning and visual question answering. In our comprehensive survey paper, we delve into the key advancements within the realm of VLMs. Our classification organizes VLMs into three distinct categories: models dedicated to vision-language understanding, models that process multimodal inputs to generate unimodal (textual) outputs and models that both accept and produce multimodal inputs and outputs.This classification is based on their respective capabilities and functionalities in processing and generating various modalities of data.We meticulously dissect each model, offering an extensive analysis of its foundational architecture, training data sources, as well as its strengths and limitations wherever possible, providing readers with a comprehensive understanding of its essential components. We also analyzed the performance of VLMs in various benchmark datasets. By doing so, we aim to offer a nuanced understanding of the diverse landscape of VLMs. Additionally, we underscore potential avenues for future research in this dynamic domain, anticipating further breakthroughs and advancements.

[216] Funny-Valen-Tine: Planning Solution Distribution Enhances Machine Abstract Reasoning Ability

Ruizhuo Song, Beiming Yuan

Main category: cs.CV

TL;DR: The paper introduces Valen, a probability-highlighting baseline for visual abstract reasoning, and improves it with Funny, a Gaussian-mixture model that directly estimates correct-solution density, showing that explicit distribution planning enhances abstract reasoning.

Details

Motivation: Visual abstract reasoning is core to image processing, and current solvers implicitly treat tasks as distributions where primary samples fit and auxiliaries do not, shaping learning targets by both sets rather than correct solutions alone.

Method: The authors first introduce Tine, an adversarial adapter to nudge Valen toward correct-solution density, but due to instability, replace it with Funny, a fast Gaussian-mixture model that directly estimates correct-solution density without adversarial games, extending the paradigm to SBR for progressive-pattern planning.

Result: Extensive experiments show that explicit distribution planning is the key to stronger, interpretable abstract reasoning, with the approach excelling on both RPM (progression) and Bongard-Logo (clustering) tasks.

Conclusion: Explicit distribution planning enhances machine abstract reasoning ability, providing a more stable and effective approach compared to adversarial methods.

Abstract: Visual abstract reasoning is core to image processing. We present Valen, a unified probability-highlighting baseline that excels on both RPM (progression) and Bongard-Logo (clustering) tasks. Analysing its internals, we find solvers implicitly treat each task as a distribution where primary samples fit and auxiliaries do not; hence the learning target is jointly shaped by both sets, not by correct solutions alone. To close the gap we first introduce Tine, an adversarial adapter that nudges Valen toward correct-solution density, but adversarial training is unstable. We therefore replace it with Funny, a fast Gaussian-mixture model that directly estimates the correct-solution density without adversarial games, and extend the same paradigm to SBR for progressive-pattern planning. Extensive experiments show explicit distribution planning is the key to stronger, interpretable abstract reasoning. Codes are available in: https://github.com/Yuanbeiming/Funny-Valen-Tine-Planning-Solution-Distribution-Enhances-Machine-Abstract-Reasoning-Ability

[217] Tracing Back the Malicious Clients in Poisoning Attacks to Federated Learning

Yuqi Jia, Minghong Fang, Hongbin Liu, Jinghuai Zhang, Neil Zhenqiang Gong

Main category: cs.CV

TL;DR: FLForensics is a poison-forensics method for federated learning that traces back malicious clients after a poisoning attack is detected, complementing existing training-phase defenses.

Details

Motivation: Existing FL defenses have limited effectiveness when client data is highly non-iid or many malicious clients exist, leaving poisoned models undetected during training.

Method: FLForensics performs post-attack forensics to trace malicious clients after identifying a misclassified target input from a poisoned global model.

Result: Theoretical analysis shows FLForensics accurately distinguishes benign and malicious clients, and empirical evaluation on five datasets demonstrates effectiveness against existing and adaptive poisoning attacks.

Conclusion: FLForensics provides a complementary defense mechanism that works when training-phase defenses fail, enabling post-deployment identification of malicious clients in poisoned FL systems.

Abstract: Poisoning attacks compromise the training phase of federated learning (FL) such that the learned global model misclassifies attacker-chosen inputs called target inputs. Existing defenses mainly focus on protecting the training phase of FL such that the learnt global model is poison free. However, these defenses often achieve limited effectiveness when the clients’ local training data is highly non-iid or the number of malicious clients is large, as confirmed in our experiments. In this work, we propose FLForensics, the first poison-forensics method for FL. FLForensics complements existing training-phase defenses. In particular, when training-phase defenses fail and a poisoned global model is deployed, FLForensics aims to trace back the malicious clients that performed the poisoning attack after a misclassified target input is identified. We theoretically show that FLForensics can accurately distinguish between benign and malicious clients under a formal definition of poisoning attack. Moreover, we empirically show the effectiveness of FLForensics at tracing back both existing and adaptive poisoning attacks on five benchmark datasets.

[218] Exploring Facial Biomarkers for Depression through Temporal Analysis of Action Units

Aditya Parikh, Misha Sadeghi, Robert Richer, Lydia Helene Rupp, Lena Schindler-Gmelch, Marie Keinert, Malin Hager, Klara Capito, Farnaz Rahimi, Bernhard Egger, Matthias Berking, Bjoern M. Eskofier

Main category: cs.CV

TL;DR: The paper explores using facial action units and emotions as objective biomarkers for depression diagnosis, finding significant differences in AU intensities between depressed and non-depressed groups.

Details

Motivation: Traditional depression diagnosis relies on subjective assessments, creating a need for objective diagnostic approaches to improve accuracy.

Method: Analyzed facial expressions from video data using feature extraction, mean intensity comparisons of key AUs, time series classification models, PCA, and clustering algorithms to examine emotional expression patterns.

Result: Found significant differences in intensities of AUs associated with sadness and happiness between depressed and non-depressed groups.

Conclusion: Facial analysis shows potential as an objective tool for depression assessment, with facial action units serving as promising biomarkers.

Abstract: Depression is characterized by persistent sadness and loss of interest, significantly impairing daily functioning and now a widespread mental disorder. Traditional diagnostic methods rely on subjective assessments, necessitating objective approaches for accurate diagnosis. Our study investigates the use of facial action units (AUs) and emotions as biomarkers for depression. We analyzed facial expressions from video data of participants classified with or without depression. Our methodology involved detailed feature extraction, mean intensity comparisons of key AUs, and the application of time series classification models. Furthermore, we employed Principal Component Analysis (PCA) and various clustering algorithms to explore the variability in emotional expression patterns. Results indicate significant differences in the intensities of AUs associated with sadness and happiness between the groups, highlighting the potential of facial analysis in depression assessment.

[219] CoVLA: Comprehensive Vision-Language-Action Dataset for Autonomous Driving

Hidehisa Arai, Keita Miwa, Kento Sasaki, Yu Yamaguchi, Kohei Watanabe, Shunsuke Aoki, Issei Yamamoto

Main category: cs.CV

TL;DR: The paper introduces CoVLA, a large-scale vision-language-action dataset for autonomous driving, and demonstrates its effectiveness in training MLLMs for end-to-end path planning.

Details

Motivation: Address the lack of large-scale annotated datasets combining vision, language, and action for autonomous driving, which limits the application of MLLMs to end-to-end path planning.

Method: Created CoVLA dataset using automated data processing and caption generation pipeline from real-world driving videos (80+ hours), pairing driving trajectories with natural language descriptions of environments and maneuvers.

Result: The model trained on CoVLA shows strong proficiency in generating coherent language and action outputs across various driving scenarios.

Conclusion: CoVLA establishes a framework for robust, interpretable autonomous driving systems and demonstrates the potential of VLA models in this field, contributing to safer self-driving vehicles.

Abstract: Autonomous driving, particularly navigating complex and unanticipated scenarios, demands sophisticated reasoning and planning capabilities. While Multi-modal Large Language Models (MLLMs) offer a promising avenue for this, their use has been largely confined to understanding complex environmental contexts or generating high-level driving commands, with few studies extending their application to end-to-end path planning. A major research bottleneck is the lack of large-scale annotated datasets encompassing vision, language, and action. To address this issue, we propose CoVLA (Comprehensive Vision-Language-Action) Dataset, an extensive dataset comprising real-world driving videos spanning more than 80 hours. This dataset leverages a novel, scalable approach based on automated data processing and a caption generation pipeline to generate accurate driving trajectories paired with detailed natural language descriptions of driving environments and maneuvers. This approach utilizes raw in-vehicle sensor data, allowing it to surpass existing datasets in scale and annotation richness. Using CoVLA, we investigate the driving capabilities of MLLMs that can handle vision, language, and action in a variety of driving scenarios. Our results illustrate the strong proficiency of our model in generating coherent language and action outputs, emphasizing the potential of Vision-Language-Action (VLA) models in the field of autonomous driving. This dataset establishes a framework for robust, interpretable, and data-driven autonomous driving systems by providing a comprehensive platform for training and evaluating VLA models, contributing to safer and more reliable self-driving vehicles. The dataset is released for academic purpose.

[220] TreeDiffusion: Hierarchical Generative Clustering for Conditional Diffusion

Jorge da Silva Gonçalves, Laura Manduchi, Moritz Vandenhirtz, Julia E. Vogt

Main category: cs.CV

TL;DR: TreeDiffusion combines VAE-based hierarchical clustering with diffusion models to generate high-quality cluster-specific images, overcoming VAE’s generative limitations while preserving learned latent structure.

Details

Motivation: VAEs can learn meaningful cluster representations but struggle to generate high-quality samples. This paper aims to address the generative limitations of VAE-based clustering approaches by leveraging their learned structure.

Method: Two-step approach: 1) VAE-based clustering learns hierarchical latent representations; 2) Cluster-aware diffusion model generates images conditioned on the learned hierarchical structure.

Result: Conditioning diffusion models on hierarchical cluster representations improves generative performance on real-world datasets compared to other approaches. Enables generation of images that are both representative and specific to each cluster.

Conclusion: TreeDiffusion advances generative clustering by combining VAE’s structural learning with diffusion models’ generative capabilities, enabling better visualization of learned latent structure and high-quality cluster-specific generation.

Abstract: Generative modeling and clustering are conventionally distinct tasks in machine learning. Variational Autoencoders (VAEs) have been widely explored for their ability to integrate both, providing a framework for generative clustering. However, while VAEs can learn meaningful cluster representations in latent space, they often struggle to generate high-quality samples. This paper addresses this problem by introducing TreeDiffusion, a deep generative model that conditions diffusion models on learned latent hierarchical cluster representations from a VAE to obtain high-quality, cluster-specific generations. Our approach consists of two steps: first, a VAE-based clustering model learns a hierarchical latent representation of the data. Second, a cluster-aware diffusion model generates realistic images conditioned on the learned hierarchical structure. We systematically compare the generative capabilities of our approach with those of alternative conditioning strategies. Empirically, we demonstrate that conditioning diffusion models on hierarchical cluster representations improves the generative performance on real-world datasets compared to other approaches. Moreover, a key strength of our method lies in its ability to generate images that are both representative and specific to each cluster, enabling more detailed visualization of the learned latent structure. Our approach addresses the generative limitations of VAE-based clustering approaches by leveraging their learned structure, thereby advancing the field of generative clustering.

[221] Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation

Gautier Evennou, Antoine Chaffin, Vivien Chappelier, Ewa Kijak

Main category: cs.CV

TL;DR: Proposes BLIP2IDC framework to adapt image captioning models for Image Difference Captioning (IDC) and introduces synthetic data augmentation to address data scarcity and fine-grained difference detection challenges in real-world images.

Details

Motivation: Address limitations of current IDC approaches that struggle with real-world images due to training data scarcity and difficulty capturing fine-grained differences between complex images.

Method: Adapts BLIP2 model to IDC task at low computational cost and uses synthetic augmentation to create high-quality training data, resulting in new Syned1 dataset.

Result: BLIP2IDC outperforms two-stream approaches by significant margin on real-world IDC datasets, and synthetic augmentation provides challenging high-quality data.

Conclusion: The proposed framework effectively addresses IDC challenges through model adaptation and data augmentation, enabling better performance on real-world image difference captioning.

Abstract: The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.

[222] Generate, Transduct, Adapt: Iterative Transduction with VLMs

Oindrila Saha, Logan Lawrence, Grant Van Horn, Subhransu Maji

Main category: cs.CV

TL;DR: GTA-CLIP improves zero-shot learning by incorporating language model supervision for joint transduction in vision and language spaces, achieving significant performance gains over CLIP baselines.

Details

Motivation: Current transductive zero-shot learning methods focus on image-image similarities but neglect the structure of language space, limiting their potential.

Method: Iterative three-step approach: (1) incrementally exploring attribute space via language model queries, (2) attribute-augmented transductive inference, (3) fine-tuning vision and language encoders using inferred labels.

Result: Average improvements of 8.6% over CLIP and 3.7% over transductive CLIP across 12 datasets and 3 encoders in zero-shot setting, with similar gains in few-shot setting.

Conclusion: Joint transduction in both vision and language spaces with language model supervision significantly enhances zero-shot learning performance.

Abstract: Transductive zero-shot learning with vision-language models leverages image-image similarities within the dataset to achieve better classification accuracy compared to the inductive setting. However, there is little work that explores the structure of the language space in this context. We propose GTA-CLIP, a novel technique that incorporates supervision from language models for joint transduction in language and vision spaces. Our approach is iterative and consists of three steps: (i) incrementally exploring the attribute space by querying language models, (ii) an attribute-augmented transductive inference procedure, and (iii) fine-tuning the language and vision encoders based on inferred labels within the dataset. Through experiments with CLIP encoders, we demonstrate that GTA-CLIP, yields an average performance improvement of 8.6% and 3.7% across 12 datasets and 3 encoders, over CLIP and transductive CLIP respectively in the zero-shot setting. We also observe similar improvements in a few-shot setting. We present ablation studies that demonstrate the value of each step and visualize how the vision and language spaces evolve over iterations driven by the transductive learning. Code is released at https://github.com/cvl-umass/GTA-CLIP

[223] FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

Main category: cs.CV

TL;DR: FOCUS is a geospatial deep learning framework that predicts PFAS contamination in surface water using hydrological flow data, land cover information, and proximity to known PFAS sources with a noise-aware loss function.

Details

Motivation: PFAS are persistent environmental pollutants with severe health risks, but detecting contamination across large regions is challenging due to high testing costs and difficulty simulating spread.

Method: Geospatial deep learning framework integrating hydrological flow data, land cover information, and proximity to known PFAS sources with a label noise-aware loss function.

Result: The framework shows improved prediction accuracy through extensive ablation studies, robustness analysis, real-world validation, and outperforms baselines like sparse segmentation, Kriging, and pollutant transport simulations.

Conclusion: FOCUS demonstrates potential for scalable PFAS monitoring, with results and expert feedback highlighting its effectiveness for large-scale contamination mapping.

Abstract: Per- and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies, robustness analysis, real-world validation, and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results and expert feedback highlight our framework’s potential for scalable PFAS monitoring.

[224] Extremely low-bitrate Image Compression Semantically Disentangled by LMMs from a Human Perception Perspective

Juan Song, Lijie Yang, Mingtao Feng

Main category: cs.CV

TL;DR: SEDIC is a semantically disentangled image compression framework that uses progressive object restoration to achieve high-quality reconstructions at extremely low bitrates (≤0.05 bpp).

Details

Motivation: Addressing the challenge of compressing images at extremely low bitrates while maintaining both semantic consistency and high perceptual quality, inspired by human progressive perception mechanisms.

Method: Uses a learned image encoder for initial compression, extracts semantic components (descriptions, object details, segmentation masks) using LMMs, and employs a training-free Object Restoration model with Attention Guidance (ORAG) based on ControlNet to progressively restore object details.

Result: Significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low bitrates.

Conclusion: SEDIC provides an effective framework for extremely low-bitrate image compression through progressive semantic restoration, demonstrating strong performance in both quality and semantic fidelity.

Abstract: It remains a significant challenge to compress images at extremely low bitrate while achieving both semantic consistency and high perceptual quality. Inspired by human progressive perception mechanism, we propose a Semantically Disentangled Image Compression framework (SEDIC) in this paper. Initially, an extremely compressed reference image is obtained through a learned image encoder. Then we leverage LMMs to extract essential semantic components, including overall descriptions, object detailed description, and semantic segmentation masks. We propose a training-free Object Restoration model with Attention Guidance (ORAG) built on pre-trained ControlNet to restore object details conditioned by object-level text descriptions and semantic masks. Based on the proposed ORAG, we design a multistage semantic image decoder to progressively restore the details object by object, starting from the extremely compressed reference image, ultimately generating high-quality and high-fidelity reconstructions. Experimental results demonstrate that SEDIC significantly outperforms state-of-the-art approaches, achieving superior perceptual quality and semantic consistency at extremely low-bitrates ($\le$ 0.05 bpp).

[225] SPEED: Scalable, Precise, and Efficient Concept Erasure for Diffusion Models

Ouxiang Li, Yuan Wang, Xinting Hu, Houcheng Jiang, Tao Liang, Yanbin Hao, Guojun Ma, Fuli Feng

Main category: cs.CV

TL;DR: SPEED is an efficient concept erasure method for text-to-image diffusion models that directly edits model parameters using null space optimization to erase target concepts while preserving non-target concepts.

Details

Motivation: Address growing concerns over copyright infringement, offensive content, and privacy violations in text-to-image models, and overcome limitations of existing methods that are either time-consuming for multiple concepts or degrade non-target concept quality.

Method: Directly edits model parameters by searching for a null space where updates don’t affect non-target concepts, using three strategies: Influence-based Prior Filtering (IPF), Directed Prior Augmentation (DPA), and Invariant Equality Constraints (IEC).

Result: Outperforms existing methods in non-target preservation while achieving efficient erasure (100 concepts in 5 seconds), with extensive evaluations showing high-fidelity concept erasure.

Conclusion: SPEED provides a scalable and precise solution for concept erasure in text-to-image models, effectively balancing erasure efficiency with preservation of non-target concepts.

Abstract: Erasing concepts from large-scale text-to-image (T2I) diffusion models has become increasingly crucial due to the growing concerns over copyright infringement, offensive content, and privacy violations. In scalable applications, fine-tuning-based methods are time-consuming to precisely erase multiple target concepts, while real-time editing-based methods often degrade the generation quality of non-target concepts due to conflicting optimization objectives. To address this dilemma, we introduce SPEED, an efficient concept erasure approach that directly edits model parameters. SPEED searches for a null space, a model editing space where parameter updates do not affect non-target concepts, to achieve scalable and precise erasure. To facilitate accurate null space optimization, we incorporate three complementary strategies: Influence-based Prior Filtering (IPF) to selectively retain the most affected non-target concepts, Directed Prior Augmentation (DPA) to enrich the filtered retain set with semantically consistent variations, and Invariant Equality Constraints (IEC) to preserve key invariants during the T2I generation process. Extensive evaluations across multiple concept erasure tasks demonstrate that SPEED consistently outperforms existing methods in non-target preservation while achieving efficient and high-fidelity concept erasure, successfully erasing 100 concepts within only 5 seconds. Our code and models are available at: https://github.com/Ouxiang-Li/SPEED.

[226] UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu Sun, Lu Jiang

Main category: cs.CV

TL;DR: This paper proposes using multimodal large language models (MLLMs) as unified evaluators for AI-generated videos, introduces UVE-Bench benchmark for evaluation, and shows that advanced MLLMs outperform existing specialized methods but still lag behind human evaluators.

Details

Motivation: Existing methods for evaluating AI-generated videos are constrained to specific evaluation aspects and difficult to scale for finer-grained and more comprehensive evaluations. There's a need for reliable and comprehensive automatic metrics as video generative models rapidly grow.

Method: The work investigates using MLLMs as unified evaluators for AIGVs, leveraging their visual perception and language understanding capabilities. They introduce UVE-Bench benchmark with videos from state-of-the-art VGMs and human preference annotations across 15 evaluation aspects, then extensively evaluate 18 MLLMs.

Result: Advanced MLLMs (Qwen2VL-72B and InternVL2.5-78B) demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods, though they still lag behind human evaluators. The analysis provides insights on key design choices affecting MLLM-driven evaluators.

Conclusion: MLLMs show strong potential as unified evaluators for AI-generated videos, offering a scalable solution that outperforms existing specialized methods, though further improvements are needed to reach human-level evaluation capabilities.

Abstract: With the rapid growth of video generative models (VGMs), it is essential to develop reliable and comprehensive automatic metrics for AI-generated videos (AIGVs). Existing methods either use off-the-shelf models optimized for other tasks or rely on human assessment data to train specialized evaluators. These approaches are constrained to specific evaluation aspects and are difficult to scale with the increasing demands for finer-grained and more comprehensive evaluations. To address this issue, this work investigates the feasibility of using multimodal large language models (MLLMs) as a unified evaluator for AIGVs, leveraging their strong visual perception and language understanding capabilities. To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench. UVE-Bench collects videos generated by state-of-the-art VGMs and provides pairwise human preference annotations across 15 evaluation aspects. Using UVE-Bench, we extensively evaluate 18 MLLMs. Our empirical results suggest that while advanced MLLMs (e.g., Qwen2VL-72B and InternVL2.5-78B) still lag behind human evaluators, they demonstrate promising ability in unified AIGV evaluation, significantly surpassing existing specialized evaluation methods. Additionally, we conduct an in-depth analysis of key design choices that impact the performance of MLLM-driven evaluators, offering valuable insights for future research on AIGV evaluation.

[227] OpenLex3D: A Tiered Evaluation Benchmark for Open-Vocabulary 3D Scene Representations

Christina Kassab, Sacha Morin, Martin Büchner, Matías Mattamala, Kumaraditya Gupta, Abhinav Valada, Liam Paull, Maurice Fallon

Main category: cs.CV

TL;DR: OpenLex3D is a new benchmark for evaluating 3D open-vocabulary scene representations that addresses limitations of existing datasets by introducing linguistic variability through synonyms and nuanced descriptions across multiple 3D scene datasets.

Details

Motivation: Current evaluation of 3D scene understanding models is limited to datasets with closed-set semantics that don't capture the richness of natural language, creating a need for benchmarks that reflect real-world linguistic variability.

Method: Created new label annotations for scenes from Replica, ScanNet++, and HM3D datasets, introducing synonymical object categories and additional nuanced descriptions, providing 13 times more labels per scene than original datasets.

Result: The benchmark enables evaluation of various 3D open-vocabulary methods through open-set 3D semantic segmentation and object retrieval tasks, revealing failure cases and improvement opportunities while providing insights on feature precision, segmentation, and downstream capabilities.

Conclusion: OpenLex3D addresses the gap in evaluating 3D open-vocabulary representations by capturing real-world linguistic variability, and serves as a publicly available benchmark to drive improvements in 3D scene understanding.

Abstract: 3D scene understanding has been transformed by open-vocabulary language models that enable interaction via natural language. However, at present the evaluation of these representations is limited to datasets with closed-set semantics that do not capture the richness of language. This work presents OpenLex3D, a dedicated benchmark for evaluating 3D open-vocabulary scene representations. OpenLex3D provides entirely new label annotations for scenes from Replica, ScanNet++, and HM3D, which capture real-world linguistic variability by introducing synonymical object categories and additional nuanced descriptions. Our label sets provide 13 times more labels per scene than the original datasets. By introducing an open-set 3D semantic segmentation task and an object retrieval task, we evaluate various existing 3D open-vocabulary methods on OpenLex3D, showcasing failure cases, and avenues for improvement. Our experiments provide insights on feature precision, segmentation, and downstream capabilities. The benchmark is publicly available at: https://openlex3d.github.io/.

[228] Mind the (Data) Gap: Evaluating Vision Systems in Small Data Applications

Samuel Stevens, S M Rayeed, Jenna Kline

Main category: cs.CV

TL;DR: The paper compares multi-modal large language models (MLLMs) and vision-only methods in small-data regimes, finding MLLMs plateau early while vision-only methods continue improving with more data.

Details

Motivation: Computer vision research has focused on zero- and few-shot learning while ignoring the practical small-data regime (hundreds to thousands of samples) needed for applications with expensive expert annotations like ecological monitoring and medical diagnostics.

Method: Used the Natural World Tasks (NeWT) benchmark to compare MLLMs and vision-only methods across varying training set sizes in small-data contexts.

Result: MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples.

Conclusion: Advocates for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments, providing the first comprehensive comparison between these approaches in small-data contexts.

Abstract: The practical application of AI tools for specific computer vision tasks relies on the “small-data regime” of hundreds to thousands of labeled samples. This small-data regime is vital for applications requiring expensive expert annotations, such as ecological monitoring, medical diagnostics or industrial quality control. We find, however, that computer vision research has ignored the small data regime as evaluations increasingly focus on zero- and few-shot learning. We use the Natural World Tasks (NeWT) benchmark to compare multi-modal large language models (MLLMs) and vision-only methods across varying training set sizes. MLLMs exhibit early performance plateaus, while vision-only methods improve throughout the small-data regime, with performance gaps widening beyond 10 training examples. We provide the first comprehensive comparison between these approaches in small-data contexts and advocate for explicit small-data evaluations in AI research to better bridge theoretical advances with practical deployments.

[229] DSM: Constructing a Diverse Semantic Map for 3D Visual Grounding

Qinghongbing Xie, Zijian Liang, Fuhao Li, Long Zeng

Main category: cs.CV

TL;DR: The paper introduces Diverse Semantic Map (DSM), a novel scene representation framework that enriches 3D models with VLM-derived semantics for improved visual grounding, achieving state-of-the-art performance on benchmarks and real-world robotics tasks.

Details

Motivation: Existing 3D visual grounding methods are limited - they either focus only on geometric/visual cues or lack multi-dimensional attributes needed for complex reasoning, creating a gap in effective scene representation.

Method: Proposes DSM framework that fuses multi-view observations within temporal sliding windows to create persistent world models enriched with VLM-derived semantics (appearance, physical properties, affordances), and DSM-Grounding paradigm that shifts from free-form VLM queries to structured reasoning over semantic-rich maps.

Result: Achieves state-of-the-art 59.06% overall accuracy on ScanRefer benchmark (10% improvement), 67.93% F-mIoU in semantic segmentation outperforming all baselines, and successful deployment on physical robots for complex navigation and grasping tasks.

Conclusion: The DSM framework bridges the gap in 3D visual grounding by providing comprehensive semantic-rich representations, significantly improving accuracy, interpretability, and practical utility in real-world robotics applications.

Abstract: Effective scene representation is critical for the visual grounding ability of representations, yet existing methods for 3D Visual Grounding are often constrained. They either only focus on geometric and visual cues, or, like traditional 3D scene graphs, lack the multi-dimensional attributes needed for complex reasoning. To bridge this gap, we introduce the Diverse Semantic Map (DSM) framework, a novel scene representation framework that enriches robust geometric models with a spectrum of VLM-derived semantics, including appearance, physical properties, and affordances. The DSM is first constructed online by fusing multi-view observations within a temporal sliding window, creating a persistent and comprehensive world model. Building on this foundation, we propose DSM-Grounding, a new paradigm that shifts grounding from free-form VLM queries to a structured reasoning process over the semantic-rich map, markedly improving accuracy and interpretability. Extensive evaluations validate our approach’s superiority. On the ScanRefer benchmark, DSM-Grounding achieves a state-of-the-art 59.06% overall accuracy of IoU@0.5, surpassing others by 10%. In semantic segmentation, our DSM attains a 67.93% F-mIoU, outperforming all baselines, including privileged ones. Furthermore, successful deployment on physical robots for complex navigation and grasping tasks confirms the framework’s practical utility in real-world scenarios.

[230] SAIP-Net: Enhancing Remote Sensing Image Segmentation via Spectral Adaptive Information Propagation

Zhongtao Wang, Xizhe Cao, Yisong Chen, Guoping Wang

Main category: cs.CV

TL;DR: SAIP-Net is a frequency-aware segmentation framework that uses spectral adaptive information propagation to improve remote sensing image segmentation by addressing intra-class inconsistencies and boundary precision.

Details

Motivation: Conventional hierarchical models struggle with precise spatial boundaries and intra-class consistency in remote sensing imagery, due to limitations in spatial domain feature fusion and insufficient receptive fields.

Method: SAIP-Net employs adaptive frequency filtering and multi-scale receptive field enhancement through Spectral Adaptive Information Propagation to suppress intra-class feature inconsistencies and sharpen boundary lines.

Result: Comprehensive experiments show significant performance improvements over state-of-the-art methods in remote sensing image segmentation.

Conclusion: The combination of spectral-adaptive strategies with expanded receptive fields is highly effective for improving remote sensing image segmentation performance.

Abstract: Semantic segmentation of remote sensing imagery demands precise spatial boundaries and robust intra-class consistency, challenging conventional hierarchical models. To address limitations arising from spatial domain feature fusion and insufficient receptive fields, this paper introduces SAIP-Net, a novel frequency-aware segmentation framework that leverages Spectral Adaptive Information Propagation. SAIP-Net employs adaptive frequency filtering and multi-scale receptive field enhancement to effectively suppress intra-class feature inconsistencies and sharpen boundary lines. Comprehensive experiments demonstrate significant performance improvements over state-of-the-art methods, highlighting the effectiveness of spectral-adaptive strategies combined with expanded receptive fields for remote sensing image segmentation.

[231] Visual Affordance Prediction: Survey and Reproducibility

Tommaso Apicella, Alessio Xompero, Andrea Cavallaro

Main category: cs.CV

TL;DR: The paper proposes a unified formulation for visual affordance prediction to address inconsistent definitions across different tasks, enabling fair comparisons and systematic review of existing methods and datasets.

Details

Motivation: Current visual affordance prediction methods have diverse formulations across tasks like grasping detection and affordance segmentation, leading to inconsistent definitions that prevent fair comparisons between methods.

Method: The authors propose a unified formulation that accounts for complete information on objects of interest and agent-object interactions. They introduce the Affordance Sheet to document solutions, datasets, and validation methods for reproducibility.

Result: The unified formulation allows comprehensive and systematic review of disparate visual affordance works, highlighting strengths and limitations of both methods and datasets. Reproducibility issues are identified including unavailability of implementations and experimental details.

Conclusion: The Affordance Sheet supports future reproducibility and fairness in the visual affordance prediction community by providing transparent documentation of methods, datasets, and validation approaches.

Abstract: Affordances are the potential actions an agent can perform on an object, as observed by a camera. Visual affordance prediction is formulated differently for tasks such as grasping detection, affordance classification, affordance segmentation, and hand pose estimation. This diversity in formulations leads to inconsistent definitions that prevent fair comparisons between methods. In this paper, we propose a unified formulation of visual affordance prediction by accounting for the complete information on the objects of interest and the interaction of the agent with the objects to accomplish a task. This unified formulation allows us to comprehensively and systematically review disparate visual affordance works, highlighting strengths and limitations of both methods and datasets. We also discuss reproducibility issues, such as the unavailability of methods implementation and experimental setups details, making benchmarks for visual affordance prediction unfair and unreliable. To favour transparency, we introduce the Affordance Sheet, a document that details the solution, datasets, and validation of a method, supporting future reproducibility and fairness in the community.

[232] Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS) challenge results

Meritxell Riera-Marin, Sikha O K, Julia Rodriguez-Comas, Matthias Stefan May, Zhaohong Pan, Xiang Zhou, Xiaokun Liang, Franciskus Xaverius Erick, Andrea Prenner, Cedric Hemon, Valentin Boussot, Jean-Louis Dillenseger, Jean-Claude Nunes, Abdul Qayyum, Moona Mazher, Steven A Niederer, Kaisar Kushibar, Carlos Martin-Isla, Petia Radeva, Karim Lekadir, Theodore Barfoot, Luis C. Garcia Peraza Herrera, Ben Glocker, Tom Vercauteren, Lucas Gago, Justin Englemann, Joy-Marie Kleiss, Anton Aubanell, Andreu Antolin, Javier Garcia-Lopez, Miguel A. Gonzalez Ballester, Adrian Galdran

Main category: cs.CV

TL;DR: CURVAS challenge addresses reliability in medical image segmentation by emphasizing multi-annotator ground truth, model calibration, and uncertainty estimation to improve clinical applicability of DL models.

Details

Motivation: To ensure reliability and clinical applicability of DL models for medical image segmentation by addressing annotation variability, calibration issues, and uncertainty estimation through multi-annotator approaches.

Method: Organized CURVAS challenge with 7 teams submitting DL models, evaluated using Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS) with consensus and dissensus ground truth.

Result: Better calibration strongly correlated with segmentation quality; models trained on diverse datasets with pre-trained knowledge showed greater robustness; best models achieved high DSC with well-calibrated uncertainty estimates.

Conclusion: Multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations are essential for developing trustworthy and clinically reliable DL-based medical image segmentation models.

Abstract: Deep learning (DL) has become the dominant approach for medical image segmentation, yet ensuring the reliability and clinical applicability of these models requires addressing key challenges such as annotation variability, calibration, and uncertainty estimation. This is why we created the Calibration and Uncertainty for multiRater Volume Assessment in multiorgan Segmentation (CURVAS), which highlights the critical role of multiple annotators in establishing a more comprehensive ground truth, emphasizing that segmentation is inherently subjective and that leveraging inter-annotator variability is essential for robust model evaluation. Seven teams participated in the challenge, submitting a variety of DL models evaluated using metrics such as Dice Similarity Coefficient (DSC), Expected Calibration Error (ECE), and Continuous Ranked Probability Score (CRPS). By incorporating consensus and dissensus ground truth, we assess how DL models handle uncertainty and whether their confidence estimates align with true segmentation performance. Our findings reinforce the importance of well-calibrated models, as better calibration is strongly correlated with the quality of the results. Furthermore, we demonstrate that segmentation models trained on diverse datasets and enriched with pre-trained knowledge exhibit greater robustness, particularly in cases deviating from standard anatomical structures. Notably, the best-performing models achieved high DSC and well-calibrated uncertainty estimates. This work underscores the need for multi-annotator ground truth, thorough calibration assessments, and uncertainty-aware evaluations to develop trustworthy and clinically reliable DL-based medical image segmentation models.

[233] VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, Tianfei Zhou

Main category: cs.CV

TL;DR: VideoRFT extends reinforcement fine-tuning to multimodal LLMs for video reasoning, using a two-stage approach with CoT annotations and RL, plus novel datasets and semantic-consistency reward.

Details

Motivation: Video reasoning remains challenging due to complex logic and temporal structures in videos, while existing RFT methods haven't adequately addressed video domains due to lack of large-scale video CoT datasets.

Method: Two-stage RFT: SFT with CoT annotations followed by RL with semantic-consistency reward. Uses multi-expert CoT curation pipeline with cognition-inspired prompting and MLLM revision to create VideoRFT-CoT-102K and VideoRFT-RL-310K datasets.

Result: Achieves state-of-the-art performance on six video reasoning benchmarks.

Conclusion: VideoRFT successfully extends RFT to video reasoning, demonstrating human-like reasoning capabilities through novel dataset creation and semantic-consistency rewards.

Abstract: Reinforcement fine-tuning (RFT) has shown great promise in achieving humanlevel reasoning capabilities of Large Language Models (LLMs), and has recently been extended to MLLMs. Nevertheless, reasoning about videos, which is a fundamental aspect of human intelligence, remains a persistent challenge due to the complex logic, temporal and causal structures inherent in video data. To fill this gap, we propose VideoRFT, a novel approach that extends the RFT paradigm to cultivate human-like video reasoning capabilities in MLLMs. VideoRFT follows the standard two-stage scheme in RFT: supervised fine-tuning (SFT) with chain-of-thought (CoT) annotations, followed by reinforcement learning (RL) to improve generalization. A central challenge to achieve this in the video domain lies in the scarcity of large-scale, high-quality video CoT datasets. We address this by building a multi-expert-driven, cognition-inspired CoT curation pipeline. First, we devise a cognition-inspired prompting strategy to elicit a reasoning LLM to generate preliminary CoTs based solely on rich, structured, and literal representations of video content. Subsequently, these CoTs are revised by a MLLM conditioned on the actual video, ensuring visual consistency and reducing visual hallucinations. This pipeline results in two new datasets, i.e.VideoRFT-CoT-102K for SFT and VideoRFT-RL-310K for RL. To further strengthen the RL phase, we introduce a novel semantic-consistency reward that explicitly promotes the alignment between textual reasoning and visual evidence. This reward encourages the model to produce coherent, context-aware reasoning outputs grounded in visual input. Extensive experiments show that VideoRFT achieves state-of-the-art performance on six video reasoning benchmarks.

[234] GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia, Seongheon Park, Song Gao, Xiangyu Zhao, Sharon Li

Main category: cs.CV

TL;DR: GeoRanker is a distance-aware ranking framework for worldwide image geolocalization that uses vision-language models to encode query-candidate interactions and predict geographic proximity through multi-order distance loss.

Details

Motivation: Current two-stage geolocalization approaches rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates, which limits their performance in diverse global contexts.

Method: Proposes GeoRanker framework with large vision-language models to jointly encode query-candidate interactions, introduces multi-order distance loss for ranking both absolute and relative distances, and creates GeoRanking dataset for geographic ranking tasks.

Result: GeoRanker achieves state-of-the-art results on IM2GPS3K and YFCC4K benchmarks, significantly outperforming current best methods.

Conclusion: The proposed distance-aware ranking framework with structured spatial reasoning effectively addresses the challenges of worldwide image geolocalization and demonstrates superior performance over existing approaches.

Abstract: Worldwide image geolocalization-the task of predicting GPS coordinates from images taken anywhere on Earth-poses a fundamental challenge due to the vast diversity in visual content across regions. While recent approaches adopt a two-stage pipeline of retrieving candidates and selecting the best match, they typically rely on simplistic similarity heuristics and point-wise supervision, failing to model spatial relationships among candidates. In this paper, we propose GeoRanker, a distance-aware ranking framework that leverages large vision-language models to jointly encode query-candidate interactions and predict geographic proximity. In addition, we introduce a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships. To support this, we curate GeoRanking, the first dataset explicitly designed for geographic ranking tasks with multimodal candidate information. GeoRanker achieves state-of-the-art results on two well-established benchmarks (IM2GPS3K and YFCC4K), significantly outperforming current best methods.

[235] Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval

Siting Li, Xiang Gao, Simon Shaolei Du

Main category: cs.CV

TL;DR: The paper introduces COCO-Facet benchmark to evaluate text-to-image retrievers on attribute-focused queries, reveals limitations of CLIP-like and MLLM-based retrievers, and proposes promptable image embeddings with acceleration strategies.

Details

Motivation: Current text-to-image retrievers struggle with attribute-focused queries because their image embeddings focus on global semantics and subjects while missing other details, leading to poor and imbalanced performance.

Method: Proposes using promptable image embeddings enabled by multimodal retrievers to highlight required attributes, with two acceleration strategies: pre-processing promptable embeddings and using linear approximations.

Result: The proposed pipeline generalizes across query types, image pools, and base retriever architectures. Pre-processing yields 15% improvement in Recall@5 when prompts are predefined, while linear approximations achieve 8% improvement when prompts are only available during inference.

Conclusion: Retrieving with general image embeddings is suboptimal for attribute-focused queries, and promptable image embeddings provide an effective solution that can be accelerated for real-world applications.

Abstract: While an image is worth more than a thousand words, only a few provide crucial information for a given task and thus should be focused on. In light of this, ideal text-to-image (T2I) retrievers should prioritize specific visual attributes relevant to queries. To evaluate current retrievers on handling attribute-focused queries, we build COCO-Facet, a COCO-based benchmark with 9,112 queries about diverse attributes of interest. We find that CLIP-like retrievers, which are widely adopted due to their efficiency and zero-shot ability, have poor and imbalanced performance, possibly because their image embeddings focus on global semantics and subjects while leaving out other details. Notably, we reveal that even recent Multimodal Large Language Model (MLLM)-based, stronger retrievers with a larger output dimension struggle with this limitation. Hence, we hypothesize that retrieving with general image embeddings is suboptimal for performing such queries. As a solution, we propose to use promptable image embeddings enabled by these multimodal retrievers, which boost performance by highlighting required attributes. Our pipeline for deriving such embeddings generalizes across query types, image pools, and base retriever architectures. To enhance real-world applicability, we offer two acceleration strategies: Pre-processing promptable embeddings and using linear approximations. We show that the former yields a 15% improvement in Recall@5 when prompts are predefined, while the latter achieves an 8% improvement when prompts are only available during inference.

[236] Image Quality Assessment for Embodied AI

Chunyi Li, Jiaohao Xiao, Jianbo Zhang, Farong Wen, Zicheng Zhang, Yuan Tian, Xiangyang Zhu, Xiaohong Liu, Zhengxue Cheng, Weisi Lin, Guangtao Zhai

Main category: cs.CV

TL;DR: The paper proposes IQA for Embodied AI to assess image usability for robots in embodied tasks, creating a comprehensive database and evaluation framework.

Details

Motivation: Current IQA methods only predict human preferences for distorted images but cannot assess image usability for embodied AI tasks in real-world scenarios with various distortions.

Method: Constructed a perception-cognition-decision-execution pipeline based on Mertonian system and meta-cognitive theory, built Embodied-IQA database with 36k+ image pairs and 5m+ annotations from VLMs/VLAMs/real robots, and validated mainstream IQA methods.

Result: Demonstrated that existing IQA methods are insufficient for embodied AI tasks, showing the need for more accurate quality indicators specifically designed for robotic perception and decision-making.

Conclusion: The proposed Embodied-IQA framework can promote the application of Embodied AI in real-world scenarios with complex distortions through better evaluation methods.

Abstract: Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA

[237] Learning Adaptive and Temporally Causal Video Tokenization in a 1D Latent Space

Yan Li, Changyao Tian, Renqiu Xia, Ning Liao, Weiwei Guo, Junchi Yan, Hongsheng Li, Jifeng Dai, Hao Li, Xue Yang

Main category: cs.CV

TL;DR: AdapTok is an adaptive temporal causal video tokenizer that dynamically allocates tokens across video frames based on content, using block-wise masking and causal scoring during training, and integer linear programming for optimal token allocation during inference.

Details

Motivation: To enable more scalable and token-efficient generative video modeling by addressing the limitation of fixed token allocation across frames, allowing content-aware and temporally dynamic token distribution under controllable budgets.

Method: Uses block-wise masking strategy during training to randomly drop tail tokens, block causal scorer to predict reconstruction quality with different token counts, and adaptive token allocation via integer linear programming during inference.

Result: Extensive experiments on UCF-101 and Kinetics-600 show consistent improvements in reconstruction quality and generation performance under different token budgets without requiring additional image data.

Conclusion: AdapTok enables more scalable and token-efficient generative video modeling through adaptive, content-aware token allocation, demonstrating effectiveness across various token budgets and datasets.

Abstract: We propose AdapTok, an adaptive temporal causal video tokenizer that can flexibly allocate tokens for different frames based on video content. AdapTok is equipped with a block-wise masking strategy that randomly drops tail tokens of each block during training, and a block causal scorer to predict the reconstruction quality of video frames using different numbers of tokens. During inference, an adaptive token allocation strategy based on integer linear programming is further proposed to adjust token usage given predicted scores. Such design allows for sample-wise, content-aware, and temporally dynamic token allocation under a controllable overall budget. Extensive experiments for video reconstruction and generation on UCF-101 and Kinetics-600 demonstrate the effectiveness of our approach. Without additional image data, AdapTok consistently improves reconstruction quality and generation performance under different token budgets, allowing for more scalable and token-efficient generative video modeling.

Bingqian Lin, Yunshuang Nie, Khun Loun Zai, Ziming Wei, Mingfei Han, Rongtao Xu, Minzhe Niu, Jianhua Han, Hanwang Zhang, Liang Lin, Bokui Chen, Cewu Lu, Xiaodan Liang

Main category: cs.CV

TL;DR: EvolveNav is a self-improving reasoning paradigm that enhances LLM-based vision-language navigation through formalized CoT supervised fine-tuning and self-reflective post-training, achieving superior performance across multiple benchmarks.

Details

Motivation: Current LLM-based VLN approaches use simple input-output mapping, making learning difficult and decisions unexplainable. CoT training can improve accuracy and interpretability, but perfect CoT labels are unavailable and may cause overfitting.

Method: Two-stage training: (1) Formalized CoT Supervised Fine-Tuning with curated CoT labels to activate reasoning capabilities, (2) Self-Reflective Post-Training using model’s own reasoning outputs as self-enriched CoT labels with a self-reflective auxiliary task that contrasts correct and wrong reasoning patterns.

Result: EvolveNav demonstrates consistent superiority over previous LLM-based VLN approaches on benchmarks including R2R, REVERIE, CVDN, and SOON under both task-specific and cross-task training paradigms.

Conclusion: The proposed self-improving embodied reasoning paradigm enables adaptable and generalizable navigational reasoning, effectively boosting LLM-based vision-language navigation performance.

Abstract: Recent studies have revealed the potential of training open-source Large Language Models (LLMs) to unleash LLMs’ reasoning ability for enhancing vision-language navigation (VLN) performance, and simultaneously mitigate the domain gap between LLMs’ training corpus and the VLN task. However, these approaches predominantly adopt straightforward input-output mapping paradigms, causing the mapping learning difficult and the navigational decisions unexplainable. Chain-of-Thought (CoT) training is a promising way to improve both navigational decision accuracy and interpretability, while the complexity of the navigation task makes the perfect CoT labels unavailable and may lead to overfitting through pure CoT supervised fine-tuning. To address these issues, we propose EvolveNav, a novel sElf-improving embodied reasoning paradigm that realizes adaptable and generalizable navigational reasoning for boosting LLM-based vision-language Navigation. Specifically, EvolveNav involves a two-stage training process: (1) Formalized CoT Supervised Fine-Tuning, where we train the model with curated formalized CoT labels to first activate the model’s navigational reasoning capabilities, and simultaneously increase the reasoning speed; (2) Self-Reflective Post-Training, where the model is iteratively trained with its own reasoning outputs as self-enriched CoT labels to enhance the supervision diversity. A self-reflective auxiliary task is also designed to encourage the model to learn correct reasoning patterns by contrasting with wrong ones. Experimental results under both task-specific and cross-task training paradigms demonstrate the consistent superiority of EvolveNav over previous LLM-based VLN approaches on various popular benchmarks, including R2R, REVERIE, CVDN, and SOON. Code is available at https://github.com/expectorlin/EvolveNav.

[239] Normalize Filters! Classical Wisdom for Deep Vision

Gustavo Perez, Stella X. Yu

Main category: cs.CV

TL;DR: The paper proposes filter normalization for deep learning filters to address distortion issues during atmospheric transfer, achieving improved performance and robustness.

Details

Motivation: Classical image filters are carefully normalized to avoid artifacts, but convolutional filters in deep networks lack such constraints, leading to distorted responses and incorrect outcomes during atmospheric transfer.

Method: Proposes filter normalization followed by learnable scaling and shifting (similar to batch normalization) to make filters atmosphere-equivariant and enable co-domain symmetry.

Result: Significant improvements on artificial and natural intensity variation benchmarks; ResNet34 outperformed CLIP by a large margin. Filter normalization regularizes learning, promotes diversity, and improves robustness.

Conclusion: Integrating classical filtering normalization principles into deep learning enables atmosphere-equivariant filters that improve performance, robustness, and generalization across CNNs and vision transformers.

Abstract: Classical image filters, such as those for averaging or differencing, are carefully normalized to ensure consistency, interpretability, and to avoid artifacts like intensity shifts, halos, or ringing. In contrast, convolutional filters learned end-to-end in deep networks lack such constraints. Although they may resemble wavelets and blob/edge detectors, they are not normalized in the same or any way. Consequently, when images undergo atmospheric transfer, their responses become distorted, leading to incorrect outcomes. We address this limitation by proposing filter normalization, followed by learnable scaling and shifting, akin to batch normalization. This simple yet effective modification ensures that the filters are atmosphere-equivariant, enabling co-domain symmetry. By integrating classical filtering principles into deep learning (applicable to both convolutional neural networks and convolution-dependent vision transformers), our method achieves significant improvements on artificial and natural intensity variation benchmarks. Our ResNet34 could even outperform CLIP by a large margin. Our analysis reveals that unnormalized filters degrade performance, whereas filter normalization regularizes learning, promotes diversity, and improves robustness and generalization.

[240] CryoFastAR: Fast Cryo-EM Ab Initio Reconstruction Made Easy

Jiakai Zhang, Shouchen Zhou, Haizhao Dai, Xinhang Liu, Peihao Wang, Zhiwen Fan, Yuan Pei, Jingyi Yu

Main category: cs.CV

TL;DR: CryoFastAR is the first geometric foundation model that directly predicts poses from cryo-EM noisy images for fast ab initio reconstruction, significantly accelerating inference over traditional iterative approaches.

Details

Motivation: Pose estimation in cryo-EM still depends on time-consuming iterative optimization due to challenges like low SNR and CTF distortions, while recent geometric foundation models remain underexplored in scientific imaging fields.

Method: Integrates multi-view features and trains on large-scale simulated cryo-EM data with realistic noise and CTF modulations. Uses progressive training strategy that starts with simpler conditions and gradually increases difficulty.

Result: Achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

Conclusion: CryoFastAR enables fast pose estimation for cryo-EM reconstruction, bridging the gap between geometric foundation models and scientific imaging applications.

Abstract: Pose estimation from unordered images is fundamental for 3D reconstruction, robotics, and scientific imaging. Recent geometric foundation models, such as DUSt3R, enable end-to-end dense 3D reconstruction but remain underexplored in scientific imaging fields like cryo-electron microscopy (cryo-EM) for near-atomic protein reconstruction. In cryo-EM, pose estimation and 3D reconstruction from unordered particle images still depend on time-consuming iterative optimization, primarily due to challenges such as low signal-to-noise ratios (SNR) and distortions from the contrast transfer function (CTF). We introduce CryoFastAR, the first geometric foundation model that can directly predict poses from Cryo-EM noisy images for Fast ab initio Reconstruction. By integrating multi-view features and training on large-scale simulated cryo-EM data with realistic noise and CTF modulations, CryoFastAR enhances pose estimation accuracy and generalization. To enhance training stability, we propose a progressive training strategy that first allows the model to extract essential features under simpler conditions before gradually increasing difficulty to improve robustness. Experiments show that CryoFastAR achieves comparable quality while significantly accelerating inference over traditional iterative approaches on both synthetic and real datasets.

[241] SPADE: Spatial Transcriptomics and Pathology Alignment Using a Mixture of Data Experts for an Expressive Latent Space

Ekaterina Redekop, Mara Pleasure, Zichen Wang, Kimberly Flores, Anthony Sisk, William Speier, Corey W. Arnold

Main category: cs.CV

TL;DR: SPADE is a foundation model that integrates whole-slide images with spatial transcriptomics data using mixture-of-data experts and contrastive learning, achieving superior few-shot performance on 20 downstream tasks.

Details

Motivation: To address the gap in comprehensive integration of whole-slide images with spatial transcriptomics data, which is crucial for capturing molecular heterogeneity beyond standard H&E staining.

Method: Uses mixture-of-data experts technique with two-stage imaging feature-space clustering via contrastive learning to learn representations of co-registered WSI patches and gene expression profiles within a unified framework.

Result: Demonstrates significantly superior few-shot performance compared to baseline models on 20 downstream tasks, highlighting benefits of integrating morphological and molecular information.

Conclusion: SPADE successfully creates an ST-informed latent space that effectively integrates histopathology with spatial transcriptomics data, providing improved performance across various pathology tasks.

Abstract: The rapid growth of digital pathology and advances in self-supervised deep learning have enabled the development of foundational models for various pathology tasks across diverse diseases. While multimodal approaches integrating diverse data sources have emerged, a critical gap remains in the comprehensive integration of whole-slide images (WSIs) with spatial transcriptomics (ST), which is crucial for capturing critical molecular heterogeneity beyond standard hematoxylin & eosin (H&E) staining. We introduce SPADE, a foundation model that integrates histopathology with ST data to guide image representation learning within a unified framework, in effect creating an ST-informed latent space. SPADE leverages a mixture-of-data experts technique, where experts are created via two-stage imaging feature-space clustering using contrastive learning to learn representations of co-registered WSI patches and gene expression profiles. Pre-trained on the comprehensive HEST-1k dataset, SPADE is evaluated on 20 downstream tasks, demonstrating significantly superior few-shot performance compared to baseline models, highlighting the benefits of integrating morphological and molecular information into one latent space. Code and pretrained weights are available at https://github.com/uclabair/SPADE.

[242] OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding

Jingli Lin, Chenming Zhu, Runsen Xu, Xiaohan Mao, Xihui Liu, Tai Wang, Jiangmiao Pang

Main category: cs.CV

TL;DR: OST-Bench is a new benchmark for evaluating multimodal LLMs on online spatio-temporal understanding, simulating active agent exploration with incremental observations and memory integration.

Details

Motivation: Existing benchmarks evaluate models offline with fixed inputs, but real-world embodied perception requires processing incremental observations and integrating current visual inputs with historical memory for dynamic spatial reasoning.

Method: Built an efficient data collection pipeline using ScanNet, Matterport3D, and ARKitScenes, creating 1.4k scenes and 10k question-answer pairs to test online spatio-temporal reasoning.

Result: Leading MLLMs perform poorly on OST-Bench, with accuracy declining as exploration horizon extends and memory grows. Complex clue-based spatial reasoning and long-term memory retrieval requirements significantly drop performance.

Conclusion: Current MLLMs struggle with online embodied reasoning challenges, particularly complex spatial reasoning and memory integration. OST-Bench highlights core challenges that need addressing for improved embodied AI systems.

Abstract: Recent advances in multimodal large language models (MLLMs) have shown remarkable capabilities in integrating vision and language for complex reasoning. While most existing benchmarks evaluate models under offline settings with a fixed set of pre-recorded inputs, we introduce OST-Bench, a benchmark designed to evaluate Online Spatio-Temporal understanding from the perspective of an agent actively exploring a scene. The Online aspect emphasizes the need to process and reason over incrementally acquired observations, while the Spatio-Temporal component requires integrating current visual inputs with historical memory to support dynamic spatial reasoning. OST-Bench better reflects the challenges of real-world embodied perception. Built on an efficient data collection pipeline, OST-Bench consists of 1.4k scenes and 10k question-answer pairs collected from ScanNet, Matterport3D, and ARKitScenes. We evaluate several leading MLLMs on OST-Bench and observe that they fall short on tasks requiring complex spatio-temporal reasoning. Under the online setting, their accuracy declines as the exploration horizon extends and the memory grows. Through further experimental analysis, we identify common error patterns across models and find that both complex clue-based spatial reasoning demands and long-term memory retrieval requirements significantly drop model performance along two separate axes, highlighting the core challenges that must be addressed to improve online embodied reasoning. To foster further research and development in the field, our codes, dataset, and benchmark are available. Our project page is: https://rbler1234.github.io/OSTBench.github.io/

[243] GTPBD: A Fine-Grained Global Terraced Parcel and Boundary Dataset

Zhiwei Zhang, Zi Ye, Yibin Wen, Shuai Yuan, Haohuan Fu, Jianxi Huang, Juepeng Zheng

Main category: cs.CV

TL;DR: The paper introduces GTPBD, the first global fine-grained terraced parcel dataset with 200,000+ manually annotated parcels, addressing the gap in complex terrain agriculture mapping.

Details

Motivation: Existing agricultural parcel extraction studies focus on mid-resolution mapping or regular plain farmlands, lacking representation of complex terraced terrains needed for precision agriculture.

Method: Created GTPBD dataset with 47,537 high-resolution images covering seven major geographic zones in China and transcontinental climatic regions worldwide, featuring three-level labels: pixel-level boundaries, masks, and parcel labels.

Result: Benchmarked on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods using multi-dimensional evaluation framework. Dataset presents challenges due to terrain diversity, complex irregular parcels, and multiple domain styles.

Conclusion: GTPBD fills a critical gap in terraced remote sensing research, providing infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.

Abstract: Agricultural parcels serve as basic units for conducting agricultural practices and applications, which is vital for land ownership registration, food security assessment, soil erosion monitoring, etc. However, existing agriculture parcel extraction studies only focus on mid-resolution mapping or regular plain farmlands while lacking representation of complex terraced terrains due to the demands of precision agriculture.In this paper, we introduce a more fine-grained terraced parcel dataset named GTPBD (Global Terraced Parcel and Boundary Dataset), which is the first fine-grained dataset covering major worldwide terraced regions with more than 200,000 complex terraced parcels with manual annotation. GTPBD comprises 47,537 high-resolution images with three-level labels, including pixel-level boundary labels, mask labels, and parcel labels. It covers seven major geographic zones in China and transcontinental climatic regions around the world.Compared to the existing datasets, the GTPBD dataset brings considerable challenges due to the: (1) terrain diversity; (2) complex and irregular parcel objects; and (3) multiple domain styles. Our proposed GTPBD dataset is suitable for four different tasks, including semantic segmentation, edge detection, terraced parcel extraction, and unsupervised domain adaptation (UDA) tasks.Accordingly, we benchmark the GTPBD dataset on eight semantic segmentation methods, four edge extraction methods, three parcel extraction methods, and five UDA methods, along with a multi-dimensional evaluation framework integrating pixel-level and object-level metrics. GTPBD fills a critical gap in terraced remote sensing research, providing a basic infrastructure for fine-grained agricultural terrain analysis and cross-scenario knowledge transfer.

[244] Finding Dori: Memorization in Text-to-Image Diffusion Models Is Not Local

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

Main category: cs.CV

TL;DR: This paper challenges the assumption that memorization in text-to-image diffusion models can be localized to specific weights, showing that current pruning-based defenses are fragile and that memorization is distributed throughout the model.

Details

Motivation: Address concerns about data privacy and intellectual property in text-to-image diffusion models, which can inadvertently memorize and replicate training data despite existing mitigation efforts.

Method: Analyze the fragility of weight-pruning defenses by showing small perturbations can re-trigger data replication, and provide multiple analyses demonstrating memorization is distributed (text embedding space distribution, divergent activations, inconsistent weight identification).

Result: Found that memorization is not localized - replication triggers are distributed, embeddings produce divergent activations, and different pruning methods identify inconsistent weight sets. Showed adversarial fine-tuning provides more robust mitigation.

Conclusion: Memorization in text-to-image diffusion models is inherently distributed rather than localized, challenging current defense assumptions and enabling more robust mitigation approaches through adversarial fine-tuning.

Abstract: Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such defenses. Our further analysis then provides multiple indications that memorization is indeed not inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the nature of memorization in text-to-image DMs and inform the development of more reliable mitigations against DM memorization.

[245] Capturing More: Learning Multi-Domain Representations for Robust Online Handwriting Verification

Peirong Zhang, Kai Ding, Lianwen Jin

Main category: cs.CV

TL;DR: SPECTRUM is a temporal-frequency synergistic model for online handwriting verification that combines temporal and frequency domain features through multi-scale interaction and self-gated fusion, achieving superior performance over temporal-only approaches.

Details

Motivation: To unlock the untapped potential of multi-domain representation learning for online handwriting verification by moving beyond conventional temporal-only approaches and leveraging both temporal and frequency domain features.

Method: Three core components: (1) multi-scale interactor combining temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) self-gated fusion module dynamically integrating global temporal and frequency features, (3) multi-domain distance-based verifier using both temporal and frequency representations.

Result: Extensive experiments demonstrate SPECTRUM’s superior performance over existing OHV methods, validating the effectiveness of temporal-frequency multi-domain learning and showing that incorporating multiple handwritten biometrics enhances discriminative power.

Conclusion: The findings validate multi-domain learning efficacy in OHV and pave the way for future research in multi-domain approaches across both feature and biometric domains.

Abstract: In this paper, we propose SPECTRUM, a temporal-frequency synergistic model that unlocks the untapped potential of multi-domain representation learning for online handwriting verification (OHV). SPECTRUM comprises three core components: (1) a multi-scale interactor that finely combines temporal and frequency features through dual-modal sequence interaction and multi-scale aggregation, (2) a self-gated fusion module that dynamically integrates global temporal and frequency features via self-driven balancing. These two components work synergistically to achieve micro-to-macro spectral-temporal integration. (3) A multi-domain distance-based verifier then utilizes both temporal and frequency representations to improve discrimination between genuine and forged handwriting, surpassing conventional temporal-only approaches. Extensive experiments demonstrate SPECTRUM’s superior performance over existing OHV methods, underscoring the effectiveness of temporal-frequency multi-domain learning. Furthermore, we reveal that incorporating multiple handwritten biometrics fundamentally enhances the discriminative power of handwriting representations and facilitates verification. These findings not only validate the efficacy of multi-domain learning in OHV but also pave the way for future research in multi-domain approaches across both feature and biometric domains. Code is publicly available at https://github.com/NiceRingNode/SPECTRUM.

[246] mmWave Radar-Based Non-Line-of-Sight Pedestrian Localization at T-Junctions Utilizing Road Layout Extraction via Camera

Byeonggyu Park, Hee-Yeun Kim, Byonghyok Choi, Hansang Cho, Byungkwan Kim, Soomok Lee, Mingu Jeon, Seong-Woo Kim

Main category: cs.CV

TL;DR: A framework that uses camera-derived road layout to interpret 2D radar point clouds for localizing pedestrians in non-line-of-sight urban environments.

Details

Motivation: Pedestrian localization in NLoS regions is challenging for autonomous driving. Radar point clouds suffer from multipath distortions, while cameras lack depth perception and cannot directly observe NLoS objects.

Method: Proposes a novel framework that interprets radar point cloud data through road layout inferred from camera images, enabling spatial scene reconstruction for NLoS pedestrian localization.

Result: Validated through experiments using a radar-camera system on a real vehicle, with evaluation on outdoor NLoS driving environment datasets demonstrating practical applicability.

Conclusion: The proposed approach effectively combines radar and camera data to overcome limitations of individual sensors for accurate NLoS pedestrian localization in autonomous driving scenarios.

Abstract: Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

[247] Macro-from-Micro Planning for High-Quality and Parallelized Autoregressive Long Video Generation

Xunzhi Xiang, Yabo Chen, Guiyu Zhang, Zhongyu Wang, Zhe Gao, Quanming Xiang, Gonghu Shang, Junqi Liu, Haibin Huang, Yang Gao, Chi Zhang, Qi Fan, Xuelong Li

Main category: cs.CV

TL;DR: Proposes a planning-then-populating framework with Macro-from-Micro Planning (MMPL) for long video generation, addressing temporal drift and enabling parallelization through hierarchical keyframe planning and parallel content generation.

Details

Motivation: Current autoregressive diffusion models for video generation suffer from temporal drift due to error accumulation and limited parallelization capabilities, restricting them to short temporal durations.

Method: MMPL framework with two hierarchical stages: Micro Planning predicts sparse future keyframes within short segments, while Macro Planning extends keyframe planning across the entire video through autoregressive micro plans. Content Populating then generates all intermediate frames in parallel across segments with Adaptive Workload Scheduling.

Result: The method outperforms existing long video generation models in quality and stability, enabling efficient parallelization and overcoming temporal drift limitations.

Conclusion: The proposed MMPL framework successfully addresses temporal drift and parallelization challenges in long video generation, achieving superior performance compared to existing methods.

Abstract: Current autoregressive diffusion models excel at video generation but are generally limited to short temporal durations. Our theoretical analysis indicates that the autoregressive modeling typically suffers from temporal drift caused by error accumulation and hinders parallelization in long video synthesis. To address these limitations, we propose a novel planning-then-populating framework centered on Macro-from-Micro Planning (MMPL) for long video generation. MMPL sketches a global storyline for the entire video through two hierarchical stages: Micro Planning and Macro Planning. Specifically, Micro Planning predicts a sparse set of future keyframes within each short video segment, offering motion and appearance priors to guide high-quality video segment generation. Macro Planning extends the in-segment keyframes planning across the entire video through an autoregressive chain of micro plans, ensuring long-term consistency across video segments. Subsequently, MMPL-based Content Populating generates all intermediate frames in parallel across segments, enabling efficient parallelization of autoregressive generation. The parallelization is further optimized by Adaptive Workload Scheduling for balanced GPU execution and accelerated autoregressive video generation. Extensive experiments confirm that our method outperforms existing long video generation models in quality and stability. Generated videos and comparison results are in our project page.

[248] Levarging Learning Bias for Noisy Anomaly Detection

Yuxin Zhang, Yunkang Cao, Yuqi Cheng, Yihan Sun, Weiming Shen

Main category: cs.CV

TL;DR: A two-stage framework for fully unsupervised image anomaly detection that exploits learning bias to handle contaminated training data where anomalies may be present but unlabeled.

Details

Motivation: Real-world training data often contains unlabeled anomalies, causing conventional methods that assume anomaly-free data to degrade in performance by learning anomalies as normal patterns.

Method: Two-stage framework: Stage 1 partitions training data, trains sub-models, aggregates cross-model anomaly scores to filter a purified dataset; Stage 2 trains final detector on purified data. Leverages learning bias from statistical dominance of normal samples and feature-space divergence.

Result: Superior anomaly detection and localization performance on Real-IAD benchmark under different noise conditions. Ablation studies validate contamination resilience and importance of learning bias exploitation.

Conclusion: The model-agnostic framework provides a practical solution for real-world scenarios with imperfect training data by systematically exploiting learning bias to handle data contamination.

Abstract: This paper addresses the challenge of fully unsupervised image anomaly detection (FUIAD), where training data may contain unlabeled anomalies. Conventional methods assume anomaly-free training data, but real-world contamination leads models to absorb anomalies as normal, degrading detection performance. To mitigate this, we propose a two-stage framework that systematically exploits inherent learning bias in models. The learning bias stems from: (1) the statistical dominance of normal samples, driving models to prioritize learning stable normal patterns over sparse anomalies, and (2) feature-space divergence, where normal data exhibit high intra-class consistency while anomalies display high diversity, leading to unstable model responses. Leveraging the learning bias, stage 1 partitions the training set into subsets, trains sub-models, and aggregates cross-model anomaly scores to filter a purified dataset. Stage 2 trains the final detector on this dataset. Experiments on the Real-IAD benchmark demonstrate superior anomaly detection and localization performance under different noise conditions. Ablation studies further validate the framework’s contamination resilience, emphasizing the critical role of learning bias exploitation. The model-agnostic design ensures compatibility with diverse unsupervised backbones, offering a practical solution for real-world scenarios with imperfect training data. Code is available at https://github.com/hustzhangyuxin/LLBNAD.

[249] Boosting Generic Semi-Supervised Medical Image Segmentation via Diverse Teaching and Label Propagation

Wei Li, Pengcheng Zhou, Linye Ma, Wenyi Zhao, Huihua Yang

Main category: cs.CV

TL;DR: DTLP-Net is a generic framework for semi-supervised medical image segmentation that addresses limited annotation and domain shift through diverse teacher models and label propagation.

Details

Motivation: To develop a unified solution for semi-supervised medical image segmentation (SSMIS), semi-supervised medical domain generalization (Semi-MDG), and unsupervised medical domain adaptation (UMDA) that overcomes error accumulation and suboptimal performance in conventional methods.

Method: Uses a Diverse Teaching and Label Propagation Network with one student model and two diverse teacher models, inter-sample and intra-sample data augmentation, and label propagation for voxel-level correlations.

Result: Achieves notable improvements over state-of-the-art methods across five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks.

Conclusion: The framework demonstrates potential for tackling challenging semi-supervised learning scenarios in medical image segmentation.

Abstract: Both limited annotation and domain shift are significant challenges frequently encountered in medical image segmentation, leading to derivative scenarios like semi-supervised medical (SSMIS), semi-supervised medical domain generalization (Semi-MDG) and unsupervised medical domain adaptation (UMDA). Conventional methods are generally tailored to specific tasks in isolation, the error accumulation hinders the effective utilization of unlabeled data and limits further improvements, resulting in suboptimal performance when these issues occur. In this paper, we aim to develop a generic framework that masters all three tasks. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data and increasing the diversity of the model. To tackle this issue, we employ a Diverse Teaching and Label Propagation Network (DTLP-Net) to boosting the Generic Semi-Supervised Medical Image Segmentation. Our DTLP-Net involves a single student model and two diverse teacher models, which can generate reliable pseudo-labels for the student model. The first teacher model decouple the training process with labeled and unlabeled data, The second teacher is momentum-updated periodically, thus generating reliable yet divers pseudo-labels. To fully utilize the information within the data, we adopt inter-sample and intra-sample data augmentation to learn the global and local knowledge. In addition, to further capture the voxel-level correlations, we propose label propagation to enhance the model robust. We evaluate our proposed framework on five benchmark datasets for SSMIS, UMDA, and Semi-MDG tasks. The results showcase notable improvements compared to state-of-the-art methods across all five settings, indicating the potential of our framework to tackle more challenging SSL scenarios.

[250] KonfAI: A Modular and Fully Configurable Framework for Deep Learning in Medical Imaging

Valentin Boussot, Jean-Louis Dillenseger

Main category: cs.CV

TL;DR: KonfAI is a configurable deep learning framework for medical imaging that uses YAML files to define workflows, enabling reproducibility and advanced strategies without code changes.

Details

Motivation: To create a modular framework that enhances reproducibility and reduces development time in medical imaging by allowing workflow configuration through declarative files rather than code modifications.

Method: Uses structured YAML configuration files to define training, inference, and evaluation workflows, with native support for patch-based learning, test-time augmentation, model ensembling, and multi-model training setups.

Result: Successfully applied to segmentation, registration, and image synthesis tasks, achieving top-ranking results in international medical imaging challenges.

Conclusion: KonfAI provides an effective open-source solution for medical imaging that improves experimental traceability and workflow efficiency through its declarative configuration approach.

Abstract: KonfAI is a modular, extensible, and fully configurable deep learning framework specifically designed for medical imaging tasks. It enables users to define complete training, inference, and evaluation workflows through structured YAML configuration files, without modifying the underlying code. This declarative approach enhances reproducibility, transparency, and experimental traceability while reducing development time. Beyond the capabilities of standard pipelines, KonfAI provides native abstractions for advanced strategies including patch-based learning, test-time augmentation, model ensembling, and direct access to intermediate feature representations for deep supervision. It also supports complex multi-model training setups such as generative adversarial architectures. Thanks to its modular and extensible architecture, KonfAI can easily accommodate custom models, loss functions, and data processing components. The framework has been successfully applied to segmentation, registration, and image synthesis tasks, and has contributed to top-ranking results in several international medical imaging challenges. KonfAI is open source and available at https://github.com/vboussot/KonfAI.

[251] Contrast Sensitivity in Multimodal Large Language Models: A Psychophysics-Inspired Evaluation

Pablo Hernández-Cámara, Alexandra Gomez-Villa, Jose Manuel Jaén-Lorites, Jorge Vila-Tomás, Valero Laparra, Jesus Malo

Main category: cs.CV

TL;DR: This paper introduces a method to estimate Contrast Sensitivity Functions (CSFs) in Multimodal Large Language Models (MLLMs) using psychophysical testing, revealing that while some models resemble human CSFs in shape or scale, none fully capture both aspects.

Details

Motivation: To systematically understand how MLLMs process low-level visual features and evaluate their perceptual abilities, which has not been previously characterized.

Method: Using human psychophysics-inspired behavioral testing with structured prompts and noise-based stimuli filtered at specific spatial frequencies, deriving psychometric functions from binary verbal responses without relying on internal activations or classifier proxies.

Result: Models show CSF patterns that resemble human CSFs in either shape or scale but not both; CSF estimates are highly sensitive to prompt phrasing; CSFs predict model performance under frequency-filtered and adversarial conditions.

Conclusion: There are systematic differences in frequency tuning across MLLMs, and CSF estimation serves as a scalable diagnostic tool for evaluating multimodal perception capabilities.

Abstract: Understanding how Multimodal Large Language Models (MLLMs) process low-level visual features is critical for evaluating their perceptual abilities and has not been systematically characterized. Inspired by human psychophysics, we introduce a behavioural method for estimating the Contrast Sensitivity Function (CSF) in MLLMs by treating them as end-to-end observers. Models are queried with structured prompts while viewing noise-based stimuli filtered at specific spatial frequencies. Psychometric functions are derived from the binary verbal responses, and contrast thresholds (and CSFs) are obtained without relying on internal activations or classifier-based proxies. Our results reveal that some models resemble human CSFs in shape or scale, but none capture both. We also find that CSF estimates are highly sensitive to prompt phrasing, indicating limited linguistic robustness. Finally, we show that CSFs predict model performance under frequency-filtered and adversarial conditions. These findings highlight systematic differences in frequency tuning across MLLMs and establish CSF estimation as a scalable diagnostic tool for multimodal perception.

[252] STRIDE-QA: Visual Question Answering Dataset for Spatiotemporal Reasoning in Urban Driving Scenes

Keishi Ishihara, Kento Sasaki, Tsubasa Takahashi, Daiki Shiono, Yu Yamaguchi

Main category: cs.CV

TL;DR: STRIDE-QA is a large-scale VQA dataset for spatiotemporal reasoning in autonomous driving, created from 100 hours of Tokyo driving data with 16M QA pairs across 285K frames. It enables both object-centric and ego-centric reasoning through novel tasks requiring spatial localization and temporal prediction.

Details

Motivation: Current VLMs are trained on static web images, limiting their ability for precise spatiotemporal reasoning needed to understand dynamic traffic scenes in autonomous driving.

Method: Created STRIDE-QA dataset from 100 hours of multi-sensor driving data in Tokyo, with dense automatically generated annotations (3D bounding boxes, segmentation masks, multi-object tracks) and three novel QA tasks for spatial localization and temporal prediction.

Result: Existing VLMs struggle significantly (near-zero scores on prediction consistency), while VLMs fine-tuned on STRIDE-QA achieve 55% success in spatial localization and 28% consistency in future motion prediction.

Conclusion: STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems by addressing the spatiotemporal reasoning gap in current models.

Abstract: Vision-Language Models (VLMs) have been applied to autonomous driving to support decision-making in complex real-world scenarios. However, their training on static, web-sourced image-text pairs fundamentally limits the precise spatiotemporal reasoning required to understand and predict dynamic traffic scenes. We address this critical gap with STRIDE-QA, a large-scale visual question answering (VQA) dataset for physically grounded reasoning from an ego-centric perspective. Constructed from 100 hours of multi-sensor driving data in Tokyo, capturing diverse and challenging conditions, STRIDE-QA is the largest VQA dataset for spatiotemporal reasoning in urban driving, offering 16 million QA pairs over 285K frames. Grounded by dense, automatically generated annotations including 3D bounding boxes, segmentation masks, and multi-object tracks, the dataset uniquely supports both object-centric and ego-centric reasoning through three novel QA tasks that require spatial localization and temporal prediction. Our benchmarks demonstrate that existing VLMs struggle significantly, achieving near-zero scores on prediction consistency. In contrast, VLMs fine-tuned on STRIDE-QA exhibit dramatic performance gains, achieving 55% success in spatial localization and 28% consistency in future motion prediction, compared to near-zero scores from general-purpose VLMs. Therefore, STRIDE-QA establishes a comprehensive foundation for developing more reliable VLMs for safety-critical autonomous systems.

[253] NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Nikita Lyubaykin, Andrei Polubarov, Alexander Derevyagin, Vladislav Kurenkov

Main category: cs.CV

TL;DR: NinA replaces diffusion-based action decoders in Vision-Language-Action models with Normalizing Flows, enabling one-shot sampling and faster inference while maintaining performance.

Details

Motivation: Diffusion models in VLA architectures require multiple iterative denoising steps at inference, limiting practicality for real-world high-frequency control applications.

Method: Replace diffusion action decoder with Normalizing Flow (NF) that enables one-shot sampling through invertible transformation. Integrated into FLOWER VLA architecture and fine-tuned on LIBERO benchmark.

Result: NinA matches performance of diffusion-based counterpart under same training regime while achieving substantially faster inference times.

Conclusion: NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.

Abstract: Recent advances in Vision-Language-Action (VLA) models have established a two-component architecture, where a pre-trained Vision-Language Model (VLM) encodes visual observations and task descriptions, and an action decoder maps these representations to continuous actions. Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions. However, they require multiple iterative denoising steps at inference time or downstream techniques to speed up sampling, limiting their practicality in real-world settings where high-frequency control is crucial. In this work, we present NinA (Normalizing Flows in Action), a fast and expressive alternative to diffusion-based decoders for VLAs. NinA replaces the diffusion action decoder with a Normalizing Flow (NF) that enables one-shot sampling through an invertible transformation, significantly reducing inference time. We integrate NinA into the FLOWER VLA architecture and fine-tune on the LIBERO benchmark. Our experiments show that NinA matches the performance of its diffusion-based counterpart under the same training regime, while achieving substantially faster inference. These results suggest that NinA offers a promising path toward efficient, high-frequency VLA control without compromising performance.

[254] UrbanTwin: Building High-Fidelity Digital Twins for Sim2Real LiDAR Perception and Evaluation

Muhammad Shahbaz, Shaurya Agarwal

Main category: cs.CV

TL;DR: This tutorial presents a workflow for creating high-fidelity digital twins to generate realistic synthetic LiDAR datasets for Sim2Real learning in intelligent transportation systems, addressing the high cost of real data collection.

Details

Motivation: Creating large-scale labeled LiDAR datasets for ITS perception is expensive, time-consuming, and labor-intensive, limiting system scalability. Sim2Real learning offers a scalable alternative but requires high simulation fidelity.

Method: A reproducible workflow for building high-fidelity digital twins using open-source resources like satellite imagery, OpenStreetMap, and sensor specifications to model static geometry, road infrastructure, and dynamic traffic.

Result: Three synthetic LiDAR datasets (UT-LUMPI, UT-V2X-Real, UT-TUMTraf-I) were released that closely replicate real locations and outperform real-data-trained baselines in perception tasks.

Conclusion: The workflow enables broader adoption of high-fidelity digital twins for scalable and cost-effective data generation in ITS research and deployment.

Abstract: LiDAR-based perception in intelligent transportation systems (ITS) relies on deep neural networks trained with large-scale labeled datasets. However, creating such datasets is expensive, time-consuming, and labor-intensive, limiting the scalability of perception systems. Sim2Real learning offers a scalable alternative, but its success depends on the simulation’s fidelity to real-world environments, dynamics, and sensors. This tutorial introduces a reproducible workflow for building high-fidelity digital twins (HiFi DTs) to generate realistic synthetic datasets. We outline practical steps for modeling static geometry, road infrastructure, and dynamic traffic using open-source resources such as satellite imagery, OpenStreetMap, and sensor specifications. The resulting environments support scalable and cost-effective data generation for robust Sim2Real learning. Using this workflow, we have released three synthetic LiDAR datasets, namely UT-LUMPI, UT-V2X-Real, and UT-TUMTraf-I, which closely replicate real locations and outperform real-data-trained baselines in perception tasks. This guide enables broader adoption of HiFi DTs in ITS research and deployment.

[255] In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting

Taiying Peng, Jiacheng Hua, Miao Liu, Feng Lu

Main category: cs.CV

TL;DR: EgoGazeVQA is a new benchmark for egocentric gaze-guided video question answering that uses gaze information to improve understanding of daily-life videos, showing existing MLLMs struggle with user intent interpretation.

Details

Motivation: Existing benchmarks overlook gaze as an indicator of user intent in egocentric videos, which could enable more proactive and personalized AI experiences.

Method: Created EgoGazeVQA benchmark with gaze-based QA pairs generated by MLLMs and refined by human annotators, plus gaze-guided intent prompting methods integrating spatial, temporal, and intent cues.

Result: Existing MLLMs struggle to interpret user intentions accurately, but gaze-guided intent prompting significantly enhances performance. Gaze-related fine-tuning and gaze estimation accuracy impact prompting effectiveness.

Conclusion: Gaze information is valuable for creating more personalized and effective AI assistants in egocentric settings.

Abstract: The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants’ ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA

[256] InternScenes: A Large-scale Simulatable Indoor Scene Dataset with Realistic Layouts

Weipeng Zhong, Peizhou Cao, Yichen Jin, Li Luo, Wenzhe Cai, Jingli Lin, Hanqing Wang, Zhaoyang Lyu, Tai Wang, Bo Dai, Xudong Xu, Jiangmiao Pang

Main category: cs.CV

TL;DR: InternScenes is a large-scale simulatable indoor scene dataset with 40,000 diverse scenes, 1.96M 3D objects, covering 15 scene types and 288 object classes, featuring realistic layouts with many small items.

Details

Motivation: Existing datasets have limitations in scale, diversity, sanitized layouts lacking small items, and object collisions, hindering Embodied AI advancement.

Method: Integrates three scene sources (real-world scans, procedurally generated scenes, designer-created scenes) with comprehensive data processing pipeline including real-to-sim replicas, interactive object incorporation, and collision resolution via physical simulations.

Result: Created dataset with average 41.5 objects per region, demonstrated value through benchmark applications in scene layout generation and point-goal navigation, showing new challenges from complex layouts.

Conclusion: InternScenes enables scaling up model training for generation and navigation in complex scenes, paving the way for Embodied AI advancement, with commitment to open-source data, models, and benchmarks.

Abstract: The advancement of Embodied AI heavily relies on large-scale, simulatable 3D scene datasets characterized by scene diversity and realistic layouts. However, existing datasets typically suffer from limitations in data scale or diversity, sanitized layouts lacking small items, and severe object collisions. To address these shortcomings, we introduce \textbf{InternScenes}, a novel large-scale simulatable indoor scene dataset comprising approximately 40,000 diverse scenes by integrating three disparate scene sources, real-world scans, procedurally generated scenes, and designer-created scenes, including 1.96M 3D objects and covering 15 common scene types and 288 object classes. We particularly preserve massive small items in the scenes, resulting in realistic and complex layouts with an average of 41.5 objects per region. Our comprehensive data processing pipeline ensures simulatability by creating real-to-sim replicas for real-world scans, enhances interactivity by incorporating interactive objects into these scenes, and resolves object collisions by physical simulations. We demonstrate the value of InternScenes with two benchmark applications: scene layout generation and point-goal navigation. Both show the new challenges posed by the complex and realistic layouts. More importantly, InternScenes paves the way for scaling up the model training for both tasks, making the generation and navigation in such complex scenes possible. We commit to open-sourcing the data, models, and benchmarks to benefit the whole community.

[257] StegOT: Trade-offs in Steganography via Optimal Transport

Chengde Lin, Xuezhu Gong, Shuxue Ding, Mingzhe Yang, Xijun Lu, Chengjun Mo

Main category: cs.CV

TL;DR: StegOT is an autoencoder-based steganography model that uses optimal transport theory to address mode collapse issues in image hiding, achieving better balance between cover and secret image information.

Details

Motivation: Existing steganography models based on GANs and VAEs suffer from mode collapse, which causes information imbalance between cover and secret images in stego images and affects extraction quality.

Method: Proposed StegOT model with multiple channel optimal transport (MCOT) module that transforms multi-peak feature distributions into single-peak distributions to achieve information trade-off between cover and secret images.

Result: Experiments show the model achieves trade-off between cover and secret images while enhancing quality of both stego and recovery images.

Conclusion: StegOT effectively addresses mode collapse in steganography through optimal transport theory, improving information balance and image quality in both hiding and extraction processes.

Abstract: Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.

[258] Uncertainty-Supervised Interpretable and Robust Evidential Segmentation

Yuzhu Li, An Sui, Fuping Wu, Xiahai Zhuang

Main category: cs.CV

TL;DR: A self-supervised approach for uncertainty estimation in medical image segmentation that uses principles about uncertainty-image gradient relationships to improve interpretability and robustness.

Details

Motivation: Previous uncertainty estimation methods lack effective supervision, leading to low interpretability and robustness in predictions for medical image segmentation.

Method: Proposed self-supervised approach with three principles about uncertainty-image gradient relationships around boundaries and noise, and designed two uncertainty supervision losses to enhance alignment with human interpretation.

Result: Achieves competitive segmentation performance and superior results in out-of-distribution scenarios while significantly improving interpretability and robustness of uncertainty estimation compared to state-of-the-art approaches.

Conclusion: The proposed self-supervised uncertainty estimation method effectively improves interpretability and robustness in medical image segmentation, with code publicly available.

Abstract: Uncertainty estimation has been widely studied in medical image segmentation as a tool to provide reliability, particularly in deep learning approaches. However, previous methods generally lack effective supervision in uncertainty estimation, leading to low interpretability and robustness of the predictions. In this work, we propose a self-supervised approach to guide the learning of uncertainty. Specifically, we introduce three principles about the relationships between the uncertainty and the image gradients around boundaries and noise. Based on these principles, two uncertainty supervision losses are designed. These losses enhance the alignment between model predictions and human interpretation. Accordingly, we introduce novel quantitative metrics for evaluating the interpretability and robustness of uncertainty. Experimental results demonstrate that compared to state-of-the-art approaches, the proposed method can achieve competitive segmentation performance and superior results in out-of-distribution (OOD) scenarios while significantly improving the interpretability and robustness of uncertainty estimation. Code is available via https://github.com/suiannaius/SURE.

[259] Prompt-guided Representation Disentanglement for Action Recognition

Tianci Wu, Guangming Zhu, Jiang Lu, Siyuan Wang, Ning Wang, Nuoye Xiong, Zhang Liang

Main category: cs.CV

TL;DR: ProDA is a novel framework for action recognition that disentangles specified actions from multi-action scenes using spatio-temporal scene graphs and dynamic prompts to generate action-specific representations.

Details

Motivation: Existing methods extract unified features for all actions in a video, making it challenging to model interactions between different objects in multi-action scenarios.

Method: Uses Spatio-temporal Scene Graphs (SSGs) with Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations, featuring dynamic weight aggregation.

Result: Experiments demonstrate effectiveness in video action recognition compared to state-of-the-art methods.

Conclusion: ProDA provides an effective solution for disentangling specified actions from complex multi-action scenes, improving action recognition performance.

Abstract: Action recognition is a fundamental task in video understanding. Existing methods typically extract unified features to process all actions in one video, which makes it challenging to model the interactions between different objects in multi-action scenarios. To alleviate this issue, we explore disentangling any specified actions from complex scenes as an effective solution. In this paper, we propose Prompt-guided Disentangled Representation for Action Recognition (ProDA), a novel framework that disentangles any specified actions from a multi-action scene. ProDA leverages Spatio-temporal Scene Graphs (SSGs) and introduces Dynamic Prompt Module (DPM) to guide a Graph Parsing Neural Network (GPNN) in generating action-specific representations. Furthermore, we design a video-adapted GPNN that aggregates information using dynamic weights. Experiments in video action recognition demonstrate the effectiveness of our approach when compared with the state-of-the-art methods. Our code can be found in https://github.com/iamsnaping/ProDA.git

[260] LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen

Main category: cs.CV

TL;DR: LongLive is an autoregressive framework for real-time long video generation that addresses efficiency and quality challenges through causal attention, KV-recache for prompt switching, streaming long tuning, and frame sink attention.

Details

Motivation: To overcome the limitations of existing methods: diffusion models are inefficient due to bidirectional attention, while causal AR models degrade in quality on long videos and lack interactive capabilities for real-time prompt streaming.

Method: Uses frame-level autoregressive design with KV-recache mechanism for prompt transitions, streaming long tuning for long video training, and short window attention with frame sink to maintain consistency while enabling faster generation.

Result: Achieves 20.7 FPS on single H100 GPU, supports up to 240-second videos, fine-tunes 1.3B model in 32 GPU-days, and maintains strong VBench performance on both short and long videos with minimal quality loss in INT8 quantization.

Conclusion: LongLive successfully enables efficient, high-quality long video generation with real-time interactive capabilities through its innovative causal AR architecture and optimization techniques.

Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

Prerit Gupta, Shourya Verma, Ananth Grama, Aniket Bera

Main category: cs.CV

TL;DR: DualFlow is a unified framework for generating realistic two-person motion conditioned on text, music, or prior motion sequences, using rectified flow for efficient sampling and RAG for enhanced semantic grounding.

Details

Motivation: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains challenging in computer graphics, animation, and human-computer interaction.

Method: Uses rectified flow for deterministic straight-line sampling paths between noise and data, employs Retrieval-Augmented Generation (RAG) with music features and LLM-based text decompositions, and incorporates contrastive objective and synchronization loss.

Result: Extensive evaluations show consistent gains in motion quality, responsiveness, and efficiency across text-to-motion, music-to-motion, and multi-modal interactive benchmarks.

Conclusion: DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

Abstract: Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.

[262] IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?

Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi

Main category: cs.CV

TL;DR: IWR-Bench is a new benchmark for evaluating Large Vision-Language Models’ ability to reconstruct interactive webpages from videos, addressing limitations of static screenshot-to-code tasks by focusing on dynamic interactions in real-world web applications.

Details

Motivation: Existing benchmarks focus on static screenshot-to-code tasks, overlooking dynamic interactions fundamental to real-world web applications. This gap limits evaluation of models' ability to handle temporal dynamics and event-driven logic.

Method: Created IWR-Bench with 113 tasks from 100 real-world websites, featuring 1,001 actions with diverse interaction complexities. Includes user interaction videos and crawled static assets. Uses agent-as-a-judge framework with comprehensive metrics to assess functional correctness and visual fidelity.

Result: Extensive experiments on 28 LVLMs show poor performance - best model achieves only 36.35% overall score. Functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS), revealing critical limitations in temporal reasoning and event-driven logic synthesis.

Conclusion: IWR-Bench establishes a challenging frontier for vision-language research, highlighting that current models struggle significantly with dynamic webpage reconstruction from video, particularly in reasoning about temporal dynamics and generating functional interaction logic.

Abstract: The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models’ ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/L-O-I/IWR-Bench.

[263] GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning

Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan

Main category: cs.CV

TL;DR: A novel post-training framework that incorporates task-aware rewards to adapt reinforcement learning models for Earth Observation tasks, improving reasoning capabilities and performance across multiple EO benchmarks.

Details

Motivation: Reinforcement learning has shown strong reasoning capabilities in natural image domains but remains largely unexplored for Earth Observation tasks, which present unique challenges like referred object detection, image captioning, change detection, and temporal analysis that require task-aware reasoning.

Method: Proposed a post-training framework that incorporates task-aware rewards to enable effective adaptation of reasoning-based RL models to diverse EO tasks, enhancing reasoning for remote sensing images while stabilizing optimization and improving robustness.

Result: Extensive experiments across multiple EO benchmarks show consistent performance gains over state-of-the-art generic and specialized vision language models.

Conclusion: The proposed framework successfully adapts RL models to Earth Observation tasks, demonstrating improved reasoning capabilities and superior performance compared to existing approaches.

Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .

[264] Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Yuansen Liu, Haiming Tang, Jinlong Peng, Jiangning Zhang, Xiaozhong Ji, Qingdong He, Wenbin Wu, Donghao Luo, Zhenye Gan, Junwei Zhu, Yunhang Shen, Chaoyou Fu, Chengjie Wang, Xiaobin Hu, Shuicheng Yan

Main category: cs.CV

TL;DR: Human-MME is a comprehensive benchmark for evaluating multimodal large language models on human-centric scene understanding, featuring diverse scenarios, progressive evaluation dimensions, and high-quality annotations.

Details

Motivation: Current MLLMs lack proper evaluation for human-centric scene comprehension due to absence of benchmarks that consider both granular human-oriented perception and higher-dimensional causal reasoning.

Method: Created a curated benchmark with 4 primary visual domains, 15 secondary domains, and 43 sub-fields, featuring 19,945 real-world image question pairs across eight progressive evaluation dimensions with automated annotation pipeline and human-annotation platform.

Result: Extensive experiments on 17 state-of-the-art MLLMs exposed limitations in human-centric understanding and provided guidance for future research.

Conclusion: Human-MME enables more holistic evaluation of MLLMs in human-centric scene understanding and will guide future development toward better human-centric image comprehension.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.

[265] TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, Anpei Chen

Main category: cs.CV

TL;DR: TTT3R improves 3D reconstruction length generalization by treating it as a test-time training problem, using alignment confidence to dynamically adjust memory updates without additional training.

Details

Motivation: Modern RNNs for 3D reconstruction suffer from limited length generalization beyond training context lengths, degrading performance on longer sequences.

Method: Framed 3D reconstruction as test-time training, using alignment confidence between memory state and observations to derive closed-form learning rates for memory updates, balancing historical retention and new adaptation.

Result: Achieved 2× improvement in global pose estimation over baselines, operating at 20 FPS with 6 GB GPU memory for thousands of images.

Conclusion: TTT3R provides training-free intervention that substantially improves length generalization in 3D reconstruction while maintaining efficiency.

Abstract: Modern Recurrent Neural Networks have become a competitive architecture for 3D reconstruction due to their linear-time complexity. However, their performance degrades significantly when applied beyond the training context length, revealing limited length generalization. In this work, we revisit the 3D reconstruction foundation models from a Test-Time Training perspective, framing their designs as an online learning problem. Building on this perspective, we leverage the alignment confidence between the memory state and incoming observations to derive a closed-form learning rate for memory updates, to balance between retaining historical information and adapting to new observations. This training-free intervention, termed TTT3R, substantially improves length generalization, achieving a $2\times$ improvement in global pose estimation over baselines, while operating at 20 FPS with just 6 GB of GPU memory to process thousands of images. Code available in https://rover-xingyu.github.io/TTT3R

[266] Pathology-CoT: Learning Visual Chain-of-Thought Agent from Expert Whole Slide Image Diagnosis Behavior

Sheng Wang, Ruiming Wu, Charles Herndon, Yihang Liu, Shunsuke Koga, Jeanne Shen, Zhi Huang

Main category: cs.CV

TL;DR: This paper introduces a framework for creating agentic pathology systems that can navigate whole-slide images like human pathologists, using recorded viewing behavior to train AI models for explainable diagnoses.

Details

Motivation: Current pathology foundation models lack practical agentic systems that can decide where to look next, adjust magnification, and provide explainable diagnoses. This is bottlenecked by the lack of scalable supervision for expert viewing behavior that is tacit and experience-based.

Method: Three key breakthroughs: 1) AI Session Recorder that records routine navigation in standard viewers, 2) Human-in-the-loop review to create Pathology-CoT dataset with rationales, 3) Pathology-o3 agent that proposes important ROIs and performs behavior-guided reasoning.

Result: Achieved 100% recall on internal validation from Stanford Medicine and 97.6% recall on external validation from Sweden, exceeding state-of-the-art OpenAI o3 model and generalizing across backbones.

Conclusion: The framework makes agentic pathology practical by turning everyday viewer logs into scalable, expert-validated supervision, establishing a path to human-aligned, upgradeable clinical AI.

Abstract: Diagnosing a whole-slide image is an interactive, multi-stage process of changing magnification and moving between fields. Although recent pathology foundation models demonstrated superior performances, practical agentic systems that decide what field to examine next, adjust magnification, and deliver explainable diagnoses are still lacking. Such limitation is largely bottlenecked by data: scalable, clinically aligned supervision of expert viewing behavior that is tacit and experience-based, not documented in textbooks or internet, and therefore absent from LLM training. Here we introduce a framework designed to address this challenge through three key breakthroughs. First, the AI Session Recorder seamlessly integrates with standard whole-slide image viewers to unobtrusively record routine navigation and convert the viewer logs into standardized behavioral commands and bounding boxes. Second, a lightweight human-in-the-loop review turns AI-drafted rationales for behavioral commands into the Pathology-CoT dataset, a form of paired “where to look” and “why it matters”, enabling six-fold faster labeling compared to manual constructing such Chain-of-Thought dataset. Using this behavioral data, we build Pathology-o3, a two-stage agent that first proposes important ROIs and then performs behavior-guided reasoning. On the gastrointestinal lymph-node metastasis detection task, our method achieved 100 recall on the internal validation from Stanford Medicine and 97.6 recall on an independent external validation from Sweden, exceeding the state-of-the-art OpenAI o3 model and generalizing across backbones. To our knowledge, Pathology-CoT constitutes one of the first behavior-grounded agentic systems in pathology. Turning everyday viewer logs into scalable, expert-validated supervision, our framework makes agentic pathology practical and establishes a path to human-aligned, upgradeable clinical AI.

[267] Hands-Free Heritage: Automated 3D Scanning for Cultural Heritage Digitization

Javed Ahmad, Federico Dassiè, Selene Frascella, Gabriele Marchello, Ferdinando Cannella, Arianna Traviglia

Main category: cs.CV

TL;DR: Automated two-robot system for high-fidelity 3D scanning of cultural heritage artefacts using coordinated robotic manipulation and optimized trajectory planning.

Details

Motivation: Conventional 3D scanning methods require specialized expertise and manual intervention, which limits efficiency and accessibility for cultural heritage preservation.

Method: Two-robot system with coordinated motion planning: one robot handles scanning while another manages tray handling. Parameterizes scanning space into regions with optimized trajectory planning and waypoint distribution for comprehensive coverage.

Result: Achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, demonstrating superior geometric accuracy and improved digitization efficiency.

Conclusion: The automated system eliminates need for handheld/semi-automatic workflows, reduces reliance on expert operators, and provides efficient, high-quality 3D scanning for cultural heritage preservation.

Abstract: High-fidelity 3D scanning is essential for preserving cultural heritage artefacts, supporting documentation, analysis, and long-term conservation. However, conventional methods typically require specialized expertise and manual intervention to maintain optimal scanning conditions and coverage. We present an automated two-robot scanning system that eliminates the need for handheld or semi-automatic workflows by combining coordinated robotic manipulation with high-resolution 3D scanning. Our system parameterizes the scanning space into distinct regions, enabling coordinated motion planning between a scanner-equipped robot and a tray-handling robot. Optimized trajectory planning and waypoint distribution ensure comprehensive surface coverage, minimize occlusions, and balance reconstruction accuracy with system efficiency. Experimental results show that our approach achieves significantly lower Chamfer Distance and higher F-score compared to baseline methods, offering superior geometric accuracy, improved digitization efficiency, and reduced reliance on expert operators.

[268] BenthiCat: An opti-acoustic dataset for advancing benthic classification and habitat mapping

Hayat Rajani, Valerio Franchi, Borja Martinez-Clavel Valles, Raimon Ramos, Rafael Garcia, Nuno Gracias

Main category: cs.CV

TL;DR: A comprehensive multi-modal dataset for benthic habitat mapping with side-scan sonar tiles, bathymetric maps, and optical images, including annotated segmentation masks and tools for machine learning development.

Details

Motivation: To address the scarcity of large annotated datasets for marine ecosystem research and enable benchmarking of machine learning models in underwater habitat mapping.

Method: Collection of approximately 1 million side-scan sonar tiles from Catalonia coast, complemented by bathymetric maps and co-registered optical images from AUV surveys, with 36,000 manually annotated tiles for supervised learning.

Result: Creation of a standardized multi-modal dataset with spatial association between optical images and SSS tiles to support cross-modal representation learning and algorithm development.

Conclusion: This resource establishes a benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration through accessible open-source tools.

Abstract: Benthic habitat mapping is fundamental for understanding marine ecosystems, guiding conservation efforts, and supporting sustainable resource management. Yet, the scarcity of large, annotated datasets limits the development and benchmarking of machine learning models in this domain. This paper introduces a thorough multi-modal dataset, comprising about a million side-scan sonar (SSS) tiles collected along the coast of Catalonia (Spain), complemented by bathymetric maps and a set of co-registered optical images from targeted surveys using an autonomous underwater vehicle (AUV). Approximately \num{36000} of the SSS tiles have been manually annotated with segmentation masks to enable supervised fine-tuning of classification models. All the raw sensor data, together with mosaics, are also released to support further exploration and algorithm development. To address challenges in multi-sensor data fusion for AUVs, we spatially associate optical images with corresponding SSS tiles, facilitating self-supervised, cross-modal representation learning. Accompanying open-source preprocessing and annotation tools are provided to enhance accessibility and encourage research. This resource aims to establish a standardized benchmark for underwater habitat mapping, promoting advancements in autonomous seafloor classification and multi-sensor integration.

[269] Denoised Diffusion for Object-Focused Image Augmentation

Nisha Pillai, Aditi Virupakshaiah, Harrison W. Smith, Amanda J. Ashworth, Prasanna Gowda, Phillip R. Owens, Adam R. Rivers, Bindu Nanduri, Mahalingam Ramkumar

Main category: cs.CV

TL;DR: Proposes an object-focused data augmentation framework using segmentation and diffusion-based synthesis to enhance animal detection in drone-based monitoring with limited data.

Details

Motivation: Addresses limited data availability and scene-specific challenges in aerial drone-based animal health monitoring, where transfer learning fails due to lack of farm-specific datasets.

Method: Segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes for animal detection.

Result: Initial experiments show superior performance compared to baseline models on animal detection tasks using the augmented dataset.

Conclusion: The method enables real-time animal health monitoring in data-scarce scenarios by generating domain-specific data, bridging the gap between limited data and practical applications.

Abstract: Modern agricultural operations increasingly rely on integrated monitoring systems that combine multiple data sources for farm optimization. Aerial drone-based animal health monitoring serves as a key component but faces limited data availability, compounded by scene-specific issues such as small, occluded, or partially visible animals. Transfer learning approaches often fail to address this limitation due to the unavailability of large datasets that reflect specific farm conditions, including variations in animal breeds, environments, and behaviors. Therefore, there is a need for developing a problem-specific, animal-focused data augmentation strategy tailored to these unique challenges. To address this gap, we propose an object-focused data augmentation framework designed explicitly for animal health monitoring in constrained data settings. Our approach segments animals from backgrounds and augments them through transformations and diffusion-based synthesis to create realistic, diverse scenes that enhance animal detection and monitoring performance. Our initial experiments demonstrate that our augmented dataset yields superior performance compared to our baseline models on the animal detection task. By generating domain-specific data, our method empowers real-time animal health monitoring solutions even in data-scarce scenarios, bridging the gap between limited data and practical applicability.

Clara Tomasini, Luis Riazuelo, Ana C. Murillo

Main category: cs.CV

TL;DR: An image-based bronchoscopy topological localization pipeline that provides navigation assistance without requiring patient CT scans, trained only on phantom data with good generalization to real data.

Details

Motivation: Current bronchoscopy navigation methods require CT scans and additional sensors, which involve extra setup, scans, and training. A simpler topological localization approach could provide sufficient navigation assistance without these requirements.

Method: Image-based bronchoscopy topological localization pipeline trained exclusively on phantom data, eliminating the need for expensive real data labeling.

Result: The approach surpasses existing methods, particularly showing strong performance on real data test sequences despite being trained only on phantom data.

Conclusion: The proposed pipeline provides effective bronchoscopy navigation assistance without requiring patient CT scans, demonstrating good generalization from phantom training data to real clinical scenarios.

Abstract: Video bronchoscopy is a fundamental procedure in respiratory medicine, where medical experts navigate through the bronchial tree of a patient to diagnose or operate the patient. Surgeons need to determine the position of the scope as they go through the airway until they reach the area of interest. This task is very challenging for practitioners due to the complex bronchial tree structure and varying doctor experience and training. Navigation assistance to locate the bronchoscope during the procedure can improve its outcome. Currently used techniques for navigational guidance commonly rely on previous CT scans of the patient to obtain a 3D model of the airway, followed by tracking of the scope with additional sensors or image registration. These methods obtain accurate locations but imply additional setup, scans and training. Accurate metric localization is not always required, and a topological localization with regard to a generic airway model can often suffice to assist the surgeon with navigation. We present an image-based bronchoscopy topological localization pipeline to provide navigation assistance during the procedure, with no need of patient CT scan. Our approach is trained only on phantom data, eliminating the high cost of real data labeling, and presents good generalization capabilities. The results obtained surpass existing methods, particularly on real data test sequences.

[271] J-RAS: Enhancing Medical Image Segmentation via Retrieval-Augmented Joint Training

Salma J. Ahmed, Emad A. Mohammed, Azam Asilian Bidgoli

Main category: cs.CV

TL;DR: J-RAS is a joint training method that combines segmentation and retrieval models to improve medical image segmentation by leveraging retrieved image-mask pairs for better anatomical understanding and boundary delineation.

Details

Motivation: Manual medical image segmentation is time-consuming and variable, while AI methods require large annotated datasets and struggle with generalization across diverse imaging conditions and rare cases.

Method: Joint training of segmentation and retrieval models where both are optimized together - the segmentation model uses retrieved image-mask pairs for anatomical context, while the retrieval model learns segmentation-relevant features beyond visual similarity.

Result: Consistent improvements across multiple segmentation backbones (U-Net, TransUNet, SAM, SegFormer) on ACDC and M&Ms datasets. For example, SegFormer with J-RAS improved Dice score from 0.8708 to 0.9115 and reduced Hausdorff Distance from 1.8130 to 1.1489 on ACDC dataset.

Conclusion: J-RAS effectively enhances segmentation performance by enabling retrieval to provide meaningful contextual cues, demonstrating strong generalizability across different architectures and datasets.

Abstract: Image segmentation, the process of dividing images into meaningful regions, is critical in medical applications for accurate diagnosis, treatment planning, and disease monitoring. Although manual segmentation by healthcare professionals produces precise outcomes, it is time-consuming, costly, and prone to variability due to differences in human expertise. Artificial intelligence (AI)-based methods have been developed to address these limitations by automating segmentation tasks; however, they often require large, annotated datasets that are rarely available in practice and frequently struggle to generalize across diverse imaging conditions due to inter-patient variability and rare pathological cases. In this paper, we propose Joint Retrieval Augmented Segmentation (J-RAS), a joint training method for guided image segmentation that integrates a segmentation model with a retrieval model. Both models are jointly optimized, enabling the segmentation model to leverage retrieved image-mask pairs to enrich its anatomical understanding, while the retrieval model learns segmentation-relevant features beyond simple visual similarity. This joint optimization ensures that retrieval actively contributes meaningful contextual cues to guide boundary delineation, thereby enhancing the overall segmentation performance. We validate J-RAS across multiple segmentation backbones, including U-Net, TransUNet, SAM, and SegFormer, on two benchmark datasets: ACDC and M&Ms, demonstrating consistent improvements. For example, on the ACDC dataset, SegFormer without J-RAS achieves a mean Dice score of 0.8708$\pm$0.042 and a mean Hausdorff Distance (HD) of 1.8130$\pm$2.49, whereas with J-RAS, the performance improves substantially to a mean Dice score of 0.9115$\pm$0.031 and a mean HD of 1.1489$\pm$0.30. These results highlight the method’s effectiveness and its generalizability across architectures and datasets.

[272] HccePose(BF): Predicting Front & Back Surfaces to Construct Ultra-Dense 2D-3D Correspondences for Pose Estimation

Yulin Wang, Mengting Hu, Hongli Li, Chen Luo

Main category: cs.CV

TL;DR: This paper proposes a novel pose estimation method that predicts 3D coordinates for both front and back surfaces of objects, creating ultra-dense 2D-3D correspondences to improve pose estimation accuracy using the PnP algorithm.

Details

Motivation: Current pose estimation methods focus only on the object's front surface, overlooking the potential benefits of incorporating the back surface and interior of the object for more accurate pose estimation.

Method: The method predicts 3D coordinates of both front and back surfaces, densely samples coordinates between them to create ultra-dense 2D-3D correspondences, and uses Hierarchical Continuous Coordinate Encoding (HCCE) for efficient representation of surface coordinates.

Result: Experimental results show the proposed approach outperforms existing state-of-the-art methods across seven classic BOP core datasets on the BOP website.

Conclusion: Incorporating both front and back surfaces with dense sampling significantly enhances pose estimation accuracy, demonstrating the importance of utilizing the full object surface and interior for improved performance.

Abstract: In pose estimation for seen objects, a prevalent pipeline involves using neural networks to predict dense 3D coordinates of the object surface on 2D images, which are then used to establish dense 2D-3D correspondences. However, current methods primarily focus on more efficient encoding techniques to improve the precision of predicted 3D coordinates on the object’s front surface, overlooking the potential benefits of incorporating the back surface and interior of the object. To better utilize the full surface and interior of the object, this study predicts 3D coordinates of both the object’s front and back surfaces and densely samples 3D coordinates between them. This process creates ultra-dense 2D-3D correspondences, effectively enhancing pose estimation accuracy based on the Perspective-n-Point (PnP) algorithm. Additionally, we propose Hierarchical Continuous Coordinate Encoding (HCCE) to provide a more accurate and efficient representation of front and back surface coordinates. Experimental results show that, compared to existing state-of-the-art (SOTA) methods on the BOP website, the proposed approach outperforms across seven classic BOP core datasets. Code is available at https://github.com/WangYuLin-SEU/HCCEPose.

[273] VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu

Main category: cs.CV

TL;DR: VR-Thinker is a thinking-with-image framework that enhances multimodal reward models by enabling active visual reasoning operations and configurable visual memory windows, achieving state-of-the-art performance on video preference benchmarks.

Details

Motivation: Current multimodal reward models face limitations: visual inputs consume large context budgets (forcing fewer frames and loss of details) and packing all visual information into initial prompts exacerbates hallucination and forgetting during reasoning.

Method: Introduces VR-Thinker with visual reasoning operations (e.g., select frame) and configurable visual memory window. Uses reinforcement fine-tuning pipeline: (i) Cold Start with visual chain-of-thought data, (ii) Rejection sampling Fine-Tuning on high-quality traces, and (iii) Group Relative Policy Optimization (GRPO).

Result: Achieves state-of-the-art accuracy: 7B VR-Thinker gets 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video, especially effective for longer videos.

Conclusion: The approach validates the effectiveness and promise of thinking-with-image multimodal reward modeling, overcoming limitations of current visual reward models.

Abstract: Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

[274] FlexAC: Towards Flexible Control of Associative Reasoning in Multimodal Large Language Models

Shengming Yuan, Xinyu Lyu, Shuailong Wang, Beitao Chen, Jingkuan Song, Lianli Gao

Main category: cs.CV

TL;DR: FlexAC is a training-free framework that enables flexible control over associative reasoning in multimodal LLMs by modulating intermediate layer representations using hallucination-guided steering vectors.

Details

Motivation: MLLMs face a trade-off between faithfulness and creativity, but existing methods lack flexibility to modulate associative reasoning strength for different tasks.

Method: FlexAC induces hallucination-guided intermediate representations, constructs associative steering vectors from high-association instances, and incorporates task-specific vectors for multi-dimensional associative reasoning.

Result: Achieves 5.8x improvement in creativity on Creation-MMBench and 29% reduction in hallucination rate on CHAIR, outperforming existing baselines.

Conclusion: FlexAC effectively enables flexible control over associative reasoning in MLLMs, balancing creative guidance with output stability across factual and creative scenarios.

Abstract: Multimodal large language models (MLLMs) face an inherent trade-off between faithfulness and creativity, as different tasks require varying degrees of associative reasoning. However, existing methods lack the flexibility to modulate this reasoning strength, limiting MLLMs’ adaptability across factual and creative scenarios. To bridge this gap, we propose equipping MLLMs with mechanisms that enable flexible control over associative reasoning. We begin by investigating the internal mechanisms underlying associative behavior in MLLMs and find that: (1) middle layers play a pivotal role in shaping model’s associative tendencies, (2) modifying representations in these layers effectively regulates associative reasoning strength, and (3) hallucinations can be exploited to derive steering vectors that guide this modulation. Building on these findings, we introduce Flexible Association Control (FlexAC), a lightweight and training-free framework for modulating associative behavior in MLLMs. FlexAC first induces hallucination-guided intermediate representations to encode associative directions. Then, it selects high-association instances to construct effective associative steering vectors, whose strengths are adaptively calibrated to balance creative guidance with output stability. Finally, recognizing the multi-dimensional nature of associative reasoning, FlexAC incorporates task-specific associative vectors derived from a forward pass on a few target-domain samples, enabling models to follow diverse associative directions and better adapt to creative tasks. Notably, our method achieves up to a 5.8x improvement in creativity on Creation-MMBench and a 29% reduction in hallucination rate on CHAIR, surpassing existing baselines and demonstrating its effectiveness in enabling flexible control over associative reasoning in MLLMs. Our code is available at https://github.com/ylhz/FlexAC.

[275] REACT3D: Recovering Articulations for Interactive Physical 3D Scenes

Zhao Huang, Boyang Sun, Alexandros Delitzas, Jiaqi Chen, Marc Pollefeys

Main category: cs.CV

TL;DR: REACT3D is a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry for embodied intelligence applications.

Details

Motivation: Existing 3D scene datasets are limited due to labor-intensive annotation of part segmentation, kinematic types, and motion trajectories, creating a need for scalable solutions.

Method: The framework includes: (i) openable-object detection and segmentation, (ii) articulation estimation for joint types and motion parameters, (iii) hidden-geometry completion with interactive object assembly, and (iv) interactive scene integration in standard formats.

Result: Achieves state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes.

Conclusion: Provides a practical foundation for scalable interactive scene generation, lowering the barrier to large-scale research on articulated scene understanding.

Abstract: Interactive 3D scenes are increasingly vital for embodied intelligence, yet existing datasets remain limited due to the labor-intensive process of annotating part segmentation, kinematic types, and motion trajectories. We present REACT3D, a scalable zero-shot framework that converts static 3D scenes into simulation-ready interactive replicas with consistent geometry, enabling direct use in diverse downstream tasks. Our contributions include: (i) openable-object detection and segmentation to extract candidate movable parts from static scenes, (ii) articulation estimation that infers joint types and motion parameters, (iii) hidden-geometry completion followed by interactive object assembly, and (iv) interactive scene integration in widely supported formats to ensure compatibility with standard simulation platforms. We achieve state-of-the-art performance on detection/segmentation and articulation metrics across diverse indoor scenes, demonstrating the effectiveness of our framework and providing a practical foundation for scalable interactive scene generation, thereby lowering the barrier to large-scale research on articulated scene understanding. Our project page is https://react3d.github.io/

[276] AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai, Haijing Liu, Yuqing Qiu, Ke Chen, Zixian Li, Chi Xie, Huafei Li, Chenxing Li, Chuangchuang Wang, Kai Tang, Zhiguang Zhu, Kai Tang, Wenmei Gao, Rui Wang, Jun Wu, Chao Liu, Qin Xie, Chen Chen, Haonan Lu

Main category: cs.CV

TL;DR: AndesVL is a suite of mobile-optimized MLLMs with 0.6B-4B parameters that achieves state-of-the-art performance while being deployable on edge devices through efficient compression and acceleration techniques.

Details

Motivation: Cloud-based MLLMs are too large for edge devices due to memory, power, and computing constraints, creating a need for efficient mobile-side MLLMs.

Method: Based on Qwen3’s LLM with various visual encoders, using 1+N LoRA architecture, Quantization-Aware LoRA Fine-Tuning (QALFT), cache eviction algorithm (OKV), speculative decoding, and compression strategies.

Result: Achieves first-tier performance across multiple benchmarks, with 6.7x peak decoding speedup, 30.9% memory reduction, and 1.8 bits-per-weight when deployed on MediaTek Dimensity 9500 chips.

Conclusion: AndesVL demonstrates that high-performance MLLMs can be effectively deployed on mobile devices through careful architecture design and optimization techniques.

Abstract: In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3’s LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoRA architecture alongside a Quantization-Aware LoRA Fine-Tuning (QALFT) framework to facilitate efficient task adaptation and model compression during mobile-side deployment of AndesVL. Moreover, utilizing our cache eviction algorithm – OKV – along with customized speculative decoding and compression strategies, we achieve a 6.7x peak decoding speedup ratio, up to 30.9% memory reduction, and 1.8 bits-per-weight when deploying AndesVL-4B on MediaTek Dimensity 9500 chips. We release all models on https://huggingface.co/OPPOer.

[277] Massive Activations are the Key to Local Detail Synthesis in Diffusion Transformers

Chaofan Gan, Zicheng Zhao, Yuanpeng Tu, Xi Chen, Ziran Qin, Tieyuan Chen, Mehrtash Harandi, Weiyao Lin

Main category: cs.CV

TL;DR: The paper investigates massive activations in Diffusion Transformers (DiTs) and proposes Detail Guidance (DG), a training-free method to enhance local detail synthesis in visual generation.

Details

Motivation: Recent observations show massive activations in DiTs' internal feature maps, but their function remains poorly understood. The authors aim to systematically investigate these activations and leverage them for improving detail synthesis.

Method: The authors propose Detail Guidance (DG), a training-free self-guidance strategy that constructs a degraded ‘detail-deficient’ model by disrupting massive activations and uses it to guide the original network toward better detail synthesis. DG can integrate with Classifier-Free Guidance.

Result: Extensive experiments show that DG consistently improves fine-grained detail quality across various pre-trained DiTs including SD3, SD3.5, and Flux.

Conclusion: Massive activations in DiTs play a key role in local detail synthesis, and the proposed Detail Guidance method effectively leverages this understanding to enhance detail fidelity without additional training.

Abstract: Diffusion Transformers (DiTs) have recently emerged as a powerful backbone for visual generation. Recent observations reveal \emph{Massive Activations} (MAs) in their internal feature maps, yet their function remains poorly understood. In this work, we systematically investigate these activations to elucidate their role in visual generation. We found that these massive activations occur across all spatial tokens, and their distribution is modulated by the input timestep embeddings. Importantly, our investigations further demonstrate that these massive activations play a key role in local detail synthesis, while having minimal impact on the overall semantic content of output. Building on these insights, we propose \textbf{D}etail \textbf{G}uidance (\textbf{DG}), a MAs-driven, training-free self-guidance strategy to explicitly enhance local detail fidelity for DiTs. Specifically, DG constructs a degraded ``detail-deficient’’ model by disrupting MAs and leverages it to guide the original network toward higher-quality detail synthesis. Our DG can seamlessly integrate with Classifier-Free Guidance (CFG), enabling further refinements of fine-grained details. Extensive experiments demonstrate that our DG consistently improves fine-grained detail quality across various pre-trained DiTs (\eg, SD3, SD3.5, and Flux).

[278] Benchmarking foundation models for hyperspectral image classification: Application to cereal crop type mapping

Walid Elbarz, Mohamed Bourriz, Hicham Hajji, Hamd Ait Abdelali, François Bourzeix

Main category: cs.CV

TL;DR: This study benchmarks three foundation models (HyperSigma, DOFA, and Vision Transformers pre-trained on SpectralEarth) for hyperspectral cereal crop mapping, finding that the SpectralEarth model achieves the best performance (93.5% OA) and highlighting the importance of model architecture for cross-regional generalization.

Details

Motivation: Foundation models are transforming Earth observation but their potential for hyperspectral crop mapping remains underexplored, creating a need for systematic evaluation of these models for operational agricultural applications.

Method: Benchmarked three foundation models on hyperspectral imagery for cereal crop mapping: HyperSigma, DOFA, and Vision Transformers pre-trained on SpectralEarth dataset. Models were fine-tuned on manually labeled training data and evaluated on independent test regions using overall accuracy, average accuracy, and F1-score metrics.

Result: HyperSigma achieved 34.5% OA, DOFA reached 62.6% OA, and the SpectralEarth model achieved the best performance with 93.5% OA. A compact SpectralEarth variant trained from scratch achieved 91%, demonstrating strong generalization capabilities.

Conclusion: The results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development, emphasizing the importance of model architecture for strong generalization across geographic regions and sensor platforms.

Abstract: Foundation models are transforming Earth observation, but their potential for hyperspectral crop mapping remains underexplored. This study benchmarks three foundation models for cereal crop mapping using hyperspectral imagery: HyperSigma, DOFA, and Vision Transformers pre-trained on the SpectralEarth dataset (a large multitemporal hyperspectral archive). Models were fine-tuned on manually labeled data from a training region and evaluated on an independent test region. Performance was measured with overall accuracy (OA), average accuracy (AA), and F1-score. HyperSigma achieved an OA of 34.5% (+/- 1.8%), DOFA reached 62.6% (+/- 3.5%), and the SpectralEarth model achieved an OA of 93.5% (+/- 0.8%). A compact SpectralEarth variant trained from scratch achieved 91%, highlighting the importance of model architecture for strong generalization across geographic regions and sensor platforms. These results provide a systematic evaluation of foundation models for operational hyperspectral crop mapping and outline directions for future model development.

cs.AI

[279] AI Agents for the Dhumbal Card Game: A Comparative Study

Sahaj Raj Malla

Main category: cs.AI

TL;DR: This study compares AI agents for Dhumbal card game, finding rule-based Aggressive agent outperforms search-based and learning-based methods with 88.3% win rate through effective Jhyap declarations.

Details

Motivation: To evaluate AI agents for culturally significant Dhumbal card game with imperfect information and advance AI research while supporting digital preservation of cultural games.

Method: Implemented diverse agents: rule-based (Aggressive, Conservative, Balanced, Opportunistic), search-based (MCTS, ISMCTS), and learning-based (DQN, PPO) approaches. Evaluated through tournaments with statistical analysis including Welch’s t-test, Cohen’s d, and 95% confidence intervals over 1024 rounds.

Result: Rule-based Aggressive agent achieved highest win rate (88.3%, 95% CI: [86.3, 90.3]), significantly outperforming ISMCTS (9.0%) and PPO (1.5%) through effective exploitation of Jhyap declarations.

Conclusion: Heuristic rule-based approaches can outperform sophisticated search and learning methods in imperfect information games like Dhumbal, contributing a reproducible AI framework and insights for cultural game preservation.

Abstract: This study evaluates Artificial Intelligence (AI) agents for Dhumbal, a culturally significant multiplayer card game with imperfect information, through a systematic comparison of rule-based, search-based, and learning-based strategies. We formalize Dhumbal’s mechanics and implement diverse agents, including heuristic approaches (Aggressive, Conservative, Balanced, Opportunistic), search-based methods such as Monte Carlo Tree Search (MCTS) and Information Set Monte Carlo Tree Search (ISMCTS), and reinforcement learning approaches including Deep Q-Network (DQN) and Proximal Policy Optimization (PPO), and a random baseline. Evaluation involves within-category tournaments followed by a cross-category championship. Performance is measured via win rate, economic outcome, Jhyap success, cards discarded per round, risk assessment, and decision efficiency. Statistical significance is assessed using Welch’s t-test with Bonferroni correction, effect sizes via Cohen’s d, and 95% confidence intervals (CI). Across 1024 simulated rounds, the rule-based Aggressive agent achieves the highest win rate (88.3%, 95% CI: [86.3, 90.3]), outperforming ISMCTS (9.0%) and PPO (1.5%) through effective exploitation of Jhyap declarations. The study contributes a reproducible AI framework, insights into heuristic efficacy under partial information, and open-source code, thereby advancing AI research and supporting digital preservation of cultural games.

[280] Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations

Suryaansh Jain, Umair Z. Ahmed, Shubham Sahai, Ben Leong

Main category: cs.AI

TL;DR: LLMs as evaluators suffer from strong positive bias, being good at identifying valid outputs but poor at detecting invalid ones. The paper proposes minority-veto ensemble and regression-based methods to mitigate this bias.

Details

Motivation: Human evaluation of LLMs is costly and unscalable, while current LLM-as-judge approaches exhibit systematic positive bias that inflates reliability scores.

Method: Proposes two methods: 1) optimal minority-veto ensemble strategy resilient to missing data, and 2) regression-based framework that models validator bias using small human-annotated ground truth data.

Result: On a code feedback task with 366 Python programs, the regression approach reduced maximum absolute error to 1.2%, achieving 2x improvement over best-performing ensemble of 14 state-of-the-art LLMs.

Conclusion: The proposed methods effectively mitigate LLM evaluator bias, with regression-based approach providing particularly high precision for scenarios requiring accurate evaluation.

Abstract: New Large Language Models (LLMs) become available every few weeks, and modern application developers confronted with the unenviable task of having to decide if they should switch to a new model. While human evaluation remains the gold standard, it is costly and unscalable. The state-of-the-art approach is to use LLMs as evaluators ( LLM-as-a-judge), but this suffers from a critical flaw: LLMs exhibit a strong positive bias. We provide empirical evidence showing that while LLMs can identify valid outputs with high accuracy (i.e., True Positive Rate 96%), they are remarkably poor at identifying invalid ones (i.e., True Negative Rate <25%). This systematic bias, coupled with class imbalance, often leads to inflated reliability scores. While ensemble-based methods like majority voting can help, we show that they are not good enough. We introduce an optimal minority-veto strategy that is resilient to missing data and mitigates this bias to a large extent. For scenarios requiring even higher precision, we propose a novel regression-based framework that directly models the validator bias using a small set of human-annotated ground truth data. On a challenging code feedback task over 366 high-school Python programs, our regression approach reduces the maximum absolute error to just 1.2%, achieving a 2x improvement over the best-performing ensemble of 14 state-of-the-art LLMs.

[281] Holistic Agent Leaderboard: The Missing Infrastructure for AI Agent Evaluation

Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Dawn Song, Peter Henderson, Yu Su, Percy Liang, Arvind Narayanan

Main category: cs.AI

TL;DR: Introduces HAL (Holistic Agent Leaderboard) to address AI agent evaluation challenges through standardized evaluation harness, multi-dimensional analysis, and LLM-aided log inspection.

Details

Motivation: AI agent evaluations suffer from challenges that undermine understanding of real-world performance, with current methods often producing unreliable results.

Method: Three main contributions: 1) Standardized evaluation harness for parallel evaluations across hundreds of VMs, 2) Three-dimensional analysis across models, scaffolds, and benchmarks with 21,730 agent rollouts, 3) LLM-aided log inspection to uncover unreported behaviors.

Result: Reduced evaluation time from weeks to hours, identified surprising insights (e.g., higher reasoning effort reducing accuracy), uncovered previously unreported behaviors like searching for benchmarks instead of solving tasks or misusing credit cards.

Conclusion: HAL standardizes agent evaluation and addresses common pitfalls, shifting focus from benchmark performance to real-world reliability, with all 2.5B tokens of agent logs shared for further research.

Abstract: AI agents have been developed for complex real-world tasks from coding to customer service. But AI agent evaluations suffer from many challenges that undermine our understanding of how well agents really work. We introduce the Holistic Agent Leaderboard (HAL) to address these challenges. We make three main contributions. First, we provide a standardized evaluation harness that orchestrates parallel evaluations across hundreds of VMs, reducing evaluation time from weeks to hours while eliminating common implementation bugs. Second, we conduct three-dimensional analysis spanning models, scaffolds, and benchmarks. We validate the harness by conducting 21,730 agent rollouts across 9 models and 9 benchmarks in coding, web navigation, science, and customer service with a total cost of about $40,000. Our analysis reveals surprising insights, such as higher reasoning effort reducing accuracy in the majority of runs. Third, we use LLM-aided log inspection to uncover previously unreported behaviors, such as searching for the benchmark on HuggingFace instead of solving a task, or misusing credit cards in flight booking tasks. We share all agent logs, comprising 2.5B tokens of language model calls, to incentivize further research into agent behavior. By standardizing how the field evaluates agents and addressing common pitfalls in agent evaluation, we hope to shift the focus from agents that ace benchmarks to agents that work reliably in the real world.

[282] CGBench: Benchmarking Language Model Scientific Reasoning for Clinical Genetics Research

Owen Queen, Harrison G. Zhang, James Zou

Main category: cs.AI

TL;DR: CGBench is a benchmark for evaluating language models’ reasoning capabilities on scientific publications in clinical genetics, testing extraction of experimental results, evidence strength judgment, and outcome categorization.

Details

Motivation: Traditional variant and gene interpretation methods are manual and labor-intensive. Generative language models can accelerate translation of research into clinical insights, but existing benchmarks focus on narrow tasks that don't translate to real-world research.

Method: Built from ClinGen expert-curated literature interpretations, CGBench tests 8 different LMs on three capabilities: extracting experimental results following precise protocols, judging evidence strength, and categorizing experiment outcomes.

Result: Models show promise but have substantial gaps in literature interpretation, especially on fine-grained instructions. Reasoning models excel at fine-grained tasks while non-reasoning models are better at high-level interpretations. Models often hallucinate or misinterpret results even when correctly classifying evidence.

Conclusion: CGBench reveals strengths and weaknesses of LMs for precise scientific publication interpretation, opening avenues for future AI research in clinical genetics and science.

Abstract: Variant and gene interpretation are fundamental to personalized medicine and translational biomedicine. However, traditional approaches are manual and labor-intensive. Generative language models (LMs) can facilitate this process, accelerating the translation of fundamental research into clinically-actionable insights. While existing benchmarks have attempted to quantify the capabilities of LMs for interpreting scientific data, these studies focus on narrow tasks that do not translate to real-world research. To meet these challenges, we introduce CGBench, a robust benchmark that tests reasoning capabilities of LMs on scientific publications. CGBench is built from ClinGen, a resource of expert-curated literature interpretations in clinical genetics. CGBench measures the ability to 1) extract relevant experimental results following precise protocols and guidelines, 2) judge the strength of evidence, and 3) categorize and describe the relevant outcome of experiments. We test 8 different LMs and find that while models show promise, substantial gaps exist in literature interpretation, especially on fine-grained instructions. Reasoning models excel in fine-grained tasks but non-reasoning models are better at high-level interpretations. Finally, we measure LM explanations against human explanations with an LM judge approach, revealing that models often hallucinate or misinterpret results even when correctly classifying evidence. CGBench reveals strengths and weaknesses of LMs for precise interpretation of scientific publications, opening avenues for future research in AI for clinical genetics and science more broadly.

[283] Asking Clarifying Questions for Preference Elicitation With Large Language Models

Ali Montazeralghaem, Guy Tennenholtz, Craig Boutilier, Ofer Meshi

Main category: cs.AI

TL;DR: A novel two-stage training method inspired by diffusion models that teaches LLMs to ask sequential clarifying questions to better understand user preferences in recommendation systems.

Details

Motivation: To improve personalization in LLM-based recommendation systems by effectively eliciting user preferences through sequential clarifying questions, especially when user history is limited.

Method: Two-stage process: forward process generates clarifying questions and then removes answers step by step (adding ’noise’), reverse process trains model to ‘denoise’ by learning to ask effective clarifying questions.

Result: The method significantly improves LLM’s proficiency in asking funnel questions and effectively eliciting user preferences.

Conclusion: The proposed diffusion-inspired approach successfully trains LLMs to generate better sequential clarifying questions for preference elicitation in recommendation systems.

Abstract: Large Language Models (LLMs) have made it possible for recommendation systems to interact with users in open-ended conversational interfaces. In order to personalize LLM responses, it is crucial to elicit user preferences, especially when there is limited user history. One way to get more information is to present clarifying questions to the user. However, generating effective sequential clarifying questions across various domains remains a challenge. To address this, we introduce a novel approach for training LLMs to ask sequential questions that reveal user preferences. Our method follows a two-stage process inspired by diffusion models. Starting from a user profile, the forward process generates clarifying questions to obtain answers and then removes those answers step by step, serving as a way to add noise'' to the user profile. The reverse process involves training a model to denoise’’ the user profile by learning to ask effective clarifying questions. Our results show that our method significantly improves the LLM’s proficiency in asking funnel questions and eliciting user preferences effectively.

Andries Rosseau, Raphaël Avalos, Ann Nowé

Main category: cs.AI

TL;DR: A multi-agent reinforcement learning framework inspired by biological evolution, using genetic similarity and inclusive fitness to drive social dynamics and cooperation.

Details

Motivation: To model the evolutionary forces of natural selection in AI systems, creating more socially intelligent agents through genetic-based cooperation mechanisms.

Method: Multi-agent reinforcement learning with genetic assignment and inclusive fitness reward functions, tested in network games with prisoner’s dilemmas.

Result: Social dynamics aligned with biological principles like Hamilton’s rule, enabling non-team-based cooperation patterns and spectrum of cooperation based on genetic similarity.

Conclusion: Gene-based inclusive fitness provides foundation for emergent strategic complexity and social intelligence in multi-agent systems, creating evolutionary arms races analogous to biological evolution.

Abstract: The competitive and cooperative forces of natural selection have driven the evolution of intelligence for millions of years, culminating in nature’s vast biodiversity and the complexity of human minds. Inspired by this process, we propose a novel multi-agent reinforcement learning framework where each agent is assigned a genotype and where reward functions are modelled after the concept of inclusive fitness. An agent’s genetic material may be shared with other agents, and our inclusive reward function naturally accounts for this. We study the resulting social dynamics in two types of network games with prisoner’s dilemmas and find that our results align with well-established principles from biology, such as Hamilton’s rule. Furthermore, we outline how this framework can extend to more open-ended environments with spatial and temporal structure, finite resources, and evolving populations. We hypothesize the emergence of an arms race of strategies, where each new strategy is a gradual improvement over earlier adaptations of other agents, effectively producing a multi-agent autocurriculum analogous to biological evolution. In contrast to the binary team-based structures prevalent in earlier research, our gene-based reward structure introduces a spectrum of cooperation ranging from full adversity to full cooperativeness based on genetic similarity, enabling unique non team-based social dynamics. For example, one agent having a mutual cooperative relationship with two other agents, while the two other agents behave adversarially towards each other. We argue that incorporating inclusive fitness in agents provides a foundation for the emergence of more strategically advanced and socially intelligent agents.

[285] CausalTrace: A Neurosymbolic Causal Analysis Agent for Smart Manufacturing

Chathurangi Shyalika, Aryaman Sharma, Fadi El Kalach, Utkarshani Jaimini, Cory Henson, Ramy Harik, Amit Sheth

Main category: cs.AI

TL;DR: CausalTrace is a neurosymbolic causal analysis module integrated into SmartPilot industrial CoPilot that combines prediction, explanation, and causal reasoning for manufacturing anomaly analysis, achieving strong performance in root cause analysis and expert agreement.

Details

Motivation: Modern manufacturing needs interpretable AI systems that integrate prediction, explanation, and causal reasoning, as existing black-box AI lacks transparency and practical utility in high-stakes industrial environments.

Method: CausalTrace performs data-driven causal analysis enriched by industrial ontologies and knowledge graphs, supporting causal discovery, counterfactual reasoning, and root cause analysis with real-time operator interaction.

Result: In rocket assembly testbed: substantial expert agreement (ROUGE-1: 0.91), strong RCA performance (MAP@3: 94%, PR@2: 97%, MRR: 0.92, Jaccard: 0.92), and high C3AN evaluation score (4.59/5).

Conclusion: CausalTrace demonstrates precision and reliability for live deployment, providing transparent, explainable decision support that bridges the gap between AI predictions and practical industrial applications.

Abstract: Modern manufacturing environments demand not only accurate predictions but also interpretable insights to process anomalies, root causes, and potential interventions. Existing AI systems often function as isolated black boxes, lacking the seamless integration of prediction, explanation, and causal reasoning required for a unified decision-support solution. This fragmentation limits their trustworthiness and practical utility in high-stakes industrial environments. In this work, we present CausalTrace, a neurosymbolic causal analysis module integrated into the SmartPilot industrial CoPilot. CausalTrace performs data-driven causal analysis enriched by industrial ontologies and knowledge graphs, including advanced functions such as causal discovery, counterfactual reasoning, and root cause analysis (RCA). It supports real-time operator interaction and is designed to complement existing agents by offering transparent, explainable decision support. We conducted a comprehensive evaluation of CausalTrace using multiple causal assessment methods and the C3AN framework (i.e. Custom, Compact, Composite AI with Neurosymbolic Integration), which spans principles of robustness, intelligence, and trustworthiness. In an academic rocket assembly testbed, CausalTrace achieved substantial agreement with domain experts (ROUGE-1: 0.91 in ontology QA) and strong RCA performance (MAP@3: 94%, PR@2: 97%, MRR: 0.92, Jaccard: 0.92). It also attained 4.59/5 in the C3AN evaluation, demonstrating precision and reliability for live deployment.

[286] Do Large Language Models Respect Contracts? Evaluating and Enforcing Contract-Adherence in Code Generation

Soohan Lim, Joonghyuk Hahn, Hyunwoo Park, Sang-Ki Ko, Yo-Sub Han

Main category: cs.AI

TL;DR: PACT is a framework that evaluates LLM-generated code for both functional correctness and contract adherence (handling ill-formed inputs), addressing limitations in existing benchmarks like HumanEval+ and MBPP+.

Details

Motivation: Existing code generation benchmarks focus only on functional correctness with well-formed inputs, ignoring contract adherence - how code handles ill-formed inputs according to preconditions and validity constraints.

Method: PACT extends HumanEval+ and MBPP+ with contract-violating test cases, analyzes code generation under varied prompting conditions, and introduces novel metrics to quantify contract adherence.

Result: Augmenting prompts with contract-violating test cases significantly improves models’ ability to respect contracts compared to using contract descriptions alone.

Conclusion: PACT provides rigorous metrics to evaluate code robustness in both functionality and contract-adherence, revealing critical errors that conventional benchmarks overlook.

Abstract: Prevailing code generation benchmarks, such as HumanEval+ and MBPP+, primarily evaluate large language models (LLMs) with pass@k on functional correctness using well-formed inputs. However, they ignore a crucial aspect of real-world software: adherence to contracts-the preconditions and validity constraints that dictate how ill-formed inputs must be rejected. This critical oversight means that existing benchmarks fail to measure, and models consequently fail to generate, truly robust and reliable code snippets. We introduce PACT, a program assessment and contract-adherence evaluation framework, to bridge this gap. PACT is the first framework designed to systematically evaluate and enhance contract-adherence in LLM-generated code snippets alongside functional correctness. PACT’s contributions are threefold: First, it provides a comprehensive test-suite corpus focused on contract violations, extending HumanEval+ and MBPP+. Second, it enables a systematic analysis of code generation under varied prompting conditions. This analysis demonstrates that augmenting prompts with contract-violating test cases significantly enhance a model’s ability to respect contracts compared to using contract description alone. Finally, it introduces novel metrics to rigorously quantify contract adherence in both test generation and code generation. By revealing critical errors that conventional benchmarks overlook, PACT provides the rigorous and interpretable metrics to evaluate the robustness of LLM-generated code snippets in both functionality and contract-adherence.Our code and data are available at https://github.com/suhanmen/PACT.

[287] Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Marco Del Tredici, Jacob McCarran, Benjamin Breen, Javier Aspuru Mijares, Weichen Winston Yin, Jacob M. Taylor, Frank Koppens, Dirk Englund

Main category: cs.AI

TL;DR: Ax-Prover is a multi-agent system for automated theorem proving in Lean that combines LLMs with formal tools to solve problems across diverse scientific domains, demonstrating competitive performance on benchmarks and practical utility as a human collaborator.

Details

Motivation: To create a generalizable automated theorem prover that can handle diverse scientific domains while maintaining formal correctness, addressing the limitations of specialized systems that struggle with generalization.

Method: Combines Large Language Models (providing knowledge and reasoning) with Lean tools via Model Context Protocol (ensuring formal correctness) in a multi-agent system for formal proof generation.

Result: Competitive with state-of-the-art provers on public math benchmarks and significantly outperforms them on new benchmarks in abstract algebra and quantum theory. Successfully assisted an expert mathematician in formalizing a complex cryptography theorem.

Conclusion: The tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains, overcoming the generalization limitations of specialized systems.

Abstract: We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperform them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover’s assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

[288] Empowering LLM Agents with Geospatial Awareness: Toward Grounded Reasoning for Wildfire Response

Yiheng Chen, Lingyao Li, Zihui Ma, Qikai Hu, Yilun Zhu, Min Deng, Runlong Yu

Main category: cs.AI

TL;DR: A Geospatial Awareness Layer (GAL) is introduced to ground LLM agents in structured earth data for disaster response, enabling evidence-based resource allocation recommendations by integrating infrastructure, demographic, terrain, and weather information.

Details

Motivation: Existing statistical approaches for disaster response lack semantic context, generalize poorly across events, and offer limited interpretability. LLMs provide few-shot generalization but remain text-bound and blind to geography.

Method: The GAL automatically retrieves and integrates infrastructure, demographic, terrain, and weather information from external geodatabases, assembling them into a concise, unit-annotated perception script. This enriched context enables agents to produce evidence-based resource-allocation recommendations reinforced by historical analogs and daily change signals.

Result: Evaluation in real wildfire scenarios across multiple LLM models shows that geospatially grounded agents can outperform baselines.

Conclusion: The proposed framework can generalize to other hazards such as floods and hurricanes, providing a more effective approach for disaster response.

Abstract: Effective disaster response is essential for safeguarding lives and property. Existing statistical approaches often lack semantic context, generalize poorly across events, and offer limited interpretability. While Large language models (LLMs) provide few-shot generalization, they remain text-bound and blind to geography. To bridge this gap, we introduce a Geospatial Awareness Layer (GAL) that grounds LLM agents in structured earth data. Starting from raw wildfire detections, GAL automatically retrieves and integrates infrastructure, demographic, terrain, and weather information from external geodatabases, assembling them into a concise, unit-annotated perception script. This enriched context enables agents to produce evidence-based resource-allocation recommendations (e.g., personnel assignments, budget allocations), further reinforced by historical analogs and daily change signals for incremental updates. We evaluate the framework in real wildfire scenarios across multiple LLM models, showing that geospatially grounded agents can outperform baselines. The proposed framework can generalize to other hazards such as floods and hurricanes.

[289] ThinkPilot: Steering Reasoning Models via Automated Think-prefixes Optimization

Sunzhu Li, Zhiyu Lin, Shuling Yang, Jiale Zhao, Wei Chen

Main category: cs.AI

TL;DR: ThinkPilot is a training-free framework that uses evolutionary process to generate think-prefixes, optimizing Large Reasoning Models’ reasoning behaviors and improving performance across multiple metrics.

Details

Motivation: Large Reasoning Models suffer from inefficient and off-target reasoning, and current training-free methods are limited to rigid heuristics or non-actionable analyses.

Method: Uses evolutionary process to generate think-prefixes driven by a taxonomy of reasoning behaviors to guide models toward superior performance.

Result: Significantly improves accuracy-length trade-off for efficient reasoning, drastically improves safety (cutting StrongREJECT score from 27.0% to 0.7), enhances instruction following, and synergizes with existing training-based methods.

Conclusion: ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands by automatically identifying and eliciting optimal reasoning behaviors.

Abstract: Large Reasoning Models (LRMs) are powerful, but they still suffer from inefficient and off-target reasoning. Currently, training-free methods are limited to either rigid heuristics or descriptive, non-actionable analyses. In this paper, we introduce ThinkPilot, a training-free framework that automatically optimizes LRMs reasoning. It uses an evolutionary process to generate think-prefixes, which are instructions that evolve driven by a taxonomy of reasoning behaviors to guide models toward superior performance. Extensive experiments demonstrate ThinkPilot’s broad effectiveness: it significantly improves the accuracy-length trade-off for efficient reasoning, drastically improves safety (for example, cutting the StrongREJECT score of DeepSeek-R1-Distill-Qwen-32B from 27.0% to 0.7), and enhances instruction following. It also synergizes with existing training-based methods. Our analysis reveals that think-prefixes can reliably control LRMs’ reasoning behaviors, and that different tasks have strong preferences for specific behavioral distributions. By automatically identifying and eliciting these behaviors, ThinkPilot provides a generalizable framework for aligning LRMs reasoning with task demands. Data and code are available at https://github.com/teqkilla/ThinkPilot

[290] AI Agents as Universal Task Solvers

Alessandro Achille, Stefano Soatto

Main category: cs.AI

TL;DR: AI agents can be viewed as computational systems, and this paper proposes transductive learning as a key principle for learning to reason, emphasizing time reduction over reconstruction error. It shows theoretical links between algorithmic information and optimal speed-up, and argues that scaling should focus on time optimization rather than just model size.

Details

Motivation: To understand whether AI reasoning agents can be universal, how they learn to reason, and whether scaling model size or training data is sufficient for true intelligence. The paper questions current approaches and seeks foundational principles for learning to reason.

Method: Reinterprets AI agents as compute-capable stochastic dynamical systems and proposes transductive learning as a shift from classical inductive learning. Uses theoretical analysis to link algorithmic information with optimal speed-up and derives power-law scaling relationships.

Result: Shows that optimal speed-up using past data is tightly related to algorithmic information. Derives theoretical power-law scaling of inference time versus training time. Demonstrates that scaling model size alone can lead to savant-like behavior without true intelligence.

Conclusion: The key quantity to optimize when scaling reasoning models is time, not just model size. Transductive learning and time reduction should be prioritized over reconstruction error minimization for developing truly intelligent AI agents.

Abstract: AI reasoning agents are already able to solve a variety of tasks by deploying tools, simulating outcomes of multiple hypotheses and reflecting on them. In doing so, they perform computation, although not in the classical sense – there is no program being executed. Still, if they perform computation, can AI agents be universal? Can chain-of-thought reasoning solve any computable task? How does an AI Agent learn to reason? Is it a matter of model size? Or training dataset size? In this work, we reinterpret the role of learning in the context of AI Agents, viewing them as compute-capable stochastic dynamical systems, and highlight the role of time in a foundational principle for learning to reason. In doing so, we propose a shift from classical inductive learning to transductive learning – where the objective is not to approximate the distribution of past data, but to capture their algorithmic structure to reduce the time needed to find solutions to new tasks. Transductive learning suggests that, counter to Shannon’s theory, a key role of information in learning is about reduction of time rather than reconstruction error. In particular, we show that the optimal speed-up that a universal solver can achieve using past data is tightly related to their algorithmic information. Using this, we show a theoretical derivation for the observed power-law scaling of inference time versus training time. We then show that scaling model size can lead to behaviors that, while improving accuracy on benchmarks, fail any reasonable test of intelligence, let alone super-intelligence: In the limit of infinite space and time, large models can behave as savants, able to brute-force through any task without any insight. Instead, we argue that the key quantity to optimize when scaling reasoning models is time, whose critical role in learning has so far only been indirectly considered.

[291] HiCoTraj:Zero-Shot Demographic Reasoning via Hierarchical Chain-of-Thought Prompting from Trajectory

Junyi Xie, Yuankun Jiao, Jina Kim, Yao-Yi Chiang, Lingyi Zhao, Khurram Shafique

Main category: cs.AI

TL;DR: HiCoTraj is a zero-shot framework that uses hierarchical chain-of-thought prompting with LLMs to infer demographic attributes from human mobility patterns without labeled training data.

Details

Motivation: Existing mobility-based demographic inference methods rely on large-scale labeled trajectory data, leading to limited interpretability and poor generalizability across datasets and user groups.

Method: Transforms trajectories into natural language representations (activity chronicles and multi-scale visiting summaries), then uses hierarchical chain-of-thought reasoning through three stages: factual feature extraction, behavioral pattern analysis, and demographic inference with structured output.

Result: Achieves competitive performance across multiple demographic attributes in zero-shot scenarios on real-world trajectory data.

Conclusion: HiCoTraj addresses the scarcity of labeled demographic data while providing transparent reasoning chains for demographic inference from mobility patterns.

Abstract: Inferring demographic attributes such as age, sex, or income level from human mobility patterns enables critical applications such as targeted public health interventions, equitable urban planning, and personalized transportation services. Existing mobility-based demographic inference studies heavily rely on large-scale trajectory data with demographic labels, leading to limited interpretability and poor generalizability across different datasets and user groups. We propose HiCoTraj (Zero-Shot Demographic Reasoning via Hierarchical Chain-of-Thought Prompting from Trajectory), a framework that leverages LLMs’ zero-shot learning and semantic understanding capabilities to perform demographic inference without labeled training data. HiCoTraj transforms trajectories into semantically rich, natural language representations by creating detailed activity chronicles and multi-scale visiting summaries. Then HiCoTraj uses a novel hierarchical chain of thought reasoning to systematically guide LLMs through three cognitive stages: factual feature extraction, behavioral pattern analysis, and demographic inference with structured output. This approach addresses the scarcity challenge of labeled demographic data while providing transparent reasoning chains. Experimental evaluation on real-world trajectory data demonstrates that HiCoTraj achieves competitive performance across multiple demographic attributes in zero-shot scenarios.

[292] EmboMatrix: A Scalable Training-Ground for Embodied Decision-Making

Zixing Lei, Sheng Yin, Yichen Xiong, Yuanzhuo Ding, Wenhao Huang, Yuxi Wei, Qingyao Xu, Yiming Li, Weixin Li, Yunhong Wang, Siheng Chen

Main category: cs.AI

TL;DR: EmboMatrix is a training ground infrastructure that enables LLMs to acquire embodied decision-making skills through simulation, interaction, and feedback, resulting in EmboBrain-7B outperforming much larger models on embodied benchmarks.

Details

Motivation: LLMs lack exposure to physical environments despite their general decision-making capabilities, limiting their true embodied understanding and preventing them from effectively translating high-level goals into executable actions in the physical world.

Method: Created EmboMatrix training ground with multi-agent data engine for task/scene generation, distributed heterogeneous-hardware system for scalable simulation, and multi-level reward architecture for precise supervision. Used this to train EmboBrain LLM through extensive embodied interactions.

Result: EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5% on two challenging embodied decision-making benchmarks, demonstrating superior performance despite being significantly smaller.

Conclusion: Interactive, environment-grounded learning through comprehensive training grounds is powerful for building truly intelligent embodied agents, enabling LLMs to develop genuine embodied decision-making skills.

Abstract: Embodied decision-making enables agents to translate high-level goals into executable actions through continuous interactions within the physical world, forming a cornerstone of general-purpose embodied intelligence. Large language models (LLMs), with their general decision-making capabilities, offer a promising path to realize this potential; however, LLMs trained solely on language lack exposure to physical environments, limiting their true embodied understanding. To bridge this gap, we propose the concept of a training ground: a comprehensive infrastructure that provides task and scene simulation, embodied interaction, and feedback signals, offering a one-stop solution for LLM acquire genuine embodied decision-making skills. In this work, we present EmboMatrix, the first training ground of its kind, providing massive and diverse tasks with efficient simulation and precise rewards. EmboMatrix incorporates a series of novel techniques: a multi-agent data engine for large-scale task and scene generation, a distributed heterogeneous-hardware system for scalable simulation, and a multi-level reward architecture for precise supervision. Leveraging EmboMatrix, we cultivate EmboBrain, an LLM whose embodied decision-making abilities emerge from extensive embodied interactions. Experiments show that EmboBrain-7B surpasses the 671B DeepSeek-R1 baseline by 9.5% on two challenging embodied decision-making benchmarks, demonstrating the power of interactive, environment-grounded learning for building truly intelligent embodied agents.

[293] BeSTAD: Behavior-Aware Spatio-Temporal Anomaly Detection for Human Mobility Data

Junyi Xie, Jina Kim, Yao-Yi Chiang, Lingyi Zhao, Khurram Shafique

Main category: cs.AI

TL;DR: BeSTAD is an unsupervised framework for detecting individual-level anomalies in human mobility data by learning personalized behavioral signatures and identifying deviations from normal patterns.

Details

Motivation: Traditional anomaly detection focuses on trajectory-level analysis, but detecting individual-level anomalies within large populations remains challenging due to the need to capture personal behavioral patterns and subtle deviations.

Method: BeSTAD learns semantically enriched mobility representations integrating location meaning and temporal patterns, uses behavior-cluster-aware modeling to build personalized behavioral profiles, and identifies anomalies through cross-period behavioral comparison with semantic alignment.

Result: The framework enables detection of behavioral shifts and deviations from established routines, and identifies individuals exhibiting such changes within large-scale mobility datasets.

Conclusion: BeSTAD advances anomaly detection toward personalized and interpretable mobility analysis by learning individual behaviors directly from unlabeled data.

Abstract: Traditional anomaly detection in human mobility has primarily focused on trajectory-level analysis, identifying statistical outliers or spatiotemporal inconsistencies across aggregated movement traces. However, detecting individual-level anomalies, i.e., unusual deviations in a person’s mobility behavior relative to their own historical patterns, within datasets encompassing large populations remains a significant challenge. In this paper, we present BeSTAD (Behavior-aware Spatio-Temporal Anomaly Detection for Human Mobility Data), an unsupervised framework that captures individualized behavioral signatures across large populations and uncovers fine-grained anomalies by jointly modeling spatial context and temporal dynamics. BeSTAD learns semantically enriched mobility representations that integrate location meaning and temporal patterns, enabling the detection of subtle deviations in individual movement behavior. BeSTAD further employs a behavior-cluster-aware modeling mechanism that builds personalized behavioral profiles from normal activity and identifies anomalies through cross-period behavioral comparison with consistent semantic alignment. Building on prior work in mobility behavior clustering, this approach enables not only the detection of behavioral shifts and deviations from established routines but also the identification of individuals exhibiting such changes within large-scale mobility datasets. By learning individual behaviors directly from unlabeled data, BeSTAD advances anomaly detection toward personalized and interpretable mobility analysis.

[294] Evaluating the Quality of Randomness and Entropy in Tasks Supported by Large Language Models

Rabimba Karanjai, Yang Lu, Ranjith Chodavarapu, Lei Xu, Weidong Shi

Main category: cs.AI

TL;DR: LLMs struggle with randomness tasks like generating random numbers, passwords, and shuffling items, showing inconsistent performance that deviates from expected behavior.

Details

Motivation: As LLMs are increasingly used in applications requiring randomness (stochastic decision-making, gaming, cryptography), it's unclear how well they can handle random number generation and utilization.

Method: Conducted experiments testing LLMs on various randomness tasks, considering factors like external tool access, task types, model states (fresh vs. non-fresh), and prompting strategies. Tasks included random number generation, password creation, item shuffling, and randomness quality evaluation using entropy and NIST test-suite.

Result: LLMs can generate outputs with some randomness, but performance is inconsistent and significantly deviates from expected behavior across different randomness tasks.

Conclusion: Current LLMs have significant limitations in handling randomness effectively, highlighting areas needing improvement for reliable use in applications requiring random processes.

Abstract: The rapid advancement of large language model (LLM) technology has led to diverse applications, many of which inherently require randomness, such as stochastic decision-making, gaming, scheduling, AI agents, and cryptography-related tasks. However, the capabilities of LLMs in handling randomness, particularly in generating and utilizing random numbers effectively, remain unclear. This paper investigates the capacity of LLMs for handling tasks that involve randomness through a series of experiments. We designed a set of experiments that consider various factors that can influence an LLM’s performance in tasks involving randomness, such as accessibility to external tools, types of tasks, model states (fresh vs. non-fresh), and prompting strategies. The experiments cover a range of tasks, including generating random numbers, generating random strings such as passwords, shuffling items, and evaluating the quality of randomness using entropy and the NIST randomness test-suite. Our findings reveal that while LLMs can generate outputs that exhibit some degree of randomness, their performance is inconsistent and often deviates significantly from the expected behavior. The analysis of the experimental results highlights key limitations and areas where improvement is needed for the LLMs to effectively handle tasks involving randomness

[295] One Life to Learn: Inferring Symbolic World Models for Stochastic Environments from Unguided Exploration

Zaid Khan, Archiki Prasad, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

Main category: cs.AI

TL;DR: OneLife framework learns symbolic world models in complex stochastic environments using conditionally-activated programmatic laws within probabilistic programming, enabling efficient inference and learning from minimal unguided interaction.

Details

Motivation: To address the challenge of learning symbolic world models in realistic settings with complex, stochastic environments where agents have only "one life" to explore without human guidance, unlike prior work focused on deterministic environments with abundant data.

Method: Uses conditionally-activated programmatic laws with precondition-effect structure within probabilistic programming, creating dynamic computation graphs that route inference only through relevant laws to handle complex hierarchical states and sparse rule activation.

Result: Outperformed strong baseline on 16 out of 23 scenarios in Crafter-OO environment, successfully learning key dynamics from minimal unguided interaction, and demonstrated effective planning through simulated rollouts identifying superior strategies.

Conclusion: Establishes foundation for autonomously constructing programmatic world models of unknown, complex environments through efficient probabilistic programming approach that handles stochastic dynamics and sparse interactions.

Abstract: Symbolic world modeling requires inferring and representing an environment’s transitional dynamics as an executable program. Prior work has focused on largely deterministic environments with abundant interaction data, simple mechanics, and human guidance. We address a more realistic and challenging setting, learning in a complex, stochastic environment where the agent has only “one life” to explore a hostile environment without human guidance. We introduce OneLife, a framework that models world dynamics through conditionally-activated programmatic laws within a probabilistic programming framework. Each law operates through a precondition-effect structure, activating in relevant world states. This creates a dynamic computation graph that routes inference and optimization only through relevant laws, avoiding scaling challenges when all laws contribute to predictions about a complex, hierarchical state, and enabling the learning of stochastic dynamics even with sparse rule activation. To evaluate our approach under these demanding constraints, we introduce a new evaluation protocol that measures (a) state ranking, the ability to distinguish plausible future states from implausible ones, and (b) state fidelity, the ability to generate future states that closely resemble reality. We develop and evaluate our framework on Crafter-OO, our reimplementation of the Crafter environment that exposes a structured, object-oriented symbolic state and a pure transition function that operates on that state alone. OneLife can successfully learn key environment dynamics from minimal, unguided interaction, outperforming a strong baseline on 16 out of 23 scenarios tested. We also test OneLife’s planning ability, with simulated rollouts successfully identifying superior strategies. Our work establishes a foundation for autonomously constructing programmatic world models of unknown, complex environments.

[296] Scaling Multi-Agent Epistemic Planning through GNN-Derived Heuristics

Giovanni Briglia, Francesco Fabiano, Stefano Mariani

Main category: cs.AI

TL;DR: This paper proposes using Graph Neural Networks (GNNs) to learn heuristics for Multi-agent Epistemic Planning (MEP) by capturing patterns in Kripke structures, improving scalability over traditional methods.

Details

Motivation: Existing heuristics don't work well with MEP's Kripke structure representation, leading to exponential search spaces and intractability in epistemic planning.

Method: Use Graph Neural Networks to learn relational patterns in epistemic states and derive state quality estimates, then integrate these predictive heuristics into the planning pipeline.

Result: The GNN-based heuristics show improvements in scalability compared to standard baselines for multi-agent epistemic planning.

Conclusion: GNNs effectively capture the graph-like nature of Kripke models and provide meaningful guidance for epistemic planning, enhancing solver scalability.

Abstract: Multi-agent Epistemic Planning (MEP) is an autonomous planning framework for reasoning about both the physical world and the beliefs of agents, with applications in domains where information flow and awareness among agents are critical. The richness of MEP requires states to be represented as Kripke structures, i.e., directed labeled graphs. This representation limits the applicability of existing heuristics, hindering the scalability of epistemic solvers, which must explore an exponential search space without guidance, resulting often in intractability. To address this, we exploit Graph Neural Networks (GNNs) to learn patterns and relational structures within epistemic states, to guide the planning process. GNNs, which naturally capture the graph-like nature of Kripke models, allow us to derive meaningful estimates of state quality – e.g., the distance from the nearest goal – by generalizing knowledge obtained from previously solved planning instances. We integrate these predictive heuristics into an epistemic planning pipeline and evaluate them against standard baselines, showing improvements in the scalability of multi-agent epistemic planning.

[297] ToPolyAgent: AI Agents for Coarse-Grained Topological Polymer Simulations

Lijie Ding, Jan-Michael Carrillo, Changwoo Do

Main category: cs.AI

TL;DR: ToPolyAgent is a multi-agent AI framework that enables coarse-grained molecular dynamics simulations of topological polymers through natural language instructions, integrating LLMs with computational tools for both interactive and autonomous workflows.

Details

Motivation: To lower barriers to complex computational workflows in polymer science and advance AI-driven materials discovery by coupling natural language interfaces with rigorous simulation tools.

Method: Uses four LLM-powered agents: Config Agent for initial configurations, Simulation Agent for LAMMPS-based MD simulations, Report Agent for markdown reports, and Workflow Agent for autonomous operations. Supports both interactive and autonomous modes.

Result: Successfully demonstrated versatility through case studies across diverse polymer architectures under varying conditions. Showed potential as research assistant by investigating interaction parameters on linear polymer conformation and grafting density effects on brush polymer persistence length.

Conclusion: ToPolyAgent lays the foundation for autonomous and extensible multi-agent scientific research ecosystems in polymer science, making complex computational workflows more accessible through natural language interfaces.

Abstract: We introduce ToPolyAgent, a multi-agent AI framework for performing coarse-grained molecular dynamics (MD) simulations of topological polymers through natural language instructions. By integrating large language models (LLMs) with domain-specific computational tools, ToPolyAgent supports both interactive and autonomous simulation workflows across diverse polymer architectures, including linear, ring, brush, and star polymers, as well as dendrimers. The system consists of four LLM-powered agents: a Config Agent for generating initial polymer-solvent configurations, a Simulation Agent for executing LAMMPS-based MD simulations and conformational analyses, a Report Agent for compiling markdown reports, and a Workflow Agent for streamlined autonomous operations. Interactive mode incorporates user feedback loops for iterative refinements, while autonomous mode enables end-to-end task execution from detailed prompts. We demonstrate ToPolyAgent’s versatility through case studies involving diverse polymer architectures under varying solvent condition, thermostats, and simulation lengths. Furthermore, we highlight its potential as a research assistant by directing it to investigate the effect of interaction parameters on the linear polymer conformation, and the influence of grafting density on the persistence length of the brush polymer. By coupling natural language interfaces with rigorous simulation tools, ToPolyAgent lowers barriers to complex computational workflows and advances AI-driven materials discovery in polymer science. It lays the foundation for autonomous and extensible multi-agent scientific research ecosystems.

[298] Precise Attribute Intensity Control in Large Language Models via Targeted Representation Editing

Rongzhi Zhang, Liqin Ye, Yuzhao Heng, Xiang Chen, Tong Yu, Lingkai Kong, Sudheer Chava, Chao Zhang

Main category: cs.AI

TL;DR: This paper introduces a method for precise attribute intensity control in LLMs, enabling fine-grained control over specific attribute intensities rather than just directional guidance.

Details

Motivation: Current LLM alignment methods only provide directional or open-ended guidance, failing to reliably achieve exact attribute intensities needed for AI systems adaptable to diverse user expectations.

Method: Three key designs: (1) reformulating control as target-reaching problem, (2) training lightweight value function via temporal-difference learning to predict attribute scores from partial generations, and (3) using gradient-based interventions on hidden representations to navigate toward specific targets.

Result: Experiments on LLaMA-3.2-3b and Phi-4-mini confirm the method’s ability to steer text generation to user-specified attribute intensities with high accuracy.

Conclusion: The method enables fine-grained continuous control over attribute intensities and demonstrates efficiency enhancements across preference data synthesis, Pareto frontier approximation, and distillation of aligned behaviors.

Abstract: Precise attribute intensity control–generating Large Language Model (LLM) outputs with specific, user-defined attribute intensities–is crucial for AI systems adaptable to diverse user expectations. Current LLM alignment methods, however, typically provide only directional or open-ended guidance, failing to reliably achieve exact attribute intensities. We address this limitation with three key designs: (1) reformulating precise attribute intensity control as a target-reaching problem, rather than simple maximization; (2) training a lightweight value function via temporal-difference learning to predict final attribute intensity scores from partial generations, thereby steering LLM outputs; and (3) employing gradient-based interventions on hidden representations to navigate the model precisely towards specific attribute intensity targets. Our method enables fine-grained, continuous control over attribute intensities, moving beyond simple directional alignment. Experiments on LLaMA-3.2-3b and Phi-4-mini confirm our method’s ability to steer text generation to user-specified attribute intensities with high accuracy. Finally, we demonstrate efficiency enhancements across three downstream tasks: preference data synthesis, Pareto frontier approximation and optimization, and distillation of aligned behaviors for intervention-free inference. Our code is available on https://github.com/Pre-Control/pre-control

[299] MatSciBench: Benchmarking the Reasoning Ability of Large Language Models in Materials Science

Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, Wei Wang

Main category: cs.AI

TL;DR: MatSciBench is a comprehensive college-level benchmark with 1,340 materials science problems spanning 6 fields and 31 sub-fields, featuring multimodal reasoning and three difficulty levels. Leading LLMs achieve under 80% accuracy, showing the benchmark’s complexity.

Details

Motivation: To address the underexplored reasoning capabilities of LLMs in materials science by creating a comprehensive benchmark that can systematically evaluate and drive improvements in scientific reasoning.

Method: Developed MatSciBench with structured taxonomy, three-tier difficulty classification, multimodal reasoning through visual contexts, and detailed reference solutions. Evaluated various reasoning strategies including chain-of-thought, tool augmentation, and self-correction.

Result: Even the highest-performing model (Gemini-2.5-Pro) achieved under 80% accuracy on college-level materials science questions. No single reasoning method consistently excelled across all scenarios, highlighting the complexity of materials science reasoning.

Conclusion: MatSciBench establishes a solid benchmark for assessing and improving LLMs’ scientific reasoning in materials science, revealing significant challenges in multimodal reasoning and the need for continued development of reasoning strategies.

Abstract: Large Language Models (LLMs) have demonstrated remarkable abilities in scientific reasoning, yet their reasoning capabilities in materials science remain underexplored. To fill this gap, we introduce MatSciBench, a comprehensive college-level benchmark comprising 1,340 problems that span the essential subdisciplines of materials science. MatSciBench features a structured and fine-grained taxonomy that categorizes materials science questions into 6 primary fields and 31 sub-fields, and includes a three-tier difficulty classification based on the reasoning length required to solve each question. MatSciBench provides detailed reference solutions enabling precise error analysis and incorporates multimodal reasoning through visual contexts in numerous questions. Evaluations of leading models reveal that even the highest-performing model, Gemini-2.5-Pro, achieves under 80% accuracy on college-level materials science questions, highlighting the complexity of MatSciBench. Our systematic analysis of different reasoning strategie–basic chain-of-thought, tool augmentation, and self-correction–demonstrates that no single method consistently excels across all scenarios. We further analyze performance by difficulty level, examine trade-offs between efficiency and accuracy, highlight the challenges inherent in multimodal reasoning tasks, analyze failure modes across LLMs and reasoning methods, and evaluate the influence of retrieval-augmented generation. MatSciBench thus establishes a comprehensive and solid benchmark for assessing and driving improvements in the scientific reasoning capabilities of LLMs within the materials science domain.

[300] Evolution of meta’s llama models and parameter-efficient fine-tuning of large language models: a survey

Abdulhady Abas Abdullah, Arkaitz Zubiaga, Seyedali Mirjalili, Amir H. Gandomi, Fatemeh Daneshfar, Mohammadsadra Amini, Alan Salam Mohammed, Hadi Veisi

Main category: cs.AI

TL;DR: This survey comprehensively reviews Meta AI’s LLaMA model series evolution (LLaMA 1-4) and parameter-efficient fine-tuning (PEFT) methods, analyzing architectures, performance, and real-world applications.

Details

Motivation: To provide a comprehensive resource for ML researchers and practitioners on the rapidly evolving LLaMA model family and efficient fine-tuning strategies, addressing the need for structured analysis of model architectures and adaptation methods.

Method: The paper surveys LLaMA foundation models (7B-65B to 288B parameters) including multimodal and Mixture-of-Experts variants, and reviews five PEFT methods: LoRA, LLaMA-Adapter V1/V2, LLaMA-Excitor, and QLoRA, analyzing their mechanisms, parameter savings, and applications.

Result: The survey provides structured analysis of model architectures, parameter counts, and benchmark results, showing cases where fine-tuned LLaMA models outperform larger baselines, and examines successful real-world applications in domains like legal and medical fields.

Conclusion: The paper serves as a one-stop resource for understanding LLaMA models and efficient fine-tuning, while identifying ongoing challenges and future research directions including scaling to larger contexts and improving robustness.

Abstract: This review surveys the rapid evolution of Meta AI’s LLaMA (Large Language Model Meta AI) series - from LLaMA 1 through LLaMA 4 and the specialized parameter-efficient fine-tuning (PEFT) methods developed for these models. We first describe the LLaMA family of foundation models (7B-65B to 288B parameters), their architectures (including native multimodal and Mixtureof-Experts variants), and key performance characteristics. We then describe and discuss the concept of PEFT, which adapts large pre-trained models by updating only a small subset of parameters, and review five PEFT methods that have been applied to LLaMA: LoRA (Low-Rank Adaptation), LLaMA-Adapter V1 and V2, LLaMA-Excitor, and QLoRA (Quantized LoRA). We discuss each method’s mechanism, parameter savings, and example application to LLaMA (e.g., instruction tuning, multimodal tasks). We provide structured discussion and analysis of model and adapter architectures, parameter counts, and benchmark results (including examples where fine-tuned LLaMA models outperform larger baselines). Finally, we examine real-world use cases where LLaMA-based models and PEFT have been successfully applied (e.g., legal and medical domains), and we discuss ongoing challenges and future research directions (such as scaling to even larger contexts and improving robustness). This survey paper provides a one-stop resource for ML researchers and practitioners interested in LLaMA models and efficient fine-tuning strategies.

[301] ResearStudio: A Human-Intervenable Framework for Building Controllable Deep-Research Agents

Linyi Yang, Yixuan Weng

Main category: cs.AI

TL;DR: ResearStudio is an open-source framework that enables real-time human control over deep-research agents, allowing users to pause, edit plans, run custom commands, and switch between AI-led and human-led modes while achieving state-of-the-art performance on benchmarks.

Details

Motivation: Current deep-research agents operate in a 'fire-and-forget' mode without allowing users to fix errors or add expert knowledge during execution, creating a need for systems that provide real-time human control.

Method: Uses a Collaborative Workshop design with hierarchical Planner-Executor that writes steps to a live ‘plan-as-document’, a fast communication layer that streams actions to a web interface, and allows users to pause, edit plans/code, run custom commands, and resume execution.

Result: Achieves state-of-the-art results on the GAIA benchmark, surpassing systems like OpenAI’s DeepResearch and Manus, while demonstrating that strong automated performance and fine-grained human control can coexist.

Conclusion: ResearStudio successfully combines automated performance with real-time human control, providing an open-source framework for safe and controllable research agents that supports flexible collaboration between humans and AI.

Abstract: Current deep-research agents run in a ‘‘fire-and-forget’’ mode: once started, they give users no way to fix errors or add expert knowledge during execution. We present ResearStudio, the first open-source framework that places real-time human control at its core. The system follows a Collaborative Workshop design. A hierarchical Planner-Executor writes every step to a live ‘‘plan-as-document,’’ a fast communication layer streams each action, file change, and tool call to a web interface. At any moment, the user can pause the run, edit the plan or code, run custom commands, and resume – switching smoothly between AI-led, human-assisted and human-led, AI-assisted modes. In fully autonomous mode, ResearStudio achieves state-of-the-art results on the GAIA benchmark, surpassing systems like OpenAI’s DeepResearch and Manus. These results show that strong automated performance and fine-grained human control can coexist. The full code, protocol, and evaluation scripts are available at https://github.com/ResearAI/ResearStudio. We will continue to update the repository to encourage further work on safe and controllable research agents. Our live demo is publicly accessible at http://ai-researcher.net:3000/. We support the development of DeepScientist, which can be accessed at https://github.com/ResearAI/DeepScientist.

[302] On the Design and Evaluation of Human-centered Explainable AI Systems: A Systematic Review and Taxonomy

Aline Mangold, Juliane Zietz, Susanne Weinhold, Sebastian Pannasch

Main category: cs.AI

TL;DR: This paper provides a comprehensive review of 65 user studies evaluating Explainable AI (XAI) systems, proposing human-centered design goals and evaluation metrics adapted to users with different AI expertise levels.

Details

Motivation: As AI becomes more common, there's increasing demand for systems that are both performant and understandable. Current XAI evaluation processes are too technical and not sufficiently focused on human user needs, highlighting the need for better human-centered evaluation frameworks.

Method: Conducted a comprehensive review of 65 user studies evaluating XAI systems across different domains and application contexts. Proposed human-centered design goals and extended existing XAI evaluation frameworks by adapting them to users with different AI expertise levels (AI novices and data experts).

Result: Key findings include: distinction between core system and XAI explanation as components of the whole system; classification of evaluation metrics into affection, cognition, usability, interpretability, and explanation metrics; identification of different design goals for AI novices (responsible use, acceptance, usability) versus data experts (performance-oriented goals like human-AI collaboration and task performance).

Conclusion: The paper provides a holistic framework for human-centered XAI evaluation and design, extending existing frameworks to better accommodate users’ specific characteristics and expertise levels, thereby improving the development of more effective and user-friendly XAI systems.

Abstract: As AI becomes more common in everyday living, there is an increasing demand for intelligent systems that are both performant and understandable. Explainable AI (XAI) systems aim to provide comprehensible explanations of decisions and predictions. At present, however, evaluation processes are rather technical and not sufficiently focused on the needs of human users. Consequently, evaluation studies involving human users can serve as a valuable guide for conducting user studies. This paper presents a comprehensive review of 65 user studies evaluating XAI systems across different domains and application contexts. As a guideline for XAI developers, we provide a holistic overview of the properties of XAI systems and evaluation metrics focused on human users (human-centered). We propose objectives for the human-centered design (design goals) of XAI systems. To incorporate users’ specific characteristics, design goals are adapted to users with different levels of AI expertise (AI novices and data experts). In this regard, we provide an extension to existing XAI evaluation and design frameworks. The first part of our results includes the analysis of XAI system characteristics. An important finding is the distinction between the core system and the XAI explanation, which together form the whole system. Further results include the distinction of evaluation metrics into affection towards the system, cognition, usability, interpretability, and explanation metrics. Furthermore, the users, along with their specific characteristics and behavior, can be assessed. For AI novices, the relevant extended design goals include responsible use, acceptance, and usability. For data experts, the focus is performance-oriented and includes human-AI collaboration and system and user task performance.

[303] GOAT: A Training Framework for Goal-Oriented Agent with Tools

Hyunji Min, Sangwon Jung, Junyoung Sung, Dosung Lee, Leekyeung Han, Paul Hongsuck Seo

Main category: cs.AI

TL;DR: GOAT is a training framework that enables fine-tuning of LLM agents for goal-oriented API execution tasks without human annotation, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Current LLM agents struggle with goal-oriented queries that require decomposing objectives into interdependent API calls, and smaller open-source models perform poorly compared to proprietary models like GPT-4.

Method: GOAT automatically constructs synthetic datasets from API documents and fine-tunes LLM agents to reason over interdependent API calls and generate coherent responses in a human annotation-free setting.

Result: GOAT-trained agents achieve state-of-the-art performance across multiple goal-oriented benchmarks and also excel on the newly introduced GOATBench benchmark.

Conclusion: GOAT provides a practical approach for building robust open-source LLM agents capable of complex reasoning and tool use.

Abstract: Large language models (LLMs) have recently been extended beyond traditional text generation to serve as interactive agents capable of using external tools based on user intent. However, current LLM agents still show limited ability to handle goal-oriented queries, which require decomposing a high-level objective into multiple interdependent API calls with correct planning and execution. Current approaches mainly rely on zero-shot evaluation due to the absence of training data. While proprietary closed-source models such as GPT-4 demonstrate strong reasoning abilities, smaller open-source models struggle to perform complex tool use effectively. Thus, we propose a novel training framework GOAT, which enables fine-tuning of LLM agents in a human annotation-free setting. GOAT automatically constructs synthetic datasets of goal-oriented API execution tasks directly from given API documents, equipping models with the ability to reason over interdependent calls and generate coherent responses. Through extensive experiments, we show that GOAT-trained agents achieve state-of-the-art performance across multiple existing goal-oriented benchmarks. In addition, we introduce GOATBench, a new goal-oriented API execution benchmark, and demonstrate that agents trained with GOAT also excel in this setting. These results highlight GOAT as a practical path toward building robust open-source LLM agents capable of complex reasoning and tool use.

[304] MedKGEval: A Knowledge Graph-Based Multi-Turn Evaluation Framework for Open-Ended Patient Interactions with Clinical LLMs

Yuechun Yu, Han Ying, Haoan Jin, Wenjian Jiang, Dong Xian, Binghao Wang, Zhou Yang, Mengyue Wu

Main category: cs.AI

TL;DR: MedKGEval is a novel multi-turn evaluation framework for clinical LLMs that uses knowledge graph-driven patient simulation and in-situ turn-level assessment to capture the dynamic nature of medical dialogues.

Details

Motivation: Existing evaluation methods for LLMs in medical applications rely on post hoc transcript review, neglecting the dynamic, context-sensitive nature of real clinical interactions and evolving patient informational needs.

Method: The framework uses: (1) knowledge graph-driven patient simulation with a control module retrieving medical facts from curated knowledge graphs; (2) in-situ turn-level evaluation where each response is assessed by a Judge Agent for clinical appropriateness, factual correctness, and safety; (3) comprehensive multi-turn benchmarking of eight state-of-the-art LLMs.

Result: MedKGEval demonstrates the ability to identify subtle behavioral flaws and safety risks that conventional evaluation pipelines often overlook, providing a more realistic assessment of LLM performance in clinical settings.

Conclusion: The framework offers a robust evaluation approach for clinical LLMs that captures the complexity of multi-turn doctor-patient interactions, with extensibility to additional languages through knowledge graph switching.

Abstract: The reliable evaluation of large language models (LLMs) in medical applications remains an open challenge, particularly in capturing the complexity of multi-turn doctor-patient interactions that unfold in real clinical environments. Existing evaluation methods typically rely on post hoc review of full conversation transcripts, thereby neglecting the dynamic, context-sensitive nature of medical dialogues and the evolving informational needs of patients. In this work, we present MedKGEval, a novel multi-turn evaluation framework for clinical LLMs grounded in structured medical knowledge. Our approach introduces three key contributions: (1) a knowledge graph-driven patient simulation mechanism, where a dedicated control module retrieves relevant medical facts from a curated knowledge graph, thereby endowing the patient agent with human-like and realistic conversational behavior. This knowledge graph is constructed by integrating open-source resources with additional triples extracted from expert-annotated datasets; (2) an in-situ, turn-level evaluation framework, where each model response is assessed by a Judge Agent for clinical appropriateness, factual correctness, and safety as the dialogue progresses using a suite of fine-grained, task-specific metrics; (3) a comprehensive multi-turn benchmark of eight state-of-the-art LLMs, demonstrating MedKGEval’s ability to identify subtle behavioral flaws and safety risks that are often overlooked by conventional evaluation pipelines. Although initially designed for Chinese and English medical applications, our framework can be readily extended to additional languages by switching the input knowledge graphs, ensuring seamless bilingual support and domain-specific applicability.

[305] PromptFlow: Training Prompts Like Neural Networks

Jingyi Wang, Hongyuan Zhu, Ye Niu, Yunhui Deng

Main category: cs.AI

TL;DR: PromptFlow is a modular framework for automated prompt engineering that uses meta-learning and reinforcement learning to dynamically optimize prompts for LLMs across diverse NLP tasks with minimal training data.

Details

Motivation: Current automated prompt engineering methods use static update rules, lack dynamic strategy selection, update entire prompts without granular editing, and don't effectively recycle experience in LLMs, leading to suboptimal adaptation to varying NLP task requirements.

Method: Proposed PromptFlow framework inspired by TensorFlow, integrating meta-prompts, operators, optimization, and evaluator. Uses gradient-based meta-learning to autonomously explore optimal prompt refinement trajectories and reinforcement learning to recycle experience in the prompt engineering process.

Result: Extensive experiments on various datasets demonstrate the effectiveness of PromptFlow in optimizing prompts for LLMs across different NLP tasks.

Conclusion: PromptFlow provides an effective modular framework for automated prompt engineering that addresses limitations of current methods through dynamic strategy selection, granular prompt editing, and experience recycling using meta-learning and reinforcement learning approaches.

Abstract: Large Language Models (LLMs) have demonstrated profound impact on Natural Language Processing (NLP) tasks. However, their effective deployment across diverse domains often require domain-specific adaptation strategies, as generic models may underperform when faced with specialized data distributions. Recent advances in prompt engineering (PE) offer a promising alternative to extensive retraining by refining input instructions to align LLM outputs with task objectives. This paradigm has emerged as a rapid and versatile approach for model fine-tuning. Despite its potential, manual prompt design remains labor-intensive and heavily depends on specialized expertise, often requiring iterative human effort to achieve optimal formulations. To address this limitation, automated prompt engineering methodologies have been developed to systematically generate task-specific prompts. However, current implementations predominantly employ static update rules and lack mechanisms for dynamic strategy selection, resulting in suboptimal adaptation to varying NLP task requirements. Furthermore, most methods treat and update the whole prompts at each step, without considering editing prompt sections at a finer granularity. At last, in particular, the problem of how to recycle experience in LLM is still underexplored. To this end, we propose the PromptFlow, a modular training framework inspired by TensorFlow, which integrates meta-prompts, operators, optimization, and evaluator. Our framework can be equipped with the latest optimization methods and autonomously explores optimal prompt refinement trajectories through gradient-based meta-learning, requiring minimal task-specific training data. Specifically, we devise a reinforcement learning method to recycle experience for LLM in the PE process. Finally, we conduct extensive experiments on various datasets, and demonstrate the effectiveness of PromptFlow.

[306] $\mathbf{T^3}$: Reducing Belief Deviation in Reinforcement Learning for Active Reasoning

Deyu Zou, Yongqiang Chen, Jianxiang Wang, Haochen Yang, Mufei Li, James Cheng, Pan Li, Yu Gong

Main category: cs.AI

TL;DR: T³ method detects belief deviation in LLM-based agents during active reasoning and truncates uninformative trajectory tails to improve policy optimization.

Details

Motivation: LLM-based agents suffer from belief deviation during active reasoning, leading to uninformative actions and compounding errors that hinder reinforcement learning training.

Method: Proposed T³ method tracks belief deviation, detects excessive deviation, and truncates trajectories during training to remove uninformative tails while preserving credit for informative prefixes.

Result: Across 5 challenging tasks, T³ improved training stability, token efficiency, and final performance by up to 30% while reducing rollout tokens by approximately 25%.

Conclusion: Belief control is a key principle for developing robust and generalizable LLM-based active reasoners.

Abstract: Active reasoning requires large language models (LLMs) to interact with external sources and strategically gather information to solve problems. Central to this process is belief tracking: maintaining a coherent understanding of the problem state and the missing information toward the solution. However, due to limited reasoning capabilities, LLM-based agents often suffer from belief deviation: they struggle to correctly model beliefs, lose track of problem states, and fall into uninformative or repetitive actions. Once this happens, errors compound and reinforcement learning (RL) training fails to properly credit the crucial exploratory steps. To address this issue, we propose to track the deviation of model beliefs and develop $\mathbf{T^3}$, a simple yet effective method that detects excessive belief deviation and truncates trajectories during training to remove uninformative tails. By preserving credit for informative prefixes, $\mathbf{T^3}$ systematically improves policy optimization. Across 5 challenging tasks, $\mathbf{T^3}$ consistently enhances training stability, token efficiency, and final performance, achieving up to 30% gains while cutting rollout tokens by roughly 25%. These results highlight belief control as a key principle for developing robust and generalizable LLM-based active reasoners.

[307] Tensor Logic: The Language of AI

Pedro Domingos

Main category: cs.AI

TL;DR: Tensor logic is a new programming language that unifies neural and symbolic AI by treating logical rules and Einstein summation as the same operation, enabling elegant implementation of various AI approaches.

Details

Motivation: Current AI tools like PyTorch and TensorFlow lack support for automated reasoning and knowledge acquisition, while traditional AI languages like LISP and Prolog lack scalability and learning support, creating a gap in AI development.

Method: The paper proposes tensor logic, which uses tensor equations as its sole construct, unifying logical rules with Einstein summation to create a fundamental language for AI.

Result: Tensor logic enables implementation of transformers, formal reasoning, kernel machines, and graphical models, and enables new capabilities like sound reasoning in embedding space.

Conclusion: Tensor logic combines the scalability of neural networks with the reliability of symbolic reasoning, potentially forming a basis for wider AI adoption by solving fundamental language limitations.

Abstract: Progress in AI is hindered by the lack of a programming language with all the requisite features. Libraries like PyTorch and TensorFlow provide automatic differentiation and efficient GPU implementation, but are additions to Python, which was never intended for AI. Their lack of support for automated reasoning and knowledge acquisition has led to a long and costly series of hacky attempts to tack them on. On the other hand, AI languages like LISP an Prolog lack scalability and support for learning. This paper proposes tensor logic, a language that solves these problems by unifying neural and symbolic AI at a fundamental level. The sole construct in tensor logic is the tensor equation, based on the observation that logical rules and Einstein summation are essentially the same operation, and all else can be reduced to them. I show how to elegantly implement key forms of neural, symbolic and statistical AI in tensor logic, including transformers, formal reasoning, kernel machines and graphical models. Most importantly, tensor logic makes new directions possible, such as sound reasoning in embedding space. This combines the scalability and learnability of neural networks with the reliability and transparency of symbolic reasoning, and is potentially a basis for the wider adoption of AI.

[308] RAG-Anything: All-in-One RAG Framework

Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang

Main category: cs.AI

TL;DR: RAG-Anything is a unified multimodal retrieval-augmented generation framework that addresses the limitations of text-only RAG systems by enabling comprehensive knowledge retrieval across all modalities including text, images, tables, and math expressions.

Details

Motivation: Current RAG systems are limited to text-only content, creating fundamental gaps when processing real-world multimodal knowledge repositories that contain rich combinations of text, visual elements, structured tables, and mathematical expressions.

Method: The framework reconceptualizes multimodal content as interconnected knowledge entities using dual-graph construction to capture cross-modal relationships and textual semantics. It employs cross-modal hybrid retrieval combining structural knowledge navigation with semantic matching.

Result: RAG-Anything demonstrates superior performance on challenging multimodal benchmarks with significant improvements over state-of-the-art methods, particularly on long documents where traditional approaches fail.

Conclusion: The framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems and enabling effective reasoning over heterogeneous content spanning multiple modalities.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: https://github.com/HKUDS/RAG-Anything.

[309] O-Forge: An LLM + Computer Algebra Framework for Asymptotic Analysis

Ayush Khaitan, Vijay Ganesh

Main category: cs.AI

TL;DR: LLM+CAS framework combines frontier LLMs with computer algebra systems to produce creative and verified proofs for asymptotic inequalities, addressing the verification challenge in AI-assisted mathematical research.

Details

Motivation: To overcome the verification difficulty in using LLMs for research mathematics, where plausible-looking proofs cannot be trusted without rigorous checking, and to answer Terry Tao's question about whether LLMs coupled with verifiers can help prove intricate asymptotic inequalities.

Method: LLM+CAS framework with O-Forge tool that couples frontier LLMs with computer algebra systems in an In-Context Symbolic Feedback loop - LLM suggests domain decompositions and CAS provides axiomatic verification of each piece.

Result: The framework proves remarkably effective at proposing appropriate domain decompositions for asymptotic inequalities, moving AI beyond contest math towards research-level tools for professional mathematicians.

Conclusion: LLM+CAS successfully demonstrates that AI can help prove intricate asymptotic inequalities and serve as research-level tools for professional mathematicians by combining creative suggestions from LLMs with rigorous verification from CAS.

Abstract: Large language models have recently demonstrated advanced capabilities in solving IMO and Putnam problems; yet their role in research mathematics has remained fairly limited. The key difficulty is verification: suggested proofs may look plausible, but cannot be trusted without rigorous checking. We present a framework, called LLM+CAS, and an associated tool, O-Forge, that couples frontier LLMs with a computer algebra systems (CAS) in an In-Context Symbolic Feedback loop to produce proofs that are both creative and symbolically verified. Our focus is on asymptotic inequalities, a topic that often involves difficult proofs and appropriate decomposition of the domain into the “right” subdomains. Many mathematicians, including Terry Tao, have suggested that using AI tools to find the right decompositions can be very useful for research-level asymptotic analysis. In this paper, we show that our framework LLM+CAS turns out to be remarkably effective at proposing such decompositions via a combination of a frontier LLM and a CAS. More precisely, we use an LLM to suggest domain decomposition, and a CAS (such as Mathematica) that provides a verification of each piece axiomatically. Using this loop, we answer a question posed by Terence Tao: whether LLMs coupled with a verifier can be used to help prove intricate asymptotic inequalities. More broadly, we show how AI can move beyond contest math towards research-level tools for professional mathematicians.

[310] A Survey of Vibe Coding with Large Language Models

Yuyao Ge, Lingrui Mei, Zenghao Duan, Tianhao Li, Yujia Zheng, Yiwei Wang, Lexin Wang, Jiayu Yao, Tianyu Liu, Yujun Cai, Baolong Bi, Fangda Guo, Jiafeng Guo, Shenghua Liu, Xueqi Cheng

Main category: cs.AI

TL;DR: This survey provides the first comprehensive review of Vibe Coding with LLMs, establishing theoretical foundations and practical frameworks for this AI-driven development paradigm where developers validate implementations through outcome observation rather than code comprehension.

Details

Motivation: The advancement of LLMs has enabled autonomous coding agents and Vibe Coding methodology, but its effectiveness remains under-explored with empirical evidence showing productivity losses and human-AI collaboration challenges.

Method: Systematic analysis of over 1000 research papers, formalizing Vibe Coding through Constrained Markov Decision Process, and synthesizing practices into five development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models.

Result: Established the first comprehensive taxonomy of Vibe Coding, revealing that success depends on systematic context engineering, well-established development environments, and human-agent collaborative models rather than just agent capabilities.

Conclusion: Vibe Coding represents a transformative development approach that requires systematic frameworks and collaborative models to overcome current productivity challenges and realize its full potential in human-AI software development.

Abstract: The advancement of large language models (LLMs) has catalyzed a paradigm shift from code generation assistance to autonomous coding agents, enabling a novel development methodology termed “Vibe Coding” where developers validate AI-generated implementations through outcome observation rather than line-by-line code comprehension. Despite its transformative potential, the effectiveness of this emergent paradigm remains under-explored, with empirical evidence revealing unexpected productivity losses and fundamental challenges in human-AI collaboration. To address this gap, this survey provides the first comprehensive and systematic review of Vibe Coding with large language models, establishing both theoretical foundations and practical frameworks for this transformative development approach. Drawing from systematic analysis of over 1000 research papers, we survey the entire vibe coding ecosystem, examining critical infrastructure components including LLMs for coding, LLM-based coding agent, development environment of coding agent, and feedback mechanisms. We first introduce Vibe Coding as a formal discipline by formalizing it through a Constrained Markov Decision Process that captures the dynamic triadic relationship among human developers, software projects, and coding agents. Building upon this theoretical foundation, we then synthesize existing practices into five distinct development models: Unconstrained Automation, Iterative Conversational Collaboration, Planning-Driven, Test-Driven, and Context-Enhanced Models, thus providing the first comprehensive taxonomy in this domain. Critically, our analysis reveals that successful Vibe Coding depends not merely on agent capabilities but on systematic context engineering, well-established development environments, and human-agent collaborative development models.

[311] PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks

Yunuo Liu, Dawei Zhu, Zena Al-Khalili, Dai Cheng, Yanjun Chen, Dietrich Klakow, Wei Zhang, Xiaoyu Shen

Main category: cs.AI

TL;DR: PricingLogic is the first benchmark to test LLMs’ ability to automate tourism pricing with complex fare rules, revealing significant reliability issues in revenue-critical applications.

Details

Motivation: Travel agencies want to automate error-prone pricing tasks using AI, but deploying unreliable LLMs could cause financial losses and damage customer trust.

Method: Created 300 natural-language questions based on 42 real pricing policies, with two difficulty levels: basic customer-type pricing and complex bundled-tour calculations with interacting discounts.

Result: LLMs show steep performance drop on harder tier, exposing systematic failures in rule interpretation and arithmetic reasoning.

Conclusion: Despite general capabilities, current LLMs remain unreliable for revenue-critical applications without additional safeguards or domain adaptation.

Abstract: We present PricingLogic, the first benchmark that probes whether Large Language Models(LLMs) can reliably automate tourism-related prices when multiple, overlapping fare rules apply. Travel agencies are eager to offload this error-prone task onto AI systems; however, deploying LLMs without verified reliability could result in significant financial losses and erode customer trust. PricingLogic comprises 300 natural-language questions based on booking requests derived from 42 real-world pricing policies, spanning two levels of difficulty: (i) basic customer-type pricing and (ii)bundled-tour calculations involving interacting discounts. Evaluations of a line of LLMs reveal a steep performance drop on the harder tier,exposing systematic failures in rule interpretation and arithmetic reasoning.These results highlight that, despite their general capabilities, today’s LLMs remain unreliable in revenue-critical applications without further safeguards or domain adaptation. Our code and dataset are available at https://github.com/EIT-NLP/PricingLogic.

[312] MTOS: A LLM-Driven Multi-topic Opinion Simulation Framework for Exploring Echo Chamber Dynamics

Dingyi Zuo, Hongjie Zhang, Jie Ou, Chaosheng Feng, Shuwan Liu

Main category: cs.AI

TL;DR: MTOS is a social simulation framework that integrates multi-topic contexts with LLMs to address limitations in existing opinion modeling approaches, enabling realistic simulation of opinion evolution across interrelated topics.

Details

Motivation: Existing LLM-based studies focus on single topics and cannot capture cognitive transfer in multi-topic contexts, while traditional numerical models lack interpretability and behavioral consistency when dealing with complex linguistic attitudes across multiple topics.

Method: MTOS combines LLMs with short-term/long-term memory, multiple user-selection interaction mechanisms, dynamic topic-selection strategies, and belief decay mechanisms to enable perspective updates across topics.

Result: Multi-topic settings significantly alter polarization trends: positively correlated topics amplify echo chambers, negatively correlated topics inhibit them, and irrelevant topics mitigate echo chambers through resource competition. LLM-based agents realistically simulate dynamic opinion changes and capture complex human reasoning.

Conclusion: MTOS improves simulation interpretability and system stability compared to numerical models, demonstrating that multi-topic contexts are crucial for understanding opinion evolution and echo chamber formation in social networks.

Abstract: The polarization of opinions, information segregation, and cognitive biases on social media have attracted significant academic attention. In real-world networks, information often spans multiple interrelated topics, posing challenges for opinion evolution and highlighting the need for frameworks that simulate interactions among topics. Existing studies based on large language models (LLMs) focus largely on single topics, limiting the capture of cognitive transfer in multi-topic, cross-domain contexts. Traditional numerical models, meanwhile, simplify complex linguistic attitudes into discrete values, lacking interpretability, behavioral consistency, and the ability to integrate multiple topics. To address these issues, we propose Multi-topic Opinion Simulation (MTOS), a social simulation framework integrating multi-topic contexts with LLMs. MTOS leverages LLMs alongside short-term and long-term memory, incorporates multiple user-selection interaction mechanisms and dynamic topic-selection strategies, and employs a belief decay mechanism to enable perspective updates across topics. We conduct extensive experiments on MTOS, varying topic numbers, correlation types, and performing ablation studies to assess features such as group polarization and local consistency. Results show that multi-topic settings significantly alter polarization trends: positively correlated topics amplify echo chambers, negatively correlated topics inhibit them, and irrelevant topics also mitigate echo chamber effects through resource competition. Compared with numerical models, LLM-based agents realistically simulate dynamic opinion changes, reproduce linguistic features of news texts, and capture complex human reasoning, improving simulation interpretability and system stability.

[313] Biased-Attention Guided Risk Prediction for Safe Decision-Making at Unsignalized Intersections

Chengyang Dong, Nan Guo

Main category: cs.AI

TL;DR: Proposes a DRL framework with biased attention for autonomous driving at unsignalized intersections, using SAC algorithm and risk predictor to improve safety and efficiency.

Details

Motivation: Autonomous driving decision-making at unsignalized intersections is challenging due to complex dynamic interactions and high conflict risks, requiring proactive safety control.

Method: Deep reinforcement learning framework based on Soft Actor-Critic (SAC) algorithm, integrated with biased attention mechanism to construct traffic risk predictor that assesses long-term collision risk and transforms it into dense reward signals.

Result: Simulation results demonstrate effective improvement in both traffic efficiency and vehicle safety at intersections, proving the framework’s effectiveness in complex scenarios.

Conclusion: The proposed intelligent decision-making framework successfully addresses the challenges of unsignalized intersections through DRL with biased attention, achieving safer and more efficient autonomous driving.

Abstract: Autonomous driving decision-making at unsignalized intersections is highly challenging due to complex dynamic interactions and high conflict risks. To achieve proactive safety control, this paper proposes a deep reinforcement learning (DRL) decision-making framework integrated with a biased attention mechanism. The framework is built upon the Soft Actor-Critic (SAC) algorithm. Its core innovation lies in the use of biased attention to construct a traffic risk predictor. This predictor assesses the long-term risk of collision for a vehicle entering the intersection and transforms this risk into a dense reward signal to guide the SAC agent in making safe and efficient driving decisions. Finally, the simulation results demonstrate that the proposed method effectively improves both traffic efficiency and vehicle safety at the intersection, thereby proving the effectiveness of the intelligent decision-making framework in complex scenarios. The code of our work is available at https://github.com/hank111525/SAC-RWB.

[314] Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems

Jiaxin Gao, Chen Chen, Yanwen Jia, Xueluan Gong, Kwok-Yan Lam, Qian Wang

Main category: cs.AI

TL;DR: This paper investigates judgment biases in LLM-as-a-judge models, examining 11 types of biases and finding that current LLM judges show robustness to biased inputs but can be degraded by fine-tuning on biased data.

Details

Motivation: LLMs are increasingly used to autonomously evaluate content quality in communication systems, but their impartiality is not guaranteed, and biases could undermine user trust.

Method: Systematically investigated judgment biases in GPT-Judge and JudgeLM models under point-wise scoring setting, covering 11 types of biases including both implicit and explicit forms.

Result: State-of-the-art LLM judges demonstrate robustness to biased inputs, assigning them lower scores than clean samples. Fine-tuning on high-scoring biased responses degrades performance. Scores correlate with task difficulty.

Conclusion: Proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.

Abstract: Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI “judges” is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases in two LLM-as-a-judge models (i.e., GPT-Judge and JudgeLM) under the point-wise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. Providing a detailed scoring rubric further enhances this robustness. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.

[315] Using Medical Algorithms for Task-Oriented Dialogue in LLM-Based Medical Interviews

Rui Reis, Pedro Rangel Henriques, João Ferreira-Coimbra, Eva Oliveira, Nuno F. Rodrigues

Main category: cs.AI

TL;DR: A DAG-based task-oriented dialogue framework for medical interviews that transforms clinical guidelines into question flows, uses cold-start and adaptive branching mechanisms, and generates structured reports with high usability scores.

Details

Motivation: To create an efficient medical dialogue system that reduces cognitive workload for both patients and physicians while ensuring clinical guideline compliance and automated report generation.

Method: DAG-structured question flow with systematic pipeline for guideline transformation, cold-start hierarchical clustering, expand-and-prune adaptive branching, termination logic, and automated report synthesis guided by HCI principles.

Result: Patient app: NASA-TLX=15.6 (low workload), SUS=86 (high usability), QUIS=8.1/9. Physician app: NASA-TLX=26 (moderate workload), SUS=88.5 (excellent usability), QUIS=8.3/9. Both effectively integrated into clinical workflows.

Conclusion: The framework successfully reduces cognitive demand and supports efficient clinical reporting, though limited by occasional latency and small evaluation sample.

Abstract: We developed a task-oriented dialogue framework structured as a Directed Acyclic Graph (DAG) of medical questions. The system integrates: (1) a systematic pipeline for transforming medical algorithms and guidelines into a clinical question corpus; (2) a cold-start mechanism based on hierarchical clustering to generate efficient initial questioning without prior patient information; (3) an expand-and-prune mechanism enabling adaptive branching and backtracking based on patient responses; (4) a termination logic to ensure interviews end once sufficient information is gathered; and (5) automated synthesis of doctor-friendly structured reports aligned with clinical workflows. Human-computer interaction principles guided the design of both the patient and physician applications. Preliminary evaluation involved five physicians using standardized instruments: NASA-TLX (cognitive workload), the System Usability Scale (SUS), and the Questionnaire for User Interface Satisfaction (QUIS). The patient application achieved low workload scores (NASA-TLX = 15.6), high usability (SUS = 86), and strong satisfaction (QUIS = 8.1/9), with particularly high ratings for ease of learning and interface design. The physician application yielded moderate workload (NASA-TLX = 26) and excellent usability (SUS = 88.5), with satisfaction scores of 8.3/9. Both applications demonstrated effective integration into clinical workflows, reducing cognitive demand and supporting efficient report generation. Limitations included occasional system latency and a small, non-diverse evaluation sample.

[316] Artificial Intelligence Virtual Cells: From Measurements to Decisions across Modality, Scale, Dynamics, and Evaluation

Chengpeng Hu, Calvin Yu-Chian Chen

Main category: cs.AI

TL;DR: AIVCs aim to model cell states from multimodal data, but face challenges in cross-lab transport, data leakage, and cross-scale coupling. The paper proposes a Cell-State Latent perspective with operator grammar for better evaluation and data design.

Details

Motivation: Current AIVC evaluations are limited to single datasets and settings, with poor transport across labs/platforms, data leakage issues, and inadequate handling of dose/time effects. Cross-scale coupling remains constrained.

Method: Proposes a model-agnostic Cell-State Latent perspective organized via operator grammar: measurement, lift/project for cross-scale coupling, and intervention for dosing/scheduling. Emphasizes decision-aligned evaluation across modality, scale, context and intervention.

Result: The approach enables reproducible, like-for-like comparisons through operator-aware data design, leakage-resistant partitions, and transparent calibration/reporting. Focuses on function-space readouts like pathway activity and clinical endpoints.

Conclusion: The CSL perspective provides a systematic framework for AIVC development that addresses current limitations in evaluation, cross-scale coupling, and intervention handling, enabling more robust and clinically relevant cell state modeling.

Abstract: Artificial Intelligence Virtual Cells (AIVCs) aim to learn executable, decision-relevant models of cell state from multimodal, multiscale measurements. Recent studies have introduced single-cell and spatial foundation models, improved cross-modality alignment, scaled perturbation atlases, and explored pathway-level readouts. Nevertheless, although held-out validation is standard practice, evaluations remain predominantly within single datasets and settings; evidence indicates that transport across laboratories and platforms is often limited, that some data splits are vulnerable to leakage and coverage bias, and that dose, time and combination effects are not yet systematically handled. Cross-scale coupling also remains constrained, as anchors linking molecular, cellular and tissue levels are sparse, and alignment to scientific or clinical readouts varies across studies. We propose a model-agnostic Cell-State Latent (CSL) perspective that organizes learning via an operator grammar: measurement, lift/project for cross-scale coupling, and intervention for dosing and scheduling. This view motivates a decision-aligned evaluation blueprint across modality, scale, context and intervention, and emphasizes function-space readouts such as pathway activity, spatial neighborhoods and clinically relevant endpoints. We recommend operator-aware data design, leakage-resistant partitions, and transparent calibration and reporting to enable reproducible, like-for-like comparisons.

[317] ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification

Utsav Kumar Nareti, Suraj Kumar, Soumya Pandey, Soumi Chattopadhyay, Chandranath Adak

Main category: cs.AI

TL;DR: ProtoSiTex is a semi-interpretable framework for fine-grained multi-label text classification that uses dual-phase training with prototype discovery and hierarchical consistency across text levels.

Details

Motivation: Address the limitations of existing prototype-based models that operate at coarse granularity and fail to handle multi-label classification in real-world scenarios with user-generated reviews.

Method: Uses dual-phase alternating training: unsupervised prototype discovery for semantically coherent prototypes, and supervised classification mapping prototypes to labels. Employs hierarchical loss for consistency across sub-sentence, sentence, and document levels, with adaptive prototypes and multi-head attention for overlapping semantics.

Result: Achieves state-of-the-art performance on a new hotel review benchmark dataset and two public benchmarks, providing faithful, human-aligned explanations.

Conclusion: ProtoSiTex establishes itself as a robust solution for semi-interpretable multi-label text classification, effectively capturing fine-grained insights with interpretable prototypes.

Abstract: The surge in user-generated reviews has amplified the need for interpretable models that can provide fine-grained insights. Existing prototype-based models offer intuitive explanations but typically operate at coarse granularity (sentence or document level) and fail to address the multi-label nature of real-world text classification. We propose ProtoSiTex, a semi-interpretable framework designed for fine-grained multi-label text classification. ProtoSiTex employs a dual-phase alternating training strategy: an unsupervised prototype discovery phase that learns semantically coherent and diverse prototypes, and a supervised classification phase that maps these prototypes to class labels. A hierarchical loss function enforces consistency across sub-sentence, sentence, and document levels, enhancing interpretability and alignment. Unlike prior approaches, ProtoSiTex captures overlapping and conflicting semantics using adaptive prototypes and multi-head attention. We also introduce a benchmark dataset of hotel reviews annotated at the sub-sentence level with multiple labels. Experiments on this dataset and two public benchmarks (binary and multi-class) show that ProtoSiTex achieves state-of-the-art performance while delivering faithful, human-aligned explanations, establishing it as a robust solution for semi-interpretable multi-label text classification.

[318] HardcoreLogic: Challenging Large Reasoning Models with Long-tail Logic Puzzle Games

Jingcong Liang, Shijun Wan, Xuehai Wu, Siyuan Wang, Yitong Li, Qianglong Chen, Duyu Tang, Zhongyu Wei

Main category: cs.AI

TL;DR: HardcoreLogic is a challenging benchmark of 5,000+ puzzles across 10 games that tests Large Reasoning Models’ ability to handle non-canonical game variants through systematic transformations, revealing significant performance drops and limitations in genuine reasoning.

Details

Motivation: Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants.

Method: Introduces HardcoreLogic benchmark with systematic transformations through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization.

Result: Evaluations show significant performance drops even for top-performing models, with increased complexity being the dominant source of difficulty, and models struggling with subtle rule variations that don’t necessarily increase puzzle difficulty.

Conclusion: HardcoreLogic exposes limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning, highlighting gaps in genuine reasoning capabilities.

Abstract: Large Reasoning Models (LRMs) have demonstrated impressive performance on complex tasks, including logical puzzle games that require deriving solutions satisfying all constraints. However, whether they can flexibly apply appropriate rules to varying conditions, particularly when faced with non-canonical game variants, remains an open question. Existing corpora focus on popular puzzles like 9x9 Sudoku, risking overfitting to canonical formats and memorization of solution patterns, which can mask deficiencies in understanding novel rules or adapting strategies to new variants. To address this, we introduce HardcoreLogic, a challenging benchmark of over 5,000 puzzles across 10 games, designed to test the robustness of LRMs on the “long-tail” of logical games. HardcoreLogic systematically transforms canonical puzzles through three dimensions: Increased Complexity (IC), Uncommon Elements (UE), and Unsolvable Puzzles (UP), reducing reliance on shortcut memorization. Evaluations on a diverse set of LRMs reveal significant performance drops, even for models achieving top scores on existing benchmarks, indicating heavy reliance on memorized stereotypes. While increased complexity is the dominant source of difficulty, models also struggle with subtle rule variations that do not necessarily increase puzzle difficulty. Our systematic error analysis on solvable and unsolvable puzzles further highlights gaps in genuine reasoning. Overall, HardcoreLogic exposes the limitations of current LRMs and establishes a benchmark for advancing high-level logical reasoning.

[319] Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, Jitao Sang

Main category: cs.AI

TL;DR: Proposes Memory-as-Action framework where agents actively manage working memory through explicit editing operations as part of a unified policy, enabling joint optimization of task reasoning and memory management via reinforcement learning.

Details

Motivation: LLMs struggle with long-horizon tasks due to limited memory being overwhelmed by irrelevant context, and existing memory methods use external heuristics decoupled from core policy.

Method: Memory-as-Action framework with explicit memory editing operations, trained via Dynamic Context Policy Optimization algorithm that handles trajectory fractures caused by memory edits.

Result: Joint optimization reduces computational consumption and improves task performance through adaptive context curation strategies tailored to model capabilities.

Conclusion: Treating memory management as a learnable intrinsic capability enables more efficient and effective performance in long-horizon agentic tasks.

Abstract: Large Language Models face challenges in long-horizon agentic tasks as their constrained memory is easily overwhelmed by distracting or irrelevant context. Existing working memory methods typically rely on external, heuristic mechanisms that are decoupled from the agent’s core policy. In this work, we reframe working memory management as a learnable, intrinsic capability. We propose a novel framework, Memory-as-Action, where an agent actively manages its working memory by executing explicit editing operations as part of a unified policy. This formulation allows an agent, trained via reinforcement learning, to balance memory curation against long-term task objectives under given resource constraints. However, such memory editing actions break the standard assumption of a continuously growing prefix in LLM interactions, leading to what we call trajectory fractures. These non-prefix changes disrupt the causal continuity required by standard policy gradient methods, making those methods inapplicable. To address this, we propose a new algorithm, Dynamic Context Policy Optimization, which enables stable end-to-end reinforcement learning by segmenting trajectories at memory action points and applying trajectory-level advantages to the resulting action segments. Our results demonstrate that jointly optimizing for task reasoning and memory management in an end-to-end fashion not only reduces overall computational consumption but also improves task performance, driven by adaptive context curation strategies tailored to the model’s intrinsic capabilities.

[320] ERA: Transforming VLMs into Embodied Agents via Embodied Prior Learning and Online Reinforcement Learning

Hanyang Chen, Mark Zhao, Rui Yang, Qinwei Ma, Ke Yang, Jiarui Yao, Kangrui Wang, Hao Bai, Zhenhailong Wang, Rui Pan, Mengchao Zhang, Jose Barreiros, Aykut Onol, ChengXiang Zhai, Heng Ji, Manling Li, Huan Zhang, Tong Zhang

Main category: cs.AI

TL;DR: ERA is a two-stage framework that combines prior knowledge learning and online RL to create efficient embodied AI agents, achieving significant performance improvements over large models with much smaller parameter count.

Details

Motivation: Bridge the gap between costly large VLMs and underperforming small VLMs for embodied AI tasks by developing a more efficient framework that leverages prior knowledge and reinforcement learning.

Method: Two-stage approach: 1) Embodied Prior Learning with three types of priors (trajectory-augmented, environment-anchored, external knowledge), 2) Online RL pipeline with self-summarization, dense reward shaping, and turn-level policy optimization.

Result: ERA-3B outperforms GPT-4o by 8.4% on EB-ALFRED and 19.4% on EB-Manipulation tasks, and shows strong generalization to unseen tasks.

Conclusion: ERA provides a practical path toward scalable embodied intelligence with methodological insights for future embodied AI systems.

Abstract: Recent advances in embodied AI highlight the potential of vision language models (VLMs) as agents capable of perception, reasoning, and interaction in complex environments. However, top-performing systems rely on large-scale models that are costly to deploy, while smaller VLMs lack the necessary knowledge and skills to succeed. To bridge this gap, we present \textit{Embodied Reasoning Agent (ERA)}, a two-stage framework that integrates prior knowledge learning and online reinforcement learning (RL). The first stage, \textit{Embodied Prior Learning}, distills foundational knowledge from three types of data: (1) Trajectory-Augmented Priors, which enrich existing trajectory data with structured reasoning generated by stronger models; (2) Environment-Anchored Priors, which provide in-environment knowledge and grounding supervision; and (3) External Knowledge Priors, which transfer general knowledge from out-of-environment datasets. In the second stage, we develop an online RL pipeline that builds on these priors to further enhance agent performance. To overcome the inherent challenges in agent RL, including long horizons, sparse rewards, and training instability, we introduce three key designs: self-summarization for context management, dense reward shaping, and turn-level policy optimization. Extensive experiments on both high-level planning (EB-ALFRED) and low-level control (EB-Manipulation) tasks demonstrate that ERA-3B surpasses both prompting-based large models and previous training-based baselines. Specifically, it achieves overall improvements of 8.4% on EB-ALFRED and 19.4% on EB-Manipulation over GPT-4o, and exhibits strong generalization to unseen tasks. Overall, ERA offers a practical path toward scalable embodied intelligence, providing methodological insights for future embodied AI systems.

[321] Multi-Agent Debate for LLM Judges with Adaptive Stability Detection

Tianyu Hu, Zhen Tan, Song Wang, Huaizhi Qu, Tianlong Chen

Main category: cs.AI

TL;DR: A multi-agent debate framework for LLM judges that improves judgment accuracy over majority voting through collaborative reasoning and adaptive stopping based on consensus dynamics.

Details

Motivation: Current LLM-as-Judge approaches use simplistic aggregation methods like majority voting, which can fail even when individual agents provide correct answers, highlighting the need for more sophisticated judgment frameworks.

Method: Proposed a multi-agent debate judge framework where agents collaboratively reason and iteratively refine responses. Introduced stability detection via time-varying Beta-Binomial mixture modeling of judge consensus dynamics with adaptive stopping using Kolmogorov-Smirnov test.

Result: Experiments across multiple benchmarks and models show improved judgment accuracy over majority voting while maintaining computational efficiency.

Conclusion: The multi-agent debate framework with stability detection effectively enhances LLM judgment accuracy by amplifying correctness through collaborative reasoning and adaptive consensus modeling.

Abstract: With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges’ collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.

[322] CAMNet: Leveraging Cooperative Awareness Messages for Vehicle Trajectory Prediction

Mattia Grasselli, Angelo Porrello, Carlo Augusto Grazia

Main category: cs.AI

TL;DR: CAMNet uses Cooperative Awareness Messages (CAMs) for vehicle trajectory prediction, showing promising results despite some limitations.

Details

Motivation: Autonomous driving safety is challenged by sensor limitations like obstructed field of view. Vehicle-to-vehicle communication via CAMs can enhance situational awareness when sensors are occluded.

Method: Designed and trained CAMNet neural network on motion forecasting dataset, then evaluated on custom CAM dataset to test effectiveness of CAM data for trajectory prediction.

Result: The approach demonstrates promising results, showing that CAMs can effectively support vehicle trajectory prediction.

Conclusion: CAMs show potential for vehicle trajectory prediction, though several limitations exist that present opportunities for future research.

Abstract: Autonomous driving remains a challenging task, particularly due to safety concerns. Modern vehicles are typically equipped with expensive sensors such as LiDAR, cameras, and radars to reduce the risk of accidents. However, these sensors face inherent limitations: their field of view and line of sight can be obstructed by other vehicles, thereby reducing situational awareness. In this context, vehicle-to-vehicle communication plays a crucial role, as it enables cars to share information and remain aware of each other even when sensors are occluded. One way to achieve this is through the use of Cooperative Awareness Messages (CAMs). In this paper, we investigate the use of CAM data for vehicle trajectory prediction. Specifically, we design and train a neural network, Cooperative Awareness Message-based Graph Neural Network (CAMNet), on a widely used motion forecasting dataset. We then evaluate the model on a second dataset that we created from scratch using Cooperative Awareness Messages, in order to assess whether this type of data can be effectively exploited. Our approach demonstrates promising results, showing that CAMs can indeed support vehicle trajectory prediction. At the same time, we discuss several limitations of the approach, which highlight opportunities for future research.

[323] Towards Robust Artificial Intelligence: Self-Supervised Learning Approach for Out-of-Distribution Detection

Wissam Salhab, Darine Ameyed, Hamid Mcheick, Fehmi Jaafar

Main category: cs.AI

TL;DR: Proposes a self-supervised learning approach with graph-theoretical techniques for OOD detection without labeled data, achieving AUROC=0.99.

Details

Motivation: Improve AI robustness in safety-critical systems by enabling reliable OOD detection without requiring labeled data.

Method: Combines self-supervised learning for representation learning from unlabeled data with graph-theoretical techniques for OOD sample identification.

Result: Achieved state-of-the-art performance with AUROC=0.99, outperforming existing methods.

Conclusion: The approach effectively enhances AI system robustness through unsupervised OOD detection using self-supervised learning and graph theory.

Abstract: Robustness in AI systems refers to their ability to maintain reliable and accurate performance under various conditions, including out-of-distribution (OOD) samples, adversarial attacks, and environmental changes. This is crucial in safety-critical systems, such as autonomous vehicles, transportation, or healthcare, where malfunctions could have severe consequences. This paper proposes an approach to improve OOD detection without the need of labeled data, thereby increasing the AI systems’ robustness. The proposed approach leverages the principles of self-supervised learning, allowing the model to learn useful representations from unlabeled data. Combined with graph-theoretical techniques, this enables the more efficient identification and categorization of OOD samples. Compared to existing state-of-the-art methods, this approach achieved an Area Under the Receiver Operating Characteristic Curve (AUROC) = 0.99.

[324] Clutch Control: An Attention-based Combinatorial Bandit for Efficient Mutation in JavaScript Engine Fuzzing

Myles Foley, Sergio Maffeis, Muhammad Fakhrur Rozi, Takeshi Takahashi

Main category: cs.AI

TL;DR: CLUTCH is a deep combinatorial bandit approach that uses attention mechanisms and Concrete Dropout to intelligently select mutation targets in JavaScript fuzzing, outperforming state-of-the-art methods in coverage and efficiency.

Details

Motivation: Existing JavaScript fuzzing techniques use random mutation target selection, which is inefficient. The problem of selecting better mutation targets is suitable for combinatorial bandits with volatile arms.

Method: Proposes CLUTCH - a deep combinatorial bandit that observes variable-length JavaScript test case representations using attention mechanisms, and dynamically adapts exploration using Concrete Dropout.

Result: CLUTCH increases JavaScript fuzzing efficiency by 20.3% more valid test cases and 8.9% higher coverage-per-testcase on average. Achieves at least 78.1% and 4.1% less regret in volatile and combinatorial settings respectively.

Conclusion: CLUTCH demonstrates superior performance over state-of-the-art bandits and fuzzing solutions by intelligently selecting mutation targets through deep combinatorial bandit approach with adaptive exploration.

Abstract: JavaScript engines are widely used in web browsers, PDF readers, and server-side applications. The rise in concern over their security has led to the development of several targeted fuzzing techniques. However, existing approaches use random selection to determine where to perform mutations in JavaScript code. We postulate that the problem of selecting better mutation targets is suitable for combinatorial bandits with a volatile number of arms. Thus, we propose CLUTCH, a novel deep combinatorial bandit that can observe variable length JavaScript test case representations, using an attention mechanism from deep learning. Furthermore, using Concrete Dropout, CLUTCH can dynamically adapt its exploration. We show that CLUTCH increases efficiency in JavaScript fuzzing compared to three state-of-the-art solutions by increasing the number of valid test cases and coverage-per-testcase by, respectively, 20.3% and 8.9% on average. In volatile and combinatorial settings we show that CLUTCH outperforms state-of-the-art bandits, achieving at least 78.1% and 4.1% less regret in volatile and combinatorial settings, respectively.

[325] CTRL-Rec: Controlling Recommender Systems With Natural Language

Micah Carroll, Adeline Foote, Kevin Feng, Marcus Williams, Anca Dragan, W. Bradley Knox, Smitha Milli

Main category: cs.AI

TL;DR: CTRL-Rec enables real-time natural language control of recommender systems using LLM embeddings, allowing users to fine-tune recommendations through text requests.

Details

Motivation: Users lack fine-grained controls for changing unsatisfactory recommendations from traditional recommender systems, needing natural language interfaces for better guidance.

Method: Train embedding models using LLM-simulated user approval judgments based on language requests, then integrate these predictions into standard recommender system weighting signals. Requires only one LLM embedding computation per request at deployment.

Result: Successfully enabled fine-grained control across diverse requests on MovieLens dataset. User study with 19 Letterboxd users showed significantly enhanced sense of control and satisfaction compared to traditional controls.

Conclusion: CTRL-Rec provides an effective and computationally efficient method for natural language control of recommender systems, improving user experience and satisfaction.

Abstract: When users are dissatisfied with recommendations from a recommender system, they often lack fine-grained controls for changing them. Large language models (LLMs) offer a solution by allowing users to guide their recommendations through natural language requests (e.g., “I want to see respectful posts with a different perspective than mine”). We propose a method, CTRL-Rec, that allows for natural language control of traditional recommender systems in real-time with computational efficiency. Specifically, at training time, we use an LLM to simulate whether users would approve of items based on their language requests, and we train embedding models that approximate such simulated judgments. We then integrate these user-request-based predictions into the standard weighting of signals that traditional recommender systems optimize. At deployment time, we require only a single LLM embedding computation per user request, allowing for real-time control of recommendations. In experiments with the MovieLens dataset, our method consistently allows for fine-grained control across a diversity of requests. In a study with 19 Letterboxd users, we find that CTRL-Rec was positively received by users and significantly enhanced users’ sense of control and satisfaction with recommendations compared to traditional controls.

[326] Causal Agent based on Large Language Model

Kairong Han, Kun Kuang, Ziyu Zhao, Junjian Ye, Fei Wu

Main category: cs.AI

TL;DR: The paper introduces Causal Agent, an LLM-based framework equipped with causal tools to address challenges in causal reasoning, achieving over 80% accuracy on four levels of causal problems and outperforming SOTA by 6% on real-world data.

Details

Motivation: LLMs struggle with causal problems due to natural language limitations in describing causal theory and structural mismatch with tabular data, hindering accurate causal reasoning.

Method: Developed Causal Agent framework with tools, memory, and reasoning modules. Uses Python code and causal functions to align tabular data with natural language, performs iterative reasoning, and maintains causal graphs in memory.

Result: Achieved over 80% accuracy on four-level CausalTQA benchmark (1.4K questions) and 6% improvement over SOTA on real-world QRData dataset.

Conclusion: Causal Agent effectively bridges the gap between LLMs and causal reasoning, demonstrating strong performance across multiple causal problem levels and real-world applications.

Abstract: The large language model (LLM) has achieved significant success across various domains. However, the inherent complexity of causal problems and causal theory poses challenges in accurately describing them in natural language, making it difficult for LLM to comprehend and use them effectively. Causal methods are not easily conveyed through natural language, which hinders LLM’s ability to apply them accurately. Additionally, causal datasets are typically tabular, while LLM excels in handling natural language data, creating a structural mismatch that impedes effective reasoning with tabular data. To address these challenges, we have equipped the LLM with causal tools within an agent framework, named the Causal Agent, enabling it to tackle causal problems. The causal agent comprises tools, memory, and reasoning modules. In the tool module, the causal agent calls Python code and uses the encapsulated causal function module to align tabular data with natural language. In the reasoning module, the causal agent performs reasoning through multiple iterations with the tools. In the memory module, the causal agent maintains a dictionary instance where the keys are unique names and the values are causal graphs. To verify the causal ability of the causal agent, we established a Causal Tabular Question Answer (CausalTQA) benchmark consisting of four levels of causal problems: variable level, edge level, causal graph level, and causal effect level. CausalTQA consists of about 1.4K for these four levels questions. Causal agent demonstrates remarkable efficacy on the four-level causal problems, with accuracy rates all above 80%. Through verification on the real-world dataset QRData, the causal agent is 6% higher than the original SOTA. For further insights and implementation details, our code is accessible via the GitHub repository https://github.com/kairong-han/causal_agent.

[327] Taming Text-to-Image Synthesis for Novices: User-centric Prompt Generation via Multi-turn Guidance

Yilun Liu, Minggui He, Feiyu Yao, Yuhe Ji, Shimin Tao, Jingzhou Du, Duan Li, Jian Gao, Li Zhang, Hao Yang, Boxing Chen, Osamu Yoshie

Main category: cs.AI

TL;DR: DialPrompt is a dialogue-based text-to-image synthesis prompt generation model that improves user experience for novice users through multi-turn interactions.

Details

Motivation: Existing text-to-image synthesis models are sensitive to prompts and challenging for novice users. Current solutions using single-turn prompt expansion lack user-centricity in interpretability and interactivity.

Method: Proposed DialPrompt with multi-turn dialogue workflow where the model guides users to express preferences on 15 optimization dimensions before generating final prompts. Curated a multi-turn dataset from advanced users.

Result: DialPrompt significantly improves user-centricity score compared to existing approaches while maintaining competitive image quality. Highly rated by 19 human reviewers, especially novices.

Conclusion: DialPrompt successfully enhances user experience for novice users in text-to-image synthesis through interactive dialogue-based prompt generation.

Abstract: The emergence of text-to-image synthesis (TIS) models has significantly influenced digital image creation by producing high-quality visuals from written descriptions. Yet these models are sensitive on textual prompts, posing a challenge for novice users who may not be familiar with TIS prompt writing. Existing solutions relieve this via automatic prompt expansion or generation from a user query. However, this single-turn manner suffers from limited user-centricity in terms of result interpretability and user interactivity. Thus, we propose DialPrompt, a dialogue-based TIS prompt generation model that emphasizes user experience for novice users. DialPrompt is designed to follow a multi-turn workflow, where in each round of dialogue the model guides user to express their preferences on possible optimization dimensions before generating the final TIS prompt. To achieve this, we mined 15 essential dimensions for high-quality prompts from advanced users and curated a multi-turn dataset. Through training on this dataset, DialPrompt improves user-centricity by allowing users to perceive and control the creation process of TIS prompts. Experiments indicate that DialPrompt improves significantly in user-centricity score compared with existing approaches while maintaining a competitive quality of synthesized images. In our user evaluation, DialPrompt is highly rated by 19 human reviewers (especially novices).

[328] Physics-Informed Autonomous LLM Agents for Explainable Power Electronics Modulation Design

Junhua Liu, Fanfan Lin, Xinze Li, Kwan Hui Lim, Shuai Zhao

Main category: cs.AI

TL;DR: PHIA is an LLM-driven autonomous agent that automates power converter modulation design with physics-informed simulation and optimization, achieving 63.2% error reduction and 33x speedup over benchmarks.

Details

Motivation: Current AI-assisted design automation methods lack explainability, scalability, and practical usability in renewable energy systems, particularly for power electronics design tasks.

Method: PHIA uses an LLM-based planning module with interactive chat interface to acquire requirements, combined with physics-informed simulation and optimization components for autonomous design generation and iterative refinement.

Result: PHIA reduces standard mean absolute error by 63.2% compared to benchmarks and accelerates design process by over 33 times, with user study confirming superior efficiency and usability.

Conclusion: PHIA demonstrates potential to transform industrial design workflows in power electronics by providing explainable, scalable automation with minimal human intervention.

Abstract: LLM-based autonomous agents have recently shown strong capabilities in solving complex industrial design tasks. However, in domains aiming for carbon neutrality and high-performance renewable energy systems, current AI-assisted design automation methods face critical challenges in explainability, scalability, and practical usability. To address these limitations, we introduce PHIA (Physics-Informed Autonomous Agent), an LLM-driven system that automates modulation design for power converters in Power Electronics Systems with minimal human intervention. In contrast to traditional pipeline-based methods, PHIA incorporates an LLM-based planning module that interactively acquires and verifies design requirements via a user-friendly chat interface. This planner collaborates with physics-informed simulation and optimization components to autonomously generate and iteratively refine modulation designs. The interactive interface also supports interpretability by providing textual explanations and visual outputs throughout the design process. Experimental results show that PHIA reduces standard mean absolute error by 63.2% compared to the second-best benchmark and accelerates the overall design process by over 33 times. A user study involving 20 domain experts further confirms PHIA’s superior design efficiency and usability, highlighting its potential to transform industrial design workflows in power electronics.

[329] Constrained Identifiability of Causal Effects

Yizuo Chen, Adnan Darwiche

Main category: cs.AI

TL;DR: The paper introduces constrained identifiability, which incorporates additional constraints (like logical constraints) beyond the causal graph to identify causal effects. It proposes an AC-based framework that is at least as complete as existing methods like do-calculus.

Details

Motivation: Classical causal identifiability methods only assume strict positivity constraints. Real-world scenarios often have additional constraints that could help identify causal effects that would otherwise be unidentifiable.

Method: The authors formalize constrained identifiability and develop a framework using Arithmetic Circuits (ACs) to systematically accommodate various types of constraints when testing identifiability.

Result: The AC-based approach is shown to be at least as complete as existing algorithms like do-calculus. Examples demonstrate that causal effects unidentifiable under classical methods become identifiable when incorporating different constraint types.

Conclusion: Constrained identifiability with ACs provides a more powerful framework for causal effect identification by leveraging additional constraints, expanding the scope of identifiable causal effects beyond what’s possible with classical methods.

Abstract: We study the identification of causal effects in the presence of different types of constraints (e.g., logical constraints) in addition to the causal graph. These constraints impose restrictions on the models (parameterizations) induced by the causal graph, reducing the set of models considered by the identifiability problem. We formalize the notion of constrained identifiability, which takes a set of constraints as another input to the classical definition of identifiability. We then introduce a framework for testing constrained identifiability by employing tractable Arithmetic Circuits (ACs), which enables us to accommodate constraints systematically. We show that this AC-based approach is at least as complete as existing algorithms (e.g., do-calculus) for testing classical identifiability, which only assumes the constraint of strict positivity. We use examples to demonstrate the effectiveness of this AC-based approach by showing that unidentifiable causal effects may become identifiable under different types of constraints.

[330] The Philosophical Foundations of Growing AI Like A Child

Dezhi Luo, Yijiang Li, Hokin Deng

Main category: cs.AI

TL;DR: Language models lack robustness in real-world scenarios due to missing core knowledge - foundational cognitive structures that humans develop. The paper proposes integrating core knowledge into multi-modal models through synthetic training data generation.

Details

Motivation: Current language models excel at high-level reasoning but perform poorly on fundamental problem-solving tasks that are intuitive to humans, showing a gap in robustness and generalizability.

Method: Proposes systematically integrating core knowledge into multi-modal language models through large-scale generation of synthetic training data using cognitive prototyping strategy.

Result: The paper analyzes empirical evidence of core knowledge in humans and identifies that language models’ failure to acquire it is not an inherent architectural constraint.

Conclusion: Integrating core knowledge into language models through synthetic data generation is a viable approach to address their current limitations in robustness and fundamental problem-solving abilities.

Abstract: Despite excelling in high-level reasoning, current language models lack robustness in real-world scenarios and perform poorly on fundamental problem-solving tasks that are intuitive to humans. This paper argues that both challenges stem from a core discrepancy between human and machine cognitive development. While both systems rely on increasing representational power, the absence of core knowledge, foundational cognitive structures in humans, prevents language models from developing robust, generalizable abilities, where complex skills are grounded in simpler ones within their respective domains. It explores empirical evidence of core knowledge in humans, analyzes why language models fail to acquire it, and argues that this limitation is not an inherent architectural constraint. Finally, it outlines a workable proposal for systematically integrating core knowledge into future multi-modal language models through the large-scale generation of synthetic training data using a cognitive prototyping strategy.

[331] Creativity or Brute Force? Using Brainteasers as a Window into the Problem-Solving Abilities of Large Language Models

Simeng Han, Howard Dai, Stephen Xia, Grant Zhang, Chen Liu, Lichang Chen, Hoang Huy Nguyen, Hongyuan Mei, Jiayuan Mao, R. Thomas McCoy

Main category: cs.AI

TL;DR: A benchmark using brainteasers in narrative form to evaluate LLM reasoning strategies beyond accuracy, focusing on solution creativity and efficiency.

Details

Motivation: Accuracy alone provides limited insight into how models solve problems. Brainteasers can reveal different reasoning approaches like creative insights vs brute force methods.

Method: Evaluated LLMs across multiple reasoning layers: semantic parsing to mathematical formats, solution generation, self-correction, step-by-step solution sketches, and hint utilization using narrative brainteasers.

Result: LLMs can find creative, insightful solutions to brainteasers, demonstrating capacity for novel problem-solving. However, they sometimes use brute force when more efficient creative solutions are available.

Conclusion: LLMs show promising creative reasoning abilities but still have room for improvement in consistently choosing efficient over brute-force approaches, highlighting a direction for advancing reasoning capabilities.

Abstract: Accuracy remains a standard metric for evaluating AI systems, but it offers limited insight into how models arrive at their solutions. In this work, we introduce a benchmark based on brainteasers written in long narrative form to probe more deeply into the types of reasoning strategies that models use. Brainteasers are well-suited for this goal because they can be solved with multiple approaches, such as a few-step solution that uses a creative insight or a longer solution that uses more brute force. We investigate large language models (LLMs) across multiple layers of reasoning, focusing not only on correctness but also on the quality and creativity of their solutions. We investigate many aspects of the reasoning process: (1) semantic parsing of the brainteasers into precise mathematical competition style formats; (2) generating solutions from these mathematical forms; (3) self-correcting solutions based on gold solutions; (4) producing step-by-step sketches of solutions; and (5) making use of hints. We find that LLMs are in many cases able to find creative, insightful solutions to brainteasers, suggesting that they capture some of the capacities needed to solve novel problems in creative ways. Nonetheless, there also remain situations where they rely on brute force despite the availability of more efficient, creative solutions, highlighting a potential direction for improvement in the reasoning abilities of LLMs.

[332] Open and Sustainable AI: challenges, opportunities and the road ahead in the life sciences (October 2025 – Version 2)

Gavin Farrell, Eleni Adamidi, Rafael Andrade Buono, Mihail Anton, Omar Abdelghani Attafi, Salvador Capella Gutierrez, Emidio Capriotti, Leyla Jael Castro, Davide Cirillo, Lisa Crossman, Christophe Dessimoz, Alexandros Dimopoulos, Raul Fernandez-Diaz, Styliani-Christina Fragkouli, Carole Goble, Wei Gu, John M. Hancock, Alireza Khanteymoori, Tom Lenaerts, Fabio G. Liberante, Peter Maccallum, Alexander Miguel Monzon, Magnus Palmblad, Lucy Poveda, Ovidiu Radulescu, Denis C. Shields, Shoaib Sufi, Thanasis Vergoulis, Fotis Psomopoulos, Silvio C. E. Tosatto

Main category: cs.AI

TL;DR: This perspective paper addresses trust and sustainability issues in AI-based life science research by introducing Open and Sustainable AI (OSAI) recommendations mapped to over 300 AI ecosystem components.

Details

Motivation: To maximize return on AI investments and accelerate progress in life sciences by addressing poor reusability, reproducibility, and environmental sustainability challenges exacerbated by rapid AI adoption.

Method: The authors review trust erosion issues in AI research outputs and introduce a practical set of OSAI recommendations directly mapped to over 300 components of the AI ecosystem.

Result: Development of OSAI recommendations that connect researchers with relevant AI resources to facilitate implementation of sustainable, reusable and transparent AI.

Conclusion: The outputs are designed to aid future development of policy and structured pathways for guiding AI implementation in life sciences, built upon community consensus and aligned with existing efforts.

Abstract: Artificial intelligence (AI) has recently seen transformative breakthroughs in the life sciences, expanding possibilities for researchers to interpret biological information at an unprecedented capacity, with novel applications and advances being made almost daily. In order to maximise return on the growing investments in AI-based life science research and accelerate this progress, it has become urgent to address the exacerbation of long-standing research challenges arising from the rapid adoption of AI methods. We review the increased erosion of trust in AI research outputs, driven by the issues of poor reusability and reproducibility, and highlight their consequent impact on environmental sustainability. Furthermore, we discuss the fragmented components of the AI ecosystem and lack of guiding pathways to best support Open and Sustainable AI (OSAI) model development. In response, this perspective introduces a practical set of OSAI recommendations directly mapped to over 300 components of the AI ecosystem. Our work connects researchers with relevant AI resources, facilitating the implementation of sustainable, reusable and transparent AI. Built upon life science community consensus and aligned to existing efforts, the outputs of this perspective are designed to aid the future development of policy and structured pathways for guiding AI implementation.

[333] EgoBrain: Synergizing Minds and Eyes For Human Action Understanding

Nie Lin, Yansen Wang, Dongqi Han, Weibang Jiang, Jingyuan Li, Ryosuke Furuta, Yoichi Sato, Dongsheng Li

Main category: cs.AI

TL;DR: EgoBrain is the first large-scale multimodal dataset synchronizing egocentric vision and EEG data, enabling new approaches for human-centered behavior analysis through AI integration.

Details

Motivation: To advance brain-computer interfaces by creating a unified framework that combines EEG and vision data for better understanding of human cognition and behavior.

Method: Created a dataset with 61 hours of synchronized 32-channel EEG and first-person video from 40 participants across 29 daily activities, then developed a multimodal learning framework to fuse EEG and vision data.

Result: Achieved 66.70% accuracy in action recognition across cross-subject and cross-environment challenges, demonstrating effective fusion of EEG and vision modalities.

Conclusion: EgoBrain establishes a new paradigm for multimodal brain-computer interfaces and provides open data and tools to advance cognitive computing research.

Abstract: The integration of brain-computer interfaces (BCIs), in particular electroencephalography (EEG), with artificial intelligence (AI) has shown tremendous promise in decoding human cognition and behavior from neural signals. In particular, the rise of multimodal AI models have brought new possibilities that have never been imagined before. Here, we present EgoBrain –the world’s first large-scale, temporally aligned multimodal dataset that synchronizes egocentric vision and EEG of human brain over extended periods of time, establishing a new paradigm for human-centered behavior analysis. This dataset comprises 61 hours of synchronized 32-channel EEG recordings and first-person video from 40 participants engaged in 29 categories of daily activities. We then developed a muiltimodal learning framework to fuse EEG and vision for action understanding, validated across both cross-subject and cross-environment challenges, achieving an action recognition accuracy of 66.70%. EgoBrain paves the way for a unified framework for brain-computer interface with multiple modalities. All data, tools, and acquisition protocols are openly shared to foster open science in cognitive computing.

[334] Can LLMs Reconcile Knowledge Conflicts in Counterfactual Reasoning

Khurram Yamin, Gaurav Ghosal, Bryan Wilder

Main category: cs.AI

TL;DR: LLMs struggle with counterfactual reasoning, often relying only on parametric knowledge instead of integrating it with new information, and finetuning fails to fix this while degrading stored knowledge.

Details

Motivation: To explore whether LLMs can effectively combine their parametric knowledge with new information in novel settings through counterfactual reasoning.

Method: Used synthetic and real experiments in multi-hop reasoning problems to test LLMs’ counterfactual reasoning abilities, and attempted post-hoc finetuning to improve this capability.

Result: LLMs generally fail at counterfactual reasoning, defaulting to parametric knowledge only, and finetuning degrades stored knowledge without improving counterfactual reasoning.

Conclusion: Current LLMs have significant limitations in re-purposing parametric knowledge for novel situations, highlighting a key weakness in their reasoning capabilities.

Abstract: Large Language Models have been shown to contain extensive world knowledge in their parameters, enabling impressive performance on many knowledge intensive tasks. However, when deployed in novel settings, LLMs often encounter situations where they must integrate parametric knowledge with new or unfamiliar information. In this work, we explore whether LLMs can combine knowledge in-context with their parametric knowledge through the lens of counterfactual reasoning. Through synthetic and real experiments in multi-hop reasoning problems, we show that LLMs generally struggle with counterfactual reasoning, often resorting to exclusively using their parametric knowledge. Moreover, we show that simple post-hoc finetuning can struggle to instill counterfactual reasoning ability – often leading to degradation in stored parametric knowledge. Ultimately, our work reveals important limitations of current LLM’s abilities to re-purpose parametric knowledge in novel settings.

[335] CoRGI: Verified Chain-of-Thought Reasoning with Post-hoc Visual Grounding

Shixin Yi, Lin Shang

Main category: cs.AI

TL;DR: CoRGI is a post-hoc verification framework that improves multimodal reasoning reliability by decomposing VLM-generated rationales, grounding each step in visual evidence, and filtering/correcting unsupported claims.

Details

Motivation: Vision-language models often suffer from hallucinations and generate explanations after only superficial image inspection, leading to unreliable reasoning.

Method: Decomposes VLM-generated rationales into step-wise statements, grounds each step in visual evidence, and filters/corrects unsupported claims before producing final answers.

Result: Consistently improves answer accuracy and explanation faithfulness across five benchmarks (VCR, ScienceQA, MMMU, MathVista, HallusionBench) with multiple VLM backbones (Qwen-2.5VL, LLaVA-1.6, Gemma3-12B).

Conclusion: Post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems by reducing hallucination and strengthening interpretability.

Abstract: Multimodal reasoning with vision-language models (VLMs) often suffers from hallucinations, as models tend to generate explanations after only a superficial inspection of the image. We present \textbf{CoRGI}(\textbf{C}hain \textbf{o}f \textbf{R}easoning with \textbf{G}rounded \textbf{I}nsights), a framework that enhances reasoning reliability through post-hoc verification of chain-of-thought outputs. Given a VLM-generated rationale, CoRGI decomposes it into step-wise statements, grounds each step in visual evidence, and filters or corrects unsupported claims before producing the final answer. Experiments on five challenging benchmark-VCR, ScienceQA, MMMU, MathVista, and HallusionBenc-demonstrate that CoRGI consistently improves both answer accuracy and explanation faithfulness across multiple VLM backbones, including Qwen-2.5VL, LLaVA-1.6, and Gemma3-12B. Beyond quantitative gains, qualitative analyses further illustrate how the verification process reduces hallucination and strengthens interpretability, suggesting that post-hoc visual grounding is a promising direction for building more trustworthy and transparent multimodal reasoning systems.

[336] Modular Embedding Recomposition for Incremental Learning

Aniello Panariello, Emanuele Frascaroli, Pietro Buzzega, Lorenzo Bonicelli, Angelo Porrello, Simone Calderara

Main category: cs.AI

TL;DR: MoDER enhances zero-shot capabilities of Vision-Language Models in Continual Learning by training specialized textual experts and composing them to synthesize refined prototypes for unseen classes.

Details

Motivation: While VLMs have strong zero-shot abilities, fine-tuning is needed for domain shifts. Previous CL approaches focused on preserving zero-shot capabilities, but this work aims to enhance them.

Method: MoDular Embedding Recomposition (MoDER) trains multiple textual experts specialized in single seen classes, stores them in a hub, and composes retrieved experts to synthesize refined prototypes for unseen classes at inference.

Result: The method was tested across 14 datasets using Class-IL and MTIL zero-shot incremental protocols, showing effectiveness in improving classification.

Conclusion: MoDER successfully transforms preservation into enhancement of VLMs’ zero-shot capabilities through modular expert training and composition.

Abstract: The advent of pre-trained Vision-Language Models (VLMs) has significantly transformed Continual Learning (CL), mainly due to their zero-shot classification abilities. Such proficiency makes VLMs well-suited for real-world applications, enabling robust performance on novel unseen classes without requiring adaptation. However, fine-tuning remains essential when downstream tasks deviate significantly from the pre-training domain. Prior CL approaches primarily focus on preserving the zero-shot capabilities of VLMs during incremental fine-tuning on a downstream task. We take a step further by devising an approach that transforms preservation into enhancement of the zero-shot capabilities of VLMs. Our approach, named MoDular Embedding Recomposition (MoDER), introduces a modular framework that trains multiple textual experts, each specialized in a single seen class, and stores them in a foundational hub. At inference time, for each unseen class, we query the hub and compose the retrieved experts to synthesize a refined prototype that improves classification. We show the effectiveness of our method across two popular zero-shot incremental protocols, Class-IL and MTIL, comprising a total of 14 datasets. The codebase is available at https://github.com/aimagelab/mammoth.

[337] ChatThero: An LLM-Supported Chatbot for Behavior Change and Therapeutic Support in Addiction Recovery

Junda Wang, Zonghai Yao, Lingxi Li, Junhui Qian, Zhichao Yang, Hong Yu

Main category: cs.AI

TL;DR: ChatThero is an autonomous language agent for addiction recovery that uses multi-agent simulation with stressor injection and therapeutic strategies to provide long-term support, showing superior empathy and clinical relevance compared to GPT-5.

Details

Motivation: Substance use disorders affect millions with high relapse rates and limited access to care, requiring multi-session treatments that are often unavailable.

Method: Trained in multi-agent simulated environment using anonymized patient profiles from recovery communities, classified by resistance levels. Introduces external stressors and dynamically injects motivational interview and cognitive behavioral therapy strategies.

Result: ChatThero significantly outperforms GPT-5, raising motivation by +1.71 points and confidence by +1.67 points on 1-5 scale. On difficult patients, achieves success milestone with 26% fewer turns than GPT-5.

Conclusion: ChatThero provides effective, low-cost therapeutic support for addiction recovery with superior empathy and clinical relevance, demonstrating that stressor simulation improves robustness and matches real-world relapse patterns.

Abstract: Substance use disorders (SUDs) affect millions of people, and relapses are common, requiring multi-session treatments. Access to care is limited, which contributes to the challenge of recovery support. We present \textbf{ChatThero}, an innovative low-cost, multi-session, stressor-aware, and memory-persistent autonomous \emph{language agent} designed to facilitate long-term behavior change and therapeutic support in addiction recovery. Unlike existing work that mostly finetuned large language models (LLMs) on patient-therapist conversation data, ChatThero was trained in a multi-agent simulated environment that mirrors real therapy. We created anonymized patient profiles from recovery communities (e.g., Reddit). We classify patients as \texttt{easy}, \texttt{medium}, and \texttt{difficult}, three scales representing their resistance to recovery. We created an external environment by introducing stressors (e.g., social determinants of health) to simulate real-world situations. We dynamically inject clinically-grounded therapeutic strategies (motivational interview and cognitive behavioral therapy). Our evaluation, conducted by both human (blinded clinicians) and LLM-as-Judge, shows that ChatThero is superior in empathy and clinical relevance. We show that stressor simulation improves robustness of ChatThero. Explicit stressors increase relapse-like setbacks, matching real-world patterns. We evaluate ChatThero with behavioral change metrics. On a 1–5 scale, ChatThero raises \texttt{motivation} by $+1.71$ points (from $2.39$ to $4.10$) and \texttt{confidence} by $+1.67$ points (from $1.52$ to $3.19$), substantially outperforming GPT-5. On \texttt{difficult} patients, ChatThero reaches the success milestone with $26%$ fewer turns than GPT-5.

[338] MapAgent: A Hierarchical Agent for Geospatial Reasoning with Dynamic Map Tool Integration

Md Hasebul Hasan, Mahir Labib Dihan, Tanzima Hashem, Mohammed Eunus Ali, Md Rizwan Parvez

Main category: cs.AI

TL;DR: MapAgent is a hierarchical multi-agent framework that addresses geospatial reasoning challenges by decoupling planning from execution, using specialized modules and map-tool agents to improve spatial reasoning and API coordination.

Details

Motivation: Existing AI agent frameworks are inadequate for geospatial tasks requiring spatial reasoning, multi-hop planning, and real-time map interaction, as they treat tools uniformly and overwhelm LLMs with similar geospatial APIs.

Method: Hierarchical multi-agent framework with high-level planner that decomposes queries into subgoals, specialized modules, and dedicated map-tool agents that orchestrate related APIs in parallel for efficient geospatial data fetching.

Result: Substantial gains over state-of-the-art baselines on four geospatial benchmarks (MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA), with improved tool selection accuracy and reduced cognitive load.

Conclusion: MapAgent’s hierarchical design effectively addresses geospatial reasoning challenges, enabling precise API coordination and superior performance on diverse geospatial tasks compared to existing approaches.

Abstract: Agentic AI has significantly extended the capabilities of large language models (LLMs) by enabling complex reasoning and tool use. However, most existing frameworks are tailored to domains such as mathematics, coding, or web automation, and fall short on geospatial tasks that require spatial reasoning, multi-hop planning, and real-time map interaction. To address these challenges, we introduce MapAgent, a hierarchical multi-agent plug-and-play framework with customized toolsets and agentic scaffolds for map-integrated geospatial reasoning. Unlike existing flat agent-based approaches that treat tools uniformly-often overwhelming the LLM when handling similar but subtly different geospatial APIs-MapAgent decouples planning from execution. A high-level planner decomposes complex queries into subgoals, which are routed to specialized modules. For tool-heavy modules-such as map-based services-we then design a dedicated map-tool agent that efficiently orchestrates related APIs adaptively in parallel to effectively fetch geospatial data relevant for the query, while simpler modules (e.g., solution generation or answer extraction) operate without additional agent overhead. This hierarchical design reduces cognitive load, improves tool selection accuracy, and enables precise coordination across similar APIs. We evaluate MapAgent on four diverse geospatial benchmarks-MapEval-Textual, MapEval-API, MapEval-Visual, and MapQA-and demonstrate substantial gains over state-of-the-art tool-augmented and agentic baselines. We open-source our framwork at https://github.com/Hasebul/MapAgent.

[339] Strategic Tradeoffs Between Humans and AI in Multi-Agent Bargaining

Crystal Qian, Kehang Zhu, John Horton, Benjamin S. Manning, Vivian Tsai, James Wexler, Nithum Thain

Main category: cs.AI

TL;DR: LLMs achieve performance parity with humans in negotiation tasks but through fundamentally different behavioral strategies - conservative concessionary approaches vs human strategic risk-taking.

Details

Motivation: To evaluate LLM performance and behavioral dynamics in collaborative multi-agent environments compared to humans and traditional Bayesian agents, as LLMs become increasingly embedded in real-world coordination tasks.

Method: Comparative study of humans (N=216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in identical dynamic negotiation conditions, analyzing outcomes and behavioral patterns.

Result: Bayesian agents achieved highest surplus through aggressive optimization but with frequent trade rejections. Humans and LLMs achieved similar overall surplus but through distinct behaviors - LLMs used conservative concessionary strategies with few rejections, while humans employed more strategic, risk-taking, and fairness-oriented approaches.

Conclusion: Performance parity metrics can conceal fundamental differences in process and alignment between agent types, which are critical for practical deployment in real-world coordination tasks. The study establishes foundational behavioral baselines for future research.

Abstract: As large language models (LLMs) are increasingly embedded in collaborative human activities such as business negotiations and group coordination, it becomes critical to evaluate both the performance gains they can achieve and how they interact in dynamic, multi-agent environments. Unlike traditional statistical agents such as Bayesian models, which may excel under well-specified conditions, large language models (LLMs) can generalize across diverse, real-world scenarios, raising new questions about how their strategies and behaviors compare to those of humans and other agent types. In this work, we compare outcomes and behavioral dynamics across humans (N = 216), LLMs (GPT-4o, Gemini 1.5 Pro), and Bayesian agents in a dynamic negotiation setting under identical conditions. Bayesian agents extract the highest surplus through aggressive optimization, at the cost of frequent trade rejections. Humans and LLMs achieve similar overall surplus, but through distinct behaviors: LLMs favor conservative, concessionary trades with few rejections, while humans employ more strategic, risk-taking, and fairness-oriented behaviors. Thus, we find that performance parity – a common benchmark in agent evaluation – can conceal fundamental differences in process and alignment, which are critical for practical deployment in real-world coordination tasks. By establishing foundational behavioral baselines under matched conditions, this work provides a baseline for future studies in more applied, variable-rich environments.

[340] MoEs Are Stronger than You Think: Hyper-Parallel Inference Scaling with RoE

Soheil Zibakhsh, Mohammad Samragh, Kumari Nishu, Lauren Hannah, Arnav Kundu, Minsik Cho

Main category: cs.AI

TL;DR: Hyper-parallel scaling improves LLM generation quality by aggregating multiple token-level output proposals. Implemented as Roster of Experts (RoE), it turns MoE models into dynamic ensembles through controlled stochastic routing.

Details

Motivation: To improve prediction quality at the token level, complementing existing sequence-level scaling methods like Chain-of-Thought, by leveraging the capabilities of Mixture-of-Experts models more effectively.

Method: RoE injects controlled stochasticity into expert routing to sample multiple diverse experts per token, aggregates their outputs, and uses efficient batching with specialized KV-caching to minimize computational overhead.

Result: A 7B MoE model with RoE matches the performance of a 10.5B MoE model while using 30% less compute for inference, achieving these gains without any fine-tuning.

Conclusion: Hyper-parallel scaling through RoE provides significant performance improvements for MoE models with reduced computational costs, offering a training-free inference enhancement method.

Abstract: The generation quality of large language models (LLMs) is often improved by utilizing inference-time sequence-level scaling methods (e.g., Chain-of-Thought). We introduce hyper-parallel scaling, a complementary framework that improves prediction quality at the token level. Hyper-parallel scaling computes and aggregates multiple output proposals for a single token from the model. We implement this concept in Mixture-of-Experts (MoE) models, which we refer to as Roster of Experts (RoE). RoE is a training-free inference algorithm that turns a single MoE into a dynamic ensemble of MoEs. RoE injects controlled stochasticity into the expert routing mechanism, enabling it to sample multiple diverse experts for each token and aggregate their outputs for a more accurate final prediction. To overcome the computational cost, we introduce an efficient batching strategy and a specialized KV-caching mechanism that minimizes compute and memory overhead. For example, RoE enables a 7B MoE model to match the performance of a 10.5B MoE model while using 30% less compute for inference. These gains are achieved without any fine-tuning of model parameters.

[341] Large Language Models in Operations Research: Methods, Applications, and Challenges

Yang Wang, Kai Li

Main category: cs.AI

TL;DR: LLMs can transform operations research by automating modeling, aiding optimization, and directly solving problems, overcoming limitations of traditional expert-driven approaches.

Details

Motivation: Traditional OR methods struggle with large-scale, dynamic problems due to reliance on expert modeling and manual tuning, limiting scalability and real-time use.

Method: Systematic review categorizing LLM applications in OR into three pathways: automatic modeling, auxiliary optimization, and direct solving.

Result: LLMs show strong potential to enhance OR through semantic understanding, structured generation, and reasoning capabilities.

Conclusion: LLMs can reshape OR by improving interpretability, adaptability, and scalability, enabling next-generation intelligent optimization systems.

Abstract: Operations research (OR) is a core methodology that supports complex system decision-making, with broad applications in transportation, supply chain management, and production scheduling. However, traditional approaches that rely on expert-driven modeling and manual parameter tuning often struggle with large-scale, dynamic, and multi-constraint problems, limiting scalability and real-time applicability. Large language models (LLMs), with capabilities in semantic understanding, structured generation, and reasoning control, offer new opportunities to overcome these challenges. They can translate natural language problem descriptions into mathematical models or executable code, generate heuristics, evolve algorithms, and directly solve optimization tasks. This shifts the paradigm from human-driven processes to intelligent human-AI collaboration. This paper systematically reviews progress in applying LLMs to OR, categorizing existing methods into three pathways: automatic modeling, auxiliary optimization, and direct solving. It also examines evaluation benchmarks and domain-specific applications, and highlights key challenges, including unstable semantic-to-structure mapping, fragmented research, limited generalization and interpretability, insufficient evaluation systems, and barriers to industrial deployment. Finally, it outlines potential research directions. Overall, LLMs demonstrate strong potential to reshape the OR paradigm by enhancing interpretability, adaptability, and scalability, paving the way for next-generation intelligent optimization systems.

[342] Similarity Field Theory: A Mathematical Framework for Intelligence

Kei-Sing Ng

Main category: cs.AI

TL;DR: Similarity Field Theory is a mathematical framework that formalizes similarity relations among entities and their evolution, reframing intelligence as preserving similarity fibers of concepts.

Details

Motivation: To provide a structural basis for understanding dynamic systems through similarity relations and offer a foundational language for characterizing intelligent systems.

Method: Defines similarity fields over entities, system evolution sequences, concepts as fibers of similarity maps, and a generative operator for intelligence.

Result: Proves theorems on asymmetry blocking mutual inclusion and stability requiring anchors/level set confinement, ensuring constrained and interpretable evolution.

Conclusion: The framework offers a geometric approach to intelligence and interpretability, applicable to analyzing large language models and societal cognition.

Abstract: We posit that persisting and transforming similarity relations form the structural basis of any comprehensible dynamic system. This paper introduces Similarity Field Theory, a mathematical framework that formalizes the principles governing similarity values among entities and their evolution. We define: (1) a similarity field $S: U \times U \to [0,1]$ over a universe of entities $U$, satisfying reflexivity $S(E,E)=1$ and treated as a directed relational field (asymmetry and non-transitivity are allowed); (2) the evolution of a system through a sequence $Z_p=(X_p,S^{(p)})$ indexed by $p=0,1,2,\ldots$; (3) concepts $K$ as entities that induce fibers $F_{\alpha}(K)={E\in U \mid S(E,K)\ge \alpha}$, i.e., superlevel sets of the unary map $S_K(E):=S(E,K)$; and (4) a generative operator $G$ that produces new entities. Within this framework, we formalize a generative definition of intelligence: an operator $G$ is intelligent with respect to a concept $K$ if, given a system containing entities belonging to the fiber of $K$, it generates new entities that also belong to that fiber. Similarity Field Theory thus offers a foundational language for characterizing, comparing, and constructing intelligent systems. At a high level, this framework reframes intelligence and interpretability as geometric problems on similarity fields – preserving and composing level-set fibers – rather than purely statistical ones. We prove two theorems: (i) asymmetry blocks mutual inclusion; and (ii) stability requires either an anchor coordinate or eventual confinement within a level set. These results ensure that the evolution of similarity fields is both constrained and interpretable, culminating in a framework that not only interprets large language models but also introduces a novel way of using them as experimental probes of societal cognition, supported by preliminary evidence across diverse consumer categories.

[343] Clean First, Align Later: Benchmarking Preference Data Cleaning for Reliable LLM Alignment

Samuel Yeh, Sharon Li

Main category: cs.AI

TL;DR: PrefCleanBench is the first comprehensive benchmark for evaluating 13 preference data cleaning methods in LLM alignment, providing standardized assessment of cleaning strategies across diverse datasets, models, and algorithms.

Details

Motivation: Human feedback for LLM alignment is often noisy and inconsistent, degrading reward models and hindering alignment. Current automated data cleaning methods lack systematic evaluation of their effectiveness and generalizability.

Method: Introduced PrefCleanBench benchmark with standardized protocol to assess 13 preference data cleaning methods across diverse datasets, model architectures, and optimization algorithms.

Result: The benchmark enables rigorous comparison of cleaning methods, uncovering key factors that determine success in data cleaning for alignment tasks.

Conclusion: PrefCleanBench establishes groundwork for principled approaches to improve LLM alignment through better data quality, highlighting the crucial role of data preprocessing in responsible AI development.

Abstract: Human feedback plays a pivotal role in aligning large language models (LLMs) with human preferences. However, such feedback is often noisy or inconsistent, which can degrade the quality of reward models and hinder alignment. While various automated data cleaning methods have been proposed to mitigate this issue, a systematic evaluation of their effectiveness and generalizability remains lacking. To bridge this gap, we introduce the first comprehensive benchmark for evaluating 13 preference data cleaning methods in the context of LLM alignment. PrefCleanBench offers a standardized protocol to assess cleaning strategies in terms of alignment performance and generalizability across diverse datasets, model architectures, and optimization algorithms. By unifying disparate methods and rigorously comparing them, we uncover key factors that determine the success of data cleaning in alignment tasks. This benchmark lays the groundwork for principled and reproducible approaches to improving LLM alignment through better data quality-highlighting the crucial but underexplored role of data preprocessing in responsible AI development. We release modular implementations of all methods to catalyze further research: https://github.com/deeplearning-wisc/PrefCleanBench.

[344] L2M-AID: Autonomous Cyber-Physical Defense by Fusing Semantic Reasoning of Large Language Models with Multi-Agent Reinforcement Learning (Preprint)

Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Jun Wang, Yan Li, Chang Liu

Main category: cs.AI

TL;DR: L2M-AID is an autonomous industrial defense framework that combines LLMs with multi-agent reinforcement learning to detect and respond to sophisticated cyber-physical attacks while maintaining operational stability.

Details

Motivation: Traditional industrial defenses lack contextual awareness to handle sophisticated multi-stage attacks in IIoT environments, requiring adaptive security that can reason about adversary intent rather than just pattern matching.

Method: Uses LLMs as semantic bridges to translate unstructured telemetry into contextual state representations, then applies Multi-Agent Reinforcement Learning (MAPPO) with reward functions balancing security objectives and operational stability.

Result: Achieved 97.2% detection rate, reduced false positives by over 80%, improved response times by 4x, and demonstrated superior performance in maintaining physical process stability compared to traditional IDS and other baselines.

Conclusion: L2M-AID presents a robust new paradigm for securing critical infrastructure by effectively combining LLM semantic reasoning with multi-agent reinforcement learning for adaptive, context-aware defense.

Abstract: The increasing integration of Industrial IoT (IIoT) exposes critical cyber-physical systems to sophisticated, multi-stage attacks that elude traditional defenses lacking contextual awareness. This paper introduces L2M-AID, a novel framework for Autonomous Industrial Defense using LLM-empowered, Multi-agent reinforcement learning. L2M-AID orchestrates a team of collaborative agents, each driven by a Large Language Model (LLM), to achieve adaptive and resilient security. The core innovation lies in the deep fusion of two AI paradigms: we leverage an LLM as a semantic bridge to translate vast, unstructured telemetry into a rich, contextual state representation, enabling agents to reason about adversary intent rather than merely matching patterns. This semantically-aware state empowers a Multi-Agent Reinforcement Learning (MARL) algorithm, MAPPO, to learn complex cooperative strategies. The MARL reward function is uniquely engineered to balance security objectives (threat neutralization) with operational imperatives, explicitly penalizing actions that disrupt physical process stability. To validate our approach, we conduct extensive experiments on the benchmark SWaT dataset and a novel synthetic dataset generated based on the MITRE ATT&CK for ICS framework. Results demonstrate that L2M-AID significantly outperforms traditional IDS, deep learning anomaly detectors, and single-agent RL baselines across key metrics, achieving a 97.2% detection rate while reducing false positives by over 80% and improving response times by a factor of four. Crucially, it demonstrates superior performance in maintaining physical process stability, presenting a robust new paradigm for securing critical national infrastructure.

[345] Agent Learning via Early Experience

Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, Jian Xie, Yuxuan Sun, Boyu Gou, Qi Qi, Zihang Meng, Jianwei Yang, Ning Zhang, Xian Li, Ashish Shah, Dat Huynh, Hengduo Li, Zi Yang, Sara Cao, Lawrence Jang, Shuyan Zhou, Jiacheng Zhu, Huan Sun, Jason Weston, Yu Su, Yifan Wu

Main category: cs.AI

TL;DR: The paper introduces ’early experience’ - using agent-generated interaction data where future states serve as supervision, improving language agents’ effectiveness and generalization without requiring reward signals.

Details

Motivation: Current language agents rely on supervised fine-tuning on expert data, which is hard to scale and generalizes poorly due to limited scenario coverage and environment diversity. There's a need for a middle-ground between imitation learning and full reinforcement learning.

Method: Two strategies using early experience data: (1) Implicit world modeling - using collected states to ground policy in environment dynamics; (2) Self-reflection - learning from suboptimal actions to improve reasoning and decision-making.

Result: Evaluated across eight diverse environments and multiple model families. Approaches consistently improved effectiveness and out-of-domain generalization. In environments with verifiable rewards, early experience provided strong foundation for subsequent reinforcement learning.

Conclusion: Early experience offers a practical bridge between imitation learning and fully experience-driven agents, enabling better performance and generalization without requiring reward signals.

Abstract: A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent’s own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

[346] TripScore: Benchmarking and rewarding real-world travel planning with fine-grained evaluation

Yincen Qu, Huan Xiao, Feng Li, Gregory Li, Hui Zhou, Xiangying Dai

Main category: cs.AI

TL;DR: A comprehensive benchmark for evaluating LLMs’ travel planning capabilities, focusing on feasibility, reliability, and engagement with a unified reward system.

Details

Motivation: Existing benchmarks fall short in evaluating key aspects of travel plans like feasibility, reliability, and engagement, creating a need for more comprehensive evaluation methods.

Method: Developed a unified reward system combining fine-grained criteria, created a large-scale dataset with 4,870 queries including 219 real-world requests, and tested various methods including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and reinforcement learning via GRPO.

Result: The evaluator achieved 60.75% agreement with travel-expert annotations and outperformed multiple LLM-as-judge baselines. Reinforcement learning generally improved itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

Conclusion: The proposed benchmark enables direct comparison of travel plan quality and seamless integration with reinforcement learning, with RL showing consistent improvements in plan feasibility across base models.

Abstract: Travel planning is a valuable yet complex task that poses significant challenges even for advanced large language models (LLMs). While recent benchmarks have advanced in evaluating LLMs’ planning capabilities, they often fall short in evaluating feasibility, reliability, and engagement of travel plans. We introduce a comprehensive benchmark for travel planning that unifies fine-grained criteria into a single reward, enabling direct comparison of plan quality and seamless integration with reinforcement learning (RL). Our evaluator achieves moderate agreement with travel-expert annotations (60.75%) and outperforms multiple LLM-as-judge baselines. We further release a large-scale dataset of 4,870 queries including 219 real-world, free-form requests for generalization to authentic user intent. Using this benchmark, we conduct extensive experiments across diverse methods and LLMs, including test-time computation, neuro-symbolic approaches, supervised fine-tuning, and RL via GRPO. Across base models, RL generally improves itinerary feasibility over prompt-only and supervised baselines, yielding higher unified reward scores.

[347] Humanoid Artificial Consciousness Designed with Large Language Model Based on Psychoanalysis and Personality Theory

Sang Hun Kim, Jongmin Lee, Dongkyu Park, So Young Lee, Yosep Chong

Main category: cs.AI

TL;DR: This study proposes integrating psychoanalysis and MBTI personality types to create artificial consciousness modules, developing three consciousness types and 16 personality characters, and evaluating them through various scenarios and assessment methods.

Details

Motivation: Current LLMs struggle with human consciousness due to hallucination issues, and there's a need to develop more human-like AI consciousness that can better understand and interact in complex cognitive contexts.

Method: Developed three artificial consciousness types based on psychoanalysis (self-awareness, unconsciousness, preconsciousness) and 16 MBTI personality characters with attributes like needs, status, and memories. Created 10 distinct situations with 7 attributes for evaluation, using survey evaluation, ChatGPT classification, and qualitative review.

Result: Both quantitative and qualitative analyses showed high likelihood of well-simulated consciousness, though differences between characters and consciousness types were not very significant. The models demonstrated potential for human-like cognition.

Conclusion: The integration of psychoanalysis and personality theory can lead to more intuitive and adaptable AI systems with humanoid consciousness, opening new avenues for improving AI interactions in complex cognitive contexts.

Abstract: Human consciousness is still a concept hard to define with current scientific understanding. Although Large Language Models (LLMs) have recently demonstrated significant advancements across various domains including translation and summarization, human consciousness is not something to imitate with current upfront technology owing to so-called hallucination. This study, therefore, proposes a novel approach to address these challenges by integrating psychoanalysis and the Myers-Briggs Type Indicator (MBTI) into constructing consciousness and personality modules. We developed three artificial consciousnesses (self-awareness, unconsciousness, and preconsciousness) based on the principles of psychoanalysis. Additionally, we designed 16 characters with different personalities representing the sixteen MBTI types, with several attributes such as needs, status, and memories. To determine if our model’s artificial consciousness exhibits human-like cognition, we created ten distinct situations considering seven attributes such as emotional understanding and logical thinking. The decision-making process of artificial consciousness and the final action were evaluated in three ways: survey evaluation, three-tier classification via ChatGPT, and qualitative review. Both quantitative and qualitative analyses indicated a high likelihood of well-simulated consciousness, although the difference in response between different characters and consciousnesses was not very significant. This implies that the developed models incorporating elements of psychoanalysis and personality theory can lead to building a more intuitive and adaptable AI system with humanoid consciousness. Therefore, this study contributes to opening up new avenues for improving AI interactions in complex cognitive contexts.

[348] Concise Reasoning in the Lens of Lagrangian Optimization

Chengqian Gao, Haonan Li, Taylor W. Killian, Jianshu She, Renxi Wang, Liqun Ma, Zhoujun Cheng, Shibo Hao, Zhiqiang Xu

Main category: cs.AI

TL;DR: PALU is a concise reasoning method that reduces output length by 65% while improving accuracy by 15% through performance-aware length optimization.

Details

Motivation: Existing concise reasoning approaches rely on hand-crafted heuristics that struggle to balance concision with performance and fail to adapt across domains and model scales.

Method: PALU formulates concise reasoning as a constrained optimization problem and applies Lagrangian optimization with three approximations: off-policy performance estimation, truncated Lagrange multipliers, and quantile-driven length adjustments.

Result: Applied to DeepSeek-Distill-Qwen-1.5B, PALU reduces output length by 65% while improving accuracy by 15% across five benchmarks, outperforming alternative methods and adapting across domains and model scales.

Conclusion: PALU is a practical and effective concise reasoning approach that successfully adapts across domains and model scales while significantly reducing reasoning length and improving performance.

Abstract: Concise reasoning in large language models seeks to generate only essential intermediate steps needed to arrive at a final answer, thereby alleviating issues of overthinking. Most proposed approaches hinge on carefully hand-crafted heuristics, struggling to balance concision with performance, often failing to adapt across domains and model scales. In this work, we address these challenges by introducing a principled and pragmatic strategy, performance-aware length updating (PALU). As a principled algorithm, PALU formulates concise reasoning as a constrained optimization problem, minimizing response length subject to a performance constraint, and then applies Lagrangian optimization to convert it into a tractable unconstrained problem. As a pragmatic solution, PALU streamlines complicated update rules through three approximations: (i) estimating performance with off-policy rollouts, (ii) truncating the Lagrange multiplier to two extremes, and (iii) replacing gradient-based updates with quantile-driven length adjustments. PALU reduces output length by 65% while improving accuracy by 15% when applied to DeepSeek-Distill-Qwen-1.5B, averaged over five benchmarks, outperforming a range of alternative methods. Furthermore, PALU is demonstrated to adapt across both domain (logic, STEM and math) and model scale (1.5B, 7B, 14B) entrenching the algorithm as a practical and effective concise reasoning approach.

[349] Adaptive Dual Reasoner: Large Reasoning Models Can Think Efficiently by Hybrid Reasoning

Yujian Zhang, Keyu Chen, Zhifeng Shen, Ruizhi Qiao, Xing Sun

Main category: cs.AI

TL;DR: ADR is an adaptive dual reasoning model that dynamically switches between fast and slow thinking modes based on contextual complexity, achieving better performance with reduced computational cost.

Details

Motivation: Long Reasoning Models suffer from high computational costs and inference latency due to overthinking, requiring a more efficient approach to reasoning.

Method: Two-stage training: (1) SFT with hybrid reasoning dataset, (2) RL with Entropy-guided Hybrid Policy Optimization (EHPO) using dynamic rollout strategy and difficulty-aware penalty.

Result: Achieves up to 6.1% performance gain while reducing reasoning output length by 49.5% to 59.3% on mathematical reasoning benchmarks.

Conclusion: ADR effectively balances reasoning performance and efficiency, outperforming state-of-the-art approaches in both accuracy and computational efficiency.

Abstract: Although Long Reasoning Models (LRMs) have achieved superior performance on various reasoning scenarios, they often suffer from increased computational costs and inference latency caused by overthinking. To address these limitations, we propose Adaptive Dual Reasoner, which supports two reasoning modes: fast thinking and slow thinking. ADR dynamically alternates between these modes based on the contextual complexity during reasoning. ADR is trained in two stages: (1) A cold-start stage using supervised fine-tuning (SFT) to equip the model with the ability to integrate both fast and slow reasoning modes, in which we construct a hybrid reasoning dataset through a dedicated pipeline to provide large-scale supervision. (2) A reinforcement learning stage for optimizing reasoning effort, where we introduce Entropy-guided Hybrid Policy Optimization EHPO, an RL training framework employing an entropy-guided dynamic rollout strategy for branching at high-entropy units and a difficulty-aware penalty to balance fast and slow reasoning. Across challenging mathematical reasoning benchmarks, ADR achieves an effective balance between reasoning performance and efficiency among state-of-the-art approaches. Specifically, ADR yields a performance gain of up to 6.1%, while reducing the reasoning output length by 49.5% to 59.3%.

[350] DRIFT: Decompose, Retrieve, Illustrate, then Formalize Theorems

Meiru Zhang, Philipp Borchert, Milan Gritta, Gerasimos Lampouras

Main category: cs.AI

TL;DR: DRIFT is a framework that improves mathematical autoformalization by decomposing informal statements into sub-components for better premise retrieval from libraries like Mathlib.

Details

Motivation: LLMs struggle with formalizing mathematical statements due to difficulty identifying prerequisite knowledge and formal representations. Current methods overlook that informal statements are complex with limited context on underlying concepts.

Method: DRIFT decomposes informal mathematical statements into smaller sub-components to enable targeted retrieval of premises from mathematical libraries and retrieves illustrative theorems to help models use premises effectively.

Result: DRIFT consistently improves premise retrieval across benchmarks, nearly doubling F1 score compared to DPR baseline on ProofNet. Shows strong out-of-distribution performance on ConNF with 37.14% and 42.25% BEq+@10 improvements using GPT-4.1 and DeepSeek-V3.1 respectively.

Conclusion: Retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model’s capabilities.

Abstract: Automating the formalization of mathematical statements for theorem proving remains a major challenge for Large Language Models (LLMs). LLMs struggle to identify and utilize the prerequisite mathematical knowledge and its corresponding formal representation in languages like Lean. Current retrieval-augmented autoformalization methods query external libraries using the informal statement directly, but overlook a fundamental limitation: informal mathematical statements are often complex and offer limited context on the underlying math concepts. To address this, we introduce DRIFT, a novel framework that enables LLMs to decompose informal mathematical statements into smaller, more tractable ‘‘sub-components’’. This facilitates targeted retrieval of premises from mathematical libraries such as Mathlib. Additionally, DRIFT retrieves illustrative theorems to help models use premises more effectively in formalization tasks. We evaluate DRIFT across diverse benchmarks (ProofNet, ConNF, and MiniF2F-test) and find that it consistently improves premise retrieval, nearly doubling the F1 score compared to the DPR baseline on ProofNet. Notably, DRIFT demonstrates strong performance on the out-of-distribution ConNF benchmark, with BEq+@10 improvements of 37.14% and 42.25% using GPT-4.1 and DeepSeek-V3.1, respectively. Our analysis shows that retrieval effectiveness in mathematical autoformalization depends heavily on model-specific knowledge boundaries, highlighting the need for adaptive retrieval strategies aligned with each model’s capabilities.

cs.SD

[351] SeeingSounds: Learning Audio-to-Visual Alignment via Text

Simone Carnemolla, Matteo Pennisi, Chiara Russo, Simone Palazzo, Daniela Giordano, Concetto Spampinato

Main category: cs.SD

TL;DR: SeeingSounds is a lightweight framework for audio-to-image generation that uses dual alignment between audio, language, and vision without requiring paired audio-visual data or training visual generative models.

Details

Motivation: The paper is motivated by cognitive neuroscience principles of cross-modal associations in human perception, aiming to create more natural audio-to-visual generation without relying on audio-to-text mappings or treating audio as text substitute.

Method: The method performs dual alignment: projecting audio into semantic language space via frozen language encoder, and contextually grounding into visual domain using vision-language model. It operates on frozen diffusion backbones with lightweight adapters, and uses procedural text prompt generation for fine-grained control.

Result: Extensive experiments show SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing new state-of-the-art in controllable audio-to-visual generation.

Conclusion: The framework enables efficient, scalable learning and interpretable control through audio transformations that translate into descriptive prompts, demonstrating superior performance in audio-to-image generation.

Abstract: We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., “a distant thunder”) that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.

[352] Audio-Guided Visual Perception for Audio-Visual Navigation

Yi Wang, Yinfeng Yu, Fuchun Sun, Liejun Wang, Wendong Zheng

Main category: cs.SD

TL;DR: AGVP framework improves audio-visual embodied navigation by transforming sound cues into spatial guidance through cross-modal alignment, enhancing cross-source generalization for unheard sounds.

Details

Motivation: Current AVN methods suffer from poor cross-source generalization, failing with unheard sounds due to memorizing spurious acoustic fingerprint-scenario correlations rather than learning true auditory-visual relationships.

Method: AGVP extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions. It performs temporal modeling and policy optimization with interpretable cross-modal alignment.

Result: AGVP improves navigation efficiency and robustness, achieving superior cross-scenario generalization on previously unheard sounds compared to existing methods.

Conclusion: The proposed cross-modal alignment framework effectively transforms sound into spatial guidance, reducing dependency on specific acoustic fingerprints and enabling better generalization to novel sound sources.

Abstract: Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.

[353] Serial-Parallel Dual-Path Architecture for Speaking Style Recognition

Guojian Li, Qijie Shao, Zhixian Zhao, Shuiyuan Wang, Zhonghua Fu, Lei Xie

Main category: cs.SD

TL;DR: Proposes a serial-parallel dual-path architecture for Speaking Style Recognition that fuses acoustic and linguistic information, achieving 30.3% accuracy improvement with 88.4% parameter reduction.

Details

Motivation: Existing SSR approaches rely primarily on linguistic information with limited acoustic integration, restricting recognition accuracy improvements. Fusion of acoustic and linguistic modalities offers significant potential to enhance performance.

Method: Novel serial-parallel dual-path architecture: serial path follows ASR+STYLE paradigm for sequential temporal dependency, parallel path integrates Acoustic-Linguistic Similarity Module (ALSM) for cross-modal interaction with temporal simultaneity.

Result: Compared to OSUM baseline: reduces parameter size by 88.4% and achieves 30.3% improvement in SSR accuracy for eight styles on test set.

Conclusion: The proposed dual-path architecture effectively leverages acoustic-linguistic bimodal information, significantly improving SSR performance while reducing model complexity.

Abstract: Speaking Style Recognition (SSR) identifies a speaker’s speaking style characteristics from speech. Existing style recognition approaches primarily rely on linguistic information, with limited integration of acoustic information, which restricts recognition accuracy improvements. The fusion of acoustic and linguistic modalities offers significant potential to enhance recognition performance. In this paper, we propose a novel serial-parallel dual-path architecture for SSR that leverages acoustic-linguistic bimodal information. The serial path follows the ASR+STYLE serial paradigm, reflecting a sequential temporal dependency, while the parallel path integrates our designed Acoustic-Linguistic Similarity Module (ALSM) to facilitate cross-modal interaction with temporal simultaneity. Compared to the existing SSR baseline – the OSUM model, our approach reduces parameter size by 88.4% and achieves a 30.3% improvement in SSR accuracy for eight styles on the test set.

[354] UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

Jinchuan Tian, Sang-gil Lee, Zhifeng Kong, Sreyan Ghosh, Arushi Goel, Chao-Han Huck Yang, Wenliang Dai, Zihan Liu, Hanrong Ye, Shinji Watanabe, Mohammad Shoeybi, Bryan Catanzaro, Rafael Valle, Wei Ping

Main category: cs.SD

TL;DR: UALM is a unified audio language model that combines audio understanding, text-to-audio generation, and multimodal reasoning in a single model, achieving state-of-the-art performance across all three tasks.

Details

Motivation: Current audio language models treat audio understanding and text-to-audio generation as separate tasks, lacking unified multimodal reasoning capabilities essential for advanced AI systems.

Method: Developed UALM-Gen for text-to-audio generation using direct audio token prediction, then created a unified model through data blending, specialized training recipes, and inference techniques. Also introduced UALM-Reason for multimodal reasoning using both text and audio in intermediate thinking steps.

Result: The single UALM model matches state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. UALM-Reason successfully demonstrates cross-modal generative reasoning with effectiveness confirmed by subjective evaluations.

Conclusion: UALM successfully unifies multiple audio tasks in a single model and introduces the first demonstration of cross-modal generative reasoning in audio research, representing a significant advancement toward comprehensive multimodal AI systems.

Abstract: Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks – an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.

[355] Audio Palette: A Diffusion Transformer with Multi-Signal Conditioning for Controllable Foley Synthesis

Junnuo Wang

Main category: cs.SD

TL;DR: Audio Palette is a diffusion transformer model that enables fine-grained acoustic control in text-to-audio synthesis using four time-varying control signals (loudness, pitch, spectral centroid, timbre) while maintaining audio quality and semantic alignment.

Details

Motivation: Address the 'control gap' in open-source text-to-audio synthesis where fine-grained acoustic control remains challenging despite advances in diffusion models.

Method: Extends Stable Audio Open architecture with diffusion transformer (DiT), introduces four time-varying control signals, uses LoRA adaptation on AudioSet subset (0.85% parameters), and implements three-scale classifier-free guidance.

Result: Achieves fine-grained interpretable control of sound attributes while maintaining comparable audio quality (FAD, LAION-CLAP scores) to baseline, enabling precise acoustic manipulation.

Conclusion: Establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling artist-centric workflows with scalable, modular pipeline.

Abstract: Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this “control gap” in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signals: loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85 percent of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a three-scale classifier-free guidance mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow.

[356] TFGA-Net: Temporal-Frequency Graph Attention Network for Brain-Controlled Speaker Extraction

Youhao Si, Yuan Liao, Qiushi Han, Yuhang Yang, Rui Dai, Liya Huang

Main category: cs.SD

TL;DR: Proposes TFGA-Net, a brain-controlled speaker extraction model that uses EEG signals to extract target speech, featuring multi-scale time-frequency EEG features, graph convolutional networks, and MossFormer2 separator.

Details

Motivation: To effectively utilize target-speaker common information between EEG and speech for EEG-driven target speaker extraction, addressing the unresolved problem of how to leverage this relationship.

Method: Uses multi-scale time-frequency EEG features with cortical topological structures, graph convolutional networks and self-attention for EEG encoding, and MossFormer2 (combining MossFormer and RNN-Free Recurrent) as separator to fuse EEG and speech features.

Result: Experimental results on Cocktail Party and KUL datasets show TFGA-Net significantly outperforms state-of-the-art methods in certain objective evaluation metrics.

Conclusion: The proposed TFGA-Net model effectively extracts target speech using EEG signals by leveraging cortical topology, graph networks, and advanced separator architecture, demonstrating superior performance over existing methods.

Abstract: The rapid development of auditory attention decoding (AAD) based on electroencephalography (EEG) signals offers the possibility EEG-driven target speaker extraction. However, how to effectively utilize the target-speaker common information between EEG and speech remains an unresolved problem. In this paper, we propose a model for brain-controlled speaker extraction, which utilizes the EEG recorded from the listener to extract the target speech. In order to effectively extract information from EEG signals, we derive multi-scale time–frequency features and further incorporate cortical topological structures that are selectively engaged during the task. Moreover, to effectively exploit the non-Euclidean structure of EEG signals and capture their global features, the graph convolutional networks and self-attention mechanism are used in the EEG encoder. In addition, to make full use of the fused EEG and speech feature and preserve global context and capture speech rhythm and prosody, we introduce MossFormer2 which combines MossFormer and RNN-Free Recurrent as separator. Experimental results on both the public Cocktail Party and KUL dataset in this paper show that our TFGA-Net model significantly outper-forms the state-of-the-art method in certain objective evaluation metrics. The source code is available at: https://github.com/LaoDa-X/TFGA-NET.

[357] Content Anonymization for Privacy in Long-form Audio

Cristina Aggazzotti, Ashi Garg, Zexin Cai, Nicholas Andrews

Main category: cs.SD

TL;DR: Voice anonymization works for short utterances but fails in long-form audio where multiple utterances reveal speaker identity through vocabulary and style. The paper proposes content anonymization via contextual transcript rewriting to eliminate speaker-specific style while preserving meaning.

Details

Motivation: Current voice anonymization techniques are effective for isolated utterances but insufficient for long-form audio where multiple utterances from the same speaker can reveal identity through vocabulary, syntax, and speaking style, posing significant privacy risks.

Method: The approach uses contextual rewriting of transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. It implements content-based anonymization methods including paraphrasing to defend against content-based attacks.

Result: The study demonstrates the effectiveness of content-based attacks on voice-anonymized speech in long-form telephone conversations, and shows that the proposed content anonymization methods can mitigate this risk while preserving speech utility.

Conclusion: Paraphrasing is an effective defense against content-based attacks, and stakeholders should adopt this approach to ensure anonymity in long-form audio applications.

Abstract: Voice anonymization techniques have been found to successfully obscure a speaker’s acoustic identity in short, isolated utterances in benchmarks such as the VoicePrivacy Challenge. In practice, however, utterances seldom occur in isolation: long-form audio is commonplace in domains such as interviews, phone calls, and meetings. In these cases, many utterances from the same speaker are available, which pose a significantly greater privacy risk: given multiple utterances from the same speaker, an attacker could exploit an individual’s vocabulary, syntax, and turns of phrase to re-identify them, even when their voice is completely disguised. To address this risk, we propose new content anonymization approaches. Our approach performs a contextual rewriting of the transcripts in an ASR-TTS pipeline to eliminate speaker-specific style while preserving meaning. We present results in a long-form telephone conversation setting demonstrating the effectiveness of a content-based attack on voice-anonymized speech. Then we show how the proposed content-based anonymization methods can mitigate this risk while preserving speech utility. Overall, we find that paraphrasing is an effective defense against content-based attacks and recommend that stakeholders adopt this step to ensure anonymity in long-form audio.

[358] AsynFusion: Towards Asynchronous Latent Consistency Models for Decoupled Whole-Body Audio-Driven Avatars

Tianbao Zhang, Jian Zhao, Yuer Li, Zheng Zhu, Ping Hu, Zhaoxin Fan, Wenjun Wu, Xuelong Li

Main category: cs.SD

TL;DR: AsynFusion is a novel framework that uses diffusion transformers to generate synchronized whole-body avatar animations with coordinated facial expressions and gestures, addressing the limitation of independent generation in existing methods.

Details

Motivation: Existing approaches generate audio-driven facial expressions and gestures independently, leading to lack of coordination between facial and gestural elements, resulting in less natural and cohesive animations.

Method: Proposes AsynFusion with dual-branch DiT architecture for parallel generation of facial expressions and gestures, featuring a Cooperative Synchronization Module for bidirectional feature interaction and Asynchronous LCM Sampling to reduce computational overhead.

Result: Achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

Conclusion: AsynFusion successfully addresses the coordination problem in whole-body avatar animation and enables high-quality, synchronized expression and gesture synthesis.

Abstract: Whole-body audio-driven avatar pose and expression generation is a critical task for creating lifelike digital humans and enhancing the capabilities of interactive virtual agents, with wide-ranging applications in virtual reality, digital entertainment, and remote communication. Existing approaches often generate audio-driven facial expressions and gestures independently, which introduces a significant limitation: the lack of seamless coordination between facial and gestural elements, resulting in less natural and cohesive animations. To address this limitation, we propose AsynFusion, a novel framework that leverages diffusion transformers to achieve harmonious expression and gesture synthesis. The proposed method is built upon a dual-branch DiT architecture, which enables the parallel generation of facial expressions and gestures. Within the model, we introduce a Cooperative Synchronization Module to facilitate bidirectional feature interaction between the two modalities, and an Asynchronous LCM Sampling strategy to reduce computational overhead while maintaining high-quality outputs. Extensive experiments demonstrate that AsynFusion achieves state-of-the-art performance in generating real-time, synchronized whole-body animations, consistently outperforming existing methods in both quantitative and qualitative evaluations.

[359] A Fast and Lightweight Model for Causal Audio-Visual Speech Separation

Wendi Sang, Kai Li, Runxuan Yang, Jianqiang Huang, Xiaolin Hu

Main category: cs.SD

TL;DR: Swift-Net is a streaming audio-visual speech separation model that enables real-time processing through causal architecture, lightweight visual feature extraction, and efficient audio-visual fusion.

Details

Motivation: Most existing AVSS methods have complex architectures and rely on future context, making them unsuitable for real-time applications. There's a need for streaming models that can operate causally without future information.

Method: Uses lightweight visual feature extraction, efficient audio-visual fusion module, and Grouped SRUs to integrate historical information across different feature spaces. Also proposes a causal transformation template to convert non-causal models to causal versions.

Result: Experiments on LRS2, LRS3, and VoxCeleb2 datasets showed outstanding performance under causal conditions, demonstrating effectiveness for real-time speech separation.

Conclusion: Swift-Net successfully enables real-time audio-visual speech separation with causal processing, showing potential for processing speech in complex environments through efficient historical information utilization.

Abstract: Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. Inspired by the pipeline of RTFSNet, we propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audio-visual integration. Additionally, Swift-Net employs Grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical information. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance, highlighting the potential of this method for processing speech in complex environments.

[360] Assessing Latency in ASR Systems: A Methodological Perspective for Real-Time Use

Carlos Arriaga, Alejandro Pozo, Javier Conde, Alvaro Alonso

Main category: cs.SD

TL;DR: This paper analyzes ASR systems’ limitations in real-time interpretation, particularly their delay issues, and proposes a new method to measure ASR delay for live interpretation scenarios.

Details

Motivation: ASR systems generate real-time transcriptions but miss important nuances that human interpreters capture, especially in sensitive settings like diplomatic meetings where subtle language is crucial. Human interpreters add critical value by perceiving nuances and adjusting in real time.

Method: The paper proposes a new approach to measuring delay in ASR systems, specifically focusing on user-perceived latency which differs from traditional interpretation timing metrics.

Result: ASR systems introduce delays that don’t align with real-time interpretation needs, and the user-perceived latency (time between speech and transcription delivery) differs from interpretation timing.

Conclusion: A new measurement approach is needed to validate if ASR systems are usable in live interpretation scenarios, as current systems have timing limitations that affect their practical application in real-time interpretation.

Abstract: Automatic speech recognition (ASR) systems generate real-time transcriptions but often miss nuances that human interpreters capture. While ASR is useful in many contexts, interpreters-who already use ASR tools such as Dragon-add critical value, especially in sensitive settings such as diplomatic meetings where subtle language is key. Human interpreters not only perceive these nuances but can adjust in real time, improving accuracy, while ASR handles basic transcription tasks. However, ASR systems introduce a delay that does not align with real-time interpretation needs. The user-perceived latency of ASR systems differs from that of interpretation because it measures the time between speech and transcription delivery. To address this, we propose a new approach to measuring delay in ASR systems and validate if they are usable in live interpretation scenarios.

[361] TISDiSS: A Training-Time and Inference-Time Scalable Framework for Discriminative Source Separation

Yongsheng Feng, Yuetonghui Xu, Jiehui Luo, Hongjia Liu, Xiaobing Li, Feng Yu, Wei Li

Main category: cs.SD

TL;DR: TISDiSS is a scalable source separation framework that enables flexible speed-performance trade-offs through dynamic inference repetitions without retraining, achieving state-of-the-art performance with fewer parameters.

Details

Motivation: Current source separation methods rely on increasingly large networks, which inflate training and deployment costs. The authors aim to create a more efficient framework that provides flexible speed-performance trade-offs.

Method: Proposes TISDiSS framework with early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions that allow adjusting inference depth without retraining.

Result: Achieves state-of-the-art performance on standard speech separation benchmarks with reduced parameter count. Training with more inference repetitions improves shallow-inference performance for low-latency applications.

Conclusion: TISDiSS establishes a scalable and practical framework for adaptive source separation that enables flexible deployment across different computational constraints.

Abstract: Source separation is a fundamental task in speech, music, and audio processing, and it also provides cleaner and larger data for training generative models. However, improving separation performance in practice often depends on increasingly large networks, inflating training and deployment costs. Motivated by recent advances in inference-time scaling for generative modeling, we propose Training-Time and Inference-Time Scalable Discriminative Source Separation (TISDiSS), a unified framework that integrates early-split multi-loss supervision, shared-parameter design, and dynamic inference repetitions. TISDiSS enables flexible speed-performance trade-offs by adjusting inference depth without retraining additional models. We further provide systematic analyses of architectural and training choices and show that training with more inference repetitions improves shallow-inference performance, benefiting low-latency applications. Experiments on standard speech separation benchmarks demonstrate state-of-the-art performance with a reduced parameter count, establishing TISDiSS as a scalable and practical framework for adaptive source separation. Code is available at https://github.com/WingSingFung/TISDiSS.

[362] MRSAudio: A Large-Scale Multimodal Recorded Spatial Audio Dataset with Refined Annotations

Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Xintong Hu, Yu Zhang, Li Tang, Rui Yang, Han Wang, Zongbao Zhang, Yuhan Wang, Yixuan Chen, Hankun Xu, Ke Xu, Pengfei Fan, Zhetao Chen, Yanhao Yu, Qiange Huang, Fei Wu, Zhou Zhao

Main category: cs.SD

TL;DR: MRSAudio is a large-scale multimodal spatial audio dataset with binaural/ambisonic audio, video, motion data, and annotations to advance spatial audio research.

Details

Motivation: Existing multimodal datasets provide only monaural audio, limiting development of spatial audio generation and understanding for immersive technologies like VR/AR.

Method: Created MRSAudio dataset with four components (MRSLife, MRSSpeech, MRSMusic, MRSSing) containing synchronized binaural/ambisonic audio, video, motion trajectories, and fine-grained annotations.

Result: Dataset enables high-quality spatial modeling and supports five foundational tasks: audio spatialization, spatial text-to-speech, spatial singing voice synthesis, spatial music generation, and sound event localization/detection.

Conclusion: MRSAudio addresses the gap in spatial audio datasets and supports broad spatial audio research for immersive technologies.

Abstract: Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.

[363] ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

Main category: cs.SD

TL;DR: ParsVoice is the largest Persian speech corpus for TTS, created from 2,000 audiobooks using an automated pipeline that produces 1,804 hours of high-quality speech from 470+ speakers.

Details

Motivation: Existing Persian speech datasets are smaller than English counterparts, limiting development of Persian speech technologies.

Method: Automated pipeline transforms raw audiobook content using BERT-based sentence completion detector, binary search boundary optimization for audio-text alignment, and Persian-specific quality assessment frameworks.

Result: Produced 3,526 hours of clean speech, filtered to 1,804-hour high-quality subset. Fine-tuned XTTS achieved MOS of 3.6/5 for naturalness and SMOS of 4.0/5 for speaker similarity.

Conclusion: ParsVoice is the largest high-quality Persian speech dataset with speaker diversity and audio quality comparable to major English corpora, publicly available to accelerate Persian speech technology development.

Abstract: Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice’s effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

cs.LG

[364] Think as a Doctor: An Interpretable AI Approach for ICU Mortality Prediction

Qingwen Li, Xiaohang Zhao, Xiao Han, Hailiang Huang, Lanjuan Liu

Main category: cs.LG

TL;DR: ProtoDoctor is an interpretable ICU mortality prediction framework that integrates clinical course identification, demographic heterogeneity, and prognostication awareness into its reasoning process, outperforming state-of-the-art methods in both accuracy and clinical interpretability.

Details

Motivation: ICU mortality prediction requires both accuracy and interpretability for clinical trust and regulatory compliance. Current approaches focus mainly on demographic heterogeneity while overlooking clinical course identification and prognostication awareness, which are essential elements of ICU decision-making practices.

Method: ProtoDoctor features two key modules: 1) Prognostic Clinical Course Identification module that identifies clinical courses via prototype learning with a novel regularization mechanism for prognostication awareness, and 2) Demographic Heterogeneity Recognition module that models demographic heterogeneity through cohort-specific prototypes and risk adjustments.

Result: Extensive empirical evaluations show ProtoDoctor outperforms state-of-the-art baselines in predictive accuracy. Human evaluations confirm its interpretations are more clinically meaningful, trustworthy, and applicable in ICU practice.

Conclusion: ProtoDoctor successfully integrates all three key elements of ICU decision-making practices into an intrinsically interpretable framework, achieving superior predictive performance while providing clinically meaningful interpretations that build trust and meet regulatory standards.

Abstract: Intensive Care Unit (ICU) mortality prediction, which estimates a patient’s mortality status at discharge using EHRs collected early in an ICU admission, is vital in critical care. For this task, predictive accuracy alone is insufficient; interpretability is equally essential for building clinical trust and meeting regulatory standards, a topic that has attracted significant attention in information system research. Accordingly, an ideal solution should enable intrinsic interpretability and align its reasoning with three key elements of the ICU decision-making practices: clinical course identification, demographic heterogeneity, and prognostication awareness. However, conventional approaches largely focus on demographic heterogeneity, overlooking clinical course identification and prognostication awareness. Recent prototype learning methods address clinical course identification, yet the integration of the other elements into such frameworks remains underexplored. To address these gaps, we propose ProtoDoctor, a novel ICU mortality prediction framework that delivers intrinsic interpretability while integrating all three elements of the ICU decision-making practices into its reasoning process. Methodologically, ProtoDoctor features two key innovations: the Prognostic Clinical Course Identification module and the Demographic Heterogeneity Recognition module. The former enables the identification of clinical courses via prototype learning and achieves prognostication awareness using a novel regularization mechanism. The latter models demographic heterogeneity through cohort-specific prototypes and risk adjustments. Extensive empirical evaluations demonstrate that ProtoDoctor outperforms state-of-the-art baselines in predictive accuracy. Human evaluations further confirm that its interpretations are more clinically meaningful, trustworthy, and applicable in ICU practice.

[365] GAR: Generative Adversarial Reinforcement Learning for Formal Theorem Proving

Ruida Wang, Jiarui Yao, Rui Pan, Shizhe Diao, Tong Zhang

Main category: cs.LG

TL;DR: GAR is a Generative Adversarial Reinforcement learning framework that jointly trains problem composers and solvers in an adversarial loop, creating an implicit curriculum that improves training efficiency and enables solving more complex theorems.

Details

Motivation: Current methods for training math problem solvers rely on fixed problem sets, leading to inefficient training and limitations in tackling complex problems. There's a need for more dynamic training approaches.

Method: GAR framework trains problem composers and solvers together in an adversarial loop, creating an implicit curriculum learning mechanism that aligns task difficulty with the prover’s evolving capabilities.

Result: GAR-trained models achieved 4.20% average relative improvement in pass@32 on MiniF2F-Test benchmark, and DeepSeek-Prover-V2’s pass@32 on ProofNet-Test increased from 22.58% to 25.81%.

Conclusion: GAR establishes a general RL paradigm for co-evolution of problem generation and solving in verifiable environments, enabling more efficient training and stronger performance on advanced theorems.

Abstract: Solving math problems through verifiable languages such as Lean has significantly impacted both the mathematics and computer science communities. Current state-of-the-art models are often trained with expensive online Reinforcement Learning (RL) or expert iteration. However, these approaches rely on fixed problem sets, which causes inefficient training and limits the model to tackle complex problems. To overcome these limitations, we propose GAR: Generative Adversarial Reinforcement learning, a comprehensive RL training framework that jointly trains the problem composer and solver in an adversarial loop. GAR introduces an implicit curriculum learning mechanism, which aligns task difficulty with the prover’s evolving capability. It thereby improves the training efficiency and enables stronger performance of proving advanced theorems. Experiments show that with GAR training, Goedel-Prover-V2-8B and DeepSeek-Prover-V2-7B achieve an average relative improvement in pass@32 of 4.20% on MiniF2F-Test benchmark, while DeepSeek-Prover-V2’s pass@32 on ProofNet-Test increases from 22.58% to 25.81%. Beyond formal proving, GAR establishes a general RL paradigm for co-evolution of problem generation and solving under verifiable environments.

[366] Combining Euclidean and Hyperbolic Representations for Node-level Anomaly Detection

Simone Mungari, Ettore Ritacco, Pietro Sabatino

Main category: cs.LG

TL;DR: Janus is a framework for node-level anomaly detection that combines Euclidean and Hyperbolic Graph Neural Networks to capture complementary node representations, using contrastive learning to identify anomalies when views are difficult to reconcile.

Details

Motivation: Node-level anomaly detection is challenging due to diverse structural patterns and feature distributions, with applications in fraud detection, cybersecurity, and recommendation systems.

Method: Each node is described by two views (original features and structural features from random walks/degrees), embedded into Euclidean and Hyperbolic spaces using a multi Graph-Autoencoder framework with contrastive learning regularization.

Result: Experiments on four real-world datasets show Janus consistently outperforms shallow and deep baselines.

Conclusion: Combining multiple geometric representations provides a robust and effective approach for identifying subtle and complex anomalies in graphs.

Abstract: Node-level anomaly detection (NAD) is challenging due to diverse structural patterns and feature distributions. As such, NAD is a critical task with several applications which range from fraud detection, cybersecurity, to recommendation systems. We introduce Janus, a framework that jointly leverages Euclidean and Hyperbolic Graph Neural Networks to capture complementary aspects of node representations. Each node is described by two views, composed by the original features and structural features derived from random walks and degrees, then embedded into Euclidean and Hyperbolic spaces. A multi Graph-Autoencoder framework, equipped with a contrastive learning objective as regularization term, aligns the embeddings across the Euclidean and Hyperbolic spaces, highlighting nodes whose views are difficult to reconcile and are thus likely anomalous. Experiments on four real-world datasets show that Janus consistently outperforms shallow and deep baselines, empirically demonstrating that combining multiple geometric representations provides a robust and effective approach for identifying subtle and complex anomalies in graphs.

[367] Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis

Jin Ma, Ying Tan, Renyuan Xu

Main category: cs.LG

TL;DR: The paper introduces a soft-constrained Schrödinger bridge problem (SCSBP) that replaces hard terminal constraints with penalty functions, providing more flexible stochastic control for generative AI applications.

Details

Motivation: Classical Schrödinger bridge problems enforce hard terminal constraints that lead to instability in high-dimensional or data-scarce settings, limiting practical implementation in generative modeling.

Method: Proposes soft-constrained Schrödinger bridge formulation with penalty functions, analyzes existence of optimal solutions using Doob’s h-transform, Schrödinger potentials stability, Gamma-convergence, and a novel fixed-point argument.

Result: Establishes existence of optimal solutions for all penalty levels and proves linear convergence rate of controls and value functions to classical SBP as penalty increases.

Conclusion: The soft-constrained approach enables robust generative modeling, fine-tuning, and transfer learning by providing quantitative convergence guarantees and addressing instability issues of classical methods.

Abstract: Generative AI can be framed as the problem of learning a model that maps simple reference measures into complex data distributions, and it has recently found a strong connection to the classical theory of the Schr"odinger bridge problems (SBPs) due partly to their common nature of interpolating between prescribed marginals via entropy-regularized stochastic dynamics. However, the classical SBP enforces hard terminal constraints, which often leads to instability in practical implementations, especially in high-dimensional or data-scarce regimes. To address this challenge, we follow the idea of the so-called soft-constrained Schr"odinger bridge problem (SCSBP), in which the terminal constraint is replaced by a general penalty function. This relaxation leads to a more flexible stochastic control formulation of McKean-Vlasov type. We establish the existence of optimal solutions for all penalty levels and prove that, as the penalty grows, both the controls and value functions converge to those of the classical SBP at a linear rate. Our analysis builds on Doob’s h-transform representations, the stability results of Schr"odinger potentials, Gamma-convergence, and a novel fixed-point argument that couples an optimization problem over the space of measures with an auxiliary entropic optimal transport problem. These results not only provide the first quantitative convergence guarantees for soft-constrained bridges but also shed light on how penalty regularization enables robust generative modeling, fine-tuning, and transfer learning.

[368] Z0-Inf: Zeroth Order Approximation for Data Influence

Narine Kokhlikyan, Kamalika Chaudhuri, Saeed Mahloujifar

Main category: cs.LG

TL;DR: A highly efficient zeroth-order method for estimating training data influence that uses only loss values and checkpoints, achieving superior accuracy for self-influence and comparable train-test influence estimation with much lower computational cost.

Details

Motivation: Understanding how individual training examples influence model behavior is critical for data selection and model debugging, but existing methods are impractical for large models due to poor accuracy or prohibitive computational costs from gradient and inverse-Hessian computations.

Method: A zeroth-order approximation that relies solely on loss values of intermediate checkpoints on training and test data, along with the checkpoints themselves, making it applicable even for non-differentiable loss functions.

Result: The method achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models, with significantly reduced time and memory footprint compared to prior methods.

Conclusion: This approach enables scalable and practical analysis of how training data shapes model behavior, making data influence estimation feasible for large-scale machine learning systems.

Abstract: A critical aspect of analyzing and improving modern machine learning systems lies in understanding how individual training examples influence a model’s predictive behavior. Estimating this influence enables critical applications, including data selection and model debugging; in particular, self-influence, which quantifies the influence of a training point on itself, has found many uses in data quality assessment and outlier detection. Existing methods for measuring data influence, however, are often impractical for large models due to low accuracy or prohibitive computational costs: most approaches either provide poor approximations or rely on gradients and inverse-Hessian computations that remain challenging to scale. In this work, we introduce a highly efficient zeroth-order approximation for estimating the influence of training data that requires only a fraction of the time and memory footprint of prior methods. Notably, our method relies solely on loss values of intermediate checkpoints on the training and test data, along with the checkpoints themselves, making it broadly applicable even when the loss function of interest is non-differentiable. Beyond its computational efficiency, our approach achieves superior accuracy in estimating self-influence and comparable or improved accuracy in estimating train-test influence for fine-tuned large language models, enabling scalable and practical analysis of how training data shapes model behavior.

[369] Don’t Walk the Line: Boundary Guidance for Filtered Generation

Sarah Ball, Andreas Haupt

Main category: cs.LG

TL;DR: Boundary Guidance is a reinforcement learning fine-tuning method that steers generative models away from classifier decision boundaries to improve safety and utility, avoiding the pitfalls of traditional fine-tuning approaches.

Details

Motivation: Traditional fine-tuning approaches that reduce the probability of being filtered by safety classifiers often push models toward classifier decision boundaries, increasing both false positives and false negatives in safety filtering.

Method: Boundary Guidance uses reinforcement learning fine-tuning to explicitly steer generation away from the classifier’s margin, preventing samples from clustering near decision boundaries.

Result: On jailbreak and ambiguous prompt benchmarks, Boundary Guidance improves both safety and utility of outputs according to LLM-as-a-Judge evaluations, with robust performance across model scales and reward designs.

Conclusion: The proposed Boundary Guidance method effectively addresses limitations of traditional safety fine-tuning by avoiding decision boundary clustering, leading to improved safety and utility in generative model outputs.

Abstract: Generative models are increasingly paired with safety classifiers that filter harmful or undesirable outputs. A common strategy is to fine-tune the generator to reduce the probability of being filtered, but this can be suboptimal: it often pushes the model toward producing samples near the classifier’s decision boundary, increasing both false positives and false negatives. We propose Boundary Guidance, a reinforcement learning fine-tuning method that explicitly steers generation away from the classifier’s margin. On a benchmark of jailbreak and ambiguous prompts, Boundary Guidance improves both the safety and the utility of outputs, as judged by LLM-as-a-Judge evaluations. Comprehensive ablations across model scales and reward designs demonstrate the robustness of our approach.

[370] WaveletDiff: Multilevel Wavelet Diffusion For Time Series Generation

Yu-Hsiang Wang, Olgica Milenkovic

Main category: cs.LG

TL;DR: WaveletDiff is a novel diffusion model framework that generates high-quality synthetic time series by training directly on wavelet coefficients, leveraging multi-resolution structure and cross-level attention mechanisms.

Details

Motivation: Large, high-quality time series datasets are scarce, and existing synthetic generation models struggle to reproduce the multi-scaled structure of real-world time series confined to either time or frequency domains.

Method: Trains diffusion models on wavelet coefficients using dedicated transformers for each decomposition level with cross-level attention mechanisms and adaptive gating. Incorporates energy preservation constraints based on Parseval’s theorem to maintain spectral fidelity.

Result: Outperforms state-of-the-art time-domain and frequency-domain generative methods across six real-world datasets from energy, finance, and neuroscience domains, achieving discriminative scores and Context-FID scores that are 3× smaller on average than the second-best baseline.

Conclusion: WaveletDiff provides an effective framework for generating realistic synthetic time series by exploiting multi-resolution structure through wavelet-based diffusion modeling with cross-level attention and spectral fidelity preservation.

Abstract: Time series are ubiquitous in many applications that involve forecasting, classification and causal inference tasks, such as healthcare, finance, audio signal processing and climate sciences. Still, large, high-quality time series datasets remain scarce. Synthetic generation can address this limitation; however, current models confined either to the time or frequency domains struggle to reproduce the inherently multi-scaled structure of real-world time series. We introduce WaveletDiff, a novel framework that trains diffusion models directly on wavelet coefficients to exploit the inherent multi-resolution structure of time series data. The model combines dedicated transformers for each decomposition level with cross-level attention mechanisms that enable selective information exchange between temporal and frequency scales through adaptive gating. It also incorporates energy preservation constraints for individual levels based on Parseval’s theorem to preserve spectral fidelity throughout the diffusion process. Comprehensive tests across six real-world datasets from energy, finance, and neuroscience domains demonstrate that WaveletDiff consistently outperforms state-of-the-art time-domain and frequency-domain generative methods on both short and long time series across five diverse performance metrics. For example, WaveletDiff achieves discriminative scores and Context-FID scores that are $3\times$ smaller on average than the second-best baseline across all datasets.

[371] Balancing Synthetic Data and Replay for Enhancing Task-Specific Capabilities

Urs Spiegelhalter, Jörg K. H. Franke, Frank Hutter

Main category: cs.LG

TL;DR: Comprehensive study on replay ratio optimization for language model adaptation, finding optimal configurations that balance task performance with knowledge retention under computational constraints.

Details

Motivation: Address the fundamental trade-off between learning new capabilities and avoiding catastrophic forgetting during language model adaptation, with unclear optimal replay ratios under computational limitations.

Method: Used bAbI reasoning tasks as target objective, applied synthetic data generation, and systematically evaluated different total token budgets and replay ratio configurations.

Result: Identified optimal replay ratio configurations that balance task-specific performance with general knowledge retention, enabling strong adaptation with reduced training costs.

Conclusion: Provides empirically-grounded guidelines for selecting replay ratios based on computational budget to achieve effective task adaptation while preserving existing knowledge.

Abstract: Adapting language models to new tasks through continued pretraining faces a fundamental trade-off: models must learn new capabilities while avoiding catastrophic forgetting of existing knowledge. While prior work has studied synthetic data generation techniques, the optimal replay ratios for balancing task performance and knowledge retention under computational constraints remain poorly understood. We present a comprehensive empirical study investigating the interplay between replay ratio configuration and computational budget when adapting language models to new tasks. Using the bAbI reasoning tasks as our target objective, we apply synthetic data generation and systematically evaluate different total token budgets and replay ratio configurations. We analyze their effects on both task mastery and general knowledge retention. Our experiments reveal an optimal configuration that balances task-specific performance with general knowledge retention. Based on our findings, we provide empirically-grounded guidelines for selecting replay ratios based on computational budget, enabling practitioners to achieve strong task adaptation with significantly reduced training costs.

[372] Investigating Faithfulness in Large Audio Language Models

Lovenya Jain, Pooneh Mousavi, Mirco Ravanelli, Cem Subakan

Main category: cs.LG

TL;DR: Investigates faithfulness of chain-of-thought (CoT) reasoning in large audio-language models (LALMs) using targeted interventions on reasoning datasets.

Details

Motivation: Faithfulness of CoT explanations is critical for safety-sensitive applications in audio-language models, but prior work shows text-based LLMs often produce unfaithful CoTs, and this hasn't been explored for LALMs where reasoning is more challenging due to audio processing.

Method: Applied targeted interventions including paraphrasing, filler token injection, early answering, and introducing mistakes on two reasoning datasets: SAKURA and MMAR.

Result: Experiments suggest that LALMs generally produce CoTs that appear to be faithful to their underlying decision processes across various datasets and tasks.

Conclusion: Large audio-language models produce faithful chain-of-thought representations that accurately reflect their decision processes, which is important for reliable explanations in safety-critical applications.

Abstract: Faithfulness measures whether chain-of-thought (CoT) representations accurately reflect a model’s decision process and can be used as reliable explanations. Prior work has shown that CoTs from text-based LLMs are often unfaithful. This question has not been explored for large audio-language models (LALMs), where faithfulness is critical for safety-sensitive applications. Reasoning in LALMs is also more challenging, as models must first extract relevant clues from audio before reasoning over them. In this paper, we investigate the faithfulness of CoTs produced by several LALMs by applying targeted interventions, including paraphrasing, filler token injection, early answering, and introducing mistakes, on two challenging reasoning datasets: SAKURA and MMAR. After going through the aforementioned interventions across several datasets and tasks, our experiments suggest that, LALMs generally produce CoTs that appear to be faithful to their underlying decision processes.

[373] Evaluating Open-Source Vision-Language Models for Multimodal Sarcasm Detection

Saroj Basnet, Shafkat Farabi, Tharindu Ranasinghe, Diptesh Kanoji, Marcos Zampieri

Main category: cs.LG

TL;DR: Evaluation of 7 state-of-the-art vision-language models (BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, Qwen-VL) on multimodal sarcasm detection and explanation generation using zero-, one-, and few-shot prompting across three benchmark datasets.

Details

Motivation: Recent advances in open-source vision-language models offer new opportunities for understanding complex multimodal phenomena like sarcasm, which involves visual-textual incongruities.

Method: Evaluated 7 VLMs on three sarcasm datasets (Muse, MMSD2.0, SarcNet) using zero-, one-, and few-shot prompting for both sarcasm detection and explanation generation tasks.

Result: Current models achieve moderate success in binary sarcasm detection but struggle to generate high-quality explanations without task-specific finetuning.

Conclusion: While VLMs show promise for sarcasm detection, they still require improvement in generating human-quality explanations that highlight visual-textual incongruities driving sarcasm.

Abstract: Recent advances in open-source vision-language models (VLMs) offer new opportunities for understanding complex and subjective multimodal phenomena such as sarcasm. In this work, we evaluate seven state-of-the-art VLMs - BLIP2, InstructBLIP, OpenFlamingo, LLaVA, PaliGemma, Gemma3, and Qwen-VL - on their ability to detect multimodal sarcasm using zero-, one-, and few-shot prompting. Furthermore, we evaluate the models’ capabilities in generating explanations to sarcastic instances. We evaluate the capabilities of VLMs on three benchmark sarcasm datasets (Muse, MMSD2.0, and SarcNet). Our primary objectives are twofold: (1) to quantify each model’s performance in detecting sarcastic image-caption pairs, and (2) to assess their ability to generate human-quality explanations that highlight the visual-textual incongruities driving sarcasm. Our results indicate that, while current models achieve moderate success in binary sarcasm detection, they are still not able to generate high-quality explanations without task-specific finetuning.

[374] Padé Approximant Neural Networks for Enhanced Electric Motor Fault Diagnosis Using Vibration and Acoustic Data

Sertac Kilickaya, Levent Eren

Main category: cs.LG

TL;DR: Padé Approximant Neural Networks (PadéNets) outperform conventional CNNs and Self-ONNs in induction motor fault diagnosis using vibration and acoustic data, achieving near-perfect accuracy.

Details

Motivation: To enhance fault diagnosis in induction machines by leveraging the Padé Approximant Neuron model, as deep learning with nonlinear neuron architectures can improve diagnostic performance beyond standard accelerometers and microphones.

Method: Comparative evaluation of three deep learning architectures (1D CNNs, Self-ONNs, PadéNets) on University of Ottawa’s induction motor datasets using vibration and acoustic sensor data. PadéNets introduce enhanced nonlinearity and work with unbounded activation functions like LeakyReLU.

Result: PadéNets consistently outperformed baseline models with diagnostic accuracies of 99.96%, 98.26%, 97.61%, and 98.33% for accelerometers 1, 2, 3, and acoustic sensor respectively.

Conclusion: The enhanced nonlinearity of PadéNets combined with compatibility with unbounded activation functions significantly improves fault diagnosis performance in induction motor condition monitoring.

Abstract: Purpose: The primary aim of this study is to enhance fault diagnosis in induction machines by leveraging the Pad'e Approximant Neuron (PAON) model. While accelerometers and microphones are standard in motor condition monitoring, deep learning models with nonlinear neuron architectures offer promising improvements in diagnostic performance. This research investigates whether Pad'e Approximant Neural Networks (Pad'eNets) can outperform conventional Convolutional Neural Networks (CNNs) and Self-Organized Operational Neural Networks (Self-ONNs) in the diagnosis of electrical and mechanical faults from vibration and acoustic data. Methods: We evaluate and compare the diagnostic capabilities of three deep learning architectures: one-dimensional CNNs, Self-ONNs, and Pad'eNets. These models are tested on the University of Ottawa’s publicly available constant-speed induction motor datasets, which include both vibration and acoustic sensor data. The Pad'eNet model is designed to introduce enhanced nonlinearity and is compatible with unbounded activation functions such as LeakyReLU. Results and Conclusion: Pad'eNets consistently outperformed the baseline models, achieving diagnostic accuracies of 99.96%, 98.26%, 97.61%, and 98.33% for accelerometers 1, 2, 3, and the acoustic sensor, respectively. The enhanced nonlinearity of Pad'eNets, together with their compatibility with unbounded activation functions, significantly improves fault diagnosis performance in induction motor condition monitoring.

[375] Actor-Enriched Time Series Forecasting of Process Performance

Aurelie Leribaux, Rafael Oyamada, Johannes De Smedt, Zahra Dasht Bozorgi, Artem Polyvyanyy, Jochen De Weerdt

Main category: cs.LG

TL;DR: Incorporating actor behavior as time series data improves throughput time forecasting in predictive process monitoring.

Details

Motivation: Processes are resource-driven and understanding actor behavior is crucial for accurate forecasting, but current approaches don't fully utilize actor behavior as time-varying signals.

Method: Constructed multivariate time series including throughput time and actor-centric features (involvement, continuation, interruption, handover behaviors and their durations), then trained and compared multiple models.

Result: Actor-enriched models consistently outperformed baseline models (with only TT features) across RMSE, MAE, and R2 metrics.

Conclusion: Modeling actor behavior over time and incorporating it into forecasting models enhances performance indicator predictions in process mining.

Abstract: Predictive Process Monitoring (PPM) is a key task in Process Mining that aims to predict future behavior, outcomes, or performance indicators. Accurate prediction of the latter is critical for proactive decision-making. Given that processes are often resource-driven, understanding and incorporating actor behavior in forecasting is crucial. Although existing research has incorporated aspects of actor behavior, its role as a time-varying signal in PPM remains limited. This study investigates whether incorporating actor behavior information, modeled as time series, can improve the predictive performance of throughput time (TT) forecasting models. Using real-life event logs, we construct multivariate time series that include TT alongside actor-centric features, i.e., actor involvement, the frequency of continuation, interruption, and handover behaviors, and the duration of these behaviors. We train and compare several models to study the benefits of adding actor behavior. The results show that actor-enriched models consistently outperform baseline models, which only include TT features, in terms of RMSE, MAE, and R2. These findings demonstrate that modeling actor behavior over time and incorporating this information into forecasting models enhances performance indicator predictions.

[376] Improving Knowledge Graph Embeddings through Contrastive Learning with Negative Statements

Rita T. Sousa, Heiko Paulheim

Main category: cs.LG

TL;DR: A novel dual-model architecture for knowledge graph embeddings that integrates explicitly declared negative statements to improve predictive performance by distinguishing between false and unknown triples.

Details

Motivation: Most existing knowledge graph embedding methods rely on Closed World or Local Closed World assumptions, treating missing triples as false, which contradicts the Open World Assumption of real-world knowledge graphs. Explicit negative statements are rarely included and overlooked during training.

Method: Uses a dual-model architecture with two embedding models trained in parallel - one on positive statements and another on negative statements. Each model generates negative samples by corrupting positive samples and selecting the most likely candidates as scored by the other model.

Result: Extensive experiments on general-purpose and domain-specific knowledge graphs show improved predictive performance over state-of-the-art embedding models in link prediction and triple classification tasks.

Conclusion: Integrating meaningful negative knowledge into embedding learning significantly improves predictive performance, demonstrating the value of explicitly incorporating negative statements in knowledge graph embeddings.

Abstract: Knowledge graphs represent information as structured triples and serve as the backbone for a wide range of applications, including question answering, link prediction, and recommendation systems. A prominent line of research for exploring knowledge graphs involves graph embedding methods, where entities and relations are represented in low-dimensional vector spaces that capture underlying semantics and structure. However, most existing methods rely on assumptions such as the Closed World Assumption or Local Closed World Assumption, treating missing triples as false. This contrasts with the Open World Assumption underlying many real-world knowledge graphs. Furthermore, while explicitly stated negative statements can help distinguish between false and unknown triples, they are rarely included in knowledge graphs and are often overlooked during embedding training. In this work, we introduce a novel approach that integrates explicitly declared negative statements into the knowledge embedding learning process. Our approach employs a dual-model architecture, where two embedding models are trained in parallel, one on positive statements and the other on negative statements. During training, each model generates negative samples by corrupting positive samples and selecting the most likely candidates as scored by the other model. The proposed approach is evaluated on both general-purpose and domain-specific knowledge graphs, with a focus on link prediction and triple classification tasks. The extensive experiments demonstrate that our approach improves predictive performance over state-of-the-art embedding models, demonstrating the value of integrating meaningful negative knowledge into embedding learning.

[377] Robust Adversarial Reinforcement Learning in Stochastic Games via Sequence Modeling

Xiaohang Tang, Zhuowen Cheng, Satyabrat Kumar

Main category: cs.LG

TL;DR: CART is the first framework to enhance adversarial robustness of Decision Transformers in stochastic games by formulating stage games with NashQ values, producing policies that are both robust to adversaries and conservative to transition uncertainty.

Details

Motivation: The adversarial robustness of reinforcement learning methods based on sequence modeling (like Decision Transformers) remains largely unexplored, despite their expressive power for sequential decision-making.

Method: Formulates interaction between protagonist and adversary as stage games with payoffs defined as expected maximum value over subsequent states. Conditions Transformer policies on NashQ values derived from these stage games to generate robust and conservative policies.

Result: Empirically achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.

Conclusion: CART successfully enhances the robustness of Decision Transformers in adversarial settings while maintaining conservatism to transition uncertainty, demonstrating effectiveness across various stochastic game environments.

Abstract: The Transformer, a highly expressive architecture for sequence modeling, has recently been adapted to solve sequential decision-making, most notably through the Decision Transformer (DT), which learns policies by conditioning on desired returns. Yet, the adversarial robustness of reinforcement learning methods based on sequence modeling remains largely unexplored. Here we introduce the Conservative Adversarially Robust Decision Transformer (CART), to our knowledge the first framework designed to enhance the robustness of DT in adversarial stochastic games. We formulate the interaction between the protagonist and the adversary at each stage as a stage game, where the payoff is defined as the expected maximum value over subsequent states, thereby explicitly incorporating stochastic state transitions. By conditioning Transformer policies on the NashQ value derived from these stage games, CART generates policy that are simultaneously less exploitable (adversarially robust) and conservative to transition uncertainty. Empirically, CART achieves more accurate minimax value estimation and consistently attains superior worst-case returns across a range of adversarial stochastic games.

[378] ADARL: Adaptive Low-Rank Structures for Robust Policy Learning under Uncertainty

Chenliang Li, Junyu Leng, Jiaxiang Li, Youbang Sun, Shixiang Chen, Shahin Shahrampour, Alfredo Garcia

Main category: cs.LG

TL;DR: AdaRL is a bi-level optimization framework for robust RL that adaptively adjusts policy rank to match task complexity, avoiding min-max optimization while maintaining robustness.

Details

Motivation: Existing robust RL methods use computationally expensive min-max optimization and produce overly conservative policies. There's a need for more efficient approaches that align policy complexity with task requirements.

Method: Bi-level optimization: lower level performs policy optimization under fixed-rank constraints with dynamics from Wasserstein ball; upper level adaptively adjusts rank to balance bias-variance trade-off by projecting parameters onto low-rank manifold.

Result: Outperforms fixed-rank baselines (SAC) and state-of-the-art robust RL methods (RNAC, Parseval) on MuJoCo benchmarks, converging toward intrinsic task rank while maintaining robustness.

Conclusion: Adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty, avoiding over-parameterization while ensuring robustness.

Abstract: Robust reinforcement learning (Robust RL) seeks to handle epistemic uncertainty in environment dynamics, but existing approaches often rely on nested min–max optimization, which is computationally expensive and yields overly conservative policies. We propose \textbf{Adaptive Rank Representation (AdaRL)}, a bi-level optimization framework that improves robustness by aligning policy complexity with the intrinsic dimension of the task. At the lower level, AdaRL performs policy optimization under fixed-rank constraints with dynamics sampled from a Wasserstein ball around a centroid model. At the upper level, it adaptively adjusts the rank to balance the bias–variance trade-off, projecting policy parameters onto a low-rank manifold. This design avoids solving adversarial worst-case dynamics while ensuring robustness without over-parameterization. Empirical results on MuJoCo continuous control benchmarks demonstrate that AdaRL not only consistently outperforms fixed-rank baselines (e.g., SAC) and state-of-the-art robust RL methods (e.g., RNAC, Parseval), but also converges toward the intrinsic rank of the underlying tasks. These results highlight that adaptive low-rank policy representations provide an efficient and principled alternative for robust RL under model uncertainty.

[379] Optimistic Multi-Agent Policy Gradient

Wenshuai Zhao, Yi Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

Main category: cs.LG

TL;DR: The paper proposes an optimistic update framework for multi-agent policy gradient methods to address relative overgeneralization by clipping advantages to eliminate negative values, preventing premature convergence to suboptimal policies.

Details

Motivation: Relative overgeneralization occurs in cooperative multi-agent learning when agents converge to suboptimal joint policies due to overfitting to other agents' suboptimal behavior. No existing methods address this problem in state-of-the-art multi-agent policy gradient methods.

Method: The authors propose a general framework for optimistic updates in MAPG methods by clipping advantages to eliminate negative values, which prevents agents from quickly converging to local optima. They provide formal analysis showing the method retains optimality at fixed points.

Result: Extensive evaluations on Multi-agent MuJoCo and Overcooked benchmarks show the method outperforms strong baselines on 13 out of 19 tasks and matches performance on the remaining tasks.

Conclusion: The proposed optimistic update framework effectively addresses relative overgeneralization in multi-agent policy gradient methods and demonstrates superior performance across diverse cooperative learning tasks.

Abstract: Relative overgeneralization (RO) occurs in cooperative multi-agent learning tasks when agents converge towards a suboptimal joint policy due to overfitting to suboptimal behavior of other agents. No methods have been proposed for addressing RO in multi-agent policy gradient (MAPG) methods although these methods produce state-of-the-art results. To address this gap, we propose a general, yet simple, framework to enable optimistic updates in MAPG methods that alleviate the RO problem. Our approach involves clipping the advantage to eliminate negative values, thereby facilitating optimistic updates in MAPG. The optimism prevents individual agents from quickly converging to a local optimum. Additionally, we provide a formal analysis to show that the proposed method retains optimality at a fixed point. In extensive evaluations on a diverse set of tasks including the Multi-agent MuJoCo and Overcooked benchmarks, our method outperforms strong baselines on 13 out of 19 tested tasks and matches the performance on the rest.

[380] Integrating Sequential and Relational Modeling for User Events: Datasets and Prediction Tasks

Rizal Fathony, Igor Melnyk, Owen Reinert, Nam H. Nguyen, Daniele Rosa, C. Bayan Bruss

Main category: cs.LG

TL;DR: This paper introduces datasets and a unified framework for modeling both personal and relational user events together, showing improved performance over separate modeling approaches.

Details

Motivation: User events are typically modeled separately as sequences (personal events) or graphs (relational events), but real-world systems need to capture both types together. Prior work rarely considers them jointly due to convenient simplifications.

Method: The authors introduce public datasets, propose a unified formalization that incorporates both personal and relational events, and empirically test models using this combined approach.

Result: Models benefit from incorporating both event types, but current methods still leave significant room for improvement in unified user event modeling.

Conclusion: The paper releases resources to support further research in unified user event modeling and encourages progress in this direction.

Abstract: User event modeling plays a central role in many machine learning applications, with use cases spanning e-commerce, social media, finance, cybersecurity, and other domains. User events can be broadly categorized into personal events, which involve individual actions, and relational events, which involve interactions between two users. These two types of events are typically modeled separately, using sequence-based methods for personal events and graph-based methods for relational events. Despite the need to capture both event types in real-world systems, prior work has rarely considered them together. This is often due to the convenient simplification that user behavior can be adequately represented by a single formalization, either as a sequence or a graph. To address this gap, there is a need for public datasets and prediction tasks that explicitly incorporate both personal and relational events. In this work, we introduce a collection of such datasets, propose a unified formalization, and empirically show that models benefit from incorporating both event types. Our results also indicate that current methods leave a notable room for improvements. We release these resources to support further research in unified user event modeling and encourage progress in this direction.

[381] Variational Mixture of Graph Neural Experts for Alzheimer’s Disease Biomarker Recognition in EEG Brain Networks

Jun-En Ding, Anna Zilverstand, Shihao Yang, Albert Chih-Chieh Yang, Feng Liu

Main category: cs.LG

TL;DR: VMoGE is a variational mixture of graph neural experts that improves dementia diagnosis and staging by integrating frequency-specific biomarker identification with structured variational inference, achieving superior performance over state-of-the-art methods.

Details

Motivation: Existing EEG-based methods for dementia diagnosis are limited by full-band frequency analysis, which hinders precise differentiation of dementia subtypes (AD vs FTD) and severity stages due to overlapping electrophysiological signatures.

Method: VMoGE employs a multi-granularity transformer to extract multi-scale temporal patterns across four frequency bands, followed by a variational graph convolutional encoder using Gaussian Markov Random Field priors. It uses structured variational inference and adaptive gating to link neural specialization to physiologically meaningful EEG frequency bands.

Result: VMoGE achieves superior performance with AUC improvements of +4% to +10% over state-of-the-art methods on two diverse datasets for both subtype classification and severity staging. It also provides interpretable insights through expert weights that correlate with clinical indicators.

Conclusion: VMoGE facilitates EEG biomarker discovery for comprehensive dementia diagnosis and monitoring by providing interpretable insights aligned with neuropathological signatures, enabling better differentiation of dementia subtypes and severity stages.

Abstract: Dementia disorders such as Alzheimer’s disease (AD) and frontotemporal dementia (FTD) exhibit overlapping electrophysiological signatures in EEG that challenge accurate diagnosis. Existing EEG-based methods are limited by full-band frequency analysis that hinders precise differentiation of dementia subtypes and severity stages. We propose a variational mixture of graph neural experts (VMoGE) that integrates frequency-specific biomarker identification with structured variational inference for enhanced dementia diagnosis and staging. VMoGE employs a multi-granularity transformer to extract multi-scale temporal patterns across four frequency bands, followed by a variational graph convolutional encoder using Gaussian Markov Random Field priors. Through structured variational inference and adaptive gating, VMoGE links neural specialization to physiologically meaningful EEG frequency bands. Evaluated on two diverse datasets for both subtype classification and severity staging, VMoGE achieves superior performance with AUC improvements of +4% to +10% over state-of-the-art methods. Moreover, VMoGE provides interpretable insights through expert weights that correlate with clinical indicators and spatial patterns aligned with neuropathological signatures, facilitating EEG biomarker discovery for comprehensive dementia diagnosis and monitoring.

[382] Indoor Localization using Compact, Telemetry-Agnostic, Transfer-Learning Enabled Decoder-Only Transformer

Nayan Sanjay Bhatia, Pranay Kocheta, Russell Elliott, Harikrishna S. Kuttivelil, Katia Obraczka

Main category: cs.LG

TL;DR: Locaris is a decoder-only LLM for indoor Wi-Fi positioning that treats AP measurements as tokens, enabling direct processing of raw Wi-Fi signals without pre-processing. It achieves sub-meter accuracy with minimal calibration and maintains robust performance across different devices and environments.

Details

Motivation: Conventional Wi-Fi positioning methods require labor-intensive calibration and suffer from performance degradation when devices, channels, or deployment conditions change. There's a need for calibration-free, scalable solutions that work across heterogeneous environments.

Method: Fine-tune a decoder-only large language model (LLM) on Wi-Fi datasets, treating each access point measurement as a token. This allows ingestion of raw Wi-Fi telemetry without pre-processing to learn a lightweight mapping from raw signals to device location.

Result: Locaris matches or surpasses state-of-the-art methods across various telemetry types. It achieves sub-meter accuracy with just a few hundred samples, maintains high accuracy with few-shot adaptation to unseen devices and scenarios, and performs robustly under missing APs.

Conclusion: Compact LLMs can serve as effective calibration-free regression models for indoor localization, offering scalable and robust cross-environment performance in heterogeneous Wi-Fi deployments, making them practical for real-world large-scale deployments where extensive calibration is infeasible.

Abstract: Indoor Wi-Fi positioning remains a challenging problem due to the high sensitivity of radio signals to environmental dynamics, channel propagation characteristics, and hardware heterogeneity. Conventional fingerprinting and model-based approaches typically require labor-intensive calibration and suffer rapid performance degradation when devices, channel or deployment conditions change. In this paper, we introduce Locaris, a decoder-only large language model (LLM) for indoor localization. Locaris treats each access point (AP) measurement as a token, enabling the ingestion of raw Wi-Fi telemetry without pre-processing. By fine-tuning its LLM on different Wi-Fi datasets, Locaris learns a lightweight and generalizable mapping from raw signals directly to device location. Our experimental study comparing Locaris with state-of-the-art methods consistently shows that Locaris matches or surpasses existing techniques for various types of telemetry. Our results demonstrate that compact LLMs can serve as calibration-free regression models for indoor localization, offering scalable and robust cross-environment performance in heterogeneous Wi-Fi deployments. Few-shot adaptation experiments, using only a handful of calibration points per device, further show that Locaris maintains high accuracy when applied to previously unseen devices and deployment scenarios. This yields sub-meter accuracy with just a few hundred samples, robust performance under missing APs and supports any and all available telemetry. Our findings highlight the practical viability of Locaris for indoor positioning in the real-world scenarios, particularly in large-scale deployments where extensive calibration is infeasible.

[383] Efficient Restarts in Non-Stationary Model-Free Reinforcement Learning

Hiroshi Nonaka, Simon Ambrozak, Sofia R. Miskala-Dinc, Amedeo Ercole, Aviva Prins

Main category: cs.LG

TL;DR: Proposes three efficient restart paradigms (partial, adaptive, selective) for model-free non-stationary RL to address limitations of existing restart algorithms, achieving up to 91% reduction in dynamic regret.

Details

Motivation: To address two core issues in existing restart-based RL algorithms: complete forgetting (losing all learned information after restarts) and scheduled restarts (predefined timings regardless of policy-environment incompatibility).

Method: Introduces three restart approaches: partial restarts, adaptive restarts, and selective restarts, which modify the RestartQ-UCB and RANDOMIZEDQ algorithms to improve efficiency.

Result: Achieves near-optimal empirical performance across multiple environments, with dynamic regret reduced by up to 91% compared to RestartQ-UCB.

Conclusion: The proposed restart paradigms effectively address the limitations of existing restart mechanisms in non-stationary RL, significantly improving performance through more intelligent restart strategies.

Abstract: In this work, we propose three efficient restart paradigms for model-free non-stationary reinforcement learning (RL). We identify two core issues with the restart design of Mao et al. (2022)’s RestartQ-UCB algorithm: (1) complete forgetting, where all the information learned about an environment is lost after a restart, and (2) scheduled restarts, in which restarts occur only at predefined timings, regardless of the incompatibility of the policy with the current environment dynamics. We introduce three approaches, which we call partial, adaptive, and selective restarts to modify the algorithms RestartQ-UCB and RANDOMIZEDQ (Wang et al., 2025). We find near-optimal empirical performance in multiple different environments, decreasing dynamic regret by up to $91$% relative to RestartQ-UCB.

[384] Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Main category: cs.LG

TL;DR: AT-GRPO is a novel approach that adapts on-policy reinforcement learning (GRPO-style) for multi-agent systems, addressing unique challenges in role-based and turn-based interactions through agent- and turn-wise grouping, achieving substantial performance gains across various tasks.

Details

Motivation: Applying on-policy RL to multi-agent systems (MAS) is underexplored and presents unique challenges where standard GRPO grouping assumptions break down due to varying prompts by role and turn, requiring specialized training systems for MAS workflows.

Method: Proposes AT-GRPO with (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system supporting both single- and multi-policy regimes for MAS workflow rollouts and on-policy updates.

Result: Substantial performance gains across tasks: long-horizon planning accuracy increased from 14.0-47.0% to 96.0-99.5%, coding tasks improved by 3.87-7.62% on average, and math tasks improved by 9.0-17.93% on average.

Conclusion: AT-GRPO successfully adapts on-policy RL to multi-agent systems, overcoming algorithmic and system challenges to deliver significant performance improvements across diverse domains including planning, coding, and mathematical reasoning tasks.

Abstract: Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

[385] On efficiently computable functions, deep networks and sparse compositionality

Tomaso Poggio

Main category: cs.LG

TL;DR: Efficient Turing computability implies existence of compositionally sparse DAG representations and neural approximants achieving target precision through bounded-fan-in Boolean circuits and deep networks.

Details

Motivation: To establish connections between efficient computability, compositional sparsity in DAG representations, and neural network approximation capabilities for functions computable in polynomial time.

Method: For functions computable in polynomial time in bit-depths, construct bounded-fan-in Boolean circuits of polynomial size/depth computing discretized maps, then replace gates with constant-size neural emulators to form deep networks.

Result: Achieves accuracy ε=2^{-m_out} with deep networks of size/depth poly(n+m_out) for functions computable in polynomial time.

Conclusion: Efficient computability guarantees existence of sparse compositional representations and corresponding neural approximants, relating to compositional approximation rates and hierarchical optimization over sparse structures.

Abstract: We show that \emph{efficient Turing computability} at any fixed input/output precision implies the existence of \emph{compositionally sparse} (bounded-fan-in, polynomial-size) DAG representations and of corresponding neural approximants achieving the target precision. Concretely: if $f:[0,1]^d\to\R^m$ is computable in time polynomial in the bit-depths, then for every pair of precisions $(n,m_{\mathrm{out}})$ there exists a bounded-fan-in Boolean circuit of size and depth $\poly(n+m_{\mathrm{out}})$ computing the discretized map; replacing each gate by a constant-size neural emulator yields a deep network of size/depth $\poly(n+m_{\mathrm{out}})$ that achieves accuracy $\varepsilon=2^{-m_{\mathrm{out}}}$. We also relate these constructions to compositional approximation rates \cite{MhaskarPoggio2016b,poggio_deep_shallow_2017,Poggio2017,Poggio2023HowDS} and to optimization viewed as hierarchical search over sparse structures.

[386] Sculpting Latent Spaces With MMD: Disentanglement With Programmable Priors

Quentin Fruytier, Akshay Malhotra, Shahab Hamidi-Rad, Aditya Sant, Aryan Mokhtari, Sujay Sanghavi

Main category: cs.LG

TL;DR: The paper shows that KL divergence in VAEs fails to enforce disentangled representations and introduces a new Programmable Prior Framework using MMD that achieves state-of-the-art disentanglement without reconstruction trade-offs.

Details

Motivation: Current VAE approaches using KL divergence penalty are unreliable for learning disentangled representations, as they fail to enforce the target distribution on the aggregate posterior.

Method: Introduces Programmable Prior Framework based on Maximum Mean Discrepancy (MMD) that allows explicit control over latent space structure, and proposes Latent Predictability Score (LPS) to quantify entanglement.

Result: Achieves state-of-the-art mutual independence on complex datasets like CIFAR-10 and Tiny ImageNet without reconstruction quality trade-offs, and enables engineering of sophisticated priors for better semantic alignment.

Conclusion: Provides a foundational tool for representation engineering that opens new avenues for model identifiability and causal reasoning, overcoming limitations of traditional KL-based approaches.

Abstract: Learning disentangled representations, where distinct factors of variation are captured by independent latent variables, is a central goal in machine learning. The dominant approach has been the Variational Autoencoder (VAE) framework, which uses a Kullback-Leibler (KL) divergence penalty to encourage the latent space to match a factorized Gaussian prior. In this work, however, we provide direct evidence that this KL-based regularizer is an unreliable mechanism, consistently failing to enforce the target distribution on the aggregate posterior. We validate this and quantify the resulting entanglement using our novel, unsupervised Latent Predictability Score (LPS). To address this failure, we introduce the Programmable Prior Framework, a method built on the Maximum Mean Discrepancy (MMD). Our framework allows practitioners to explicitly sculpt the latent space, achieving state-of-the-art mutual independence on complex datasets like CIFAR-10 and Tiny ImageNet without the common reconstruction trade-off. Furthermore, we demonstrate how this programmability can be used to engineer sophisticated priors that improve alignment with semantically meaningful features. Ultimately, our work provides a foundational tool for representation engineering, opening new avenues for model identifiability and causal reasoning.

[387] Y-shaped Generative Flows

Arip Asadulaev, Semyon Semenov, Abduragim Shtanchaev, Eric Moulines, Fakhri Karray, Martin Takac

Main category: cs.LG

TL;DR: Introduces Y-shaped generative flows that move probability mass along shared pathways before branching, using a novel velocity-powered transport cost with sublinear exponent to reward joint mass movement.

Details

Motivation: Existing continuous-time generative models use V-shaped transport where samples travel independently along straight trajectories, overlooking shared structure in the data.

Method: Y-shaped generative flows based on novel velocity-powered transport cost with sublinear exponent (0-1), instantiated as scalable neural ODE training objective.

Result: Y-flows recover hierarchy-aware structure, improve distributional metrics over flow-based baselines, and reach targets with fewer integration steps on synthetic, image, and biology datasets.

Conclusion: Y-shaped flows provide more efficient and structured transport by leveraging shared pathways before branching to specific endpoints.

Abstract: Modern continuous-time generative models often induce V-shaped transport: each sample travels independently along nearly straight trajectories from prior to data, overlooking shared structure. We introduce Y-shaped generative flows, which move probability mass together along shared pathways before branching to target-specific endpoints. Our formulation is based on novel velocity-powered transport cost with a sublinear exponent (between zero and one). this concave dependence rewards joint and fast mass movement. Practically, we instantiate the idea in a scalable neural ODE training objective. On synthetic, image, and biology datasets, Y-flows recover hierarchy-aware structure, improve distributional metrics over strong flow-based baselines, and reach targets with fewer integration steps.

[388] MosaicDiff: Training-free Structural Pruning for Diffusion Model Acceleration Reflecting Pretraining Dynamics

Bowei Guo, Shengkun Tang, Cong Zeng, Zhiqiang Shen

Main category: cs.LG

TL;DR: MosaicDiff is a novel framework that aligns diffusion pretraining dynamics with post-training sampling acceleration through trajectory-aware structural pruning, achieving significant speed-ups without quality loss.

Details

Motivation: Prior diffusion acceleration methods have overlooked the distinct learning speed phases during pretraining, which exhibit early slow, middle fast, and later slow learning stages.

Method: Trajectory-aware structural pruning that adapts pruning strategy based on pretraining dynamics: conservative pruning for fast-learning middle stage, aggressive pruning for slow-learning early and later stages.

Result: Extensive experiments on DiT and SDXL show significant sampling speed-ups without compromising output quality, outperforming previous state-of-the-art methods by large margins.

Conclusion: The method provides a new viewpoint for efficient and robust training-free diffusion acceleration by harmonizing pretraining dynamics with accelerated sampling.

Abstract: Diffusion models are renowned for their generative capabilities, yet their pretraining processes exhibit distinct phases of learning speed that have been entirely overlooked in prior post-training acceleration efforts in the community. In this study, we introduce a novel framework called MosaicDiff that aligns diffusion pretraining dynamics with post-training sampling acceleration via trajectory-aware structural pruning. Our approach leverages the observation that the middle, fast-learning stage of diffusion pretraining requires more conservative pruning to preserve critical model features, while the early and later, slow-learning stages benefit from a more aggressive pruning strategy. This adaptive pruning mechanism is the first to explicitly mirror the inherent learning speed variations of diffusion pretraining, thereby harmonizing the model’s inner training dynamics with its accelerated sampling process. Extensive experiments on DiT and SDXL demonstrate that our method achieves significant speed-ups in sampling without compromising output quality, outperforming previous state-of-the-art methods by large margins, also providing a new viewpoint for more efficient and robust training-free diffusion acceleration.

[389] QLENS: Towards A Quantum Perspective of Language Transformers

Aditya Gupta, Kirandeep Kaur, Vinayak Gupta

Main category: cs.LG

TL;DR: QLENS: A physics-inspired framework that models Transformers using quantum mechanics concepts, treating latent activations as state vectors in Hilbert space and layers as unitary operators/Hamiltonians.

Details

Motivation: Current Transformer interpretability methods lack mathematical frameworks to model how layers transition between states. Inspired by interdisciplinary approaches and parallels between language models' probabilistic nature and quantum mechanics.

Method: Convert Transformer latent activations to state vectors in Hilbert space derived from output units. Model hidden layers as unitary operators and Hamiltonians. Apply Born rule for final probability distribution.

Result: Proof-of-concept with toy Transformer shows QLENS can probe individual layers’ influence on prediction trajectory, demonstrating potential for mechanistic understanding.

Conclusion: QLENS provides foundation for cross-domain insights from physics to enhance Transformer understanding, bridging interpretability gap through quantum-inspired mathematical framework.

Abstract: In natural language processing, current methods for understanding Transformers are successful at identifying intermediate predictions during a model’s inference. However, these approaches function as limited diagnostic checkpoints, lacking a mathematical framework for mechanistically modeling how each layer facilitates transitions between these evolving states. This interpretability gap and past successes of interdisciplinary outlooks inspire us to turn to physics in search of a descriptive mathematical framework for Transformers. We observe that language models are intrinsically probabilistic, an attribute that is echoed in the core postulates of quantum mechanics. This parallel inspires us to translate insights from this discipline to that of natural language processing. Towards this objective, we propose QLENS a novel attempt to develop a physics-based perspective on the Transformer generation process. Under QLENS, a Transformer is studied by converting its latent activations into a state vector in a Hilbert space derived from the model’s output units. This state subsequently evolves through hidden layers - reformulated as unitary operators and analogously defined Hamiltonians - during inference. The model’s final probability distribution is obtained by applying the Born rule to the end state using a specific measurement operator. To demonstrate QLENS’s potential, we conduct a proof-of-concept by probing a toy Transformer to investigate the influence of individual layers in a model’s prediction trajectory. We present our work as a foundation for cross-domain insights to be leveraged towards a broader understanding of Transformers.

[390] Learning Dynamics of VLM Finetuning

Jusheng Zhang, Kaitong Cai, Jing Yang, Keze Wang

Main category: cs.LG

TL;DR: CW-DPO is a two-stage alignment method that stabilizes VLM training by using gentle negatives in SFT and cooling-weighted DPO to suppress uninformative gradients while preserving hard negative signals.

Details

Motivation: Preference-based finetuning of VLMs is brittle due to trivially wrong negatives injecting uninformative gradients that destabilize training.

Method: Two-stage approach: Stage 1 performs supervised finetuning with gentle negatives (low-weight smoothed supervision). Stage 2 applies DPO with cooling-weighted negative terms scaled by average token log-probability on negatives. Uses on-policy negatives and mixed negatives with dataset negatives for contrast freshness.

Result: More stable optimization, better calibration, higher pairwise win-rates than SFT-only and vanilla DPO, while converging in fewer steps. Cooling-weight mechanism identified as primary driver of gains.

Conclusion: Smoothing learning dynamics before cooling preferences is a simple, general principle for robust VLM alignment.

Abstract: Preference-based finetuning of vision–language models (VLMs) is brittle: trivially wrong negatives inject uninformative gradients that destabilize training. We recast alignment as \textbf{learning-dynamics–aware optimization} and introduce \textbf{Cooling-Weighted DPO (CW-DPO)}, a two-stage recipe that explicitly models and exploits the training trajectory. \textbf{Stage 1} performs supervised finetuning with \textbf{gentle negatives}: \textbf{low-weight smoothed supervision} that regularizes the base policy and curbs overconfidence without explicit penalties. \textbf{Stage 2} applies a DPO objective in which the \textbf{negative term is scaled by a cooling weight} computed from the model’s \textbf{average token log-probability} on each negative, suppressing uninformative gradients from easy or off-distribution samples while preserving signal from hard negatives. In practice, we emphasize \textbf{on-policy negatives} and allow \textbf{mixed negatives} by blending a controllable fraction of dataset negatives to maintain contrast freshness. Throughout, we instrument training with $\Delta!\log p$ probes on positives and negatives as first-class signals for early stopping, curriculum design, and failure diagnosis. Across diverse VLM tasks, CW-DPO yields \textbf{more stable optimization}, \textbf{better calibration}, and \textbf{higher pairwise win-rates} than SFT-only and vanilla DPO, while \textbf{converging in fewer steps}. Ablations isolate the \textbf{cooling-weight mechanism} as the primary driver of these gains and show complementary benefits from mixing on-policy and dataset negatives. Taken together, our results show that \textbf{smoothing learning dynamics before cooling preferences} is a simple, general principle for robust VLM alignment.

[391] Learning by Steering the Neural Dynamics: A Statistical Mechanics Perspective

Mattia Scardecchia

Main category: cs.LG

TL;DR: This paper bridges AI and neuroscience by studying how neural dynamics enable local, distributed learning without backpropagation, revealing phase transitions in network fixed points and proposing a biologically plausible supervised learning algorithm.

Details

Motivation: To understand how biological neural networks achieve robust, sample-efficient learning without backpropagation, and to bridge the gap between contemporary AI and computational neuroscience.

Method: Using statistical mechanics to analyze dynamical attractors in random asymmetric recurrent networks, deriving phase transitions in fixed point structure, and proposing a biologically plausible supervised learning algorithm that maps inputs to fixed points via local plasticity.

Result: Identified critical self-coupling thresholds for phase transitions in fixed point structure, and demonstrated that the proposed algorithm can learn entangled MNIST, leverage depth for hierarchical representations, and achieve hetero-association capacity across various architectures.

Conclusion: The study reveals a strong connection between algorithm performance and phase transitions in network dynamics, suggesting cortex-inspired alternatives to self-couplings for achieving robust, biologically plausible learning.

Abstract: Despite the striking successes of deep neural networks trained with gradient-based optimization, these methods differ fundamentally from their biological counterparts. This gap raises key questions about how nature achieves robust, sample-efficient learning at minimal energy costs and solves the credit-assignment problem without backpropagation. We take a step toward bridging contemporary AI and computational neuroscience by studying how neural dynamics can support fully local, distributed learning that scales to simple machine-learning benchmarks. Using tools from statistical mechanics, we identify conditions for the emergence of robust dynamical attractors in random asymmetric recurrent networks. We derive a closed-form expression for the number of fixed points as a function of self-coupling strength, and we reveal a phase transition in their structure: below a critical self-coupling, isolated fixed points coexist with exponentially many narrow clusters showing the overlap-gap property; above it, subdominant yet dense and extensive clusters appear. These fixed points become accessible, including to a simple asynchronous dynamical rule, after an algorithm-dependent self-coupling threshold. Building on this analysis, we propose a biologically plausible algorithm for supervised learning with any binary recurrent network. Inputs are mapped to fixed points of the dynamics, by relaxing under transient external stimuli and stabilizing the resulting configurations via local plasticity. We show that our algorithm can learn an entangled version of MNIST, leverages depth to develop hierarchical representations and increase hetero-association capacity, and is applicable to several architectures. Finally, we highlight the strong connection between algorithm performance and the unveiled phase transition, and we suggest a cortex-inspired alternative to self-couplings for its emergence.

[392] Nonlinear discretizations and Newton’s method: characterizing stationary points of regression objectives

Conor Rowan

Main category: cs.LG

TL;DR: Exact second-order methods using true Hessian fail in neural network training, challenging conventional wisdom about loss landscapes being full of local minima.

Details

Motivation: To investigate why exact second-order optimization methods fail in neural network training despite theoretical advantages over first-order methods and quasi-Newton approximations.

Method: Analyzed neural network training using exact Hessian-based second-order optimization methods instead of quasi-Newton approximations.

Result: Neural network training reliably fails when using exact curvature information, revealing insights about nonlinear discretization geometry and stationary point distribution.

Conclusion: The findings question the conventional view that neural network loss landscapes are replete with local minima, suggesting different geometric properties.

Abstract: Second-order methods are emerging as promising alternatives to standard first-order optimizers such as gradient descent and ADAM for training neural networks. Though the advantages of including curvature information in computing optimization steps have been celebrated in the scientific machine learning literature, the only second-order methods that have been studied are quasi-Newton, meaning that the Hessian matrix of the objective function is approximated. Though one would expect only to gain from using the true Hessian in place of its approximation, we show that neural network training reliably fails when relying on exact curvature information. The failure modes provide insight both into the geometry of nonlinear discretizations as well as the distribution of stationary points in the loss landscape, leading us to question the conventional wisdom that the loss landscape is replete with local minima.

[393] Mamaba Can Learn Low-Dimensional Targets In-Context via Test-Time Feature Learning

Junsoo Oh, Wei Huang, Taiji Suzuki

Main category: cs.LG

TL;DR: Mamba achieves efficient in-context learning of single-index models through test-time feature extraction, outperforming linear Transformers and matching nonlinear Transformers’ performance.

Details

Motivation: Despite Mamba's empirical success as a linear-time sequence model, there's limited theoretical understanding of its in-context learning mechanisms, particularly for nonlinear target functions.

Method: Theoretical analysis of Mamba’s ICL capability for single-index models, proving that pretrained Mamba can extract relevant feature directions from context examples via its nonlinear gating mechanism.

Result: Mamba achieves test-time sample complexity that improves upon linear Transformers and is comparable to nonlinear Transformers, surpassing the CSQ lower bound and approaching information-theoretically optimal rates.

Conclusion: Mamba’s nonlinear gating mechanism is crucial for feature extraction, enabling both computational efficiency and high performance in in-context learning tasks.

Abstract: Mamba, a recently proposed linear-time sequence model, has attracted significant attention for its computational efficiency and strong empirical performance. However, a rigorous theoretical understanding of its underlying mechanisms remains limited. In this work, we provide a theoretical analysis of Mamba’s in-context learning (ICL) capability by focusing on tasks defined by low-dimensional nonlinear target functions. Specifically, we study in-context learning of a single-index model $y \approx g_*(\langle \boldsymbol{\beta}, \boldsymbol{x} \rangle)$, which depends on only a single relevant direction $\boldsymbol{\beta}$, referred to as feature. We prove that Mamba, pretrained by gradient-based methods, can achieve efficient ICL via test-time feature learning, extracting the relevant direction directly from context examples. Consequently, we establish a test-time sample complexity that improves upon linear Transformers – analyzed to behave like kernel methods – and is comparable to nonlinear Transformers, which have been shown to surpass the Correlational Statistical Query (CSQ) lower bound and achieve near information-theoretically optimal rate in previous works. Our analysis reveals the crucial role of the nonlinear gating mechanism in Mamba for feature extraction, highlighting it as the fundamental driver behind Mamba’s ability to achieve both computational efficiency and high performance.

[394] Your VAR Model is Secretly an Efficient and Explainable Generative Classifier

Yi-Chung Chen, David I. Inouye, Jing Gao

Main category: cs.LG

TL;DR: A novel generative classifier using visual autoregressive (VAR) modeling as an alternative to diffusion-based methods, offering better computational efficiency and unique properties like explainability and resistance to catastrophic forgetting.

Details

Motivation: Current generative classifiers are dominated by diffusion-based models which are computationally expensive and limit understanding of generative classifiers. There's a need for more efficient alternatives with different properties.

Method: Proposes Adaptive VAR Classifier+ (A-VARC+) built on visual autoregressive modeling, which achieves better accuracy-speed trade-off and enables token-wise mutual information for explainability.

Result: VAR-based classifier shows fundamentally different properties from diffusion methods, with tractable likelihood enabling visual explainability and inherent resistance to catastrophic forgetting in incremental learning.

Conclusion: VAR-based generative classifiers offer a promising alternative to diffusion methods with superior computational efficiency, explainability capabilities, and practical advantages for incremental learning scenarios.

Abstract: Generative classifiers, which leverage conditional generative models for classification, have recently demonstrated desirable properties such as robustness to distribution shifts. However, recent progress in this area has been largely driven by diffusion-based models, whose substantial computational cost severely limits scalability. This exclusive focus on diffusion-based methods has also constrained our understanding of generative classifiers. In this work, we propose a novel generative classifier built on recent advances in visual autoregressive (VAR) modeling, which offers a new perspective for studying generative classifiers. To further enhance its performance, we introduce the Adaptive VAR Classifier$^+$ (A-VARC$^+$), which achieves a superior trade-off between accuracy and inference speed, thereby significantly improving practical applicability. Moreover, we show that the VAR-based method exhibits fundamentally different properties from diffusion-based methods. In particular, due to its tractable likelihood, the VAR-based classifier enables visual explainability via token-wise mutual information and demonstrates inherent resistance to catastrophic forgetting in class-incremental learning tasks.

[395] MEASURE: Multi-scale Minimal Sufficient Representation Learning for Domain Generalization in Sleep Staging

Sangmin Jo, Jee Seok Yoon, Wootaek Jeong, Kwanseok Oh, Heung-Il Suk

Main category: cs.LG

TL;DR: Proposes MEASURE framework for domain generalization in sleep staging by reducing domain-relevant information while preserving essential temporal and spectral features across multiple scales.

Details

Motivation: Deep learning models for sleep staging struggle to generalize on unseen subjects due to physiological signal variability. Existing contrastive learning methods fail to adequately extract domain-invariant representations by not addressing domain characteristics in unshared information.

Method: MEASURE (Multi-scalE minimAl SUfficient Representation lEarning) framework that reduces domain-relevant information while preserving essential temporal and spectral features for sleep stage classification across multiple feature levels.

Result: Outperformed state-of-the-art methods on SleepEDF-20 and MASS benchmark datasets in exhaustive experiments.

Conclusion: The proposed MEASURE framework effectively addresses domain generalization in sleep staging by mitigating domain-relevant attributes while leveraging diverse temporal and spectral information from multiple feature levels.

Abstract: Deep learning-based automatic sleep staging has significantly advanced in performance and plays a crucial role in the diagnosis of sleep disorders. However, those models often struggle to generalize on unseen subjects due to variability in physiological signals, resulting in degraded performance in out-of-distribution scenarios. To address this issue, domain generalization approaches have recently been studied to ensure generalized performance on unseen domains during training. Among those techniques, contrastive learning has proven its validity in learning domain-invariant features by aligning samples of the same class across different domains. Despite its potential, many existing methods are insufficient to extract adequately domain-invariant representations, as they do not explicitly address domain characteristics embedded within the unshared information across samples. In this paper, we posit that mitigating such domain-relevant attributes-referred to as excess domain-relevant information-is key to bridging the domain gap. However, the direct strategy to mitigate the domain-relevant attributes often overfits features at the high-level information, limiting their ability to leverage the diverse temporal and spectral information encoded in the multiple feature levels. To address these limitations, we propose a novel MEASURE (Multi-scalE minimAl SUfficient Representation lEarning) framework, which effectively reduces domain-relevant information while preserving essential temporal and spectral features for sleep stage classification. In our exhaustive experiments on publicly available sleep staging benchmark datasets, SleepEDF-20 and MASS, our proposed method consistently outperformed state-of-the-art methods. Our code is available at : https://github.com/ku-milab/Measure

[396] Influence Dynamics and Stagewise Data Attribution

Jin Hwa Lee, Matthew Smith, Maxwell Adam, Jesse Hoogland

Main category: cs.LG

TL;DR: A framework for stagewise data attribution that treats influence as dynamic rather than static, showing influence changes non-monotonically during neural network training stages.

Details

Motivation: Current TDA methods treat influence as static, but neural networks learn in distinct stages with changing influence patterns.

Method: Introduce stagewise data attribution framework grounded in singular learning theory, validated analytically and empirically in toy models and language models.

Result: Influence changes non-monotonically with sign flips and sharp peaks at developmental transitions; dynamic shifts map to progressive learning of semantic hierarchy; demonstrated at scale in language models.

Conclusion: Influence in neural networks is dynamic and changes through distinct learning stages, with token-level influence changes aligning with known developmental stages.

Abstract: Current training data attribution (TDA) methods treat the influence one sample has on another as static, but neural networks learn in distinct stages that exhibit changing patterns of influence. In this work, we introduce a framework for stagewise data attribution grounded in singular learning theory. We predict that influence can change non-monotonically, including sign flips and sharp peaks at developmental transitions. We first validate these predictions analytically and empirically in a toy model, showing that dynamic shifts in influence directly map to the model’s progressive learning of a semantic hierarchy. Finally, we demonstrate these phenomena at scale in language models, where token-level influence changes align with known developmental stages.

[397] GraphShaper: Geometry-aware Alignment for Improving Transfer Learning in Text-Attributed Graphs

Heng Zhang, Tianyi Zhang, Yuling Shi, Xiaodong Gu, Yaomin Shen, Haochen You, Zijian Zhang, Yilei Yuan, Jin Huang

Main category: cs.LG

TL;DR: GraphShaper is a geometry-aware framework that addresses performance degradation at structural boundaries in graph foundation models by using multi-geometric specialization and adaptive fusion of geometric properties.

Details

Motivation: Current graph foundation models using large language models suffer from significant performance degradation (over 20% accuracy loss) at structural boundaries where different topological patterns converge, due to the limitation of encoding all graph structures in a single Euclidean space.

Method: Employs expert networks tailored to different geometric spaces (hyperbolic for tree structures, spherical for cyclic patterns) and dynamically computes fusion weights to adaptively integrate geometric properties based on local structural characteristics before alignment with text embeddings.

Result: Achieves 9.47% accuracy improvements on citation networks and 7.63% on social networks in zero-shot settings compared to existing methods.

Conclusion: The proposed geometry-aware framework successfully addresses the geometric diversity challenge in graph structures and significantly improves performance at structural boundaries through multi-geometric specialization.

Abstract: Graph foundation models represent a transformative paradigm for learning transferable representations across diverse graph domains. Recent methods leverage large language models to unify graph and text modalities into a shared representation space using contrastive learning. However, systematic evaluations reveal significant performance degradation at structural boundaries where distinct topological patterns converge, with accuracy losses exceeding 20 percentage points. This issue arises from a key limitation: current methods assume all graph structures can be encoded within a single Euclidean space. In reality, tree structures require hyperbolic geometry to preserve hierarchical branching, while cyclic patterns depend on spherical geometry for closure properties. At structural boundaries, nodes experience conflicting geometric constraints that uniform encoding spaces cannot resolve. This raises a crucial challenge: \textbf{Can alignment frameworks be designed to respect the intrinsic geometric diversity of graph structures?} We introduce \textbf{GraphShaper}, a geometry-aware framework that enhances graph encoding through multi-geometric specialization. Our approach employs expert networks tailored to different geometric spaces, dynamically computing fusion weights to adaptively integrate geometric properties based on local structural characteristics. This adaptive fusion preserves structural integrity before alignment with text embeddings. Extensive experiments demonstrate that GraphShaper achieves 9.47% accuracy improvements on citation networks and 7.63% on social networks in zero-shot settings.

[398] H4G: Unlocking Faithful Inference for Zero-Shot Graph Learning in Hyperbolic Space

Heng Zhang, Tianyi Zhang, Zijun Liu, Yuling Shi, Yaomin Shen, Haochen You, Haichuan Hu, Lubin Gan, Jin Huang

Main category: cs.LG

TL;DR: H4G addresses over-abstraction in graph-text alignment by reducing hyperbolic embedding radii to preserve fine-grained structural details, achieving significant improvements in zero-shot learning on both heterophilic and homophilic graphs.

Details

Motivation: Existing methods for text-attributed graphs struggle with fine-grained pattern recognition on heterophilic graphs due to over-abstraction, where multi-scale structural information is compressed into uniform high-level abstractions, causing information loss.

Method: Proposes H4G framework that systematically reduces embedding radii using learnable block-diagonal scaling matrices and Möbius matrix multiplication to restore access to fine-grained patterns while maintaining global receptive ability.

Result: H4G achieves state-of-the-art zero-shot performance with 12.8% improvement on heterophilic graphs and 8.4% on homophilic graphs.

Conclusion: Radius reduction enables faithful multi-scale representation for advancing zero-shot graph learning by preserving fine-grained structural details essential for accurate predictions.

Abstract: Text-attributed graphs are widely used across domains, offering rich opportunities for zero-shot learning via graph-text alignment. However, existing methods struggle with tasks requiring fine-grained pattern recognition, particularly on heterophilic graphs. Through empirical and theoretical analysis, we identify an \textbf{over-abstraction problem}: current approaches operate at excessively large hyperbolic radii, compressing multi-scale structural information into uniform high-level abstractions. This abstraction-induced information loss obscures critical local patterns essential for accurate predictions. By analyzing embeddings in hyperbolic space, we demonstrate that optimal graph learning requires \textbf{faithful preservation} of fine-grained structural details, better retained by representations positioned closer to the origin. To address this, we propose \textbf{H4G}, a framework that systematically reduces embedding radii using learnable block-diagonal scaling matrices and M"obius matrix multiplication. This approach restores access to fine-grained patterns while maintaining global receptive ability with minimal computational overhead. Experiments show H4G achieves state-of-the-art zero-shot performance with \textbf{12.8%} improvement on heterophilic graphs and \textbf{8.4%} on homophilic graphs, confirming that radius reduction enables faithful multi-scale representation for advancing zero-shot graph learning.

[399] Rethinking the Role of Dynamic Sparse Training for Scalable Deep Reinforcement Learning

Guozheng Ma, Lu Li, Zilin Wang, Haoyu Wang, Shengchao Hu, Leszek Rutkowski, Dacheng Tao

Main category: cs.LG

TL;DR: Dynamic sparse training strategies provide module-specific benefits that complement architectural improvements in deep reinforcement learning, leading to the development of Module-Specific Training (MST) framework for substantial scalability gains.

Details

Motivation: Scaling neural networks fails in DRL due to optimization pathologies like plasticity loss, and existing dynamic training approaches have limitations including uniform strategies across modules, limited evaluation on basic architectures, and lack of systematic comparison between different dynamic approaches.

Method: Comprehensive investigation across modules and architectures, comparing different dynamic approaches (sparse-to-sparse, dense-to-sparse, sparse-to-dense), and developing Module-Specific Training (MST) framework that applies module-specific dynamic sparse training strategies.

Result: Dynamic sparse training strategies provide module-specific benefits that complement architectural improvements, and MST framework demonstrates substantial scalability gains across diverse RL algorithms without requiring algorithmic modifications.

Conclusion: Module-specific dynamic training strategies effectively complement architectural improvements in DRL, enabling better scalability through the proposed MST framework that works across various RL algorithms.

Abstract: Scaling neural networks has driven breakthrough advances in machine learning, yet this paradigm fails in deep reinforcement learning (DRL), where larger models often degrade performance due to unique optimization pathologies such as plasticity loss. While recent works show that dynamically adapting network topology during training can mitigate these issues, existing studies have three critical limitations: (1) applying uniform dynamic training strategies across all modules despite encoder, critic, and actor following distinct learning paradigms, (2) focusing evaluation on basic architectures without clarifying the relative importance and interaction between dynamic training and architectural improvements, and (3) lacking systematic comparison between different dynamic approaches including sparse-to-sparse, dense-to-sparse, and sparse-to-dense. Through comprehensive investigation across modules and architectures, we reveal that dynamic sparse training strategies provide module-specific benefits that complement the primary scalability foundation established by architectural improvements. We finally distill these insights into Module-Specific Training (MST), a practical framework that further exploits the benefits of architectural improvements and demonstrates substantial scalability gains across diverse RL algorithms without algorithmic modifications.

[400] Chimera: State Space Models Beyond Sequences

Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu

Main category: cs.LG

TL;DR: Chimera is a unified model that incorporates data topology directly using state space models, eliminating the need for domain-specific biases like position embeddings. It achieves strong performance across language, vision, and graph domains with algorithmic optimizations for efficiency.

Details

Motivation: Transformer methods ignore neighborhood structure and require task-specific inductive biases (position embeddings, random walks) which require significant effort and can hinder generalization. There's a need for a principled way to incorporate data topology without domain-specific biases.

Method: Generalize state space models to capture any graph topology. Use algorithmic optimizations: linear-time recurrence for Directed Acyclic Graphs and mathematical relaxation for general graphs to achieve quadratic complexity.

Result: Outperforms BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on Long Range Graph Benchmark. Achieves strong performance across language, vision, and graph domains.

Conclusion: Chimera validates that data topology is a powerful inductive bias across modalities and provides a unified approach that eliminates the need for domain-specific biases while maintaining efficiency through algorithmic optimizations.

Abstract: Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases–such as position embeddings in sequences and images, or random walks in graphs–to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models–which naturally do not require position embeddings–can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera’s efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer’s quadratic complexity without domain-specific heuristics. These results validate Chimera’s core contribution and support the idea that data topology is a powerful inductive bias across modalities.

[401] nuGPR: GPU-Accelerated Gaussian Process Regression with Iterative Algorithms and Low-Rank Approximations

Ziqi Zhao, Vivek Sarin

Main category: cs.LG

TL;DR: nuGPR is a new framework that accelerates Gaussian Process Regression training using numerical linear algebra techniques, reducing computation time by 2x and memory consumption by 12x compared to existing GPU implementations.

Details

Motivation: To address the high computational cost and memory requirements of traditional GPR training, which limits its practical application despite its inherent uncertainty measurement capabilities.

Method: Combines preconditioned conjugate gradient for faster linear solves, exploits data clustering for block-diagonal covariance matrix structure with low-rank off-diagonal approximations, uses numerical gradients instead of exact differentiation for hyperparameter optimization, and leverages CUDA for GPU parallelization.

Result: Achieves up to 2x reduction in total training time and up to 12x reduction in peak memory consumption across various synthetic and real-world datasets compared to the best existing GPU-based GPR implementation.

Conclusion: nuGPR provides an efficient end-to-end training algorithm for GPR that significantly reduces computational costs while maintaining model performance, making GPR more practical for real-world applications.

Abstract: Gaussian Process Regression (GPR) is an important type of supervised machine learning model with inherent uncertainty measure in its predictions. We propose a new framework, nuGPR, to address the well-known challenge of high computation cost associated with GPR training. Our framework includes several ideas from numerical linear algebra to reduce the amount of computation in key steps of GPR, and we combine them to establish an end-to-end training algorithm. Specifically, we leverage the preconditioned conjugate gradient method to accelerate the convergence of the linear solves required in GPR. We exploit clustering in the input data to identify block-diagonal structure of the covariance matrix and subsequently construct low-rank approximations of the off-diagonal blocks. These enhancements significantly reduce the time and space complexity of our computations. In addition, unlike other frameworks that rely on exact differentiation, we employ numerical gradients to optimize the hyperparameters of our GPR model, further reducing the training cost by eliminating the need for backpropagation. Lastly, we leverage the CUDA Toolkit to efficiently parallelize the training procedure on NVIDIA GPUs. As a result, nuGPR reduces total training time by up to 2x and peak memory consumption by up to 12x on various synthetic and real-world datasets when compared to the best existing GPU-based GPR implementation.

[402] Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration

Yonghao Liu, Yajun Wang, Chunli Guo, Wei Pang, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan

Main category: cs.LG

TL;DR: GRACE is a graph few-shot learning framework that addresses limitations of fixed spectral operations and distribution mismatches between support and query sets through adaptive spectrum experts and cross-set distribution calibration.

Details

Motivation: Current graph few-shot learning methods use predefined graph filters that don't account for local topological heterogeneity, and assume support and query sets come from the same distribution, which leads to suboptimal generalization with limited labeled data.

Method: Proposes GRACE framework with two key components: adaptive spectrum experts to handle local structural variations, and cross-set distribution calibration techniques to address distribution mismatches between support and query sets.

Result: GRACE consistently outperforms state-of-the-art baselines across various experimental settings, demonstrating improved generalization capabilities.

Conclusion: The proposed approach effectively addresses key limitations in graph few-shot learning by adapting to local structural variations and performing cross-set distribution calibration, leading to superior performance.

Abstract: Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.

[403] Fairness-Constrained Optimization Attack in Federated Learning

Harsh Kasyap, Minghong Fang, Zhuqing Liu, Carsten Maple, Somanath Tripathy

Main category: cs.LG

TL;DR: This paper proposes an intentional fairness attack in federated learning where a malicious client sends biased models by increasing fairness loss while maintaining global accuracy, making the attack hard to detect.

Details

Motivation: Federated learning's privacy-preserving nature makes it susceptible to poisoning attacks and bias propagation, even with homogeneous data distribution, due to participants' independence over training data.

Method: The attack involves a client maliciously sending biased models by solving an optimization problem to increase fairness loss for metrics like demographic parity and equalized odds, while maintaining global accuracy.

Result: Empirical evaluation shows the attack increases bias up to 90% even with a single malicious client, and remains effective against state-of-the-art Byzantine-robust and fairness-aware aggregation schemes across various datasets.

Conclusion: The proposed fairness attack is insidious and hard to detect, demonstrating significant vulnerability in federated learning systems despite existing defense mechanisms.

Abstract: Federated learning (FL) is a privacy-preserving machine learning technique that facilitates collaboration among participants across demographics. FL enables model sharing, while restricting the movement of data. Since FL provides participants with independence over their training data, it becomes susceptible to poisoning attacks. Such collaboration also propagates bias among the participants, even unintentionally, due to different data distribution or historical bias present in the data. This paper proposes an intentional fairness attack, where a client maliciously sends a biased model, by increasing the fairness loss while training, even considering homogeneous data distribution. The fairness loss is calculated by solving an optimization problem for fairness metrics such as demographic parity and equalized odds. The attack is insidious and hard to detect, as it maintains global accuracy even after increasing the bias. We evaluate our attack against the state-of-the-art Byzantine-robust and fairness-aware aggregation schemes over different datasets, in various settings. The empirical results demonstrate the attack efficacy by increasing the bias up to 90%, even in the presence of a single malicious client in the FL system.

[404] Budget-constrained Active Learning to Effectively De-censor Survival Data

Ali Parsaee, Bei Jiang, Zachary Friggstad, Russell Greiner

Main category: cs.LG

TL;DR: This paper explores budgeted learning for survival datasets with censored instances, where learners can pay to acquire partial or complete labels for censored data, providing theoretical bounds and empirical results that outperform other approaches.

Details

Motivation: Standard supervised learning assumes fully labeled datasets, but in survival analysis with censored data, acquiring complete labels is costly. This work addresses the practical need for budgeted learning where limited resources can be used to strategically acquire additional information from censored instances.

Method: The approach extends budgeted learning to survival datasets with right-censored instances, allowing learners to pay for partial labeling (e.g., acquiring actual time-to-event or additional time information). It provides theoretical bounds and algorithms with time complexity equivalent to BatchBALD active learning method.

Result: Empirical analysis on multiple survival tasks demonstrates that the proposed model performs better than other potential approaches on several benchmarks. The method provides bounds and time complexity that are asymptotically equivalent to standard active learning methods.

Conclusion: The paper successfully adapts budgeted learning to survival data with censored instances, offering both theoretical guarantees and practical performance improvements over alternative methods, making it suitable for real-world scenarios where follow-up data collection is constrained by budget limitations.

Abstract: Standard supervised learners attempt to learn a model from a labeled dataset. Given a small set of labeled instances, and a pool of unlabeled instances, a budgeted learner can use its given budget to pay to acquire the labels of some unlabeled instances, which it can then use to produce a model. Here, we explore budgeted learning in the context of survival datasets, which include (right) censored instances, where we know only a lower bound on an instance’s time-to-event. Here, that learner can pay to (partially) label a censored instance – e.g., to acquire the actual time for an instance [perhaps go from (3 yr, censored) to (7.2 yr, uncensored)], or other variants [e.g., learn about one more year, so go from (3 yr, censored) to either (4 yr, censored) or perhaps (3.2 yr, uncensored)]. This serves as a model of real world data collection, where follow-up with censored patients does not always lead to uncensoring, and how much information is given to the learner model during data collection is a function of the budget and the nature of the data itself. We provide both experimental and theoretical results for how to apply state-of-the-art budgeted learning algorithms to survival data and the respective limitations that exist in doing so. Our approach provides bounds and time complexity asymptotically equivalent to the standard active learning method BatchBALD. Moreover, empirical analysis on several survival tasks show that our model performs better than other potential approaches on several benchmarks.

[405] Self-Verifying Reflection Helps Transformers with CoT Reasoning

Zhongwei Yu, Wannian Xia, Xue Yan, Bo Xu, Haifeng Zhang, Yali Du, Jun Wang

Main category: cs.LG

TL;DR: A minimalistic reasoning framework enables small transformers to perform self-verifying reflection, achieving LLM-level performance in tasks like integer multiplication and Sudoku without natural language.

Details

Motivation: To understand how reflection contributes to empirical improvements in reasoning, given that LLMs detect limited errors in chain-of-thoughts.

Method: Developed a minimalistic reasoning framework supporting basic self-verifying reflection for small transformers without natural language, with theoretical guarantees on improvement when verification errors are bounded.

Result: Tiny transformers with only a few million parameters benefited from self-verification in training and execution, reaching remarkable LLM-level performance. RL improved in-distribution performance but optimized shallow patterns without reducing verification errors.

Conclusion: Integrating generative transformers with discriminative verification inherently facilitates chain-of-thought reasoning, regardless of scaling and natural language.

Abstract: Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LLMs detect limited errors in CoTs, how reflection contributes to empirical improvements remains unclear. To analyze this issue, in this paper, we present a minimalistic reasoning framework to support basic self-verifying reflection for small transformers without natural language, which ensures analytic clarity and reduces the cost of comprehensive experiments. Theoretically, we prove that self-verifying reflection guarantees improvements if verification errors are properly bounded. Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku. Similar to LLM results, we find that reinforcement learning (RL) improves in-distribution performance and incentivizes frequent reflection for tiny transformers, yet RL mainly optimizes shallow statistical patterns without faithfully reducing verification errors. In conclusion, integrating generative transformers with discriminative verification inherently facilitates CoT reasoning, regardless of scaling and natural language.

[406] Revisiting Meta-Learning with Noisy Labels: Reweighting Dynamics and Theoretical Guarantees

Yiming Zhang, Chester Holtz, Gal Mishne, Alex Cloninger

Main category: cs.LG

TL;DR: The paper provides a theoretical analysis of meta-reweighting methods for learning with noisy labels, revealing a three-phase training trajectory and proposing a lightweight surrogate method that avoids expensive bi-level optimization.

Details

Motivation: Meta-learning-based sample reweighting helps mitigate label noise by using a small clean subset, but its behavior and training dynamics lack theoretical understanding, limiting its practical application.

Method: The authors conduct rigorous theoretical analysis of meta-reweighting under label noise, identify three training phases, and propose a lightweight surrogate method with mean-centering, row shifting, and label-signed modulation.

Result: The analysis reveals that meta-reweighting training unfolds in three phases: alignment, filtering, and post-filtering. The proposed lightweight method consistently outperforms strong reweighting/selection baselines across synthetic and real noisy-label benchmarks.

Conclusion: Meta-reweighting’s effectiveness stems from similarity-weighted coupling between training and clean subset signals, but loses discriminatory power in post-filtering regime. The proposed lightweight surrogate provides more stable performance while avoiding expensive bi-level optimization.

Abstract: Learning with noisy labels remains challenging because over-parameterized networks memorize corrupted supervision. Meta-learning-based sample reweighting mitigates this by using a small clean subset to guide training, yet its behavior and training dynamics lack theoretical understanding. We provide a rigorous theoretical analysis of meta-reweighting under label noise and show that its training trajectory unfolds in three phases: (i) an alignment phase that amplifies examples consistent with a clean subset and suppresses conflicting ones; (ii) a filtering phase driving noisy example weights toward zero until the clean subset loss plateaus; and (iii) a post-filtering phase in which noise filtration becomes perturbation-sensitive. The mechanism is a similarity-weighted coupling between training and clean subset signals together with clean subset training loss contraction; in the post-filtering regime where the clean-subset loss is sufficiently small, the coupling term vanishes and meta-reweighting loses discriminatory power. Guided by this analysis, we propose a lightweight surrogate for meta-reweighting that integrates mean-centering, row shifting, and label-signed modulation, yielding more stable performance while avoiding expensive bi-level optimization. Across synthetic and real noisy-label benchmarks, our method consistently outperforms strong reweighting/selection baselines.

[407] DE3S: Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification

Tao Xie, Zexi Tan, Haoyi Xiao, Binbin Sun, Yiqun Zhang

Main category: cs.LG

TL;DR: DE3S is a novel framework for medical early time-series classification that addresses the accuracy-earliness trade-off through dual-enhancement strategy, soft shapelet sparsification, and dual-path architecture with MoE and Inception modules.

Details

Motivation: Early time-series classification in medical applications like sepsis prediction is crucial but faces challenges of weak initial signals, class imbalance, and the conflicting goals of accuracy vs earliness, where existing methods often sacrifice one for the other.

Method: Proposes Dual-Enhanced Soft-Sparse-Shape Learning with three innovations: (1) dual-enhancement combining temporal augmentation and attention-based global enhancement, (2) attention-score-based soft shapelet sparsification, and (3) dual-path MoE and Inception modules fusion architecture for local and global pattern learning.

Result: Extensive experiments on six real-world medical datasets show state-of-the-art performance, with ablation studies confirming the efficacy of each component.

Conclusion: DE3S successfully addresses the accuracy-earliness trade-off in medical ETSC through precise shapelet discovery and robust representation learning, demonstrating superior performance on real-world medical datasets.

Abstract: Early time-series classification (ETSC) in medical applications is crucial for time-sensitive scenarios such as sepsis prediction in intensive care units (ICUs), where a large number of deaths are caused by delayed prediction. ETSC can significantly improve ICU resource utilization efficiency and healthcare precision. However, it faces conflicting goals of accuracy and earliness, with existing methods often trading one for the other, struggling to capture subtle early-stage patterns due to weak initial signals and class imbalance. The key to solve these challenges is to find shapelets, which are discriminative subsequences (or shapes) with high interpretability in time-series classification. This paper proposes Dual-Enhanced Soft-Sparse-Shape Learning for Medical Early Time-Series Classification (DE3S), which introduces a novel Dual-Enhanced Soft-Shape Learning framework to figure out shapelets precisely through three innovations: (1) a comprehensive dual-enhancement strategy combines traditional temporal augmentation with attention-based global temporal enhancement for robust representation learning, (2) an attention-score-based soft shapelet sparsification mechanism dynamically preserves discriminative patterns while aggregating less important shapelets into representative tokens, and (3) a dual-path Mixture of Experts Network (MoE) and Inception modules fusion architecture where MoE performs local learning within shapelets and multi-scale Inception modules capture global patterns across shapelets. The framework employs weighted cross-entropy loss for class imbalance handling and demonstrates robustness on subject-consistency datasets. Extensive experiments on six real-world medical datasets show state-of-the-art performance, with ablation studies confirming component efficacy.

[408] Hierarchical Koopman Diffusion: Fast Generation with Interpretable Diffusion Trajectory

Hanru Bai, Weiyang Ding, Difan Zou

Main category: cs.LG

TL;DR: Hierarchical Koopman Diffusion enables one-step image generation while maintaining interpretable generative trajectories through Koopman operator theory and hierarchical architecture.

Details

Motivation: To resolve the trade-off between fast sampling and interpretability in diffusion models - current one-step methods sacrifice the interpretability and fine-grained control that iterative diffusion models provide.

Method: Uses Koopman operator theory to lift nonlinear diffusion dynamics into a latent space with globally linear operators, enabling closed-form trajectory solutions. Implements a hierarchical architecture with scale-specific Koopman subspaces to disentangle generative dynamics across spatial resolutions.

Result: Achieves competitive one-step generation performance while providing full access to intermediate states for manual intervention during generation. Enables spectral analysis for interpreting and manipulating the generative process.

Conclusion: Bridges the gap between fast sampling and interpretability in diffusion models, paving the way for explainable image synthesis in generative modeling.

Abstract: Diffusion models have achieved impressive success in high-fidelity image generation but suffer from slow sampling due to their inherently iterative denoising process. While recent one-step methods accelerate inference by learning direct noise-to-image mappings, they sacrifice the interpretability and fine-grained control intrinsic to diffusion dynamics, key advantages that enable applications like editable generation. To resolve this dichotomy, we introduce \textbf{Hierarchical Koopman Diffusion}, a novel framework that achieves both one-step sampling and interpretable generative trajectories. Grounded in Koopman operator theory, our method lifts the nonlinear diffusion dynamics into a latent space where evolution is governed by globally linear operators, enabling closed-form trajectory solutions. This formulation not only eliminates iterative sampling but also provides full access to intermediate states, allowing manual intervention during generation. To model the multi-scale nature of images, we design a hierarchical architecture that disentangles generative dynamics across spatial resolutions via scale-specific Koopman subspaces, capturing coarse-to-fine details systematically. We empirically show that the Hierarchical Koopman Diffusion not only achieves competitive one-step generation performance but also provides a principled mechanism for interpreting and manipulating the generative process through spectral analysis. Our framework bridges the gap between fast sampling and interpretability in diffusion models, paving the way for explainable image synthesis in generative modeling.

[409] Unveiling the Vulnerability of Graph-LLMs: An Interpretable Multi-Dimensional Adversarial Attack on TAGs

Bowen Fan, Zhilin Guo, Xunkai Li, Yihan Zhou, Bing Zhou, Zhenjun Li, Rong-Hua Li, Guoren Wang

Main category: cs.LG

TL;DR: IMDGA is a human-centric adversarial attack framework that unifies structural and textual perturbations on Graph-LLMs, offering superior interpretability and effectiveness.

Details

Motivation: Graph-LLMs integrate structural and textual data but lack comprehensive vulnerability analysis; existing attacks target only one aspect, leaving multi-dimensional vulnerabilities unexplored.

Method: IMDGA uses three integrated modules to orchestrate multi-level perturbations across graph structure and textual features, balancing interpretability and attack impact.

Result: IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness across diverse datasets and architectures compared to existing methods.

Conclusion: The work exposes critical semantic vulnerabilities in Graph-LLMs, providing insights for improving their resilience through comprehensive multi-dimensional attack analysis.

Abstract: Graph Neural Networks (GNNs) have become a pivotal framework for modeling graph-structured data, enabling a wide range of applications from social network analysis to molecular chemistry. By integrating large language models (LLMs), text-attributed graphs (TAGs) enhance node representations with rich textual semantics, significantly boosting the expressive power of graph-based learning. However, this sophisticated synergy introduces critical vulnerabilities, as Graph-LLMs are susceptible to adversarial attacks on both their structural topology and textual attributes. Although specialized attack methods have been designed for each of these aspects, no work has yet unified them into a comprehensive approach. In this work, we propose the Interpretable Multi-Dimensional Graph Attack (IMDGA), a novel human-centric adversarial attack framework designed to orchestrate multi-level perturbations across both graph structure and textual features. IMDGA utilizes three tightly integrated modules to craft attacks that balance interpretability and impact, enabling a deeper understanding of Graph-LLM vulnerabilities. Through rigorous theoretical analysis and comprehensive empirical evaluations on diverse datasets and architectures, IMDGA demonstrates superior interpretability, attack effectiveness, stealthiness, and robustness compared to existing methods. By exposing critical weaknesses in TAG representation learning, this work uncovers a previously underexplored semantic dimension of vulnerability in Graph-LLMs, offering valuable insights for improving their resilience. Our code and resources are publicly available at https://anonymous.4open.science/r/IMDGA-7289.

Tao Yin, Xiaohong Zhang, Jiacheng Zhang, Li Huang, Zhibin Zhang, Yuansong Zeng, Jin Xie, Meng Yan

Main category: cs.LG

TL;DR: MoRA introduces instance-specific parameter space alignment for molecular graphs in LLMs, using dynamic low-rank adaptation weights for each molecule instead of static fine-tuning.

Details

Motivation: Existing methods process molecular structures through static adapters or fine-tuning, which limits capture of instance-specific features and causes catastrophic forgetting of LLM capabilities.

Method: Molecule-aware Low-Rank Adaptation (MoRA) generates unique low-rank adaptation weights for each input molecular graph and dynamically injects them into a frozen LLM.

Result: MoRA outperforms static baselines with 14.1% relative improvement in reaction prediction exact match and 22% error reduction in quantum property prediction.

Conclusion: Instance-specific dynamic adaptation preserves LLM knowledge while enabling better molecular structure reasoning, showing superiority over static adaptation approaches.

Abstract: Effectively integrating molecular graph structures with Large Language Models (LLMs) is a key challenge in drug discovery. Most existing multi-modal alignment methods typically process these structures by fine-tuning the LLM or adding a static adapter simultaneously. However, these approaches have two main limitations: (1) it optimizes a shared parameter space across all molecular inputs, limiting the model’s ability to capture instance-specific structural features; and (2) fine-tuning the LLM for molecular tasks can lead to catastrophic forgetting, undermining its general reasoning capabilities. In this paper, instead of static task-oriented adaptation, we propose an instance-specific parameter space alignment approach for each molecule on-the-fly. To this end, we introduce Molecule-aware Low-Rank Adaptation (MoRA) that produces a unique set of low-rank adaptation weights for each input molecular graph. These weights are then dynamically injected into a frozen LLM, allowing the model to adapt its reasoning to the structure of each molecular input, while preserving the LLM’s core knowledge. Extensive experiments demonstrate that on key molecular tasks, such as chemical reaction prediction and molecular captioning, MoRA’s instance-specific dynamic adaptation outperforms statically adapted baselines, including a 14.1% relative improvement in reaction prediction exact match and a 22% reduction in error for quantum property prediction. The code is available at https://github.com/jk-sounds/MoRA.

[411] Optimal Regularization for Performative Learning

Edwige Cyffers, Alireza Mirrokni, Marco Mondelli

Main category: cs.LG

TL;DR: Regularization in high-dimensional ridge regression can help mitigate performative effects where data distribution reacts to the deployed model, with optimal regularization scaling with the strength of performative effects.

Details

Motivation: In performative learning, data distribution reacts to deployed models (e.g., strategic users gaming the system), creating complex dynamics beyond classical supervised learning. Models must account for potential distribution shifts they might cause.

Method: Study regularization impact in high-dimensional ridge regression, analyzing how optimal regularization scales with performative effect strength. Use empirical evaluations on synthetic and real-world datasets.

Result: Performative effects worsen test risk in population setting but can be beneficial in over-parameterized regime (features > samples). Optimal regularization scales with overall strength of performative effect.

Conclusion: Regularization can effectively cope with performative effects, with optimal regularization parameter set in anticipation of these effects, particularly beneficial in over-parameterized settings.

Abstract: In performative learning, the data distribution reacts to the deployed model

for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.

[412] Diffusion Models for Reinforcement Learning: Foundations, Taxonomy, and Development

Changfu Xu, Jianxiong Guo, Yuzhu Liang, Haiyang Huang, Haodong Zou, Xi Zheng, Shui Yu, Xiaowen Chu, Jiannong Cao, Tian Wang

Main category: cs.LG

TL;DR: This survey provides a comprehensive synthesis of diffusion-based reinforcement learning (RL), covering integration methods, taxonomy, applications, and future directions.

Details

Motivation: To address key challenges in RL by leveraging diffusion models' advantages like multi-modal expressiveness, stable training, and trajectory-level planning capabilities.

Method: Establishes dual-axis taxonomy: function-oriented (roles of DMs in RL pipeline) and technique-oriented (online vs offline learning regimes), and examines progression from single-agent to multi-agent domains.

Result: Provides frameworks for DM-RL integration and highlights practical utility across diverse application domains.

Conclusion: Identifies open research issues and promising future development directions for advancing diffusion-based RL, with maintained GitHub repository for ongoing resources.

Abstract: Diffusion Models (DMs), as a leading class of generative models, offer key advantages for reinforcement learning (RL), including multi-modal expressiveness, stable training, and trajectory-level planning. This survey delivers a comprehensive and up-to-date synthesis of diffusion-based RL. We first provide an overview of RL, highlighting its challenges, and then introduce the fundamental concepts of DMs, investigating how they are integrated into RL frameworks to address key challenges in this research field. We establish a dual-axis taxonomy that organizes the field along two orthogonal dimensions: a function-oriented taxonomy that clarifies the roles DMs play within the RL pipeline, and a technique-oriented taxonomy that situates implementations across online versus offline learning regimes. We also provide a comprehensive examination of this progression from single-agent to multi-agent domains, thereby forming several frameworks for DM-RL integration and highlighting their practical utility. Furthermore, we outline several categories of successful applications of diffusion-based RL across diverse domains, discuss open research issues of current methodologies, and highlight key directions for future research to advance the field. Finally, we summarize the survey to identify promising future development directions. We are actively maintaining a GitHub repository (https://github.com/ChangfuXu/D4RL-FTD) for papers and other related resources to apply DMs for RL.

Ningxin He, Yang Liu, Wei Sun, Xiaozhou Ye, Ye Ouyang, Tiegang Gao, Zehui Zhang

Main category: cs.LG

TL;DR: FedMMKT enables privacy-preserving co-enhancement of server T2I models and client task-specific models using decentralized multimodal data.

Details

Motivation: T2I model adaptation is limited by task-specific data availability due to privacy concerns, while rich multimodal data from mobile/IoT systems presents an opportunity.

Method: Federated Multi-modal Knowledge Transfer framework that leverages decentralized multimodal data without compromising privacy.

Result: Not specified in abstract.

Conclusion: Not specified in abstract.

Abstract: Text-to-Image (T2I) models have demonstrated their versatility in a wide range of applications. However, adaptation of T2I models to specialized tasks is often limited by the availability of task-specific data due to privacy concerns. On the other hand, harnessing the power of rich multimodal data from modern mobile systems and IoT infrastructures presents a great opportunity. This paper introduces Federated Multi-modal Knowledge Transfer (FedMMKT), a novel framework that enables co-enhancement of a server T2I model and client task-specific models using decentralized multimodal data without compromising data privacy.

[414] HiLoRA: Adaptive Hierarchical LoRA Routing for Training-Free Domain Generalization

Ziyi Han, Huanyu Wang, Zeyu Zhang, Xiangxiang Dai, Xutong Liu, John C. S. Lui

Main category: cs.LG

TL;DR: HiLoRA is a training-free framework that performs adaptive hierarchical routing over LoRA pools using rank-one components, achieving substantial domain generalization improvements without additional training.

Details

Motivation: Existing LoRA reuse methods require explicit task labels or additional training, and activate fixed numbers of entire LoRA modules, leading to parameter redundancy or insufficiency that degrades performance.

Method: HiLoRA defines rank-one components (ROCs) where each rank parameter is independent. It adaptively selects LoRAs and determines ROC allocation using Gaussian likelihoods at sequence level, then refines routing at token level by activating only the most informative ROCs.

Result: Extensive experiments show HiLoRA achieves up to 55% accuracy gains over state-of-the-art baselines in domain generalization while maintaining comparable inference throughput.

Conclusion: HiLoRA provides an effective training-free framework for adaptive LoRA routing that significantly improves domain generalization performance with theoretical guarantees on relevant LoRA selection.

Abstract: Low-Rank Adaptation (LoRA) has emerged as a widely used technique for adapting large language models (LLMs) to new domains, due to its modular design and broad availability on platforms such as HuggingFace. This availability has motivated efforts to reuse existing LoRAs for domain generalization. However, existing methods often rely on explicit task labels or additional training, which are impractical for deployment. Moreover, they typically activate a fixed number of entire LoRA modules, leading to parameter redundancy or insufficiency that degrade performance. In this paper, we propose \texttt{HiLoRA}, a training-free framework that performs adaptive hierarchical routing over LoRA pools. Drawing on structural properties of LoRA, we define rank-one components (ROCs), in which each rank parameter is regarded as an independent unit. For a given input sequence, \texttt{HiLoRA} first adaptively selects a subset of LoRAs and determines their ROC allocation based on Gaussian likelihoods at the sequence level. At the token level, it further refines routing by activating only the most informative ROCs. We further provide theoretical guarantees that \texttt{HiLoRA} selects the most relevant LoRAs with high probability. Extensive experiments show that \texttt{HiLoRA} achieves substantial improvements in domain generalization, with accuracy gains of up to {\small $55%$} over state-of-the-art baselines, while maintaining comparable inference throughput.

[415] Multi-Action Self-Improvement for Neural Combinatorial Optimization

Laurin Luttmann, Lin Xie

Main category: cs.LG

TL;DR: This paper proposes a multi-agent self-improvement approach for Neural Combinatorial Optimization that predicts joint agent-task assignments and uses set-prediction loss to exploit symmetries, improving sample efficiency and solution quality while reducing computational costs.

Details

Motivation: Existing self-improvement methods in NCO are computationally expensive and fail to exploit multi-agent coordination structures and agent-permutation symmetries in combinatorial problems like routing and scheduling.

Method: Extends self-improvement to operate over joint multi-agent actions with a model architecture that predicts complete agent-task assignments jointly. Uses set-prediction loss to supervise on multiple expert assignments per state to leverage symmetries.

Result: Demonstrates consistent improvements in final solution quality and reduced generation latency compared to standard self-improvement across several combinatorial problems.

Conclusion: The proposed multi-agent self-improvement approach with set-prediction loss effectively addresses computational inefficiency and symmetry exploitation limitations, enabling better coordinated behavior and faster solution generation.

Abstract: Self-improvement has emerged as a state-of-the-art paradigm in Neural Combinatorial Optimization (NCO), where models iteratively refine their policies by generating and imitating high-quality solutions. Despite strong empirical performance, existing methods face key limitations. Training is computationally expensive, as policy updates require sampling numerous candidate solutions per instance to extract a single expert trajectory. More fundamentally, these approaches fail to exploit the structure of combinatorial problems involving the coordination of multiple agents, such as vehicles in min-max routing or machines in scheduling. By supervising on single-action trajectories, they fail to exploit agent-permutation symmetries, where distinct sequences of actions yield identical solutions, hindering generalization and the ability to learn coordinated behavior. We address these challenges by extending self-improvement to operate over joint multi-agent actions. Our model architecture predicts complete agent-task assignments jointly at each decision step. To explicitly leverage symmetries, we employ a set-prediction loss, which supervises the policy on multiple expert assignments for any given state. This approach enhances sample efficiency and the model’s ability to learn coordinated behavior. Furthermore, by generating multi-agent actions in parallel, it drastically accelerates the solution generation phase of the self-improvement loop. Empirically, we validate our method on several combinatorial problems, demonstrating consistent improvements in the quality of the final solution and a reduced generation latency compared to standard self-improvement.

[416] General Fourier Feature Physics-Informed Extreme Learning Machine (GFF-PIELM) for High-Frequency PDEs

Fei Ren, Sifan Wang, Pei-Zhi Zhuang, Hai-Sui Yu, He Yang

Main category: cs.LG

TL;DR: A novel Fourier feature physics-informed extreme learning machine (GFF-PIELM) is proposed to solve PDEs with high-frequency and variable-frequency behaviors, overcoming limitations of conventional PIELM while maintaining efficiency and simplicity.

Details

Motivation: Conventional PIELM struggles with solving PDEs involving high-frequency and variable-frequency behaviors, requiring a more effective approach to handle these challenging scenarios.

Method: Three-step approach: 1) Integrate Fourier feature mapping as activation function in ELM, 2) Assign frequency coefficients to hidden neurons for capturing diverse frequency components, 3) Develop innovative initialization method for hyperparameters by monitoring ELM output weight distribution.

Result: GFF-PIELM significantly improves predictive accuracy without additional training time or architecture complexity, successfully handling high-frequency, variable-frequency, multi-scale, irregular boundary, and inverse problems across five case studies with ten numerical examples.

Conclusion: PIELM can be effectively extended to solve high-frequency and variable-frequency PDEs with high accuracy, and the proposed initialization strategy may inspire advances in other physics-informed machine learning frameworks.

Abstract: Conventional physics-informed extreme learning machine (PIELM) often faces challenges in solving partial differential equations (PDEs) involving high-frequency and variable-frequency behaviors. To address these challenges, we propose a general Fourier feature physics-informed extreme learning machine (GFF-PIELM). We demonstrate that directly concatenating multiple Fourier feature mappings (FFMs) and an extreme learning machine (ELM) network makes it difficult to determine frequency-related hyperparameters. Fortunately, we find an alternative to establish the GFF-PIELM in three main steps. First, we integrate a variation of FFM into ELM as the Fourier-based activation function, so there is still one hidden layer in the GFF-PIELM framework. Second, we assign a set of frequency coefficients to the hidden neurons, which enables ELM network to capture diverse frequency components of target solutions. Finally, we develop an innovative, straightforward initialization method for these hyperparameters by monitoring the distribution of ELM output weights. GFF-PIELM not only retains the high accuracy, efficiency, and simplicity of the PIELM framework but also inherits the ability of FFMs to effectively handle high-frequency problems. We carry out five case studies with a total of ten numerical examples to highlight the feasibility and validity of the proposed GFF-PIELM, involving high frequency, variable frequency, multi-scale behaviour, irregular boundary and inverse problems. Compared to conventional PIELM, the GFF-PIELM approach significantly improves predictive accuracy without additional cost in training time and architecture complexity. Our results confirm that that PIELM can be extended to solve high-frequency and variable-frequency PDEs with high accuracy, and our initialization strategy may further inspire advances in other physics-informed machine learning (PIML) frameworks.

[417] Deep SPI: Safe Policy Improvement via World Models

Florent Delgrange, Raphael Avalos, Willem Röpke

Main category: cs.LG

TL;DR: DeepSPI extends safe policy improvement (SPI) to online deep RL settings with world models and representation learning, providing theoretical guarantees for monotonic improvement through restricted policy updates.

Details

Motivation: Existing SPI guarantees are limited to offline, tabular RL settings, creating a need to extend these theoretical foundations to online deep reinforcement learning with representation learning.

Method: DeepSPI combines local transition and reward prediction losses with regularized policy updates, restricting updates to a neighborhood of the current policy to ensure monotonic improvement.

Result: On the ALE-57 benchmark, DeepSPI matches or exceeds performance of strong baselines like PPO and DeepMDPs while maintaining theoretical guarantees.

Conclusion: The framework successfully extends classical SPI theorems to online deep RL, linking representation quality to prediction losses and enabling safe policy improvement with theoretical foundations.

Abstract: Safe policy improvement (SPI) offers theoretical control over policy updates, yet existing guarantees largely concern offline, tabular reinforcement learning (RL). We study SPI in general online settings, when combined with world model and representation learning. We develop a theoretical framework showing that restricting policy updates to a well-defined neighborhood of the current policy ensures monotonic improvement and convergence. This analysis links transition and reward prediction losses to representation quality, yielding online, “deep” analogues of classical SPI theorems from the offline RL literature. Building on these results, we introduce DeepSPI, a principled on-policy algorithm that couples local transition and reward losses with regularised policy updates. On the ALE-57 benchmark, DeepSPI matches or exceeds strong baselines, including PPO and DeepMDPs, while retaining theoretical guarantees.

[418] Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand

Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma

Main category: cs.LG

TL;DR: Physics-informed GNNs with extreme-value analysis improve rainfall forecasting in Thailand by capturing spatiotemporal patterns and addressing extreme events through novel statistical methods.

Details

Motivation: Accurate rainfall forecasting, especially for extreme events, is challenging in climatology. Traditional models struggle with complex spatiotemporal patterns and extreme-value prediction.

Method: Graph Attention Network with LSTM using physics-derived edge features, combined with Spatial Season-aware Generalized Pareto Distribution for extreme-value analysis via Peak-Over-Threshold mapping.

Result: Outperforms established baselines across most regions, including extreme-prone areas, and improves extreme-event prediction compared to operational system SEAS5.

Conclusion: The approach provides practical enhancement for fine-resolution rainfall maps, supporting long-term water management decision-making with better extreme-event forecasting.

Abstract: Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce fine-resolution maps that support decision-making in long-term water management.

[419] Finite-time Convergence Analysis of Actor-Critic with Evolving Reward

Rui Hu, Yu Chen, Longbo Huang

Main category: cs.LG

TL;DR: First finite-time convergence analysis of single-timescale actor-critic algorithm with evolving reward functions under Markovian sampling, achieving O(1/√T) convergence rate matching static reward performance.

Details

Motivation: Many practical RL algorithms use evolving reward functions (reward shaping, entropy regularization, curriculum learning) but lack theoretical foundations for their convergence analysis.

Method: Analyzes single-timescale actor-critic algorithm where reward parameters change at each time step, affecting both policy optimization and value estimation, under Markovian sampling and standard assumptions.

Result: Derives non-asymptotic bounds showing O(1/√T) convergence rate is achievable for both actor and critic errors when reward parameters evolve slowly enough, matching best-known static reward rate.

Conclusion: Provides theoretical foundation for popular RL techniques with evolving rewards, and introduces improved distribution mismatch analysis under Markovian sampling that enhances static-reward rate by log²T factor.

Abstract: Many popular practical reinforcement learning (RL) algorithms employ evolving reward functions-through techniques such as reward shaping, entropy regularization, or curriculum learning-yet their theoretical foundations remain underdeveloped. This paper provides the first finite-time convergence analysis of a single-timescale actor-critic algorithm in the presence of an evolving reward function under Markovian sampling. We consider a setting where the reward parameters may change at each time step, affecting both policy optimization and value estimation. Under standard assumptions, we derive non-asymptotic bounds for both actor and critic errors. Our result shows that an $O(1/\sqrt{T})$ convergence rate is achievable, matching the best-known rate for static rewards, provided the reward parameters evolve slowly enough. This rate is preserved when the reward is updated via a gradient-based rule with bounded gradient and on the same timescale as the actor and critic, offering a theoretical foundation for many popular RL techniques. As a secondary contribution, we introduce a novel analysis of distribution mismatch under Markovian sampling, improving the best-known rate by a factor of $\log^2T$ in the static-reward case.

[420] Traveling Salesman-Based Token Ordering Improves Stability in Homomorphically Encrypted Language Models

Donghwan Rho, Sieun Seo, Hyewon Sung, Chohong Min, Ernest K. Ryu

Main category: cs.LG

TL;DR: Proposes TSP-based token reordering and post-processing to enable encrypted text generation with LLMs using homomorphic encryption, addressing next-token prediction challenges while preserving privacy.

Details

Motivation: As users increasingly interact with LLMs using private information, secure encrypted communication becomes essential. Homomorphic encryption enables computation on encrypted data, but text generation (particularly next-token prediction) remains a key obstacle to practical encrypted interaction.

Method: Uses TSP-based token reordering strategy to address encrypted text generation difficulties, combined with a post-processing step that reduces approximation error.

Result: Theoretical analysis and experiments show the method prevents collapse, improves coherence in generated text, and preserves data privacy throughout the process.

Conclusion: The contributions advance the feasibility of practical and privacy-preserving LLM inference by enabling secure encrypted text generation.

Abstract: As users increasingly interact with large language models (LLMs) using private information, secure and encrypted communication becomes essential. Homomorphic encryption (HE) provides a principled solution by enabling computation directly on encrypted data. Although prior work has explored aspects of running LLMs under HE, the challenge of text generation, particularly next-token prediction, has received limited attention and remains a key obstacle to practical encrypted interaction. In this work, we propose a TSP-based token reordering strategy to address the difficulties of encrypted text generation, together with a post-processing step that further reduces approximation error. Theoretical analysis and experimental results demonstrate that our method prevents collapse, improves coherence in generated text, and preserves data privacy throughout. Overall, our contributions advance the feasibility of practical and privacy-preserving LLM inference.

Olga Ovcharenko, Sebastian Schelter

Main category: cs.LG

TL;DR: Benchmarking study on cross-modal error detection in tabular data shows Cleanlab and DataScope perform best with AutoML, but current methods remain limited for real-world heavy-tailed data.

Details

Motivation: Traditional error detection methods focus on single modalities and miss cross-modal errors common in domains like e-commerce and healthcare where image, tabular, and text data coexist.

Method: Benchmarked several methods across four datasets using five baseline approaches, evaluating performance with AutoML frameworks.

Result: Cleanlab (label error detection framework) and DataScope (data valuation method) achieved the highest F1 scores when paired with strong AutoML framework.

Conclusion: Current methods for cross-modal error detection remain limited, especially for heavy-tailed real-world data, motivating further research in this area.

Abstract: Ensuring data quality at scale remains a persistent challenge for large organizations. Despite recent advances, maintaining accurate and consistent data is still complex, especially when dealing with multiple data modalities. Traditional error detection and correction methods tend to focus on a single modality, typically a table, and often miss cross-modal errors that are common in domains like e-Commerce and healthcare, where image, tabular, and text data co-exist. To address this gap, we take an initial step towards cross-modal error detection in tabular data, by benchmarking several methods. Our evaluation spans four datasets and five baseline approaches. Among them, Cleanlab, a label error detection framework, and DataScope, a data valuation method, perform the best when paired with a strong AutoML framework, achieving the highest F1 scores. Our findings indicate that current methods remain limited, particularly when applied to heavy-tailed real-world data, motivating further research in this area.

[422] Enhanced Pre-training of Graph Neural Networks for Million-Scale Heterogeneous Graphs

Shengyin Sun, Chen Ma, Jiehao Chen

Main category: cs.LG

TL;DR: Proposes a framework for pre-training GNNs on large-scale heterogeneous graphs using structure-aware and semantic-aware tasks to address semantic mismatch and improve transferability.

Details

Motivation: Existing GNN pre-training methods are designed for homogeneous graphs and don't consider semantic mismatch between original data and ideal transferable data, while real-world graphs are mostly heterogeneous.

Method: Designs two pre-training tasks: structure-aware task to capture structural properties in heterogeneous graphs, and semantic-aware task using perturbation subspace with semantic neighbors to address semantic mismatch and focus on general knowledge.

Result: Extensive experiments on real-world large-scale heterogeneous graphs demonstrate superiority over state-of-the-art baselines.

Conclusion: The proposed framework effectively pre-trains GNNs on heterogeneous graphs by addressing both structural properties and semantic mismatch, leading to better transferability to downstream tasks with limited labeled data.

Abstract: In recent years, graph neural networks (GNNs) have facilitated the development of graph data mining. However, training GNNs requires sufficient labeled task-specific data, which is expensive and sometimes unavailable. To be less dependent on labeled data, recent studies propose to pre-train GNNs in a self-supervised manner and then apply the pre-trained GNNs to downstream tasks with limited labeled data. However, most existing methods are designed solely for homogeneous graphs (real-world graphs are mostly heterogeneous) and do not consider semantic mismatch (the semantic difference between the original data and the ideal data containing more transferable semantic information). In this paper, we propose an effective framework to pre-train GNNs on the large-scale heterogeneous graph. We first design a structure-aware pre-training task, which aims to capture structural properties in heterogeneous graphs. Then, we design a semantic-aware pre-training task to tackle the mismatch. Specifically, we construct a perturbation subspace composed of semantic neighbors to help deal with the semantic mismatch. Semantic neighbors make the model focus more on the general knowledge in the semantic space, which in turn assists the model in learning knowledge with better transferability. Finally, extensive experiments are conducted on real-world large-scale heterogeneous graphs to demonstrate the superiority of the proposed method over state-of-the-art baselines. Code available at https://github.com/sunshy-1/PHE.

[423] Cautious Weight Decay

Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu

Main category: cs.LG

TL;DR: Cautious Weight Decay (CWD) is a simple optimizer modification that applies weight decay only to parameters whose signs match the optimizer update direction, improving performance without extra hyperparameters.

Details

Motivation: Standard weight decay methods implicitly modify the optimization objective, while CWD aims to preserve the original loss function while still providing regularization benefits.

Method: One-line optimizer-agnostic modification that selectively applies weight decay to parameter coordinates where the sign aligns with the optimizer update direction.

Result: Consistently improves final loss and accuracy for language model pre-training and ImageNet classification at million- to billion-parameter scales.

Conclusion: CWD is an effective drop-in replacement that requires no additional tuning and provides better optimization performance while preserving the original objective.

Abstract: We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.

[424] Continuous Uniqueness and Novelty Metrics for Generative Modeling of Inorganic Crystals

Masahiro Negishi, Hyunsoo Park, Kinga O. Mastej, Aron Walsh

Main category: cs.LG

TL;DR: Proposes two continuous distance functions to overcome limitations of traditional crystal distance metrics for evaluating generative AI models in materials science.

Details

Motivation: Traditional crystal distance functions have four key limitations: they can't quantify similarity degrees, can't distinguish compositional vs structural differences, lack Lipschitz continuity, and have non-invariant uniqueness metrics.

Method: Developed two continuous distance functions specifically designed to address the limitations of prevalent crystal distance metrics used in evaluating generative models.

Result: Experiments show the proposed distances reveal insights missed by traditional distance functions, providing more reliable evaluation of generative models for inorganic crystals.

Conclusion: The continuous distance functions offer a more robust theoretical foundation for evaluating and comparing generative AI models in materials discovery, particularly for inorganic crystal structures.

Abstract: To address pressing scientific challenges such as climate change, increasingly sophisticated generative artificial intelligence models are being developed that can efficiently sample the large chemical space of possible functional materials. These models can quickly sample new chemical compositions paired with crystal structures. They are typically evaluated using uniqueness and novelty metrics, which depend on a chosen crystal distance function. However, the most prevalent distance function has four limitations: it fails to quantify the degree of similarity between compounds, cannot distinguish compositional difference and structural difference, lacks Lipschitz continuity against shifts in atomic coordinates, and results in a uniqueness metric that is not invariant against the permutation of generated samples. In this work, we propose using two continuous distance functions to evaluate uniqueness and novelty, which theoretically overcome these limitations. Our experiments show that these distances reveal insights missed by traditional distance functions, providing a more reliable basis for evaluating and comparing generative models for inorganic crystals.

[425] Bayesian Optimization for Dynamic Pricing and Learning

Anush Anand, Pranav Agrawal, Tejas Bodas

Main category: cs.LG

TL;DR: A Gaussian Process-based nonparametric approach using Bayesian Optimization for dynamic pricing that outperforms traditional RL methods without restrictive modeling assumptions.

Details

Motivation: Traditional dynamic pricing methods assume specific parametric demand functions, which may not hold in real-world scenarios, limiting their applicability.

Method: Proposed Gaussian Process-based nonparametric approach using Bayesian Optimization to treat demand as a black-box function, with tailored algorithms for both infinite and finite inventory settings.

Result: BO-based methods outperform state-of-the-art RL algorithms in revenue, require fewer assumptions, and offer greater robustness with regret guarantees for both regimes.

Conclusion: Bayesian Optimization is a powerful and practical tool for dynamic pricing in complex, uncertain environments, providing sample-efficient learning without restrictive modeling assumptions.

Abstract: Dynamic pricing is the practice of adjusting the selling price of a product to maximize a firm’s revenue by responding to market demand. The literature typically distinguishes between two settings: infinite inventory, where the firm has unlimited stock and time to sell, and finite inventory, where both inventory and selling horizon are limited. In both cases, the central challenge lies in the fact that the demand function – how sales respond to price – is unknown and must be learned from data. Traditional approaches often assume a specific parametric form for the demand function, enabling the use of reinforcement learning (RL) to identify near-optimal pricing strategies. However, such assumptions may not hold in real-world scenarios, limiting the applicability of these methods. In this work, we propose a Gaussian Process (GP) based nonparametric approach to dynamic pricing that avoids restrictive modeling assumptions. We treat the demand function as a black-box function of the price and develop pricing algorithms based on Bayesian Optimization (BO) – a sample-efficient method for optimizing unknown functions. We present BO-based algorithms tailored for both infinite and finite inventory settings and provide regret guarantees for both regimes, thereby quantifying the learning efficiency of our methods. Through extensive experiments, we demonstrate that our BO-based methods outperform several state-of-the-art RL algorithms in terms of revenue, while requiring fewer assumptions and offering greater robustness. This highlights Bayesian Optimization as a powerful and practical tool for dynamic pricing in complex, uncertain environments.

[426] A Function Centric Perspective On Flat and Sharp Minima

Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis

Main category: cs.LG

TL;DR: Sharp minima can correlate with better generalization, challenging the traditional belief that flat minima always lead to better performance.

Details

Motivation: To revisit the role of sharpness in model performance and challenge the conventional wisdom that flat minima necessarily indicate better generalization.

Method: Conducted extensive empirical studies across single-objective optimization to modern image classification tasks, examining models with various regularizations (SAM, weight decay, data augmentation).

Result: Found that sharper minima often emerge with regularization and can coincide with better generalization, calibration, robustness, and functional consistency. Unregularized baselines converged to flatter minima but performed worse across safety metrics.

Conclusion: Function complexity, rather than flatness alone, governs solution geometry, and sharper minima can reflect more appropriate inductive biases, calling for a function-centric reappraisal of loss landscape geometry.

Abstract: Flat minima are widely believed to correlate with improved generalisation in deep neural networks. However, this connection has proven more nuanced in recent studies, with both theoretical counterexamples and empirical exceptions emerging in the literature. In this paper, we revisit the role of sharpness in model performance, proposing that sharpness is better understood as a function-dependent property rather than a reliable indicator of poor generalisation. We conduct extensive empirical studies, from single-objective optimisation to modern image classification tasks, showing that sharper minima often emerge when models are regularised (e.g., via SAM, weight decay, or data augmentation), and that these sharp minima can coincide with better generalisation, calibration, robustness, and functional consistency. Across a range of models and datasets, we find that baselines without regularisation tend to converge to flatter minima yet often perform worse across all safety metrics. Our findings demonstrate that function complexity, rather than flatness alone, governs the geometry of solutions, and that sharper minima can reflect more appropriate inductive biases (especially under regularisation), calling for a function-centric reappraisal of loss landscape geometry.

[427] Time-Correlated Video Bridge Matching

Viacheslav Vasilev, Arseny Ivanov, Nikita Gushchin, Maria Kovaleva, Alexander Korotin

Main category: cs.LG

TL;DR: TCVBM extends Bridge Matching to time-correlated video data by explicitly modeling temporal dependencies in diffusion bridges, achieving superior performance in video tasks like frame interpolation, image-to-video generation, and super-resolution.

Details

Motivation: Diffusion models struggle with data-to-data tasks between complex distributions, and existing Bridge Matching methods haven't been applied to time-correlated sequences, which is crucial for video generation where temporal coherence is essential.

Method: Time-Correlated Video Bridge Matching (TCVBM) extends Bridge Matching to time-correlated data sequences by explicitly modeling inter-sequence dependencies within the diffusion bridge and incorporating temporal correlations directly into the sampling process.

Result: TCVBM achieves superior performance across multiple quantitative metrics compared to classical bridge matching and diffusion models, demonstrating enhanced generation quality and reconstruction fidelity in video tasks.

Conclusion: TCVBM successfully addresses the gap in modeling time-correlated data sequences for video tasks, providing a framework that maintains temporal coherence while improving performance in frame interpolation, image-to-video generation, and video super-resolution.

Abstract: Diffusion models excel in noise-to-data generation tasks, providing a mapping from a Gaussian distribution to a more complex data distribution. However they struggle to model translations between complex distributions, limiting their effectiveness in data-to-data tasks. While Bridge Matching (BM) models address this by finding the translation between data distributions, their application to time-correlated data sequences remains unexplored. This is a critical limitation for video generation and manipulation tasks, where maintaining temporal coherence is particularly important. To address this gap, we propose Time-Correlated Video Bridge Matching (TCVBM), a framework that extends BM to time-correlated data sequences in the video domain. TCVBM explicitly models inter-sequence dependencies within the diffusion bridge, directly incorporating temporal correlations into the sampling process. We compare our approach to classical methods based on bridge matching and diffusion models for three video-related tasks: frame interpolation, image-to-video generation, and video super-resolution. TCVBM achieves superior performance across multiple quantitative metrics, demonstrating enhanced generation quality and reconstruction fidelity.

[428] CrossAD: Time Series Anomaly Detection with Cross-scale Associations and Cross-window Modeling

Beibu Li, Qichao Shentu, Yang Shu, Hui Zhang, Ming Li, Ning Jin, Bin Yang, Chenjuan Guo

Main category: cs.LG

TL;DR: CrossAD is a novel time series anomaly detection framework that incorporates cross-scale associations and cross-window modeling to capture dynamic multi-scale patterns and comprehensive contextual information.

Details

Motivation: Existing methods model multi-scale information independently or use simple fusion, neglecting dynamic cross-scale associations during anomalies, and rely on fixed sliding windows that limit contextual information capture.

Method: Proposes cross-scale reconstruction to reconstruct fine-grained series from coarser series, and designs a query library with global multi-scale context to overcome fixed window size limitations.

Result: Extensive experiments on multiple real-world datasets using nine evaluation metrics validate CrossAD’s effectiveness, demonstrating state-of-the-art performance in anomaly detection.

Conclusion: CrossAD successfully addresses limitations of existing methods by explicitly modeling cross-scale associations and incorporating global multi-scale context, achieving superior anomaly detection performance.

Abstract: Time series anomaly detection plays a crucial role in a wide range of real-world applications. Given that time series data can exhibit different patterns at different sampling granularities, multi-scale modeling has proven beneficial for uncovering latent anomaly patterns that may not be apparent at a single scale. However, existing methods often model multi-scale information independently or rely on simple feature fusion strategies, neglecting the dynamic changes in cross-scale associations that occur during anomalies. Moreover, most approaches perform multi-scale modeling based on fixed sliding windows, which limits their ability to capture comprehensive contextual information. In this work, we propose CrossAD, a novel framework for time series Anomaly Detection that takes Cross-scale associations and Cross-window modeling into account. We propose a cross-scale reconstruction that reconstructs fine-grained series from coarser series, explicitly capturing cross-scale associations. Furthermore, we design a query library and incorporate global multi-scale context to overcome the limitations imposed by fixed window sizes. Extensive experiments conducted on multiple real-world datasets using nine evaluation metrics validate the effectiveness of CrossAD, demonstrating state-of-the-art performance in anomaly detection.

[429] PubSub-VFL: Towards Efficient Two-Party Split Learning in Heterogeneous Environments via Publisher/Subscriber Architecture

Yi Liu, Yang Liu, Leqian Zheng, Jue Hong, Junjie Shi, Qingyou Yang, Ye Wu, Cong Wang

Main category: cs.LG

TL;DR: PubSub-VFL is a novel Vertical Federated Learning paradigm that uses a Publisher/Subscriber architecture to address computational inefficiency and training latency in two-party collaborative learning while maintaining privacy.

Details

Motivation: Direct data sharing between organizations is impractical due to privacy concerns, and existing Two-Party Split Learning/VFL architectures suffer from low computational resource utilization, training inefficiency, and latency issues caused by synchronous dependencies and participant heterogeneity.

Method: Proposes PubSub-VFL with a Publisher/Subscriber architecture that leverages decoupling capabilities and data parallelism to create a hierarchical asynchronous mechanism. Also formalizes an optimization problem based on system profiles to select optimal hyperparameters while preserving privacy.

Result: PubSub-VFL accelerates training by 2-7× without compromising accuracy, achieves computational resource utilization rate up to 91.07%, and demonstrates stable convergence with compatibility to security protocols like differential privacy across five benchmark datasets.

Conclusion: The proposed PubSub-VFL paradigm effectively addresses computational inefficiency and training latency in VFL while maintaining privacy, offering significant performance improvements over state-of-the-art baselines.

Abstract: With the rapid advancement of the digital economy, data collaboration between organizations has become a well-established business model, driving the growth of various industries. However, privacy concerns make direct data sharing impractical. To address this, Two-Party Split Learning (a.k.a. Vertical Federated Learning (VFL)) has emerged as a promising solution for secure collaborative learning. Despite its advantages, this architecture still suffers from low computational resource utilization and training efficiency. Specifically, its synchronous dependency design increases training latency, while resource and data heterogeneity among participants further hinder efficient computation. To overcome these challenges, we propose PubSub-VFL, a novel VFL paradigm with a Publisher/Subscriber architecture optimized for two-party collaborative learning with high computational efficiency. PubSub-VFL leverages the decoupling capabilities of the Pub/Sub architecture and the data parallelism of the parameter server architecture to design a hierarchical asynchronous mechanism, reducing training latency and improving system efficiency. Additionally, to mitigate the training imbalance caused by resource and data heterogeneity, we formalize an optimization problem based on participants’ system profiles, enabling the selection of optimal hyperparameters while preserving privacy. We conduct a theoretical analysis to demonstrate that PubSub-VFL achieves stable convergence and is compatible with security protocols such as differential privacy. Extensive case studies on five benchmark datasets further validate its effectiveness, showing that, compared to state-of-the-art baselines, PubSub-VFL not only accelerates training by $2 \sim 7\times$ without compromising accuracy, but also achieves a computational resource utilization rate of up to 91.07%.

[430] Mitigating the Noise Shift for Denoising Generative Models via Noise Awareness Guidance

Jincheng Zhong, Boyuan Jiang, Xin Tao, Pengfei Wan, Kun Gai, Mingsheng Long

Main category: cs.LG

TL;DR: The paper identifies “noise shift” - a misalignment between pre-defined noise levels and actual noise in diffusion model sampling - and proposes Noise Awareness Guidance (NAG) to correct this issue and improve generation quality.

Details

Motivation: Existing denoising models suffer from noise shift, where intermediate states during sampling don't align with the pre-defined noise schedule, leading to sub-optimal generation due to out-of-distribution generalization and inaccurate denoising updates.

Method: Proposes Noise Awareness Guidance (NAG) to steer sampling trajectories to remain consistent with the noise schedule, and introduces a classifier-free variant using noise-condition dropout to jointly train noise-conditional and noise-unconditional models.

Result: Extensive experiments on ImageNet generation and supervised fine-tuning tasks show NAG consistently mitigates noise shift and substantially improves generation quality of mainstream diffusion models.

Conclusion: NAG effectively addresses the pervasive noise shift problem in diffusion models through explicit guidance and classifier-free training, leading to significant improvements in generation quality across various tasks.

Abstract: Existing denoising generative models rely on solving discretized reverse-time SDEs or ODEs. In this paper, we identify a long-overlooked yet pervasive issue in this family of models: a misalignment between the pre-defined noise level and the actual noise level encoded in intermediate states during sampling. We refer to this misalignment as noise shift. Through empirical analysis, we demonstrate that noise shift is widespread in modern diffusion models and exhibits a systematic bias, leading to sub-optimal generation due to both out-of-distribution generalization and inaccurate denoising updates. To address this problem, we propose Noise Awareness Guidance (NAG), a simple yet effective correction method that explicitly steers sampling trajectories to remain consistent with the pre-defined noise schedule. We further introduce a classifier-free variant of NAG, which jointly trains a noise-conditional and a noise-unconditional model via noise-condition dropout, thereby eliminating the need for external classifiers. Extensive experiments, including ImageNet generation and various supervised fine-tuning tasks, show that NAG consistently mitigates noise shift and substantially improves the generation quality of mainstream diffusion models.

[431] The Robustness of Differentiable Causal Discovery in Misspecified Scenarios

Huiyang Yi, Yanyan He, Duxin Chen, Mingyu Kang, He Wang, Wenwu Yu

Main category: cs.LG

TL;DR: This paper benchmarks causal discovery algorithms’ performance under model assumption violations, finding differentiable methods show robustness except in scale variation scenarios.

Details

Motivation: Causal discovery algorithms rely on unverifiable assumptions that are difficult to satisfy in real-world data, limiting practical applications. The authors aim to understand how these algorithms perform when their assumptions are violated.

Method: Extensive benchmarking of mainstream causal discovery algorithms under eight different model assumption violations, using Structural Hamming Distance and Structural Intervention Distance metrics.

Result: Differentiable causal discovery methods exhibit robustness in most challenging scenarios except for scale variation. Theoretical explanations are provided for their performance.

Conclusion: The work provides comprehensive benchmarks for evaluating causal discovery methods under real-world conditions and aims to establish standards for reasonable evaluation while promoting practical applications.

Abstract: Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in commonly used challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.

[432] Multi-Armed Bandits with Minimum Aggregated Revenue Constraints

Ahmed Ben Yahmed, Hafedh El Ferchichi, Marc Abeille, Vianney Perchet

Main category: cs.LG

TL;DR: This paper studies a multi-armed bandit problem with contextual information and minimum reward constraints per arm across contexts, aiming to maximize total cumulative reward while ensuring fair revenue allocation.

Details

Motivation: The framework addresses real-world applications requiring fair revenue allocation where contextual variation is inherent, and cross-context aggregation of minimum reward constraints enables better performance but introduces technical challenges.

Method: Design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction, deriving problem-dependent upper bounds on regret and constraint violations.

Result: Established problem-dependent upper bounds on both regret and constraint violations for the proposed algorithms, and proved a lower bound showing optimal dependence on time horizon.

Conclusion: The results demonstrate fundamental limitations of the free exploration principle used in prior work and establish the optimality of the time horizon dependence in the proposed algorithms.

Abstract: We examine a multi-armed bandit problem with contextual information, where the objective is to ensure that each arm receives a minimum aggregated reward across contexts while simultaneously maximizing the total cumulative reward. This framework captures a broad class of real-world applications where fair revenue allocation is critical and contextual variation is inherent. The cross-context aggregation of minimum reward constraints, while enabling better performance and easier feasibility, introduces significant technical challenges – particularly the absence of closed-form optimal allocations typically available in standard MAB settings. We design and analyze algorithms that either optimistically prioritize performance or pessimistically enforce constraint satisfaction. For each algorithm, we derive problem-dependent upper bounds on both regret and constraint violations. Furthermore, we establish a lower bound demonstrating that the dependence on the time horizon in our results is optimal in general and revealing fundamental limitations of the free exploration principle leveraged in prior work.

[433] Demystifying Hybrid Thinking: Can LLMs Truly Switch Between Think and No-Think?

Shouren Wang, Wang Yang, Xianxuan Long, Qifan Wang, Vipin Chaudhary, Xiaotian Han

Main category: cs.LG

TL;DR: Current hybrid thinking LLMs have poor mode separation with reasoning behaviors leaking into no-think mode. The paper identifies key factors for better controllability and proposes a training recipe that reduces output length and reasoning tokens while maintaining accuracy.

Details

Motivation: Hybrid thinking LLMs enable switching between reasoning and direct answering for efficiency, but current implementations suffer from poor mode separation where reasoning behaviors leak into no-think mode.

Method: Analyzed factors influencing controllability and identified four key factors: larger data scale, using think/no-think answers from different questions, moderate increase in no-think data, and two-phase training strategy. Proposed a practical training recipe based on these findings.

Result: The proposed recipe significantly reduces no-think output length (from 1085 to 585 on MATH500) and reasoning-supportive tokens like “wait” (from 5917 to 522 on MATH500) while maintaining accuracy in both modes.

Conclusion: Current hybrid thinking has limitations in mode separation, but the identified factors and proposed recipe offer directions for improving controllability in hybrid thinking LLMs.

Abstract: Hybrid thinking enables LLMs to switch between reasoning and direct answering, offering a balance between efficiency and reasoning capability. Yet our experiments reveal that current hybrid thinking LLMs only achieve partial mode separation: reasoning behaviors often leak into the no-think mode. To understand and mitigate this, we analyze the factors influencing controllability and identify four that matter most: (1) larger data scale, (2) using think and no-think answers from different questions rather than the same question, (3) a moderate increase in no-think data number, and (4) a two-phase strategy that first trains reasoning ability and then applies hybrid think training. Building on these findings, we propose a practical recipe that, compared to standard training, can maintain accuracy in both modes while significantly reducing no-think output length (from $1085$ to $585$ on MATH500) and occurrences of reasoning-supportive tokens such as ``\texttt{wait}’’ (from $5917$ to $522$ on MATH500). Our findings highlight the limitations of current hybrid thinking and offer directions for strengthening its controllability.

[434] Evaluation of Real-Time Preprocessing Methods in AI-Based ECG Signal Analysis

Jasmin Freudenberg, Kai Hahn, Christian Weber, Madjid Fathi

Main category: cs.LG

TL;DR: Analysis of ECG signal pre-processing methods for edge computing in the FACE project, focusing on energy efficiency, processing capability, and real-time performance.

Details

Motivation: Growing demand for privacy-compliant, energy-efficient real-time ECG analysis requires new approaches at the point of data acquisition, with edge computing reducing latency and improving data security.

Method: Analysis of various ECG signal pre-processing steps for applicability in edge computing, evaluating methods based on energy efficiency, processing capability, and real-time capability.

Result: The study identifies suitable pre-processing methods that can be implemented in edge devices for ECG analysis within the FACE project framework.

Conclusion: Edge computing combined with cloud computing provides a synergistic approach for long-term ECG analysis, with carefully selected pre-processing methods enabling energy-efficient, real-time processing at the data acquisition point.

Abstract: The increasing popularity of portable ECG systems and the growing demand for privacy-compliant, energy-efficient real-time analysis require new approaches to signal processing at the point of data acquisition. In this context, the edge domain is acquiring increasing importance, as it not only reduces latency times, but also enables an increased level of data security. The FACE project aims to develop an innovative machine learning solution for analysing long-term electrocardiograms that synergistically combines the strengths of edge and cloud computing. In this thesis, various pre-processing steps of ECG signals are analysed with regard to their applicability in the project. The selection of suitable methods in the edge area is based in particular on criteria such as energy efficiency, processing capability and real-time capability.

[435] Research in Collaborative Learning Does Not Serve Cross-Silo Federated Learning in Practice

Kevin Kuo, Chhavi Yadav, Virginia Smith

Main category: cs.LG

TL;DR: Interview study identifies practical barriers to cross-silo federated learning adoption, revealing challenges distinct from cross-device FL that are not well-captured by current research.

Details

Motivation: To understand why cross-silo federated learning adoption remains limited despite growing organizational interest driven by data protection regulations like GDPR and HIPAA.

Method: Conducted interview study with diverse stakeholders including user organizations, software providers, and academic researchers.

Result: Uncovered various barriers including concerns about model performance, questions of incentives, and trust issues between participating organizations.

Conclusion: Cross-silo FL faces distinct challenges requiring future research directions to overcome adoption barriers.

Abstract: Cross-silo federated learning (FL) is a promising approach to enable cross-organization collaboration in machine learning model development without directly sharing private data. Despite growing organizational interest driven by data protection regulations such as GDPR and HIPAA, the adoption of cross-silo FL remains limited in practice. In this paper, we conduct an interview study to understand the practical challenges associated with cross-silo FL adoption. With interviews spanning a diverse set of stakeholders such as user organizations, software providers, and academic researchers, we uncover various barriers, from concerns about model performance to questions of incentives and trust between participating organizations. Our study shows that cross-silo FL faces a set of challenges that have yet to be well-captured by existing research in the area and are quite distinct from other forms of federated learning such as cross-device FL. We end with a discussion on future research directions that can help overcome these challenges.

[436] Rethinking Knowledge Distillation: A Data Dependent Regulariser With a Negative Asymmetric Payoff

Israel Mason-Williams, Gabryel Mason-Williams, Helen Yannakoudakis

Main category: cs.LG

TL;DR: Knowledge distillation functions more as a data-dependent regularizer with negative asymmetric transfer rather than an effective compression mechanism, with limited knowledge transfer across modalities and architectures.

Details

Motivation: To understand the functional impact and compression capacity of knowledge distillation, decoupling compression from architectural reduction to provide improved understanding of knowledge transfer mechanisms.

Method: Employed hypothesis testing, controls, and random control distillation across multiple distillation variants, analyzing distillation scaling laws across model sizes, 12 experimental setups, 9 architectures, and 7 datasets.

Result: While statistically significant knowledge transfer occurs in some modalities and architectures, the extent is less pronounced than anticipated. Significant cases show consistent severe asymmetric transfer of negative knowledge to the student.

Conclusion: Knowledge distillation functions less as a compression mechanism and more as a data-dependent regulariser with a negative asymmetric payoff, raising safety concerns in applications.

Abstract: Knowledge distillation is often considered a compression mechanism when judged on the resulting student’s accuracy and loss, yet its functional impact is poorly understood. In this work, we quantify the compression capacity of knowledge distillation and the resulting knowledge transfer from a functional perspective, decoupling compression from architectural reduction, which provides an improved understanding of knowledge distillation. We employ hypothesis testing, controls, and random control distillation to understand knowledge transfer mechanisms across data modalities. To rigorously test the breadth and limits of our analyses, we explore multiple distillation variants and analyse distillation scaling laws across model sizes. Our findings demonstrate that, while there is statistically significant knowledge transfer in some modalities and architectures, the extent of this transfer is less pronounced than anticipated, even under conditions designed to maximise knowledge sharing. Notably, in cases of significant knowledge transfer, we identify a consistent and severe asymmetric transfer of negative knowledge to the student, raising safety concerns in knowledge distillation applications. Across 12 experimental setups, 9 architectures, and 7 datasets, our findings show that knowledge distillation functions less as a compression mechanism and more as a data-dependent regulariser with a negative asymmetric payoff.

[437] Learning-To-Measure: In-context Active Feature Acquisition

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

Main category: cs.LG

TL;DR: L2M is a meta-learning approach for active feature acquisition that learns acquisition policies across multiple tasks using uncertainty quantification and greedy feature selection, eliminating per-task retraining.

Details

Motivation: Address limitations of traditional AFA methods that work on single predetermined tasks and require per-task retraining, by enabling scalable learning across various tasks using retrospective data with systematic missingness.

Method: Learning-to-Measure (L2M) with reliable uncertainty quantification using sequence-modeling pre-training, and uncertainty-guided greedy feature acquisition that maximizes conditional mutual information.

Result: L2M matches or surpasses task-specific baselines across synthetic and real-world tabular benchmarks, especially under scarce labels and high missingness conditions.

Conclusion: L2M provides an effective meta-learning framework for active feature acquisition that works across multiple tasks without retraining, demonstrating strong performance particularly in challenging scenarios with limited labels and high missingness.

Abstract: Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

[438] Towards Fast Coarse-graining and Equation Discovery with Foundation Inference Models

Manuel Hinz, Maximilian Mauel, Patrick Seifner, David Berghaus, Kostadin Cvejoski, Ramses J. Sanchez

Main category: cs.LG

TL;DR: The paper proposes a method to decouple the discovery of coarse-grained variables from fitting governing equations in dynamical systems by using pretrained Foundation Inference Models (FIMs) to estimate infinitesimal generators, enabling stable representation learning.

Details

Motivation: High-dimensional dynamical processes often have low-dimensional latent dynamics, but existing methods typically solve variable discovery and equation fitting jointly, which can be unstable. The goal is to develop a more stable and reusable approach for coarse-graining pipelines.

Method: Leverage pretrained FIMs with frozen weights to estimate infinitesimal generators (drift and diffusion) in zero-shot mode, then train only the encoder-decoder map using a simulation-consistent loss that stabilizes representation learning.

Result: A proof of concept on a stochastic double-well system with semicircle diffusion embedded in synthetic video data demonstrates the approach’s potential for fast and reusable coarse-graining pipelines.

Conclusion: Decoupling variable discovery from dynamics fitting using FIMs provides a stable and efficient method for identifying latent dynamics in high-dimensional systems, enabling reusable coarse-graining pipelines.

Abstract: High-dimensional recordings of dynamical processes are often characterized by a much smaller set of effective variables, evolving on low-dimensional manifolds. Identifying these latent dynamics requires solving two intertwined problems: discovering appropriate coarse-grained variables and simultaneously fitting the governing equations. Most machine learning approaches tackle these tasks jointly by training autoencoders together with models that enforce dynamical consistency. We propose to decouple the two problems by leveraging the recently introduced Foundation Inference Models (FIMs). FIMs are pretrained models that estimate the infinitesimal generators of dynamical systems (e.g., the drift and diffusion of a stochastic differential equation) in zero-shot mode. By amortizing the inference of the dynamics through a FIM with frozen weights, and training only the encoder-decoder map, we define a simple, simulation-consistent loss that stabilizes representation learning. A proof of concept on a stochastic double-well system with semicircle diffusion, embedded into synthetic video data, illustrates the potential of this approach for fast and reusable coarse-graining pipelines.

[439] Laminar: A Scalable Asynchronous RL Post-Training Framework

Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xiang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu

Main category: cs.LG

TL;DR: Laminar is a scalable RL post-training system that addresses GPU underutilization in large-scale LLM training by breaking global synchronization through trajectory-level asynchrony and dynamic repacking of long-tail trajectories.

Details

Motivation: Existing RL frameworks suffer from severe GPU underutilization due to extreme long-tail skewness in RL trajectory generation, and global synchronization creates rigid model update schedules that are ill-suited for evolving latency distributions.

Method: Proposes a fully decoupled architecture with: 1) relay workers as distributed parameter service for asynchronous weight synchronization, 2) dynamic repack mechanism to consolidate long-tail trajectories onto dedicated rollouts, and 3) failure isolation for robustness.

Result: Evaluation on 1024-GPU cluster shows up to 5.48× training throughput speedup over state-of-the-art systems while reducing model convergence time.

Conclusion: Laminar’s trajectory-level asynchrony and fully decoupled design enable efficient scaling of RL post-training for LLMs by eliminating global synchronization bottlenecks and maximizing GPU utilization.

Abstract: Reinforcement learning (RL) post-training for Large Language Models (LLMs) is now scaling to large clusters and running for extended durations to enhance model reasoning performance. However, the scalability of existing RL frameworks is limited, as extreme long-tail skewness in RL trajectory generation causes severe GPU underutilization. Current asynchronous RL systems attempt to mitigate this, but they rely on global weight synchronization between the actor and all rollouts, which creates a rigid model update schedule. This global synchronization is ill-suited for the highly skewed and evolving distribution of trajectory generation latency in RL training, crippling training efficiency. Our key insight is that efficient scaling requires breaking this lockstep through trajectory-level asynchrony, which generates and consumes each trajectory independently. We propose Laminar, a scalable and robust RL post-training system built on a fully decoupled architecture. First, we replace global updates with a tier of relay workers acting as a distributed parameter service. This enables asynchronous and fine-grained weight synchronization, allowing rollouts to pull the latest weight anytime without stalling the actor’s training loop. Second, a dynamic repack mechanism consolidates long-tail trajectories onto a few dedicated rollouts, maximizing generation throughput. The fully decoupled design also isolates failures, ensuring robustness for long-running jobs. Our evaluation on a 1024-GPU cluster shows that Laminar achieves up to 5.48$\times$ training throughput speedup over state-of-the-art systems, while reducing model convergence time.

[440] Expert or not? assessing data quality in offline reinforcement learning

Arip Asadulaev, Fakhri Karray, Martin Takac

Main category: cs.LG

TL;DR: The paper introduces Bellman Wasserstein Distance (BWD), a value-aware optimal transport metric that estimates offline RL dataset quality without training agents, and shows it correlates strongly with actual algorithm performance while also improving returns when used as a regularizer.

Details

Motivation: Offline RL datasets vary widely in quality but assessing quality a priori is difficult since data provenance and skill composition are unknown. There's a need to estimate dataset quality without training agents to guide algorithm selection.

Method: Proposes Bellman Wasserstein Distance (BWD) - a value-aware optimal transport score computed from a behavioral critic and state-conditional OT formulation. It measures how dissimilar a dataset’s behavioral policy is from a random reference policy, requiring no environment interaction or full policy optimization.

Result: BWD strongly correlates with an oracle performance score across D4RL MuJoCo tasks, enabling efficient prediction of how well standard agents will perform. When used as a regularizer during policy optimization, BWD explicitly pushes learned policies away from random behavior and improves returns.

Conclusion: Value-aware, distributional signals like BWD are practical tools for triaging offline RL datasets and policy optimization, providing dataset quality assessment without agent training.

Abstract: Offline reinforcement learning (RL) learns exclusively from static datasets, without further interaction with the environment. In practice, such datasets vary widely in quality, often mixing expert, suboptimal, and even random trajectories. The choice of algorithm therefore depends on dataset fidelity. Behavior cloning can suffice on high-quality data, whereas mixed- or low-quality data typically benefits from offline RL methods that stitch useful behavior across trajectories. Yet in the wild it is difficult to assess dataset quality a priori because the data’s provenance and skill composition are unknown. We address the problem of estimating offline dataset quality without training an agent. We study a spectrum of proxies from simple cumulative rewards to learned value based estimators, and introduce the Bellman Wasserstein distance (BWD), a value aware optimal transport score that measures how dissimilar a dataset’s behavioral policy is from a random reference policy. BWD is computed from a behavioral critic and a state conditional OT formulation, requiring no environment interaction or full policy optimization. Across D4RL MuJoCo tasks, BWD strongly correlates with an oracle performance score that aggregates multiple offline RL algorithms, enabling efficient prediction of how well standard agents will perform on a given dataset. Beyond prediction, integrating BWD as a regularizer during policy optimization explicitly pushes the learned policy away from random behavior and improves returns. These results indicate that value aware, distributional signals such as BWD are practical tools for triaging offline RL datasets and policy optimization.

[441] SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning

Chih-Chuan Cheng, Yi-Ju Tseng

Main category: cs.LG

TL;DR: SG-XDEAT is a novel framework for tabular data that uses dual-stream encoding with raw and target-aware representations, cross-dimensional/cross-encoding attention, and adaptive sparse attention for noise filtering.

Details

Motivation: To create a more robust deep learning model for tabular data by jointly modeling raw features and target-aware representations while adaptively filtering noise.

Method: Uses dual-stream encoder with raw value and target-conditioned streams, cross-dimensional self-attention for intra-view dependencies, cross-encoding self-attention for bidirectional interaction between streams, and adaptive sparse self-attention to suppress low-utility tokens.

Result: Empirical results on multiple public benchmarks show consistent gains over strong baselines.

Conclusion: Jointly modeling raw and target-aware views with adaptive noise filtering yields a more robust deep tabular learner.

Abstract: We propose SG-XDEAT (Sparsity-Guided Cross Dimensional and Cross-Encoding Attention with Target Aware Conditioning), a novel framework designed for supervised learning on tabular data. At its core, SG-XDEAT employs a dual-stream encoder that decomposes each input feature into two parallel representations: a raw value stream and a target-conditioned (label-aware) stream. These dual representations are then propagated through a hierarchical stack of attention-based modules. SG-XDEAT integrates three key components: (i) Cross-Dimensional self-attention, which captures intra-view dependencies among features within each stream; (ii) Cross-Encoding self-attention, which enables bidirectional interaction between raw and target-aware representations; and (iii) an Adaptive Sparse Self-Attention (ASSA) mechanism, which dynamically suppresses low-utility tokens by driving their attention weights toward zero–thereby mitigating the impact of noise. Empirical results on multiple public benchmarks show consistent gains over strong baselines, confirming that jointly modeling raw and target-aware views–while adaptively filtering noise–yields a more robust deep tabular learner.

[442] On Foundation Models for Temporal Point Processes to Accelerate Scientific Discovery

David Berghaus, Patrick Seifner, Kostadin Cvejoski, Ramses J. Sanchez

Main category: cs.LG

TL;DR: A foundation model for event sequence analysis that learns general patterns from simulated data and can be applied to new datasets without retraining.

Details

Motivation: Traditional ML models require building and training from scratch for each new dataset, which is slow and costly for scientific fields that analyze event sequences.

Method: Train a single foundation model on millions of simulated event sequences to learn general patterns, then apply it to new datasets with few-shot learning or fine-tuning.

Result: The model can analyze new scientific data instantly without retraining and can be quickly fine-tuned for higher accuracy.

Conclusion: This approach makes sophisticated event analysis more accessible and accelerates scientific discovery by eliminating the need for dataset-specific model training.

Abstract: Many scientific fields, from medicine to seismology, rely on analyzing sequences of events over time to understand complex systems. Traditionally, machine learning models must be built and trained from scratch for each new dataset, which is a slow and costly process. We introduce a new approach: a single, powerful model that learns the underlying patterns of event data in context. We trained this “foundation model” on millions of simulated event sequences, teaching it a general-purpose understanding of how events can unfold. As a result, our model can analyze new scientific data instantly, without retraining, simply by looking at a few examples from the dataset. It can also be quickly fine-tuned for even higher accuracy. This approach makes sophisticated event analysis more accessible and accelerates the pace of scientific discovery.

[443] Towards Foundation Inference Models that Learn ODEs In-Context

Maximilian Mauel, Manuel Hinz, Patrick Seifner, David Berghaus, Ramses J. Sanchez

Main category: cs.LG

TL;DR: FIM-ODE is a pretrained neural model that estimates ODEs zero-shot from sparse and noisy observations using a flexible neural operator trained on synthetic data.

Details

Motivation: Accurate data-driven modeling of dynamical systems as ODEs is challenging with sparse or noisy data, which is common across natural sciences.

Method: Uses a pretrained neural model (FIM-ODE) with a flexible neural operator trained on synthetic data to perform zero-shot ODE inference from sparse and noisy observations.

Result: FIM-ODE provides accurate ODE estimates comparable to state-of-the-art neural methods and enables qualitative comparison of estimated vector fields.

Conclusion: FIM-ODE offers an effective approach for robust ODE inference from challenging data conditions, demonstrating competitive performance with existing methods.

Abstract: Ordinary differential equations (ODEs) describe dynamical systems evolving deterministically in continuous time. Accurate data-driven modeling of systems as ODEs, a central problem across the natural sciences, remains challenging, especially if the data is sparse or noisy. We introduce FIM-ODE (Foundation Inference Model for ODEs), a pretrained neural model designed to estimate ODEs zero-shot (i.e., in context) from sparse and noisy observations. Trained on synthetic data, the model utilizes a flexible neural operator for robust ODE inference, even from corrupted data. We empirically verify that FIM-ODE provides accurate estimates, on par with a neural state-of-the-art method, and qualitatively compare the structure of their estimated vector fields.

[444] DiffEM: Learning from Corrupted Data with Diffusion Models via Expectation Maximization

Danial Hosseintabar, Fan Chen, Giannis Daras, Antonio Torralba, Constantinos Daskalakis

Main category: cs.LG

TL;DR: DiffEM: A new method for training diffusion models using Expectation-Maximization (EM) from corrupted data, with theoretical convergence guarantees and demonstrated effectiveness on image reconstruction tasks.

Details

Motivation: Diffusion models are powerful generative priors for high-dimensional inverse problems, but learning them from only corrupted or noisy observations remains challenging.

Method: Proposes DiffEM method that uses conditional diffusion models to reconstruct clean data from observations in the E-step, then refines the conditional diffusion model using reconstructed data in the M-step.

Result: The method provides monotonic convergence guarantees under appropriate statistical conditions and demonstrates effectiveness through experiments on various image reconstruction tasks.

Conclusion: DiffEM offers a principled approach to train diffusion models from corrupted data using EM framework with theoretical guarantees and practical performance.

Abstract: Diffusion models have emerged as powerful generative priors for high-dimensional inverse problems, yet learning them when only corrupted or noisy observations are available remains challenging. In this work, we propose a new method for training diffusion models with Expectation-Maximization (EM) from corrupted data. Our proposed method, DiffEM, utilizes conditional diffusion models to reconstruct clean data from observations in the E-step, and then uses the reconstructed data to refine the conditional diffusion model in the M-step. Theoretically, we provide monotonic convergence guarantees for the DiffEM iteration, assuming appropriate statistical conditions. We demonstrate the effectiveness of our approach through experiments on various image reconstruction tasks.

[445] Structured Sparsity and Weight-adaptive Pruning for Memory and Compute efficient Whisper models

Prasenjit K Mudi, Anshi Sachan, Dahlia Devapriya, Sheetal Kalyani

Main category: cs.LG

TL;DR: A framework for creating efficient Whisper speech recognition models using structured sparsity and weight pruning to reduce model size and computational requirements while maintaining accuracy.

Details

Motivation: Whisper models achieve excellent speech recognition but are too large for deployment on resource-constrained edge devices, creating a need for more efficient variants.

Method: Uses Sparse Group LASSO penalty as loss regularizer for structured sparsity, weight statistics aware pruning algorithm, and custom text normalizer for WER evaluation.

Result: Achieved 35.4% parameter reduction, 14.25% lower memory, 18.5% fewer FLOPs on Whisper-small; 31% parameter reduction, 15.29% lower memory, 16.95% fewer FLOPs on Whisper-medium; outperformed Iterative Magnitude Pruning by 18.7% more parameters with 12.31 WER reduction.

Conclusion: The proposed framework successfully creates efficient Whisper variants that significantly reduce model size and computational requirements without degrading performance, making them suitable for edge deployment.

Abstract: Whisper models have achieved remarkable progress in speech recognition; yet their large size remains a bottleneck for deployment on resource-constrained edge devices. This paper proposes a framework to design fine-tuned variants of Whisper which address the above problem. Structured sparsity is enforced via the Sparse Group LASSO penalty as a loss regularizer, to reduce the number of FLOating Point operations (FLOPs). Further, a weight statistics aware pruning algorithm is proposed. We also design our custom text normalizer for WER evaluation. On Common Voice 11.0 Hindi dataset, we obtain, without degrading WER, (a) 35.4% reduction in model parameters, 14.25% lower memory consumption and 18.5% fewer FLOPs on Whisper-small, and (b) 31% reduction in model parameters, 15.29% lower memory consumption and 16.95% fewer FLOPs on Whisper-medium; and, (c) substantially outperform the state-of-the-art Iterative Magnitude Pruning based method by pruning 18.7% more parameters along with a 12.31 reduction in WER.

[446] Topological Signatures of ReLU Neural Network Activation Patterns

Vicente Bosca, Tatum Rask, Sunia Tanweer, Andrew R. Tawfeek, Branden Stone

Main category: cs.LG

TL;DR: This paper analyzes topological signatures in ReLU neural networks by examining activation patterns and polytope decompositions of the feature space.

Details

Motivation: To understand how topological properties like Fiedler partitions and homology relate to neural network behavior and decision boundaries.

Method: Analyze polytope decomposition of feature space in feedforward ReLU networks, study Fiedler partitions of dual graphs for binary classification, and compute homology of cellular decompositions for regression tasks.

Result: Fiedler partitions appear to correlate with decision boundaries in binary classification, and training loss patterns correlate with polyhedral cell-count changes during training in regression tasks.

Conclusion: Topological signatures from activation patterns provide insights into neural network behavior, connecting geometric decompositions with learning dynamics and decision boundaries.

Abstract: This paper explores the topological signatures of ReLU neural network activation patterns. We consider feedforward neural networks with ReLU activation functions and analyze the polytope decomposition of the feature space induced by the network. Mainly, we investigate how the Fiedler partition of the dual graph and show that it appears to correlate with the decision boundary – in the case of binary classification. Additionally, we compute the homology of the cellular decomposition – in a regression task – to draw similar patterns in behavior between the training loss and polyhedral cell-count, as the model is trained.

[447] Structure-Aware Spectral Sparsification via Uniform Edge Sampling

Kaiwen He, Petros Drineas, Rajiv Khanna

Main category: cs.LG

TL;DR: Uniform edge sampling suffices for spectral clustering in well-separated graphs, bypassing the need for expensive importance sampling based on effective resistances.

Details

Motivation: Spectral clustering relies on eigenvector computation which doesn't scale well to massive graphs. Classical sparsification methods require expensive preprocessing to estimate effective resistances for importance sampling.

Method: The paper proves that for graphs with well-separated k-clustering (characterized by large structure ratio), uniform sampling of O(γ²n log n/ε²) edges preserves the spectral subspace needed for clustering, where γ is the Laplacian condition number.

Result: Uniform sampling yields a sparsifier whose top (n-k)-dimensional eigenspace remains approximately orthogonal to cluster indicators, ensuring faithful spectral embedding and preserved clustering quality.

Conclusion: Under strong clusterability assumptions, uniform edge sampling is sufficient for structure-preserving spectral clustering, connecting coreset-based clustering theory to spectral sparsification and providing the first provable guarantee for this approach.

Abstract: Spectral clustering is a fundamental method for graph partitioning, but its reliance on eigenvector computation limits scalability to massive graphs. Classical sparsification methods preserve spectral properties by sampling edges proportionally to their effective resistances, but require expensive preprocessing to estimate these resistances. We study whether uniform edge sampling-a simple, structure-agnostic strategy-can suffice for spectral clustering. Our main result shows that for graphs admitting a well-separated $k$-clustering, characterized by a large structure ratio $\Upsilon(k) = \lambda_{k+1} / \rho_G(k)$, uniform sampling preserves the spectral subspace used for clustering. Specifically, we prove that uniformly sampling $O(\gamma^2 n \log n / \epsilon^2)$ edges, where $\gamma$ is the Laplacian condition number, yields a sparsifier whose top $(n-k)$-dimensional eigenspace is approximately orthogonal to the cluster indicators. This ensures that the spectral embedding remains faithful, and clustering quality is preserved. Our analysis introduces new resistance bounds for intra-cluster edges, a rank-$(n-k)$ effective resistance formulation, and a matrix Chernoff bound adapted to the dominant eigenspace. These tools allow us to bypass importance sampling entirely. Conceptually, our result connects recent coreset-based clustering theory to spectral sparsification, showing that under strong clusterability, even uniform sampling is structure-aware. This provides the first provable guarantee that uniform edge sampling suffices for structure-preserving spectral clustering.

[448] Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers

Ruben Belo, Claudia Soares, Marta Guimaraes

Main category: cs.LG

TL;DR: CALM is an inference-time method that suppresses harmful concepts in LLMs by modifying latent representations using orthogonal projection, without requiring retraining or fine-tuning.

Details

Motivation: Large Language Models are vulnerable to jailbreak attacks that bypass safety guardrails through adversarial prompts, creating a need for effective safety mechanisms.

Method: Uses Concept Alignment and Concept Manipulation (CALM) with orthogonal projection to remove unwanted latent directions associated with harmful content from the last layer representations.

Result: CALM reduces harmful outputs and outperforms baseline methods in most metrics with only small computational overhead at inference.

Conclusion: CALM provides a lightweight approach to AI safety that doesn’t require additional training data or model fine-tuning while effectively suppressing harmful concepts.

Abstract: Large Language Models are susceptible to jailbreak attacks that bypass built-in safety guardrails (e.g., by tricking the model with adversarial prompts). We propose Concept Alignment and Concept Manipulation \textbf{CALM}, an inference-time method that suppresses harmful concepts by modifying latent representations of the last layer of the model, without retraining. Leveraging \gls*{cw} technique from Computer Vision combined with orthogonal projection, CALM removes unwanted latent directions associated with harmful content while preserving model performance. Experiments show that CALM reduces harmful outputs and outperforms baseline methods in most metrics, offering a lightweight approach to AI safety with no additional training data or model fine-tuning, while incurring only a small computational overhead at inference.

[449] CoRA: Covariate-Aware Adaptation of Time Series Foundation Models

Guo Qin, Zhi Chen, Yong Liu, Zhiyuan Shi, Haixuan Liu, Xiangdong Huang, Jianmin Wang, Mingsheng Long

Main category: cs.LG

TL;DR: CoRA is a covariate-aware adaptation framework that enhances Time Series Foundation Models by incorporating exogenous covariates from multiple modalities while preserving pre-trained backbone parameters through zero-initialized condition injection and Granger Causality Embedding.

Details

Motivation: Current Time Series Foundation Models are typically pre-trained on univariate time series, missing crucial information from diverse covariates in real-world forecasting tasks. There's a need to enhance TSFMs by effectively incorporating exogenous covariates from various modalities.

Method: CoRA maintains pre-trained backbones as frozen feature extractors, uses Granger Causality Embedding to automatically evaluate covariates’ causal predictability, and employs zero-initialized condition-injection mechanism to avoid catastrophic forgetting while gradually integrating exogenous information.

Result: CoRA achieves 31.1% MSE reduction on covariate-aware forecasting and surpasses state-of-the-art covariate-aware deep forecasters with full or few-shot training samples. It shows strong compatibility with various advanced TSFMs and extends covariate scope to other modalities.

Conclusion: CoRA presents a practical paradigm for applying Time Series Foundation Models by effectively incorporating diverse covariates while preserving pre-trained knowledge, demonstrating significant performance improvements in multivariate forecasting tasks.

Abstract: Time Series Foundation Models (TSFMs) have shown significant impact through their model capacity, scalability, and zero-shot generalization. However, due to the heterogeneity of inter-variate dependencies and the backbone scalability on large-scale multivariate datasets, most TSFMs are typically pre-trained on univariate time series. This limitation renders them oblivious to crucial information from diverse covariates in real-world forecasting tasks. To further enhance the performance of TSFMs, we propose a general covariate-aware adaptation (CoRA) framework for TSFMs. It leverages pre-trained backbones of foundation models while effectively incorporating exogenous covariates from various modalities, including time series, language, and images, to improve the quality of predictions. Technically, CoRA maintains the equivalence of initialization and parameter consistency during adaptation. With preserved backbones of foundation models as frozen feature extractors, the outcome embeddings from foundation models are empirically demonstrated more informative than raw data. Further, CoRA employs a novel Granger Causality Embedding (GCE) to automatically evaluate covariates regarding their causal predictability with respect to the target variate. We incorporate these weighted embeddings with a zero-initialized condition-injection mechanism, avoiding catastrophic forgetting of pre-trained foundation models and gradually integrates exogenous information. Extensive experiments show that CoRA of TSFMs surpasses state-of-the-art covariate-aware deep forecasters with full or few-shot training samples, achieving 31.1% MSE reduction on covariate-aware forecasting. Compared to other adaptation methods, CoRA exhibits strong compatibility with various advanced TSFMs and extends the scope of covariates to other modalities, presenting a practical paradigm for the application of TSFMs.

[450] Hierarchical Federated Learning for Crop Yield Prediction in Smart Agricultural Production Systems

Anas Abouaomar, Mohammed El hanjri, Abdellatif Kobbane, Anis Laouiti, Khalid Nafil

Main category: cs.LG

TL;DR: A hierarchical federated learning architecture for smart agriculture with seasonal subscription mechanism, enabling crop-specific model training while preserving data privacy.

Details

Motivation: To address the need for crop yield prediction in smart agriculture while handling heterogeneous farming environments and protecting privacy-sensitive agricultural data.

Method: Three-layer architecture: individual farms (clients), crop-specific aggregators (middle layer), and global model aggregator (top layer). Uses seasonal subscription where farms join crop-specific clusters each season.

Result: Local and crop-layer models closely follow actual yield patterns, significantly outperforming standard machine learning models with consistent alignment.

Conclusion: Hierarchical federated learning is effective for agricultural contexts, enabling local specialization and global generalization while reducing communication overhead and preserving data privacy.

Abstract: In this paper, we presents a novel hierarchical federated learning architecture specifically designed for smart agricultural production systems and crop yield prediction. Our approach introduces a seasonal subscription mechanism where farms join crop-specific clusters at the beginning of each agricultural season. The proposed three-layer architecture consists of individual smart farms at the client level, crop-specific aggregators at the middle layer, and a global model aggregator at the top level. Within each crop cluster, clients collaboratively train specialized models tailored to specific crop types, which are then aggregated to produce a higher-level global model that integrates knowledge across multiple crops. This hierarchical design enables both local specialization for individual crop types and global generalization across diverse agricultural contexts while preserving data privacy and reducing communication overhead. Experiments demonstrate the effectiveness of the proposed system, showing that local and crop-layer models closely follow actual yield patterns with consistent alignment, significantly outperforming standard machine learning models. The results validate the advantages of hierarchical federated learning in the agricultural context, particularly for scenarios involving heterogeneous farming environments and privacy-sensitive agricultural data.

[451] Few Shot Semi-Supervised Learning for Abnormal Stop Detection from Sparse GPS Trajectories

Muhammad Ayub Sabir, Junbiao Pang, Jiaqi Wu, Fatima Ashraf

Main category: cs.LG

TL;DR: A sparsity-aware segmentation method with graph-based semi-supervised learning for detecting abnormal stops in intercity coach transportation using sparse GPS data and limited labels.

Details

Motivation: Address challenges in abnormal stop detection caused by sparse GPS trajectories that obscure short/unauthorized stops and limited labeled data that restricts supervised learning.

Method: Sparsity-Aware Segmentation (SAS) defines segment boundaries based on spatial-temporal density, extracts domain-specific indicators, applies LTIGA for smoothing, constructs spatial-temporal graph with label propagation and GCN, and uses self-training with pseudo-labels.

Result: Achieves AUC of 0.854 and AP of 0.866 using only 10 labeled instances on real-world coach data, outperforming prior methods.

Conclusion: The proposed framework effectively handles data sparsity and label scarcity in abnormal stop detection, demonstrating superior performance with minimal supervision.

Abstract: Abnormal stop detection (ASD) in intercity coach transportation is critical for ensuring passenger safety, operational reliability, and regulatory compliance. However, two key challenges hinder ASD effectiveness: sparse GPS trajectories, which obscure short or unauthorized stops, and limited labeled data, which restricts supervised learning. Existing methods often assume dense sampling or regular movement patterns, limiting their applicability. To address data sparsity, we propose a Sparsity-Aware Segmentation (SAS) method that adaptively defines segment boundaries based on local spatial-temporal density. Building upon these segments, we introduce three domain-specific indicators to capture abnormal stop behaviors. To further mitigate the impact of sparsity, we develop Locally Temporal-Indicator Guided Adjustment (LTIGA), which smooths these indicators via local similarity graphs. To overcome label scarcity, we construct a spatial-temporal graph where each segment is a node with LTIGA-refined features. We apply label propagation to expand weak supervision across the graph, followed by a GCN to learn relational patterns. A final self-training module incorporates high-confidence pseudo-labels to iteratively improve predictions. Experiments on real-world coach data show an AUC of 0.854 and AP of 0.866 using only 10 labeled instances, outperforming prior methods. The code and dataset are publicly available at \href{https://github.com/pangjunbiao/Abnormal-Stop-Detection-SSL.git}

[452] Multitask finetuning and acceleration of chemical pretrained models for small molecule drug property prediction

Matthew Adrian, Yunsie Chung, Kevin Boyd, Saee Paliwal, Srimukh Prasad Veccham, Alan C. Cheng

Main category: cs.LG

TL;DR: Multitask finetuning of chemical pretrained graph neural network models (KERMT and KGPT) significantly improves performance over non-pretrained models, with the most significant improvements at larger data sizes.

Details

Motivation: To leverage general chemical knowledge from self-supervised training to improve predictions for critical drug discovery endpoints like on-target potency and ADMET properties, building on previous success of multi-task learning.

Method: Finetuning chemical pretrained graph neural network models (KERMT and KGPT) in a multitask manner, and publishing multitask ADMET data splits for benchmarking.

Result: Multitask finetuning significantly improves performance over non-pretrained graph neural network models, with the most significant performance improvement at larger data sizes.

Conclusion: Multitask finetuning of chemical pretrained models is effective for drug discovery applications, and the authors provide resources (data splits and accelerated implementation) to enable broader adoption in industrial workflows.

Abstract: Chemical pretrained models, sometimes referred to as foundation models, are receiving considerable interest for drug discovery applications. The general chemical knowledge extracted from self-supervised training has the potential to improve predictions for critical drug discovery endpoints, including on-target potency and ADMET properties. Multi-task learning has previously been successfully leveraged to improve predictive models. Here, we show that enabling multitasking in finetuning of chemical pretrained graph neural network models such as Kinetic GROVER Multi-Task (KERMT), an enhanced version of the GROVER model, and Knowledge-guided Pre-training of Graph Transformer (KGPT) significantly improves performance over non-pretrained graph neural network models. Surprisingly, we find that the performance improvement from finetuning KERMT in a multitask manner is most significant at larger data sizes. Additionally, we publish two multitask ADMET data splits to enable more accurate benchmarking of multitask deep learning methods for drug property prediction. Finally, we provide an accelerated implementation of the KERMT model on GitHub, unlocking large-scale pretraining, finetuning, and inference in industrial drug discovery workflows.

[453] CARVQ: Corrective Adaptor with Group Residual Vector Quantization for LLM Embedding Compression

Dayin Gou, Sanghyun Byun, Nilesh Malpeddi, Gabrielle De Micheli, Prathamesh Vaste, Jacob Song, Woo Seong Chung

Main category: cs.LG

TL;DR: CARVQ is a novel compression method that combines corrective adaptors with group residual vector quantization to compress LLM embedding layers to ~1.6 bits, reducing memory footprint for edge device deployment.

Details

Motivation: LLMs have large embedding layers that require substantial storage and memory, especially problematic for memory-constrained edge devices where reducing memory footprint can speed up inference.

Method: CARVQ uses a composition of linear and non-linear maps with group Residual Vector Quantization and corrective adaptors to compress embedding layers without requiring specialized hardware for lower-bit storage.

Result: CARVQ achieves lower average bitwidth-per-parameter (~1.6 bits) while maintaining reasonable perplexity and accuracy compared to scalar quantization across multiple LLMs and tasks.

Conclusion: CARVQ provides an effective compression technique compatible with existing transformer quantization methods, enabling efficient LLM deployment on memory-constrained edge devices.

Abstract: Large Language Models (LLMs) typically rely on a large number of parameters for token embedding, leading to substantial storage requirements and memory footprints. In particular, LLMs deployed on edge devices are memory-bound, and reducing the memory footprint by compressing the embedding layer not only frees up the memory bandwidth but also speeds up inference. To address this, we introduce CARVQ, a post-training novel Corrective Adaptor combined with group Residual Vector Quantization. CARVQ relies on the composition of both linear and non-linear maps and mimics the original model embedding to compress to approximately 1.6 bits without requiring specialized hardware to support lower-bit storage. We test our method on pre-trained LLMs such as LLaMA-3.2-1B, LLaMA-3.2-3B, LLaMA-3.2-3B-Instruct, LLaMA-3.1-8B, Qwen2.5-7B, Qwen2.5-Math-7B and Phi-4, evaluating on common generative, discriminative, math and reasoning tasks. We show that in most cases, CARVQ can achieve lower average bitwidth-per-parameter while maintaining reasonable perplexity and accuracy compared to scalar quantization. Our contributions include a novel compression technique that is compatible with state-of-the-art transformer quantization methods and can be seamlessly integrated into any hardware supporting 4-bit memory to reduce the model’s memory footprint in memory-constrained devices. This work demonstrates a crucial step toward the efficient deployment of LLMs on edge devices.

[454] Improving Decision Trees through the Lens of Parameterized Local Search

Juha Harviainen, Frank Sommer, Manuel Sorge

Main category: cs.LG

TL;DR: The paper analyzes the complexity of optimizing decision trees by performing local search operations like adjusting thresholds or exchanging features to minimize classification errors. It shows these problems are NP-complete but identifies parameter combinations that make them tractable.

Details

Motivation: Decision tree learning algorithms often use heuristic local-search operations, but their computational complexity wasn't well understood. The paper aims to determine which properties make these optimization problems hard or tractable.

Method: The authors conduct a comprehensive parameterized-complexity analysis, studying problems where a fixed number of local operations (threshold adjustments or feature exchanges) are performed to minimize classification errors. They prove NP-completeness but identify tractable cases.

Result: The problems remain NP-hard for small numbers of features or small domain sizes individually, but become fixed-parameter tractable when both parameters are small. The algorithm runs in $(D + 1)^{2d} \cdot |I|^{O(1)}$ time, where $D$ is domain size and $d$ is number of features.

Conclusion: The combination of small domain size and small number of features yields fixed-parameter tractability for decision tree optimization problems. The authors also provide empirical validation through a proof-of-concept implementation.

Abstract: Algorithms for learning decision trees often include heuristic local-search operations such as (1) adjusting the threshold of a cut or (2) also exchanging the feature of that cut. We study minimizing the number of classification errors by performing a fixed number of a single type of these operations. Although we discover that the corresponding problems are NP-complete in general, we provide a comprehensive parameterized-complexity analysis with the aim of determining those properties of the problems that explain the hardness and those that make the problems tractable. For instance, we show that the problems remain hard for a small number $d$ of features or small domain size $D$ but the combination of both yields fixed-parameter tractability. That is, the problems are solvable in $(D + 1)^{2d} \cdot |I|^{O(1)}$ time, where $|I|$ is the size of the input. We also provide a proof-of-concept implementation of this algorithm and report on empirical results.

[455] Doctor Rashomon and the UNIVERSE of Madness: Variable Importance with Unobserved Confounding and the Rashomon Effect

Jon Donnelly, Srikar Katta, Emanuele Borgonovo, Cynthia Rudin

Main category: cs.LG

TL;DR: UNIVERSE introduces a method to bound variable importance (VI) estimates using Rashomon sets, addressing issues of omitted variables and model dependence in VI analysis.

Details

Motivation: Standard VI methods are limited by their dependence on included variables and suffer from the Rashomon Effect, where different equally-good models yield different VI estimates. These methods also fail when essential variables are missing from observational datasets.

Method: The approach uses Rashomon sets - collections of near-optimal models - to produce bounds on true variable importance even when features are missing. It theoretically guarantees robustness by considering multiple models rather than relying on a single model.

Result: The method demonstrates strong performance on semi-synthetic simulations and shows practical utility in credit risk applications. It provides theoretical guarantees for robustness.

Conclusion: UNIVERSE offers a robust framework for variable importance analysis that accounts for model uncertainty and missing variables, making VI estimates more reliable for hypothesis generation and scientific validation.

Abstract: Variable importance (VI) methods are often used for hypothesis generation, feature selection, and scientific validation. In the standard VI pipeline, an analyst estimates VI for a single predictive model with only the observed features. However, the importance of a feature depends heavily on which other variables are included in the model, and essential variables are often omitted from observational datasets. Moreover, the VI estimated for one model is often not the same as the VI estimated for another equally-good model - a phenomenon known as the Rashomon Effect. We address these gaps by introducing UNobservables and Inference for Variable importancE using Rashomon SEts (UNIVERSE). Our approach adapts Rashomon sets - the sets of near-optimal models in a dataset - to produce bounds on the true VI even with missing features. We theoretically guarantee the robustness of our approach, show strong performance on semi-synthetic simulations, and demonstrate its utility in a credit risk task.

[456] KoALA: KL-L0 Adversarial Detector via Label Agreement

Siqi Li, Yasser Shoukry

Main category: cs.LG

TL;DR: KoALA is a semantics-free adversarial detector that uses disagreement between KL divergence and L0-based similarity metrics to detect attacks without architectural changes or adversarial retraining.

Details

Motivation: Deep neural networks are highly vulnerable to adversarial attacks, posing serious risks to security- and safety-critical applications, necessitating effective detection methods.

Method: KoALA detects adversarial attacks when class predictions from two complementary similarity metrics (KL divergence for dense perturbations, L0-based similarity for sparse perturbations) disagree. Only requires simple fine-tuning on clean images.

Result: KoALA achieves precision of 0.94 and recall of 0.81 on ResNet/CIFAR-10, and precision of 0.66 and recall of 0.85 on CLIP/Tiny-ImageNet when theorem conditions are met.

Conclusion: KoALA provides a lightweight, plug-and-play solution for adversarial detection that works across different models and data modalities without requiring architectural changes or adversarial training.

Abstract: Deep neural networks are highly susceptible to adversarial attacks, which pose significant risks to security- and safety-critical applications. We present KoALA (KL-L0 Adversarial detection via Label Agreement), a novel, semantics-free adversarial detector that requires no architectural changes or adversarial retraining. KoALA operates on a simple principle: it detects an adversarial attack when class predictions from two complementary similarity metrics disagree. These metrics-KL divergence and an L0-based similarity-are specifically chosen to detect different types of perturbations. The KL divergence metric is sensitive to dense, low-amplitude shifts, while the L0-based similarity is designed for sparse, high-impact changes. We provide a formal proof of correctness for our approach. The only training required is a simple fine-tuning step on a pre-trained image encoder using clean images to ensure the embeddings align well with both metrics. This makes KOALA a lightweight, plug-and-play solution for existing models and various data modalities. Our extensive experiments on ResNet/CIFAR-10 and CLIP/Tiny-ImageNet confirm our theoretical claims. When the theorem’s conditions are met, KoALA consistently and effectively detects adversarial examples. On the full test sets, KoALA achieves a precision of 0.94 and a recall of 0.81 on ResNet/CIFAR-10, and a precision of 0.66 and a recall of 0.85 on CLIP/Tiny-ImageNet.

[457] Sample-Efficient Omniprediction for Proper Losses

Isaac Gibbs, Ryan J. Tibshirani

Main category: cs.LG

TL;DR: The paper addresses the problem of constructing probabilistic predictions that enable accurate decisions for multiple downstream users, focusing on omniprediction where a single predictor minimizes multiple loss functions simultaneously.

Details

Motivation: To design optimal predictors that work well for multiple decision makers with different utility functions, avoiding the limitations of existing approaches that either use multicalibration (which is suboptimal) or produce complex randomized predictors.

Method: The authors propose a direct, unrandomized algorithm that exploits structural elements of proper losses, improving upon existing adversarial game-based approaches that require online-to-batch conversion and produce complex predictors.

Result: The paper shows that multicalibration is strictly more difficult than omniprediction, and presents a more efficient unrandomized algorithm that avoids the complexity of previous methods.

Conclusion: A more direct approach to omniprediction is possible by leveraging the structure of proper losses, leading to simpler and more efficient predictors for multiple decision makers.

Abstract: We consider the problem of constructing probabilistic predictions that lead to accurate decisions when employed by downstream users to inform actions. For a single decision maker, designing an optimal predictor is equivalent to minimizing a proper loss function corresponding to the negative utility of that individual. For multiple decision makers, our problem can be viewed as a variant of omniprediction in which the goal is to design a single predictor that simultaneously minimizes multiple losses. Existing algorithms for achieving omniprediction broadly fall into two categories: 1) boosting methods that optimize other auxiliary targets such as multicalibration and obtain omniprediction as a corollary, and 2) adversarial two-player game based approaches that estimate and respond to the ``worst-case" loss in an online fashion. We give lower bounds demonstrating that multicalibration is a strictly more difficult problem than omniprediction and thus the former approach must incur suboptimal sample complexity. For the latter approach, we discuss how these ideas can be used to obtain a sample-efficient algorithm through an online-to-batch conversion. This conversion has the downside of returning a complex, randomized predictor. We improve on this method by designing a more direct, unrandomized algorithm that exploits structural elements of the set of proper losses.

[458] ACCO: Accumulate While You Communicate for Communication-Overlapped Sharded LLM Training

Adel Nabli, Louis Fournier, Pierre Erbacher, Louis Serrano, Eugene Belilovsky, Edouard Oyallon

Main category: cs.LG

TL;DR: ACCO is a memory-efficient distributed optimization algorithm for LLM training that reduces communication overhead by synchronizing delayed gradients while computing new ones, enabling better scalability across heterogeneous hardware.

Details

Motivation: Existing distributed training methods for LLMs face communication bottlenecks in data parallel setups and high memory costs with local optimization algorithms that prevent optimizer state sharding.

Method: Proposed ACCO algorithm synchronizes delayed gradients while computing new gradients to reduce GPU idle time, with a novel technique to mitigate convergence issues from delayed updates by aligning training dynamics with standard distributed optimization.

Result: ACCO is significantly faster than ZeRO-1 and scales effectively across heterogeneous hardware configurations.

Conclusion: ACCO provides an efficient solution for distributed LLM training that balances communication efficiency, memory usage, and convergence stability while supporting heterogeneous hardware environments.

Abstract: Training LLMs relies on distributed implementations using multiple GPUs to compute gradients in parallel with sharded optimizers. However, synchronizing gradients in data parallel setups introduces communication overhead that grows with the number of workers, limiting parallelization efficiency. Local optimization algorithms reduce communications but incur high memory costs as they prevent optimizer state sharding, hindering scalability. To address this, we propose \textbf{AC}cumulate while \textbf{CO}mmunicate (ACCO), a memory-efficient optimization algorithm for distributed LLM training. By synchronizing delayed gradients while computing new ones, ACCO reduces GPU idle time and supports heterogeneous hardware. To mitigate the convergence issues caused by delayed updates, we introduce a novel technique ensuring training dynamics align with standard distributed optimization. Compared to ZeRO-1, our approach is significantly faster and scales effectively across heterogeneous hardware.

[459] COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

Yixin Liu, Argyris Oikonomou, Weiqiang Zheng, Yang Cai, Arman Cohan

Main category: cs.LG

TL;DR: COMAL is a meta-algorithm for language model alignment that converges to exact Nash equilibrium policies, maintaining high win rates against competing algorithms.

Details

Motivation: Existing alignment methods like RLHF rely on the Bradley-Terry reward assumption, which fails to capture the full complexity of human preferences. Current self-play algorithms either diverge or only converge to modified Nash policies.

Method: Model alignment as a two-player zero-sum game and propose COMAL, a convergent meta-algorithm inspired by game theory that finds exact Nash equilibrium policies through last-iterate convergence.

Result: COMAL consistently maintains above 60.2% and 56.8% win rates against all compared algorithms when applied to Llama-3-8B-Instruct and Qwen2.5-7B models, outperforming previous methods.

Conclusion: COMAL provides a theoretically sound and practical solution for language model alignment with general preferences, achieving superior performance while being simple to integrate with existing preference optimization methods.

Abstract: Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is not always sufficient to capture the full range and complexity of general human preferences. We explore RLHF under a general preference framework by modeling the alignment problem as a two-player zero-sum game in a game-theoretic framework, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous self-play algorithms for finding the Nash policy either diverge or only converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. We provide theoretical analysis that our meta-algorithm converges to an exact Nash policy in the last iterate and demonstrate its effectiveness on a range of synthetic and preference optimization datasets. COMAL is simple and can be integrated with many existing methods designed for preference optimization with minimal changes, and empirically it consistently maintains above 60.2% and 56.8% win rates, when applied to Llama-3-8B-Instruct and Qwen2.5-7B, against all compared algorithms under controlled evaluations.

[460] Multi-View Majority Vote Learning Algorithms: Direct Minimization of PAC-Bayesian Bounds

Mehdi Hennequin, Abdelkrim Zitouni, Khalid Benabdeslem, Haytham Elghazel, Yacine Gaci

Main category: cs.LG

TL;DR: Extends PAC-Bayesian theory to multi-view learning with Rényi divergence-based generalization bounds, oracle bounds, and C-bound extensions, supported by efficient self-bounding algorithms.

Details

Motivation: The PAC-Bayesian framework has advanced statistical learning but remains underexplored for multi-view learning with multiple complementary data representations.

Method: Introduces novel generalization bounds based on Rényi divergence as an alternative to KL divergence, proposes first- and second-order oracle PAC-Bayesian bounds, extends C-bound to multi-view settings, and designs efficient self-bounding optimization algorithms.

Result: Develops theoretical framework with Rényi divergence-based bounds and practical optimization algorithms that bridge theory and practice for multi-view learning.

Conclusion: Successfully extends PAC-Bayesian theory to multi-view learning with novel bounds and practical algorithms, advancing the application of PAC-Bayesian methods to multi-view settings.

Abstract: The PAC-Bayesian framework has significantly advanced the understanding of statistical learning, particularly for majority voting methods. Despite its successes, its application to multi-view learning – a setting with multiple complementary data representations – remains underexplored. In this work, we extend PAC-Bayesian theory to multi-view learning, introducing novel generalization bounds based on R'enyi divergence. These bounds provide an alternative to traditional Kullback-Leibler divergence-based counterparts, leveraging the flexibility of R'enyi divergence. Furthermore, we propose first- and second-order oracle PAC-Bayesian bounds and extend the C-bound to multi-view settings. To bridge theory and practice, we design efficient self-bounding optimization algorithms that align with our theoretical results.

[461] GraphRAG under Fire

Jiacheng Liang, Yuhui Wang, Changjiang Li, Rongyi Zhu, Tanqiu Jiang, Neil Gong, Ting Wang

Main category: cs.LG

TL;DR: GraphRAG’s graph-based structure makes it more resilient to traditional RAG poisoning attacks but creates new vulnerabilities that GragPoison exploits through relation manipulation.

Details

Motivation: To investigate GraphRAG's security implications and vulnerability to poisoning attacks, uncovering both its strengths and new attack surfaces created by its graph-based architecture.

Method: Developed GragPoison attack with three strategies: relation injection (introducing false knowledge), relation enhancement (amplifying poisoning influence), and narrative generation (embedding malicious content in coherent text).

Result: GragPoison achieved up to 98% success rate using less than 68% poisoning text across diverse datasets and models, significantly outperforming existing attacks on GraphRAG variations.

Conclusion: GraphRAG presents a security paradox - more resilient to traditional attacks but vulnerable to novel graph-exploiting attacks like GragPoison, highlighting the need for new defensive measures.

Abstract: GraphRAG advances retrieval-augmented generation (RAG) by structuring external knowledge as multi-scale knowledge graphs, enabling language models to integrate both broad context and granular details in their generation. While GraphRAG has demonstrated success across domains, its security implications remain largely unexplored. To bridge this gap, this work examines GraphRAG’s vulnerability to poisoning attacks, uncovering an intriguing security paradox: existing RAG poisoning attacks are less effective under GraphRAG than conventional RAG, due to GraphRAG’s graph-based indexing and retrieval; yet, the same features also create new attack surfaces. We present GragPoison, a novel attack that exploits shared relations in the underlying knowledge graph to craft poisoning text capable of compromising multiple queries simultaneously. GragPoison employs three key strategies: (i) relation injection to introduce false knowledge, (ii) relation enhancement to amplify poisoning influence, and (iii) narrative generation to embed malicious content within coherent text. Empirical evaluation across diverse datasets and models shows that GragPoison substantially outperforms existing attacks in terms of effectiveness (up to 98% success rate) and scalability (using less than 68% poisoning text) on multiple variations of GraphRAG. We also explore potential defensive measures and their limitations, identifying promising directions for future research.

[462] ParetoQ: Improving Scaling Laws in Extremely Low-bit LLM Quantization

Zechun Liu, Changsheng Zhao, Hanxian Huang, Sijia Chen, Jing Zhang, Jiawei Zhao, Scott Roy, Lisa Jin, Yunyang Xiong, Yangyang Shi, Lin Xiao, Yuandong Tian, Bilge Soran, Raghuraman Krishnamoorthi, Tijmen Blankevoort, Vikas Chandra

Main category: cs.LG

TL;DR: ParetoQ is a unified framework for comparing quantization methods across different bit-widths (1-bit to 4-bit), revealing a learning transition between 2 and 3 bits and achieving state-of-the-art performance with optimized training schemes.

Details

Motivation: There is ongoing debate about the optimal bit-width for quantization, with conflicting claims about 4-bit vs 1.58-bit performance, but no cohesive framework exists to rigorously compare different bit-widths.

Method: ParetoQ is a unified framework that optimizes training schemes and refines quantization functions to enable fair comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings.

Result: ParetoQ surpasses all previous methods tailored to specific bit widths. The 600M-parameter ternary model outperforms the previous 3B-parameter ternary model using only one-fifth of parameters. Ternary, 2-bit, and 3-bit quantization generally exceed 4-bit and binary quantization in size-accuracy trade-off.

Conclusion: 2-bit quantization offers promising potential for memory reduction and speedup considering hardware constraints, while ternary, 2-bit, and 3-bit quantization maintain comparable performance and generally exceed 4-bit and binary quantization.

Abstract: The optimal bit-width for achieving the best trade-off between quantized model size and accuracy has been a subject of ongoing debate. While some advocate for 4-bit quantization, others propose that 1.58-bit offers superior results. However, the lack of a cohesive framework for different bits has left such conclusions relatively tenuous. We present ParetoQ, the first unified framework that facilitates rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. Our findings reveal a notable learning transition between 2 and 3 bits: For 3-bits and above, the fine-tuned models stay close to their original pre-trained distributions, whereas for learning 2-bit networks or below, the representations change drastically. By optimizing training schemes and refining quantization functions, ParetoQ surpasses all previous methods tailored to specific bit widths. Remarkably, our ParetoQ ternary 600M-parameter model even outperforms the previous SoTA ternary 3B-parameter model in accuracy, using only one-fifth of the parameters. Extensive experimentation shows that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off and generally exceeds 4-bit and binary quantization. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.

[463] Fixed Point Explainability

Emanuele La Malfa, Jon Vadillo, Marco Molinari, Michael Wooldridge

Main category: cs.LG

TL;DR: Introduces fixed point explanations to assess model-explainer stability through recursive applications, with properties like minimality, stability, and faithfulness.

Details

Motivation: To evaluate the stability of interactions between models and explainers, revealing hidden behaviors and explanatory weaknesses.

Method: Defines fixed point explanations with convergence conditions for various explainer classes (feature-based to mechanistic tools like Sparse AutoEncoders).

Result: Reports quantitative and qualitative results across multiple datasets and models, including large language models like Llama-3.3-70B.

Conclusion: Fixed point explanations provide a formal framework to assess and improve model-explainer stability and reveal hidden model behaviors.

Abstract: This paper introduces a formal notion of fixed point explanations, inspired by the “why regress” principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results for several datasets and models, including LLMs such as Llama-3.3-70B.

[464] Joint Embedding vs Reconstruction: Provable Benefits of Latent Space Prediction for Self Supervised Learning

Hugues Van Assel, Mark Ibrahim, Tommaso Biancalani, Aviv Regev, Randall Balestriero

Main category: cs.LG

TL;DR: The paper analyzes reconstruction vs joint embedding SSL paradigms, showing joint embedding requires weaker alignment conditions and is preferable when irrelevant features have large magnitude.

Details

Motivation: Practitioners lack clear guidelines for choosing between reconstruction and joint embedding SSL methods, both of which offer compelling advantages.

Method: Leveraging closed form solutions for both approaches to characterize how view generation processes impact learned representations and analyze alignment conditions.

Result: Both SSL paradigms require minimal alignment between augmentations and irrelevant features for asymptotic optimality. Joint embedding imposes strictly weaker alignment conditions than reconstruction methods.

Conclusion: Joint embedding methods are preferable when irrelevant features have large magnitude, clarifying trade-offs between paradigms and substantiating empirical success on real-world datasets.

Abstract: Reconstruction and joint embedding have emerged as two leading paradigms in Self Supervised Learning (SSL). Reconstruction methods focus on recovering the original sample from a different view in input space. On the other hand, joint embedding methods align the representations of different views in latent space. Both approaches offer compelling advantages, yet practitioners lack clear guidelines for choosing between them. In this work, we unveil the core mechanisms that distinguish each paradigm. By leveraging closed form solutions for both approaches, we precisely characterize how the view generation process, e.g. data augmentation, impacts the learned representations. We then demonstrate that, unlike supervised learning, both SSL paradigms require a minimal alignment between augmentations and irrelevant features to achieve asymptotic optimality with increasing sample size. Our findings indicate that in scenarios where these irrelevant features have a large magnitude, joint embedding methods are preferable because they impose a strictly weaker alignment condition compared to reconstruction based methods. These results not only clarify the trade offs between the two paradigms but also substantiate the empirical success of joint embedding approaches on real world challenging datasets.

[465] Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Beier Luo, Shuoyuan Wang, Sharon Li, Hongxin Wei

Main category: cs.LG

TL;DR: DACA is an unsupervised method that improves confidence calibration in post-trained language models by selectively using agreement examples between PLM and PoLM to avoid over-large temperature scaling caused by prediction disagreement.

Details

Motivation: Post-trained language models often suffer from over-confidence issues, assigning high confidence to both correct and incorrect outputs, which undermines reliability in critical applications. The main challenge is the scarcity of labeled data for individual downstream tasks.

Method: Proposes Disagreement-Aware Confidence Alignment (DACA), which selectively uses only agreement examples between PLM and PoLM for temperature scaling calibration, effectively decoupling the influence of disagreement examples that cause under-confidence issues.

Result: Extensive experiments show DACA improves average ECE of open-sourced and API-based LLMs (including GPT-4o) by up to 15.08% on common benchmarks.

Conclusion: DACA effectively addresses the over-confidence problem in post-trained language models through unsupervised confidence calibration that leverages agreement examples while avoiding the negative impact of prediction disagreement.

Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM’s confidence underestimates PoLM’s prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08$%$ on common benchmarks.

[466] Protein Design with Dynamic Protein Vocabulary

Nuowei Liu, Jiahao Kuang, Yanting Liu, Tao Ji, Changzhi Sun, Man Lan, Yuanbin Wu

Main category: cs.LG

TL;DR: ProDVa is a novel protein design approach that integrates text descriptions of functions with protein fragments to generate structurally plausible and functionally aligned protein sequences, achieving better foldability than state-of-the-art models with minimal training data.

Details

Motivation: Current deep generative models for protein design from text descriptions struggle with structural plausibility, while classical methods use natural protein structures. The authors explore whether incorporating fragments from natural proteins can improve foldability in generative models.

Method: ProDVa integrates three components: a text encoder for functional descriptions, a protein language model for sequence design, and a fragment encoder that dynamically retrieves protein fragments based on textual functional descriptions.

Result: ProDVa designs protein sequences that are both functionally aligned and structurally plausible. It achieves comparable function alignment using less than 0.04% of training data compared to state-of-the-art models, while significantly improving foldability - increasing proteins with pLDDT above 70 by 7.38% and those with PAE below 10 by 9.6%.

Conclusion: Incorporating natural protein fragments into generative models enhances structural plausibility in protein design, and ProDVa demonstrates an effective approach for designing well-folded, functionally aligned proteins with minimal training data.

Abstract: Protein design is a fundamental challenge in biotechnology, aiming to design novel sequences with specific functions within the vast space of possible proteins. Recent advances in deep generative models have enabled function-based protein design from textual descriptions, yet struggle with structural plausibility. Inspired by classical protein design methods that leverage natural protein structures, we explore whether incorporating fragments from natural proteins can enhance foldability in generative models. Our empirical results show that even random incorporation of fragments improves foldability. Building on this insight, we introduce ProDVa, a novel protein design approach that integrates a text encoder for functional descriptions, a protein language model for designing proteins, and a fragment encoder to dynamically retrieve protein fragments based on textual functional descriptions. Experimental results demonstrate that our approach effectively designs protein sequences that are both functionally aligned and structurally plausible. Compared to state-of-the-art models, ProDVa achieves comparable function alignment using less than 0.04% of the training data, while designing significantly more well-folded proteins, with the proportion of proteins having pLDDT above 70 increasing by 7.38% and those with PAE below 10 increasing by 9.6%.

[467] Finite Sample Analysis of Linear Temporal Difference Learning with Arbitrary Features

Zixuan Xie, Xinyu Liu, Rohan Chandra, Shangtong Zhang

Main category: cs.LG

TL;DR: Establishes first L² convergence rates for linear TD(λ) under arbitrary features without requiring linearly independent features or algorithmic modifications, applicable to both discounted and average-reward settings.

Details

Motivation: Previous convergence analyses for linear TD(λ) required linearly independent features, which limits practical applicability since many real-world scenarios use arbitrary features that don't satisfy this condition.

Method: Develops a novel stochastic approximation framework that handles convergence to solution sets rather than single points, addressing the non-uniqueness issue arising from arbitrary features.

Result: Proves L² convergence rates for linear TD(λ) under arbitrary features without additional assumptions or algorithmic changes, covering both discounted and average-reward reinforcement learning settings.

Conclusion: This work provides theoretical guarantees for linear TD(λ) in practical scenarios with arbitrary features, overcoming previous limitations and enabling broader application of this fundamental policy evaluation algorithm.

Abstract: Linear TD($\lambda$) is one of the most fundamental reinforcement learning algorithms for policy evaluation. Previously, convergence rates are typically established under the assumption of linearly independent features, which does not hold in many practical scenarios. This paper instead establishes the first $L^2$ convergence rates for linear TD($\lambda$) operating under arbitrary features, without making any algorithmic modification or additional assumptions. Our results apply to both the discounted and average-reward settings. To address the potential non-uniqueness of solutions resulting from arbitrary features, we develop a novel stochastic approximation result featuring convergence rates to the solution set instead of a single point.

[468] Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures

Yu He, Yingxi Li, Colin White, Ellen Vitercik

Main category: cs.LG

TL;DR: DSR-Bench is a new benchmark that systematically evaluates LLMs’ structural reasoning abilities using canonical data structures, revealing significant limitations in current models.

Details

Motivation: As LLMs handle more complex tasks, understanding their algorithmic reasoning capabilities has become crucial, but existing evaluations focus on isolated tasks rather than systematic structural reasoning.

Method: Created DSR-Bench with 20 data structures, 35 operations, and 4,140 synthetic problem instances using hierarchical design and automated evaluation to assess structural reasoning through canonical data structures.

Result: Top-performing LLM scored only 0.498/1 on challenging instances, with additional weaknesses in spatial data, natural language scenarios, and reasoning over generated code.

Conclusion: DSR-Bench provides a principled diagnostic tool to expose reasoning bottlenecks and guide development of more capable LLMs.

Abstract: As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning–understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark’s hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.

[469] Uncertainty Estimation on Graphs with Structure Informed Stochastic Partial Differential Equations

Fred Xu, Thomas Markovich

Main category: cs.LG

TL;DR: A novel graph neural network method that incorporates spatial-temporal noise inspired by stochastic PDEs to improve uncertainty estimation under distributional shifts.

Details

Motivation: Traditional uncertainty estimation methods struggle with graph data due to complexity from both graph structure and label distribution randomness, especially under distributional shifts.

Method: Analogizes GNN message passing with SPDE evolution driven by Matern Gaussian Process, designing a novel message passing scheme with spatial-temporal noises to control covariance kernel smoothness.

Result: Extensive experiments on OOD detection across graph datasets with varying label informativeness demonstrate superiority over existing approaches.

Conclusion: The proposed method effectively captures uncertainty across space and time while allowing explicit control over covariance kernel smoothness, enhancing uncertainty estimates for graphs with both low and high label informativeness.

Abstract: Graph Neural Networks have achieved impressive results across diverse network modeling tasks, but accurately estimating uncertainty on graphs remains difficult, especially under distributional shifts. Unlike traditional uncertainty estimation, graph-based uncertainty must account for randomness arising from both the graph’s structure and its label distribution, which adds complexity. In this paper, making an analogy between the evolution of a stochastic partial differential equation (SPDE) driven by Matern Gaussian Process and message passing using GNN layers, we present a principled way to design a novel message passing scheme that incorporates spatial-temporal noises motivated by the Gaussian Process approach to SPDE. Our method simultaneously captures uncertainty across space and time and allows explicit control over the covariance kernel smoothness, thereby enhancing uncertainty estimates on graphs with both low and high label informativeness. Our extensive experiments on Out-of-Distribution (OOD) detection on graph datasets with varying label informativeness demonstrate the soundness and superiority of our model to existing approaches.

[470] Posterior Sampling for Continuing Environments

Wanqiao Xu, Shi Dong, Benjamin Van Roy

Main category: cs.LG

TL;DR: The paper proposes continuing PSRL, an extension of posterior sampling for RL that works in continuing agent-environment interfaces and provides Bayesian regret bounds.

Details

Motivation: To develop a reinforcement learning approach suitable for continuing (non-episodic) environments that scales to complex settings and provides theoretical guarantees.

Method: Continuing PSRL maintains a statistically plausible environment model and follows an optimal policy for that model. With probability 1-γ at each time step, it resamples the model from the posterior distribution over environments.

Result: The authors establish an Õ(τS√AT) bound on Bayesian regret, where S is states, A is actions, T is horizon, and τ is reward averaging time.

Conclusion: This work formalizes and rigorously analyzes the resampling approach with randomized exploration for continuing RL environments, providing the first theoretical guarantees for this method.

Abstract: We develop an extension of posterior sampling for reinforcement learning (PSRL) that is suited for a continuing agent-environment interface and integrates naturally into agent designs that scale to complex environments. The approach, continuing PSRL, maintains a statistically plausible model of the environment and follows a policy that maximizes expected $\gamma$-discounted return in that model. At each time, with probability $1-\gamma$, the model is replaced by a sample from the posterior distribution over environments. For a choice of discount factor that suitably depends on the horizon $T$, we establish an $\tilde{O}(\tau S \sqrt{A T})$ bound on the Bayesian regret, where $S$ is the number of environment states, $A$ is the number of actions, and $\tau$ denotes the reward averaging time, which is a bound on the duration required to accurately estimate the average reward of any policy. Our work is the first to formalize and rigorously analyze the resampling approach with randomized exploration.

[471] WW-FL: Secure and Private Large-Scale Federated Learning

Felix Marx, Thomas Schneider, Ajith Suresh, Tobias Wehrle, Christian Weinert, Hossein Yalame

Main category: cs.LG

TL;DR: WW-FL is a federated learning framework that combines secure multi-party computation with hierarchical FL to protect data and global model privacy while limiting poisoning attacks.

Details

Motivation: Existing FL protection measures are inadequate when applied independently, and there are challenges in creating effective compositions to address security and privacy vulnerabilities.

Method: Proposed WW-FL framework combining secure multi-party computation (MPC) with hierarchical FL, implemented using PyTorch integrated with Meta’s CrypTen MPC framework.

Result: WW-FL prevents malicious clients from directly poisoning model parameters and confines them to less destructive data poisoning attacks.

Conclusion: WW-FL is a promising solution for secure and private large-scale federated learning, as demonstrated through extensive evaluation.

Abstract: Federated learning (FL) is an efficient approach for large-scale distributed machine learning that promises data privacy by keeping training data on client devices. However, recent research has uncovered vulnerabilities in FL, impacting both security and privacy through poisoning attacks and the potential disclosure of sensitive information in individual model updates as well as the aggregated global model. This paper explores the inadequacies of existing FL protection measures when applied independently, and the challenges of creating effective compositions. Addressing these issues, we propose WW-FL, an innovative framework that combines secure multi-party computation (MPC) with hierarchical FL to guarantee data and global model privacy. One notable feature of WW-FL is its capability to prevent malicious clients from directly poisoning model parameters, confining them to less destructive data poisoning attacks. We furthermore provide a PyTorch-based FL implementation integrated with Meta’s CrypTen MPC framework to systematically measure the performance and robustness of WW-FL. Our extensive evaluation demonstrates that WW-FL is a promising solution for secure and private large-scale federated learning.

[472] Time-IMM: A Dataset and Benchmark for Irregular Multimodal Multivariate Time Series

Ching Chang, Jeehyun Hwang, Yidan Shi, Haixin Wang, Wen-Chih Peng, Tien-Fu Chen, Wei Wang

Main category: cs.LG

TL;DR: Time-IMM is a dataset for irregular multimodal time series with nine types of irregularity, and IMM-TSF is a benchmark library with fusion modules that improves forecasting performance on real-world messy data.

Details

Motivation: Existing time series benchmarks assume clean, regular, unimodal data, creating a gap with real-world applications where data is irregular, multimodal, and messy with varying sampling rates and missing values.

Method: Created Time-IMM dataset with nine irregularity types (trigger-based, constraint-based, artifact-based) and developed IMM-TSF benchmark with timestamp-to-text and multimodality fusion modules supporting recency-aware averaging and attention-based integration.

Result: Empirical results show that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance.

Conclusion: Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions, bridging the gap between research and deployment.

Abstract: Time series data in real-world applications such as healthcare, climate modeling, and finance are often irregular, multimodal, and messy, with varying sampling rates, asynchronous modalities, and pervasive missingness. However, existing benchmarks typically assume clean, regularly sampled, unimodal data, creating a significant gap between research and real-world deployment. We introduce Time-IMM, a dataset specifically designed to capture cause-driven irregularity in multimodal multivariate time series. Time-IMM represents nine distinct types of time series irregularity, categorized into trigger-based, constraint-based, and artifact-based mechanisms. Complementing the dataset, we introduce IMM-TSF, a benchmark library for forecasting on irregular multimodal time series, enabling asynchronous integration and realistic evaluation. IMM-TSF includes specialized fusion modules, including a timestamp-to-text fusion module and a multimodality fusion module, which support both recency-aware averaging and attention-based integration strategies. Empirical results demonstrate that explicitly modeling multimodality on irregular time series data leads to substantial gains in forecasting performance. Time-IMM and IMM-TSF provide a foundation for advancing time series analysis under real-world conditions. The dataset is publicly available at https://www.kaggle.com/datasets/blacksnail789521/time-imm/data, and the benchmark library can be accessed at https://github.com/blacksnail789521/IMM-TSF.

[473] IBCL: Zero-shot Model Generation under Stability-Plasticity Trade-offs

Pengyuan Lu, Michele Caprio, Eric Eaton, Insup Lee

Main category: cs.LG

TL;DR: IBCL is a continual learning method that efficiently generates models for specified stability-plasticity trade-offs without retraining, using convex combinations of parameter distributions.

Details

Motivation: Current continual learning methods require retraining for each new trade-off preference, which is inefficient when many different trade-offs are needed.

Method: IBCL updates knowledge as a convex hull of model parameter distributions and generates Pareto-optimal models via convex combination without additional training.

Result: IBCL improves classification accuracy by up to 44% on average per task and 45% on peak per task accuracy, with constant memory overhead and zero-shot model generation.

Conclusion: IBCL provides an efficient solution for continual learning under specific trade-offs, eliminating the need for retraining while maintaining performance and reducing computational costs.

Abstract: Algorithms that balance the stability-plasticity trade-off are well studied in the Continual Learning literature. However, only a few focus on obtaining models for specified trade-off preferences. When solving the problem of continual learning under specific trade-offs (CLuST), state-of-the-art techniques leverage rehearsal-based learning, which requires retraining when a model corresponding to a new trade-off preference is requested. This is inefficient, since there potentially exists a significant number of different trade-offs, and a large number of models may be requested. As a response, we propose Imprecise Bayesian Continual Learning (IBCL), an algorithm that tackles CLuST efficiently. IBCL replaces retraining with a constant-time convex combination. Given a new task, IBCL (1) updates the knowledge base as a convex hull of model parameter distributions, and (2) generates one Pareto-optimal model per given trade-off via convex combination without additional training. That is, obtaining models corresponding to specified trade-offs via IBCL is zero-shot. Experiments whose baselines are current CLuST algorithms show that IBCL improves classification by at most 44% on average per task accuracy, and by 45% on peak per task accuracy while maintaining a near-zero to positive backward transfer, with memory overheads converging to constants. In addition, its training overhead, measured by the number of batch updates, remains constant at every task, regardless of the number of preferences requested. IBCL also improves multi-objective reinforcement learning tasks by maintaining the same Pareto front hypervolume, while significantly reducing the training cost. Details can be found at: https://github.com/ibcl-anon/ibcl.

[474] LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning

Haoyue Zhang, Hualei Zhang, Xiaosong Ma, Jie Zhang, Song Guo

Main category: cs.LG

TL;DR: LazyEviction is a KV cache compression method that reduces GPU memory usage by 50-70% while maintaining accuracy in long reasoning tasks, addressing the Token Importance Recurrence phenomenon where tokens regain high attention after multiple steps.

Details

Motivation: Existing KV cache compression methods struggle with long reasoning tasks in LLMs, failing to capture the Token Importance Recurrence phenomenon where tokens periodically regain high attention, leading to unpredictable eviction of critical tokens.

Method: LazyEviction uses an observation window-based lagged eviction framework that retains latent recurring tokens through prioritized eviction based on tokens’ recurrence patterns.

Result: Extensive experiments show LazyEviction reduces KV cache by 50-70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines.

Conclusion: LazyEviction effectively addresses the Token Importance Recurrence problem in long reasoning tasks, providing significant KV cache reduction without sacrificing performance.

Abstract: Large Language Models (LLMs) exhibit enhanced capabilities by Chain-of-Thought reasoning. However, the extended reasoning sequences introduce significant GPU memory overhead due to increased key-value (KV) cache. Existing KV cache compression methods mitigate memory bottlenecks but struggle in long reasoning tasks. In this paper, we analyze attention patterns in reasoning tasks and reveal a \textbf{Token Importance Recurrence} phenomenon: a large proportion of tokens regain high attention after multiple decoding steps, which is failed to capture by existing works and may lead to unpredictable eviction on such periodically critical tokens. To address this, we propose \textbf{LazyEviction}, an observation window-based lagged eviction framework retaining latent recurring tokens by prioritized eviction based on tokens’ recurrence patterns. Extensive experiments demonstrate that LazyEviction reduces KV cache by 50%\textasciitilde70% while maintaining comparable accuracy, outperforming existing KV cache compression baselines. Our implementation code can be found at https://github.com/Halo-949/LazyEviction.

[475] Dual Perspectives on Non-Contrastive Self-Supervised Learning

Jean Ponce, Basile Terver, Martial Hebert, Michael Arbel

Main category: cs.LG

TL;DR: Stop gradient and exponential moving average procedures prevent representation collapse in self-supervised learning, though they don’t optimize the original objective. In linear cases, the original objective always leads to collapse, while these procedures yield asymptotically stable equilibria.

Details

Motivation: To understand why stop gradient and exponential moving average procedures effectively prevent representation collapse in non-contrastive self-supervised learning, despite not optimizing the original objective function.

Method: Analyzed the procedures from optimization and dynamical systems perspectives, examining equilibria as algebraic varieties in parameter space, with empirical validation using real and synthetic data.

Result: Stop gradient and exponential moving average procedures avoid collapse and produce asymptotically stable equilibria in linear settings, unlike the original objective which always leads to collapse.

Conclusion: These procedures provide effective mechanisms to prevent representation collapse in self-supervised learning through their dynamical system properties, even though they don’t optimize the original objective function.

Abstract: The {\em stop gradient} and {\em exponential moving average} iterative procedures are commonly used in non-contrastive approaches to self-supervised learning to avoid representation collapse, with excellent performance in downstream applications in practice. This presentation investigates these procedures from the dual viewpoints of optimization and dynamical systems. We show that, in general, although they {\em do not} optimize the original objective, or {\em any} other smooth function, they {\em do} avoid collapse Following~\citet{Tian21}, but without any of the extra assumptions used in their proofs, we then show using a dynamical system perspective that, in the linear case, minimizing the original objective function without the use of a stop gradient or exponential moving average {\em always} leads to collapse. Conversely, we characterize explicitly the equilibria of the dynamical systems associated with these two procedures in this linear setting as algebraic varieties in their parameter space, and show that they are, in general, {\em asymptotically stable}. Our theoretical findings are illustrated by empirical experiments with real and synthetic data.

[476] Competitive Advantage Attacks to Decentralized Federated Learning

Yuqi Jia, Minghong Fang, Neil Zhenqiang Gong

Main category: cs.LG

TL;DR: SelfishAttack is a new attack method for decentralized federated learning where selfish clients manipulate model exchanges to gain competitive advantages over non-selfish clients, achieving higher model accuracy through optimized local model crafting.

Details

Motivation: The motivation is to explore security vulnerabilities in decentralized federated learning (DFL) where clients can potentially manipulate the learning process to gain unfair advantages over other participants, particularly in competitive scenarios like healthcare and banking.

Method: The authors formulate an optimization problem to find carefully crafted local models that selfish clients can send to non-selfish ones in each training round. They propose methods to solve this optimization problem for different DFL aggregation rules and theoretically prove these methods find optimal solutions.

Result: Empirical results show SelfishAttack successfully increases the accuracy gap between selfish and non-selfish clients’ final models. It achieves larger accuracy advantages compared to traditional poisoning attacks when adapted for competitive advantage.

Conclusion: SelfishAttack demonstrates significant security vulnerabilities in DFL systems, where malicious clients can systematically manipulate the learning process to gain unfair competitive advantages, highlighting the need for robust defense mechanisms in decentralized learning environments.

Abstract: Decentralized federated learning (DFL) enables clients (e.g., hospitals and banks) to jointly train machine learning models without a central orchestration server. In each global training round, each client trains a local model on its own training data and then they exchange local models for aggregation. In this work, we propose SelfishAttack, a new family of attacks to DFL. In SelfishAttack, a set of selfish clients aim to achieve competitive advantages over the remaining non-selfish ones, i.e., the final learnt local models of the selfish clients are more accurate than those of the non-selfish ones. Towards this goal, the selfish clients send carefully crafted local models to each remaining non-selfish one in each global training round. We formulate finding such local models as an optimization problem and propose methods to solve it when DFL uses different aggregation rules. Theoretically, we show that our methods find the optimal solutions to the optimization problem. Empirically, we show that SelfishAttack successfully increases the accuracy gap (i.e., competitive advantage) between the final learnt local models of selfish clients and those of non-selfish ones. Moreover, SelfishAttack achieves larger accuracy gaps than poisoning attacks when extended to increase competitive advantages.

[477] Retrieval-Augmented Generation with Estimation of Source Reliability

Jeongyeon Hwang, Junyoung Park, Hyejin Park, Dongwoo Kim, Sangdon Park, Jungseul Ok

Main category: cs.LG

TL;DR: RA-RAG is a Reliability-Aware RAG framework that addresses the limitation of standard RAG by estimating source reliability through cross-checking and prioritizing highly reliable documents for more accurate response generation.

Details

Motivation: Standard RAG often retrieves incorrect information because it only considers query-document relevance while ignoring the heterogeneous reliability of different sources, which can lead to factual inaccuracies.

Method: RA-RAG first estimates source reliability by cross-checking information across multiple sources, then retrieves documents from top-κ reliable and relevant sources, and aggregates information using weighted majority voting (WMV) with selective retrieval for scalability.

Result: Comprehensive experiments show RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. It also demonstrates the ability to estimate real-world sources’ reliability.

Conclusion: RA-RAG provides a practical and scalable solution for multi-source RAG that improves factual accuracy by incorporating source reliability estimation, making it applicable to real-world scenarios with diverse information sources.

Abstract: Retrieval-Augmented Generation (RAG) is an effective approach to enhance the factual accuracy of large language models (LLMs) by retrieving information from external databases, which are typically composed of diverse sources, to supplement the limited internal knowledge of LLMs. However, the standard RAG often risks retrieving incorrect information, as it relies solely on relevance between a query and a document, overlooking the heterogeneous reliability of these sources. To address this issue, we propose Reliability-Aware RAG (RA-RAG), a new multi-source RAG framework that estimates the reliability of sources and leverages this information to prioritize highly reliable and relevant documents, ensuring more robust and accurate response generation. Specifically, RA-RAG first estimates source reliability by cross-checking information across multiple sources. It then retrieves documents from the top-$\kappa$ reliable and relevant sources and aggregates their information using weighted majority voting (WMV), where the selective retrieval ensures scalability while not compromising the performance. Comprehensive experiments show that RA-RAG consistently outperforms baselines in scenarios with heterogeneous source reliability while scaling efficiently as the number of sources increases. Furthermore, we demonstrate the ability of RA-RAG to estimate real-world sources’ reliability, highlighting its practical applicability. \jy{Our code and data are available at \href{https://github.com/ml-postech/RA-RAG}{RA-RAG}.}

[478] OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

Main category: cs.LG

TL;DR: OmniDraft is a unified framework that enables a single draft model to work with any target model for speculative decoding, addressing cross-vocabulary mismatches and providing dynamic adaptation to user data with 1.5-2x speedup.

Details

Motivation: Current speculative decoding requires draft models specifically trained for particular target models, creating challenges when target models are incompatible or when latency improvements are expected over time in online deployment settings.

Method: Uses online n-gram cache with hybrid distillation fine-tuning to address cross-vocabulary mismatches, and adaptive drafting techniques to improve decoding speed. Enables a single draft model (Llama-68M) to pair with various target models.

Result: Achieves up to 1.5-2x speedup in speculative decoding. Successfully pairs a single Llama-68M model with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B.

Conclusion: OmniDraft enables the “one drafter for all” paradigm, making it particularly suitable for on-device LLM applications where model cost, efficiency and user customization are important.

Abstract: Speculative decoding generally dictates having a small, efficient draft model that is either pretrained or distilled offline to a particular target model series, for instance, Llama or Qwen models. However, within online deployment settings, there are two major challenges: 1) usage of a target model that is incompatible with the draft model; 2) expectation of latency improvements over usage and time. In this work, we propose OmniDraft, a unified framework that enables a single draft model to operate with any target model and adapt dynamically to user data. We introduce an online n-gram cache with hybrid distillation fine-tuning to address the cross-vocabulary mismatch across draft and target models; and further improve decoding speed by leveraging adaptive drafting techniques. OmniDraft is particularly suitable for on-device LLM applications where model cost, efficiency and user customization are the major points of contention. This further highlights the need to tackle the above challenges and motivates the \textit{``one drafter for all’’} paradigm. We showcase the proficiency of the OmniDraft framework by performing online learning on math reasoning, coding and text generation tasks. Notably, OmniDraft enables a single Llama-68M model to pair with various target models including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding; and additionally provides up to 1.5-2x speedup.

[479] Feature Distillation is the Better Choice for Model-Heterogeneous Federated Learning

Yichen Li, Xiuying Wang, Wenchao Xu, Haozhao Wang, Yining Qi, Jiahua Dong, Ruixuan Li

Main category: cs.LG

TL;DR: FedFD proposes feature distillation with orthogonal projection for model-heterogeneous federated learning to address knowledge bias issues in existing logit-based methods.

Details

Motivation: Existing ensemble distillation methods in heterogeneous federated learning focus on logit distillation, which fails to compensate for knowledge bias from heterogeneous models, leading to unstable training and suboptimal performance.

Method: Proposes feature-based ensemble distillation with orthogonal projection layers for each client model architecture to align features and mitigate knowledge bias, using orthogonal techniques to re-parameterize projection layers.

Result: Extensive experiments show FedFD achieves superior performance compared to state-of-the-art methods in model-heterogeneous federated learning.

Conclusion: Feature distillation with orthogonal projection effectively addresses knowledge bias in heterogeneous federated learning, providing stable training and improved performance over logit-based approaches.

Abstract: Model-Heterogeneous Federated Learning (Hetero-FL) has attracted growing attention for its ability to aggregate knowledge from heterogeneous models while keeping private data locally. To better aggregate knowledge from clients, ensemble distillation, as a widely used and effective technique, is often employed after global aggregation to enhance the performance of the global model. However, simply combining Hetero-FL and ensemble distillation does not always yield promising results and can make the training process unstable. The reason is that existing methods primarily focus on logit distillation, which, while being model-agnostic with softmax predictions, fails to compensate for the knowledge bias arising from heterogeneous models. To tackle this challenge, we propose a stable and efficient Feature Distillation for model-heterogeneous Federated learning, dubbed FedFD, that can incorporate aligned feature information via orthogonal projection to integrate knowledge from heterogeneous models better. Specifically, a new feature-based ensemble federated knowledge distillation paradigm is proposed. The global model on the server needs to maintain a projection layer for each client-side model architecture to align the features separately. Orthogonal techniques are employed to re-parameterize the projection layer to mitigate knowledge bias from heterogeneous models and thus maximize the distilled knowledge. Extensive experiments show that FedFD achieves superior performance compared to state-of-the-art methods.

[480] Toward Fair Graph Neural Networks Via Dual-Teacher Knowledge Distillation

Chengyu Li, Debo Cheng, Guixian Zhang, Yi Li, Shichao Zhang

Main category: cs.LG

TL;DR: FairDTD is a dual-teacher distillation framework that balances fairness and utility in graph neural networks by using feature and structure teachers to transfer fair knowledge while leveraging full graph data.

Details

Motivation: GNNs often produce biased predictions due to sensitive attributes, and existing fairness methods compromise model utility by using partial data (either features or structure alone).

Method: FairDTD uses two fairness-oriented teacher models (feature teacher and structure teacher) for dual distillation, with graph-level distillation and node-specific temperature modules to enhance fair knowledge transfer.

Result: Experiments on benchmark datasets show FairDTD achieves optimal fairness while preserving high model utility.

Conclusion: FairDTD effectively balances fairness and utility in GNN representation learning through dual-teacher distillation guided by causal graph models.

Abstract: Graph Neural Networks (GNNs) have demonstrated strong performance in graph representation learning across various real-world applications. However, they often produce biased predictions caused by sensitive attributes, such as religion or gender, an issue that has been largely overlooked in existing methods. Recently, numerous studies have focused on reducing biases in GNNs. However, these approaches often rely on training with partial data (e.g., using either node features or graph structure alone), which can enhance fairness but frequently compromises model utility due to the limited utilization of available graph information. To address this tradeoff, we propose an effective strategy to balance fairness and utility in knowledge distillation. Specifically, we introduce FairDTD, a novel Fair representation learning framework built on Dual-Teacher Distillation, leveraging a causal graph model to guide and optimize the design of the distillation process. Specifically, FairDTD employs two fairness-oriented teacher models: a feature teacher and a structure teacher, to facilitate dual distillation, with the student model learning fairness knowledge from the teachers while also leveraging full data to mitigate utility loss. To enhance information transfer, we incorporate graph-level distillation to provide an indirect supplement of graph information during training, as well as a node-specific temperature module to improve the comprehensive transfer of fair knowledge. Experiments on diverse benchmark datasets demonstrate that FairDTD achieves optimal fairness while preserving high model utility, showcasing its effectiveness in fair representation learning for GNNs.

[481] Attention-Aware GNN-based Input Defense against Multi-Turn LLM Jailbreak

Zixuan Huang, Kecheng Huang, Lihao Yin, Bowei He, Huiling Zhen, Mingxuan Yuan, Zili Shao

Main category: cs.LG

TL;DR: G-Guard is a Graph Neural Network-based defense system that protects LLMs from multi-turn jailbreak attacks by analyzing entity relationships and using attention-aware augmentation to identify harmful queries.

Details

Motivation: LLMs remain vulnerable to sophisticated multi-turn jailbreak attacks that incrementally escalate dialogue complexity, making them harder to detect than single-turn attacks.

Method: G-Guard constructs entity graphs from multi-turn queries, captures interrelationships between queries and harmful keywords, and uses attention-aware augmentation to retrieve relevant single-turn queries as labeled nodes to enhance classification.

Result: G-Guard consistently outperforms all baseline methods across diverse datasets and evaluation metrics, demonstrating superior effectiveness in detecting multi-turn jailbreak attacks.

Conclusion: G-Guard provides a robust defense mechanism against multi-turn jailbreak attacks, effectively enhancing LLM safety through graph-based analysis and attention-aware augmentation.

Abstract: Large Language Models (LLMs) have gained significant traction in various applications, yet their capabilities present risks for both constructive and malicious exploitation. Despite extensive training and fine-tuning efforts aimed at enhancing safety, LLMs remain susceptible to jailbreak attacks. Recently, the emergence of multi-turn attacks has intensified this vulnerability. Unlike single-turn attacks, multi-turn attacks incrementally escalate dialogue complexity, rendering them more challenging to detect and mitigate. In this study, we introduce G-Guard, an innovative attention-aware Graph Neural Network (GNN)-based input classifier specifically designed to defend against multi-turn jailbreak attacks targeting LLMs. G-Guard constructs an entity graph for multi-turn queries, which captures the interrelationships between queries and harmful keywords that present in multi-turn queries. Furthermore, we propose an attention-aware augmentation mechanism that retrieves the most relevant single-turn query based on the ongoing multi-turn conversation. The retrieved query is incorporated as a labeled node within the graph, thereby enhancing the GNN’s capacity to classify the current query as harmful or benign. Evaluation results show that G-Guard consistently outperforms all baselines across diverse datasets and evaluation metrics, demonstrating its efficacy as a robust defense mechanism against multi-turn jailbreak attacks.

[482] AlphaZero Neural Scaling and Zipf’s Law: a Tale of Board Games and Power Laws

Oren Neumann, Claudius Gros

Main category: cs.LG

TL;DR: This paper examines neural scaling laws in AlphaZero reinforcement learning, finding that game states follow Zipf’s law and agents optimize state loss in descending frequency order, with inverse scaling correlated to unusual Zipf distributions.

Details

Motivation: To understand why neural scaling laws occur and test theories suggesting they emerge from Zipf's law, particularly examining if language-model scaling theories apply to reinforcement learning in AlphaZero.

Method: Analyzed power-law scaling in AlphaZero using a language-model scaling framework, examining Zipf’s law in game states and correlation between scaling-law and Zipf’s-law exponents.

Result: Found that game states in training and inference data scale with Zipf’s law, agents optimize state loss in descending frequency order, and inverse scaling correlates with unusual Zipf curves where end-game states are among the most frequent.

Conclusion: Scaling laws in AlphaZero follow similar patterns to language models, with Zipf’s law playing a key role, and inverse scaling occurs when models shift focus to less important end-game states at the expense of understanding early-game states.

Abstract: Neural scaling laws are observed in a range of domains, to date with no universal understanding of why they occur. Recent theories suggest that loss power laws arise from Zipf’s law, a power law observed in domains like natural language. One theory suggests that language scaling laws emerge when Zipf-distributed task quanta are learned in descending order of frequency. In this paper we examine power-law scaling in AlphaZero, a reinforcement learning algorithm, using a model of language-model scaling. We find that game states in training and inference data scale with Zipf’s law, which is known to arise from the tree structure of the environment, and examine the correlation between scaling-law and Zipf’s-law exponents. In agreement with the quanta scaling model, we find that agents optimize state loss in descending order of frequency, even though this order scales inversely with modelling complexity. We also find that inverse scaling, the failure of models to improve with size, is correlated with unusual Zipf curves where end-game states are among the most frequent states. We show evidence that larger models shift their focus to these less-important states, sacrificing their understanding of important early-game states.

[483] DFAMS: Dynamic-flow guided Federated Alignment based Multi-prototype Search

Zhibang Yang, Xinke Jiang, Rihong Qiu, Ruiqing Li, Yihang Zhang, Yue Fang, Yongxin Xu, Hongxin Ding, Xu Chu, Junfeng Zhao, Yasha Wang

Main category: cs.LG

TL;DR: DFAMS is a federated retrieval framework that uses Dynamic Information Flow in LLMs to identify query intents and create semantically aligned knowledge partitions for accurate cross-domain retrieval.

Details

Motivation: Existing federated retrieval methods struggle with ambiguous queries in cross-domain scenarios, limiting their effectiveness in supporting downstream generation tasks.

Method: Leverages Dynamic Information Flow in LLMs using gradient signals and Shapley value-based attribution to trace neuron activation paths for intent recognition and subdomain detection. Uses multi-prototype contrastive learning for fine-grained intra-source modeling and inter-source semantic alignment.

Result: Outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy across five benchmarks.

Conclusion: DFAMS demonstrates effectiveness in complex federated retrieval scenarios by enabling accurate retrieval across heterogeneous knowledge sources through dynamic information flow analysis.

Abstract: Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by Dynamic Information Flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios. Our code are anonymous available at https://anonymous.4open.science/r/DFAMS/

[484] DMSC: Dynamic Multi-Scale Coordination Framework for Time Series Forecasting

Haonan Yang, Jianchao Tang, Zhuo Li, Long Lan

Main category: cs.LG

TL;DR: DMSC is a dynamic multi-scale coordination framework for time series forecasting that addresses static decomposition, fragmented dependency modeling, and inflexible fusion through three novel components: EMPD for dynamic patch decomposition, TIB for triad dependency modeling, and ASR-MoE for adaptive fusion.

Details

Motivation: Existing time series forecasting methods struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies across different scales.

Method: Proposes DMSC framework with three key components: EMPD for dynamic hierarchical patch decomposition with exponential granularities, TIB for joint modeling of intra-patch, inter-patch and cross-variable dependencies, and ASR-MoE for adaptive fusion using specialized experts with temporal-aware weighting.

Result: Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art performance and superior computational efficiency for time series forecasting tasks.

Conclusion: DMSC effectively addresses the three key challenges in time series forecasting through its dynamic multi-scale coordination framework, achieving superior performance and efficiency across multiple benchmarks.

Abstract: Time Series Forecasting (TSF) faces persistent challenges in modeling intricate temporal dependencies across different scales. Despite recent advances leveraging different decomposition operations and novel architectures based on CNN, MLP or Transformer, existing methods still struggle with static decomposition strategies, fragmented dependency modeling, and inflexible fusion mechanisms, limiting their ability to model intricate temporal dependencies. To explicitly solve the mentioned three problems respectively, we propose a novel Dynamic Multi-Scale Coordination Framework (DMSC) with Multi-Scale Patch Decomposition block (EMPD), Triad Interaction Block (TIB) and Adaptive Scale Routing MoE block (ASR-MoE). Specifically, EMPD is designed as a built-in component to dynamically segment sequences into hierarchical patches with exponentially scaled granularities, eliminating predefined scale constraints through input-adaptive patch adjustment. TIB then jointly models intra-patch, inter-patch, and cross-variable dependencies within each layer’s decomposed representations. EMPD and TIB are jointly integrated into layers forming a multi-layer progressive cascade architecture, where coarse-grained representations from earlier layers adaptively guide fine-grained feature extraction in subsequent layers via gated pathways. And ASR-MoE dynamically fuses multi-scale predictions by leveraging specialized global and local experts with temporal-aware weighting. Comprehensive experiments on thirteen real-world benchmarks demonstrate that DMSC consistently maintains state-of-the-art (SOTA) performance and superior computational efficiency for TSF tasks. Code is available at https://github.com/1327679995/DMSC.

[485] Dimension Reduction with Locally Adjusted Graphs

Yingfan Wang, Yiyang Sun, Haiyang Huang, Cynthia Rudin

Main category: cs.LG

TL;DR: LocalMAP is a new dimensionality reduction algorithm that dynamically adjusts graphs to better identify and separate real clusters in high-dimensional data, particularly useful for transcriptomic datasets.

Details

Motivation: Standard DR methods often produce suboptimal graphs due to unreliable high-dimensional distances and limited information extraction, especially with large datasets. This makes cluster identification difficult.

Method: LocalMAP dynamically extracts subgraphs and updates graphs on-the-fly to create more reliable representations, enabling better cluster separation by focusing on local data sections.

Result: The method successfully identifies and separates real clusters that other DR methods may overlook or combine, as demonstrated in biological dataset case studies.

Conclusion: LocalMAP provides more accurate cluster identification for real-world problems by addressing the limitations of traditional graph-based DR approaches through dynamic local graph adjustment.

Abstract: Dimension reduction (DR) algorithms have proven to be extremely useful for gaining insight into large-scale high-dimensional datasets, particularly finding clusters in transcriptomic data. The initial phase of these DR methods often involves converting the original high-dimensional data into a graph. In this graph, each edge represents the similarity or dissimilarity between pairs of data points. However, this graph is frequently suboptimal due to unreliable high-dimensional distances and the limited information extracted from the high-dimensional data. This problem is exacerbated as the dataset size increases. If we reduce the size of the dataset by selecting points for a specific sections of the embeddings, the clusters observed through DR are more separable since the extracted subgraphs are more reliable. In this paper, we introduce LocalMAP, a new dimensionality reduction algorithm that dynamically and locally adjusts the graph to address this challenge. By dynamically extracting subgraphs and updating the graph on-the-fly, LocalMAP is capable of identifying and separating real clusters within the data that other DR methods may overlook or combine. We demonstrate the benefits of LocalMAP through a case study on biological datasets, highlighting its utility in helping users more accurately identify clusters for real-world problems.

[486] Resource-Constrained Federated Continual Learning: What Does Matter?

Yichen Li, Yuying Wang, Jiahua Dong, Haozhao Wang, Yining Qi, Rui Zhang, Ruixuan Li

Main category: cs.LG

TL;DR: Existing Federated Continual Learning methods fail to achieve expected performance under resource-constrained settings, making them impractical for real-world deployment on edge devices.

Details

Motivation: Current FCL literature focuses on data privacy but ignores practical resource constraints like storage, computational budget, and label rate that edge devices face in real-world scenarios.

Method: Conducted large-scale benchmark with extensive experiments (1,000+ GPU hours) using various FCL techniques on six datasets in two incremental learning scenarios (Class-IL and Domain-IL), analyzing performance under different resource-constrained settings.

Result: All existing FCL approaches fail to achieve expected performance under limited resource constraints, with consistent findings in sensitivity analysis. Most methods are too resource-dependent for real-world deployment.

Conclusion: Current FCL methods are impractical for real-world edge device deployment due to resource dependency. The study provides insights for future FCL research directions considering resource constraints.

Abstract: Federated Continual Learning (FCL) aims to enable sequentially privacy-preserving model training on streams of incoming data that vary in edge devices by preserving previous knowledge while adapting to new data. Current FCL literature focuses on restricted data privacy and access to previously seen data while imposing no constraints on the training overhead. This is unreasonable for FCL applications in real-world scenarios, where edge devices are primarily constrained by resources such as storage, computational budget, and label rate. We revisit this problem with a large-scale benchmark and analyze the performance of state-of-the-art FCL approaches under different resource-constrained settings. Various typical FCL techniques and six datasets in two incremental learning scenarios (Class-IL and Domain-IL) are involved in our experiments. Through extensive experiments amounting to a total of over 1,000+ GPU hours, we find that, under limited resource-constrained settings, existing FCL approaches, with no exception, fail to achieve the expected performance. Our conclusions are consistent in the sensitivity analysis. This suggests that most existing FCL methods are particularly too resource-dependent for real-world deployment. Moreover, we study the performance of typical FCL techniques with resource constraints and shed light on future research directions in FCL.

[487] SCAN: Self-Denoising Monte Carlo Annotation for Robust Process Reward Learning

Yuyang Ding, Xinyu Shi, Juntao Li, Xiaobo Liang, Zhaopeng Tu, Min Zhang

Main category: cs.LG

TL;DR: SCAN is a self-denoising framework that enables efficient process reward model training using lightweight models to generate synthetic data, overcoming noise issues in Monte Carlo estimation while achieving superior performance with much lower computational cost.

Details

Motivation: Process reward models (PRMs) are effective for complex reasoning tasks but face challenges due to high costs of human-annotated data and noise issues in synthetic data from Monte Carlo estimation.

Method: Proposed Self-Denoising Monte Carlo Annotation (SCAN) framework with self-denoising strategy using lightweight models and robust learning strategy to handle noisy synthetic data.

Result: PRMs achieved 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench, outperforming baselines including PRM800K, with only 6% of vanilla MC estimation inference cost.

Conclusion: SCAN enables scalable, cost-efficient, and robust PRM training, with performance improving as synthetic data scales up, demonstrating strong potential for practical applications.

Abstract: Process reward models (PRMs) offer fine-grained, step-level evaluations that facilitate deeper reasoning processes in large language models (LLMs), proving effective in complex tasks like mathematical reasoning. However, developing PRMs is challenging due to the high cost and limited scalability of human-annotated data. Synthetic data from Monte Carlo (MC) estimation is a promising alternative but suffers from a high noise ratio, which can cause overfitting and hinder large-scale training. In this work, we conduct a preliminary study on the noise distribution in synthetic data from MC estimation, identifying that annotation models tend to both underestimate and overestimate step correctness due to limitations in their annotation capabilities. Building on these insights, we propose Self-Denoising Monte Carlo Annotation (SCAN), an efficient data synthesis and noise-tolerant learning framework. Our key findings indicate that: (1) Even lightweight models (e.g., 1.5B parameters) can produce high-quality annotations through a self-denoising strategy, enabling PRMs to achieve superior performance with only 6% the inference cost required by vanilla MC estimation. (2) With our robust learning strategy, PRMs can effectively learn from this weak supervision, achieving a 39.2 F1 score improvement (from 19.9 to 59.1) in ProcessBench. Despite using only a compact synthetic dataset, our models surpass strong baselines, including those trained on large-scale human-annotated datasets such as PRM800K. Furthermore, performance continues to improve as we scale up the synthetic data, highlighting the potential of SCAN for scalable, cost-efficient, and robust PRM training.

[488] Evaluating multiple models using labeled and unlabeled data

Divya Shanmugam, Shuvom Sadhuka, Manish Raghavan, John Guttag, Bonnie Berger, Emma Pierson

Main category: cs.LG

TL;DR: SSME is a semi-supervised method that uses both labeled and unlabeled data to evaluate ML classifiers, leveraging multiple classifiers, continuous scores, and abundant unlabeled data to estimate performance metrics more accurately than existing methods.

Details

Motivation: Large labeled datasets are often expensive or impossible to obtain, while unlabeled data is plentiful. Current evaluation methods don't fully utilize available unlabeled data and multiple classifier outputs.

Method: Uses semi-supervised mixture model to estimate joint distribution of ground truth labels and classifier predictions, enabling estimation of any metric based on classifier scores and labels.

Result: SSME reduces error by 5.1x compared to using labeled data alone and 2.4x compared to the next best method. Works well across healthcare, content moderation, molecular prediction, and image annotation domains.

Conclusion: SSME provides more accurate classifier evaluation by effectively leveraging both labeled and unlabeled data, especially valuable in domains where labeled data is scarce.

Abstract: It remains difficult to evaluate machine learning classifiers in the absence of a large, labeled dataset. While labeled data can be prohibitively expensive or impossible to obtain, unlabeled data is plentiful. Here, we introduce Semi-Supervised Model Evaluation (SSME), a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. SSME is the first evaluation method to take advantage of the fact that: (i) there are frequently multiple classifiers for the same task, (ii) continuous classifier scores are often available for all classes, and (iii) unlabeled data is often far more plentiful than labeled data. The key idea is to use a semi-supervised mixture model to estimate the joint distribution of ground truth labels and classifier predictions. We can then use this model to estimate any metric that is a function of classifier scores and ground truth labels (e.g., accuracy or expected calibration error). We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method. SSME also improves accuracy when evaluating performance across subsets of the test distribution (e.g., specific demographic subgroups) and when evaluating the performance of language models.

[489] General Exploratory Bonus for Optimistic Exploration in RLHF

Wendi Li, Changdae Oh, Sharon Li

Main category: cs.LG

TL;DR: The paper introduces General Exploratory Bonus (GEB), a novel framework that addresses the failure of existing exploratory bonus methods to achieve optimistic exploration in RLHF by counteracting divergence-induced bias.

Details

Motivation: Current exploratory bonus methods using KL or α-divergence regularization unintentionally bias exploration toward high-probability regions of the reference model, reinforcing conservative behavior instead of promoting discovery of uncertain regions.

Method: Proposed General Exploratory Bonus (GEB) framework that counteracts divergence-induced bias via reference-dependent reward regulation, unifying prior heuristic bonuses as special cases and extending across the full α-divergence family.

Result: GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones.

Conclusion: GEB offers both a principled and practical solution for optimistic exploration in RLHF, addressing the fundamental pitfall in current divergence-based exploration methods.

Abstract: Optimistic exploration is central to improving sample efficiency in reinforcement learning with human feedback, yet existing exploratory bonus methods to incentivize exploration often fail to realize optimism. We provide a theoretical analysis showing that current formulations, under KL or $\alpha$-divergence regularization, unintentionally bias exploration toward high-probability regions of the reference model, thereby reinforcing conservative behavior instead of promoting discovery of uncertain regions. To address this pitfall, we introduce the General Exploratory Bonus (GEB), a novel theoretical framework that provably satisfies the optimism principle. GEB counteracts divergence-induced bias via reference-dependent reward regulation and unifies prior heuristic bonuses as special cases, while extending naturally across the full $\alpha$-divergence family. Empirically, GEB consistently outperforms baselines on alignment tasks across multiple divergence settings and large language model backbones. These results demonstrate that GEB offers both a principled and practical solution for optimistic exploration in RLHF.

[490] Hard Examples Are All You Need: Maximizing GRPO Post-Training Under Annotation Budgets

Benjamin Pikus, Pratyush Ranjan Tiwari, Burton Ye

Main category: cs.LG

TL;DR: Training on the hardest examples (where base model fails) yields up to 47% performance gains in GRPO fine-tuning, while easy examples provide minimal improvements (3-15%) due to lack of outcome variance needed for learning signals.

Details

Motivation: Collecting high-quality training data for language model fine-tuning is expensive, with practical budgets limiting data procurement, so understanding which examples provide the most learning value is crucial.

Method: Compared different example selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks in GRPO training, focusing on the hardest 10% of examples where base models fail most often.

Result: Hard examples produced dramatic performance gains up to 47%, while easy examples showed minimal improvements of 3-15%. Models trained on hard examples also demonstrated superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on AIME2025 benchmark.

Conclusion: When budget-constrained, prioritize collecting and annotating examples where the base model struggles, as these drive nearly all learning value in GRPO fine-tuning due to maintaining outcome variance that generates learning signals.

Abstract: Collecting high-quality training examples for language model fine-tuning is expensive, with practical budgets limiting the amount of data that can be procured. We investigate whether example difficulty affects GRPO training effectiveness by comparing selection strategies (easy, medium, hard, random) across multiple models and reasoning tasks. Training on the hardest 10% of examples (those where the base model fails most often) yields dramatic performance gains up to 47%, while easy examples produce minimal improvements of 3-15%. This occurs because GRPO requires outcome variance to generate learning signals; hard examples maintain mixed success/failure outcomes throughout training while easy examples quickly converge to consistent success, eliminating learning opportunities. Moreover, models trained on hard examples show superior out-of-distribution generalization, with only hard-trained models achieving meaningful gains on the AIME2025 benchmark. Our findings provide clear guidance: when budget-constrained, prioritize collecting and annotating examples where your base model struggles, as these drive nearly all learning value in GRPO fine-tuning

[491] RIGNO: A Graph-based framework for robust and accurate operator learning for PDEs on arbitrary domains

Sepehr Mousavi, Shizheng Wen, Levi Lingsch, Maximilian Herde, Bogdan Raonić, Siddhartha Mishra

Main category: cs.LG

TL;DR: RIGNO is a graph neural network-based neural operator that learns PDE solution operators from point cloud data on arbitrary domains using a multi-scale approach with regional mesh downsampling.

Details

Motivation: Learning PDE solution operators is challenging due to diverse domain shapes and complex physics. Existing methods struggle with arbitrary domains and maintaining resolution invariance.

Method: End-to-end GNN neural operator using multi-scale mapping through downsampled regional meshes, with novel elements for spatio-temporal resolution invariance.

Result: RIGNO significantly outperforms neural operator baselines on challenging benchmarks with various time-dependent and steady PDEs, and robustly generalizes to unseen spatio-temporal resolutions.

Conclusion: The proposed RIGNO model provides an accurate and robust solution for learning PDE operators on arbitrary domains while maintaining resolution invariance.

Abstract: Learning the solution operators of PDEs on arbitrary domains is challenging due to the diversity of possible domain shapes, in addition to the often intricate underlying physics. We propose an end-to-end graph neural network (GNN) based neural operator to learn PDE solution operators from data on point clouds in arbitrary domains. Our multi-scale model maps data between input/output point clouds by passing it through a downsampled regional mesh. The approach includes novel elements aimed at ensuring spatio-temporal resolution invariance. Our model, termed RIGNO, is tested on a challenging suite of benchmarks composed of various time-dependent and steady PDEs defined on a diverse set of domains. We demonstrate that RIGNO is significantly more accurate than neural operator baselines and robustly generalizes to unseen resolutions both in space and in time. Our code is publicly available at github.com/camlab-ethz/rigno.

[492] Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

Main category: cs.LG

TL;DR: BGPO is a memory-efficient RL algorithm for diffusion LLMs that uses a specially constructed lower bound to enable large Monte Carlo sample sizes, improving likelihood approximation and RL performance.

Details

Motivation: Existing RL methods for diffusion LLMs have high memory overhead due to needing to retain forward computational graphs for all Monte Carlo samples during gradient computation, limiting sample sizes and causing imprecise likelihood approximations.

Method: Proposes Boundary-Guided Policy Optimization (BGPO) which maximizes a specially constructed lower bound of the ELBO-based objective that has linearity (enables gradient accumulation) and equivalence (matches value and gradient of original objective) properties.

Result: BGPO significantly outperforms previous RL algorithms for diffusion LLMs in math problem solving, code generation, and planning tasks.

Conclusion: BGPO provides an effective memory-efficient solution for RL training of diffusion LLMs by enabling large sample sizes through its specially designed lower bound, leading to improved performance across multiple domains.

Abstract: A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose \emph{Boundary-Guided Policy Optimization} (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks. Our codes and models are available at \href{https://github.com/THU-KEG/BGPO}{https://github.com/THU-KEG/BGPO}.

[493] Attention as an Adaptive Filter

Peter Racioppo

Main category: cs.LG

TL;DR: AFA is a novel attention mechanism that models input sequences as observations of a linear SDE, using dynamics models to compute attention weights through uncertainty propagation and maximum likelihood filtering.

Details

Motivation: To improve attention mechanisms by incorporating learnable dynamics models that can better capture temporal dependencies and uncertainties in sequences, moving beyond direct query-key comparisons.

Method: Models sequences as discrete observations of linear SDEs with simultaneously-diagonalizable state matrices, uses closed-form solutions of differential Lyapunov equations for uncertainty propagation, and derives attention weights as maximum likelihood filtering solutions.

Result: Developed a simplified variant with same complexity as standard attention, and showed that in certain limits it recovers complex-valued generalization of dot-product attention with rotary positional encodings.

Conclusion: AFA provides a principled framework for attention that incorporates dynamics modeling, offering theoretical connections to existing attention mechanisms while enabling more robust uncertainty-aware sequence processing.

Abstract: We introduce Adaptive Filter Attention (AFA), a novel attention mechanism that incorporates a learnable dynamics model directly into the computation of attention weights. Rather than comparing queries and keys directly, we model the input sequence as discrete observations of a linear stochastic differential equation (SDE). By assuming a continuous-time linear time-invariant system with simultaneously-diagonalizable state matrices and noise covariances, we can make use of a closed-form solution of the differential Lyapunov equation to efficiently propagate uncertainties through the dynamics from keys to queries. A generalization of attention naturally arises as the maximum likelihood solution for filtering the trajectory of this linear SDE, with attention weights corresponding to robust residual-based reweightings of the propagated query-key precisions. We further constrain the system dynamics and noise in order to obtain a simplified variant with the same computational and memory complexity as standard attention. In the limit of zero decay and process noise, and using a small-angle approximation, we recover a complex-valued generalization of ordinary dot-product attention with rotary positional encodings.

[494] How Vulnerable Is My Learned Policy? Universal Adversarial Perturbation Attacks On Modern Behavior Cloning Policies

Akansha Kalra, Basavasagar Patil, Guanhong Tao, Daniel S. Brown

Main category: cs.LG

TL;DR: This paper presents the first systematic study of adversarial attacks on Learning from Demonstration algorithms, revealing high vulnerability to universal perturbations that are transferable across algorithms and tasks.

Details

Motivation: To investigate the underexplored vulnerability of Learning from Demonstration algorithms to offline universal perturbation attacks, as current research has not systematically examined this security risk.

Method: Comprehensive study of adversarial attacks on multiple LfD algorithms including Behavior Cloning, LSTM-GMM, Implicit Behavior Cloning, Diffusion Policy, and Vector-Quantized Behavior Transformer, using both white-box and black-box attack scenarios on simulated robotic manipulation tasks.

Result: Most current LfD methods are highly vulnerable to adversarial perturbations, and these attacks are often transferable across different algorithms, architectures, and tasks, revealing significant security vulnerabilities.

Conclusion: The study highlights critical vulnerabilities in modern behavior cloning algorithms and calls for future work to address these security limitations in Learning from Demonstration systems.

Abstract: Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to offline universal perturbation attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and Vector-Quantizied Behavior Transformer (VQ-BET). We study the vulnerability of these methods to universal adversarial perturbations. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also show that these attacks are often transferable across algorithms, architectures, and tasks, raising concerning security vulnerabilities to black-box attacks. To the best of our knowledge, we are the first to present a systematic study of the vulnerabilities of different LfD algorithms to both white-box and black-box attacks. Our findings highlight the vulnerabilities of modern BC algorithms, paving the way for future work in addressing such limitations.

[495] General Demographic Foundation Models for Enhancing Predictive Performance Across Diseases and Populations

Li-Chin Chen, Ji-Tian Sheu, Yuh-Jue Chuang

Main category: cs.LG

TL;DR: A General Demographic Pre-trained (GDP) model was developed as a foundational model for demographic attributes (age and gender) in healthcare, showing improved generalization across tasks, diseases, and populations.

Details

Motivation: Demographic attributes are universally present in EHRs and serve as vital predictors in clinical risk stratification, but are often treated as auxiliaries with limited attention to learning their representations.

Method: Developed a GDP model with exploration of architecture composition through combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Pre-trained and evaluated using diverse datasets from different geographic regions.

Result: GDP demonstrated feasibility to generalize across tasks, diseases, and populations. Sequential ordering substantially improved model performance in discrimination, calibration, and information gain. GDP enhanced representational importance even in datasets where demographic attributes had low predictive value.

Conclusion: Foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.

Abstract: Demographic attributes are universally present in electronic health records. They are the most widespread information across populations and diseases, and serve as vital predictors in clinical risk stratification and treatment decisions. Despite their significance, these attributes are often treated as auxiliaries in model design, with limited attention being paid to learning their representations. This study explored the development of a General Demographic Pre-trained (GDP) model as a foundational model tailored to demographic attributes, focusing on age and gender. The model is pre-trained and evaluated using datasets with diverse diseases and populations compositions from different geographic regions. The composition of GDP architecture was explored through examining combinations of ordering approaches and encoding methods to transform tabular demographic inputs into effective latent embeddings. Results demonstrate the feasibility of GDP to generalize across task, diseases, and populations. In detailed composition, the sequential ordering substantially improves model performance in discrimination, calibration, and the corresponding information gain at each decision tree split, particularly in diseases where age and gender contribute significantly to risk stratification. Even in datasets where demographic attributes hold relatively low predictive value, GDP enhances the representational importance, increasing their influence in downstream gradient boosting models. The findings suggest that foundation models for tabular demographic attributes offer a promising direction for improving predictive performance in healthcare applications.

[496] Mirror Descent Actor Critic via Bounded Advantage Learning

Ryo Iwaki

Main category: cs.LG

TL;DR: MDAC is an actor-critic implementation of Mirror Descent Value Iteration for continuous action domains that boosts performance by bounding actor log-density terms in the critic’s loss function.

Details

Motivation: KL-entropy-regularized methods like MDVI work well in discrete domains but underperform entropy-only methods in continuous action domains, motivating the need for better continuous domain implementations.

Method: Proposed Mirror Descent Actor Critic (MDAC) with bounded actor log-density terms in critic loss, relating actor log-probability to regularized advantage functions and exploring effective bounding functions.

Result: MDAC with bounded terms significantly outperforms naive unbounded implementation and surpasses strong non-regularized and entropy-only-regularized methods with appropriate bounding functions.

Conclusion: Bounding advantage terms in MDAC is validated and beneficial for continuous action domains, with proper bounding function choices enabling superior performance over existing regularization approaches.

Abstract: Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entroy-regularized methods do not surpass a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor’s log-density terms in the critic’s loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor’s log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.

[497] Provably Efficient RL under Episode-Wise Safety in Constrained MDPs with Linear Function Approximation

Toshinori Kitamura, Arnob Ghosh, Tadashi Kozuno, Wataru Kumagai, Kazumi Kasaura, Kenta Hoshino, Yohei Hosoe, Yutaka Matsuo

Main category: cs.LG

TL;DR: Proposes an RL algorithm for linear constrained MDPs that achieves O(√K) regret with zero constraint violations and polynomial computational complexity.

Details

Motivation: While constrained RL is well-understood in tabular settings, theoretical results for function approximation remain scarce, especially for linear CMDPs where existing methods either violate constraints or have exponential computational costs.

Method: Develops a reinforcement learning algorithm specifically designed for linear constrained Markov decision processes (CMDPs) that maintains constraint satisfaction while being computationally efficient.

Result: Achieves O(√K) regret with episode-wise zero constraint violations, and scales polynomially with problem-dependent parameters while remaining independent of state space size.

Conclusion: The proposed method significantly improves upon recent linear CMDP algorithms by eliminating constraint violations and avoiding exponential computational costs, providing a practical solution for constrained RL with function approximation.

Abstract: We study the reinforcement learning (RL) problem in a constrained Markov decision process (CMDP), where an agent explores the environment to maximize the expected cumulative reward while satisfying a single constraint on the expected total utility value in every episode. While this problem is well understood in the tabular setting, theoretical results for function approximation remain scarce. This paper closes the gap by proposing an RL algorithm for linear CMDPs that achieves $\tilde{\mathcal{O}}(\sqrt{K})$ regret with an episode-wise zero-violation guarantee. Furthermore, our method is computationally efficient, scaling polynomially with problem-dependent parameters while remaining independent of the state space size. Our results significantly improve upon recent linear CMDP algorithms, which either violate the constraint or incur exponential computational costs.

[498] Wasserstein-based Kernel Principal Component Analysis for Clustering Applications

Alfredo Oneto, Blazhe Gjorgiev, Giovanni Sansavini

Main category: cs.LG

TL;DR: A framework combining Wasserstein distances with kernel methods for clustering distributional data, with efficient approximation, kernel PCA, and scalable evaluation.

Details

Motivation: Many clustering applications involve non-vector objects represented as discrete distributions, but existing methods lack an unsupervised framework combining Wasserstein distances with kernel methods for clustering such distributional data.

Method: Three-component framework: (1) efficient approximation of pairwise Wasserstein distances using multiple reference distributions, (2) shifted positive definite kernel functions based on Wasserstein distances with kernel PCA for feature mapping, (3) scalable distance-agnostic validity indices for clustering evaluation and kernel parameter optimization.

Result: Experiments on power distribution graphs and real-world time series demonstrate the framework’s effectiveness and efficiency.

Conclusion: The proposed framework successfully integrates Wasserstein metrics with kernel methods for clustering distributional data, providing a computationally tractable solution applicable to both vectorial and distributional data across various domains.

Abstract: Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.

[499] Understanding Language Prior of LVLMs by Contrasting Chain-of-Embedding

Lin Long, Changdae Oh, Seongheon Park, Sharon Li

Main category: cs.LG

TL;DR: This paper analyzes language prior in large vision-language models by examining layer-wise representation dynamics, identifying a Visual Integration Point where vision begins influencing decoding, and introducing a TVI estimator to quantify visual influence.

Details

Motivation: LVLMs often default to language priors from pre-training while under-utilizing visual evidence, but existing analysis methods fail to reveal the internal mechanisms of how vision influences model behavior.

Method: Systematic analysis through chain-of-embedding, examining layer-wise representation dynamics to identify Visual Integration Points and introducing Total Visual Integration estimator to quantify visual influence.

Result: Across 54 model-dataset combinations spanning 9 LVLMs and 6 benchmarks, VIP consistently emerges and TVI reliably predicts language prior strength.

Conclusion: Provides a principled toolkit for diagnosing and understanding language prior in LVLMs through internal representation analysis.

Abstract: Large vision-language models (LVLMs) achieve strong performance on multimodal tasks, yet they often default to their language prior (LP) – memorized textual patterns from pre-training while under-utilizing visual evidence. Prior analyses of LP mostly rely on input-output probing, which fails to reveal the internal mechanisms governing when and how vision influences model behavior. To address this gap, we present the first systematic analysis of language prior through the lens of chain-of-embedding, which examines the layer-wise representation dynamics within LVLMs. Our analysis reveals a universal phenomenon: each model exhibits a Visual Integration Point (VIP), a critical layer at which visual information begins to meaningfully reshape hidden representations and influence decoding. Building on this observation, we introduce the Total Visual Integration (TVI) estimator, which aggregates representation distance beyond the VIP to quantify how strongly visual query influences response generation. Across 54 model-dataset combinations spanning 9 contemporary LVLMs and 6 benchmarks, we demonstrate that VIP consistently emerges, and that TVI reliably predicts the strength of language prior. This offers a principled toolkit for diagnosing and understanding language prior in LVLMs.

[500] MooseAgent: A LLM Based Multi-agent Framework for Automating Moose Simulation

Tao Zhang, Zhenhai Liu, Yong Xin, Yongjun Jiao

Main category: cs.LG

TL;DR: MooseAgent is an automated framework that uses LLMs and multi-agent systems to generate MOOSE input files from natural language descriptions, reducing the complexity of finite element simulations.

Details

Motivation: To address the time-consuming and specialized knowledge requirements of FEM pre-processing, solver configuration, and post-processing in MOOSE simulations.

Method: Combines large-scale pre-trained language models with a multi-agent system, using task decomposition and multi-round iterative verification strategies, along with a vector database of annotated MOOSE input cards and documentation to reduce hallucinations.

Result: Successfully automates MOOSE simulation process, achieving high success rates for simple single-physics problems in heat transfer, mechanics, phase field, and multi-physics coupling cases.

Conclusion: MooseAgent demonstrates potential for simplifying finite element simulations and lowering user barriers, providing new directions for intelligent FEM software development. The framework has been open-sourced.

Abstract: The Finite Element Method (FEM) is widely used in engineering and scientific computing, but its pre-processing, solver configuration, and post-processing stages are often time-consuming and require specialized knowledge. This paper proposes an automated solution framework, MooseAgent, for the multi-physics simulation framework MOOSE, which combines large-scale pre-trained language models (LLMs) with a multi-agent system. The framework uses LLMs to understand user-described simulation requirements in natural language and employs task decomposition and multi-round iterative verification strategies to automatically generate MOOSE input files. To improve accuracy and reduce model hallucinations, the system builds and utilizes a vector database containing annotated MOOSE input cards and function documentation. We conducted experimental evaluations on several typical cases, including heat transfer, mechanics, phase field, and multi-physics coupling. The results show that MooseAgent can automate the MOOSE simulation process to a certain extent, especially demonstrating a high success rate when dealing with relatively simple single-physics problems. The main contribution of this research is the proposal of a multi-agent automated framework for MOOSE, which validates its potential in simplifying finite element simulation processes and lowering the user barrier, providing new ideas for the development of intelligent finite element simulation software. The code for the MooseAgent framework proposed in this paper has been open-sourced and is available at https://github.com/taozhan18/MooseAgent

[501] Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu, Johan Obando-Ceron, Han Lu, Yancheng He, Weixun Wang, Wenbo Su, Bo Zheng, Pablo Samuel Castro, Aaron Courville, Ling Pan

Main category: cs.LG

TL;DR: AsyPPO introduces lightweight mini-critics trained on disjoint prompt shards to restore critics in RL for LLMs, improving learning stability and performance over baselines like GRPO and PPO.

Details

Motivation: Conventional value functions are computationally expensive at LLM scale and fail under sparse rewards and long reasoning horizons, leading recent RL4LLM methods to avoid explicit critics.

Method: Asymmetric Proximal Policy Optimization (AsyPPO) uses lightweight mini-critics on disjoint prompt shards, leverages inter-critic uncertainty to mask advantages in agreed states and filter high-divergence states from entropy regularization.

Result: AsyPPO consistently improves learning stability and performance across benchmarks, achieving gains of over 6% on Qwen3-4b-Base and about 3% on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO with only 5,000 training samples.

Conclusion: Architectural innovations like AsyPPO are crucial for scalable, efficient RL algorithms for LLMs, demonstrating the importance of restoring critics with proper design.

Abstract: Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely pragmatic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

[502] Generative Deep Learning Framework for Inverse Design of Fuels

Kiran K. Yalamanchi, Pinaki Pal, Balaji Mohan, Abdullah S. AlRamadan, Jihad A. Badra, Yuanjiang Pei

Main category: cs.LG

TL;DR: A generative deep learning framework combining Co-optimized Variational Autoencoder (Co-VAE) with QSPR techniques for accelerated inverse design of high-RON fuels.

Details

Motivation: To address limitations of traditional fuel screening approaches and enable systematic exploration of large chemical spaces for fuel design by capturing complex structure-property relationships.

Method: Co-VAE architecture integrated with property prediction, trained on GDB-13 database enriched with RON data, hyperparameter tuning for balance between reconstruction fidelity and chemical validity, independent regression for RON refinement, and differential evolution for latent space navigation.

Result: Developed a framework that can efficiently identify promising fuel molecule candidates with high Research Octane Number through latent space exploration.

Conclusion: The methodology enables accelerated inverse fuel design, can be adapted to different target properties, and can be extended with synthesizability criteria for improved de novo fuel design applications.

Abstract: In the present work, a generative deep learning framework combining a Co-optimized Variational Autoencoder (Co-VAE) architecture with quantitative structure-property relationship (QSPR) techniques is developed to enable accelerated inverse design of fuels. The Co-VAE integrates a property prediction component coupled with the VAE latent space, enhancing molecular reconstruction and accurate estimation of Research Octane Number (RON) (chosen as the fuel property of interest). A subset of the GDB-13 database, enriched with a curated RON database, is used for model training. Hyperparameter tuning is further utilized to optimize the balance among reconstruction fidelity, chemical validity, and RON prediction. An independent regression model is then used to refine RON prediction, while a differential evolution algorithm is employed to efficiently navigate the VAE latent space and identify promising fuel molecule candidates with high RON. This methodology addresses the limitations of traditional fuel screening approaches by capturing complex structure-property relationships within a comprehensive latent representation. The generative model can be adapted to different target properties, enabling systematic exploration of large chemical spaces relevant to fuel design applications. Furthermore, the demonstrated framework can be readily extended by incorporating additional synthesizability criteria to improve applicability and reliability for de novo design of new fuels.

[503] Newton-Puiseux Analysis for Interpretability and Calibration of Complex-Valued Neural Networks

Piotr Migus

Main category: cs.LG

TL;DR: A Newton-Puiseux framework for analyzing local decision geometry of complex-valued neural networks (CVNNs), providing phase-aligned directions for class flips and multiplicity-guided temperature adjustment to improve calibration.

Details

Motivation: CVNNs are suitable for phase-sensitive signals but their interpretability and probability calibration remain insufficiently investigated.

Method: Fitting kink-aware polynomial surrogates to logit differences near uncertain inputs, then factorizing using Newton-Puiseux expansions to derive analytic branch descriptors (exponents, multiplicities, orientations).

Result: Phase-aware analysis identifies sensitive directions and enhances Expected Calibration Error in ECG and wireless modulation datasets compared to uncalibrated softmax and standard baselines.

Conclusion: The method requires no architectural modifications and applies to any CVNN with complex logits transformed to real moduli, providing improved interpretability and calibration.

Abstract: Complex-valued neural networks (CVNNs) are particularly suitable for handling phase-sensitive signals, including electrocardiography (ECG), radar/sonar, and wireless in-phase/quadrature (I/Q) streams. Nevertheless, their \emph{interpretability} and \emph{probability calibration} remain insufficiently investigated. In this work, we present a Newton–Puiseux framework that examines the \emph{local decision geometry} of a trained CVNN by (i) fitting a small, kink-aware polynomial surrogate to the \emph{logit difference} in the vicinity of uncertain inputs, and (ii) factorizing this surrogate using Newton–Puiseux expansions to derive analytic branch descriptors, including exponents, multiplicities, and orientations. These descriptors provide phase-aligned directions that induce class flips in the original network and allow for a straightforward, \emph{multiplicity-guided} temperature adjustment for improved calibration. We outline assumptions and diagnostic measures under which the surrogate proves informative and characterize potential failure modes arising from piecewise-holomorphic activations (e.g., modReLU). Our phase-aware analysis identifies sensitive directions and enhances Expected Calibration Error in two case studies beyond a controlled $\C^2$ synthetic benchmark – namely, the MIT–BIH arrhythmia (ECG) dataset and RadioML 2016.10a (wireless modulation) – when compared to uncalibrated softmax and standard post-hoc baselines. We also present confidence intervals, non-parametric tests, and quantify sensitivity to inaccuracies in estimating branch multiplicity. Crucially, this method requires no modifications to the architecture and applies to any CVNN with complex logits transformed to real moduli.

[504] Variational Rank Reduction Autoencoders

Jad Mounayer, Alicia Tierz, Jerome Tomezyk, Chady Ghnatios, Francisco Chinesta

Main category: cs.LG

TL;DR: VRRAEs combine deterministic Rank Reduction Autoencoders (RRAEs) with Variational Autoencoders (VAEs) to create a more powerful generative model that outperforms both RRAEs and VAEs on generation tasks while reducing posterior collapse.

Details

Motivation: To leverage the regularization benefits of RRAEs' truncated SVD with the generative capabilities of VAEs' probabilistic latent space, creating a hybrid model that overcomes limitations of both approaches.

Method: Combine RRAEs and VAEs by carefully sampling the latent space of RRAEs and adding KL divergence regularization, creating Variational Rank Reduction Autoencoders (VRRAEs).

Result: VRRAEs outperform both RRAEs and VAEs on random generation and interpolation tasks across MNIST, CelebA, and CIFAR-10 datasets based on FID scores, and show robustness against posterior collapse on synthetic data.

Conclusion: VRRAEs successfully combine the strengths of RRAEs and VAEs, providing better generative performance and reduced posterior collapse through SVD-induced regularization.

Abstract: Deterministic Rank Reduction Autoencoders (RRAEs) enforce by construction a regularization on the latent space by applying a truncated SVD. While this regularization makes Autoencoders more powerful, using them for generative purposes is counter-intuitive due to their deterministic nature. On the other hand, Variational Autoencoders (VAEs) are well known for their generative abilities by learning a probabilistic latent space. In this paper, we present Variational Rank Reduction Autoencoders (VRRAEs), a model that leverages the advantages of both RRAEs and VAEs. Our claims and results show that when carefully sampling the latent space of RRAEs and further regularizing with the Kullback-Leibler (KL) divergence (similarly to VAEs), VRRAEs outperform RRAEs and VAEs. Additionally, we show that the regularization induced by the SVD not only makes VRRAEs better generators than VAEs, but also reduces the possibility of posterior collapse. Our results include a synthetic dataset of a small size that showcases the robustness of VRRAEs against collapse, and three real-world datasets; the MNIST, CelebA, and CIFAR-10, over which VRRAEs are shown to outperform both VAEs and RRAEs on many random generation and interpolation tasks based on the FID score. We developed an open-source implementation of VRRAEs in JAX (Equinox), available at https://github.com/JadM133/RRAEs.git.

[505] MergeBench: A Benchmark for Merging Domain-Specialized LLMs

Yifei He, Siqi Zeng, Yuzheng Hu, Rui Yang, Tong Zhang, Han Zhao

Main category: cs.LG

TL;DR: MergeBench is a comprehensive evaluation suite for model merging at scale, assessing 8 merging methods across 5 domains using Llama and Gemma models (2B-9B), providing practical guidelines and insights.

Details

Motivation: Existing model merging evaluations are limited in model scale and task diversity, leaving questions about applicability to large, domain-specialized LLMs.

Method: Built MergeBench evaluation suite using state-of-the-art open-source LLMs (Llama, Gemma families at 2B-9B scales) covering 5 domains: instruction following, mathematics, multilingual understanding, coding, and safety. Standardized finetuning and evaluation protocols to assess 8 representative merging methods.

Result: Model merging performs better on stronger base models. Techniques like merging coefficient tuning and sparsification improve knowledge retention. However, challenges remain including computational cost on large models, performance gap compared to multi-task models, and underexplored role in standard LLM training pipelines.

Conclusion: MergeBench provides a foundation for future research to advance understanding and practical application of model merging, with identified challenges guiding future improvements.

Abstract: Model merging provides a scalable alternative to multi-task training by combining specialized finetuned models through parameter arithmetic, enabling efficient deployment without the need for joint training or access to all task data. While recent methods have shown promise, existing evaluations are limited in both model scale and task diversity, leaving open questions about their applicability to large, domain-specialized LLMs. To tackle the challenges, we introduce MergeBench, a comprehensive evaluation suite designed to assess model merging at scale. MergeBench builds on state-of-the-art open-source language models, including Llama and Gemma families at 2B to 9B scales, and covers five key domains: instruction following, mathematics, multilingual understanding, coding and safety. We standardize finetuning and evaluation protocols, and assess eight representative merging methods across multi-task performance, forgetting and runtime efficiency. Based on extensive experiments, we provide practical guidelines for algorithm selection and share insights showing that model merging tends to perform better on stronger base models, with techniques such as merging coefficient tuning and sparsification improving knowledge retention. However, several challenges remain, including the computational cost on large models, the gap for in-domain performance compared to multi-task models, and the underexplored role of model merging in standard LLM training pipelines. We hope MergeBench provides a foundation for future research to advance the understanding and practical application of model merging. Our project page is at \href{https://yifei-he.github.io/mergebench/}{https://yifei-he.github.io/mergebench/}.

[506] Panda: A pretrained forecast model for chaotic dynamics

Jeffrey Lai, Anthony Bao, William Gilpin

Main category: cs.LG

TL;DR: Panda is a patched attention model trained on synthetic chaotic systems that achieves zero-shot forecasting of unseen chaotic systems and spontaneously predicts partial differential equations despite being trained only on ordinary differential equations.

Details

Motivation: Chaotic systems are sensitive to small errors, making predictive modeling challenging. Existing approaches either use specialized models for individual time series or foundation models trained on databases with little dynamical structure.

Method: Train Panda (Patched Attention for Nonlinear Dynamics) on a novel synthetic dataset of 20,000 chaotic dynamical systems discovered using an evolutionary algorithm. The model uses patched attention architecture.

Result: Panda exhibits emergent zero-shot forecasting of unseen chaotic systems with both short-term accuracy and long-term statistics preservation. It spontaneously predicts partial differential equations despite being trained only on ordinary differential equations. Shows neural scaling law for differential equations.

Conclusion: Pretrained models like Panda have potential for probing abstract mathematical domains like nonlinear dynamics, demonstrating emergent capabilities beyond their training data.

Abstract: Chaotic systems are intrinsically sensitive to small errors, challenging efforts to construct predictive data-driven models of real-world dynamical systems such as fluid flows or neuronal activity. Prior efforts comprise either specialized models trained separately on individual time series, or foundation models trained on vast time series databases with little underlying dynamical structure. Motivated by dynamical systems theory, we present Panda, Patched Attention for Nonlinear Dynamics. We train Panda on a novel synthetic, extensible dataset of $2 \times 10^4$ chaotic dynamical systems that we discover using an evolutionary algorithm. Trained purely on simulated data, Panda exhibits emergent properties: zero-shot forecasting of unseen chaotic systems preserving both short-term accuracy and long-term statistics. Despite having been trained only on low-dimensional ordinary differential equations, Panda spontaneously develops the ability to predict partial differential equations without retraining. We also demonstrate a neural scaling law for differential equations, underscoring the potential of pretrained models for probing abstract mathematical domains like nonlinear dynamics.

[507] OptiFLIDS: Optimized Federated Learning for Energy-Efficient Intrusion Detection in IoT

Saida Elouardi, Mohammed Jouhari, Anas Motii

Main category: cs.LG

TL;DR: OptiFLIDS is a federated learning approach for IoT intrusion detection that uses pruning to reduce model complexity and energy consumption, with a customized aggregation method to handle non-IID data.

Details

Motivation: Traditional ML-based IDS requires large datasets but data sharing is limited due to privacy concerns. FL enables collaborative training without sharing raw data, but faces challenges with non-IID data and high energy costs for resource-constrained IoT devices.

Method: Proposes OptiFLIDS that applies pruning techniques during local training to reduce model complexity and energy consumption, and incorporates a customized aggregation method to handle pruned models with non-IID data distributions.

Result: Experiments on three IoT IDS datasets (TON_IoT, X-IIoTID, IDSIoT2024) show that OptiFLIDS maintains strong detection performance while improving energy efficiency.

Conclusion: OptiFLIDS is well-suited for deployment in real-world IoT environments, balancing security performance with energy efficiency.

Abstract: In critical IoT environments, such as smart homes and industrial systems, effective Intrusion Detection Systems (IDS) are essential for ensuring security. However, developing robust IDS solutions remains a significant challenge. Traditional machine learning-based IDS models typically require large datasets, but data sharing is often limited due to privacy and security concerns. Federated Learning (FL) presents a promising alternative by enabling collaborative model training without sharing raw data. Despite its advantages, FL still faces key challenges, such as data heterogeneity (non-IID data) and high energy and computation costs, particularly for resource constrained IoT devices. To address these issues, this paper proposes OptiFLIDS, a novel approach that applies pruning techniques during local training to reduce model complexity and energy consumption. It also incorporates a customized aggregation method to better handle pruned models that differ due to non-IID data distributions. Experiments conducted on three recent IoT IDS datasets, TON_IoT, X-IIoTID, and IDSIoT2024, demonstrate that OptiFLIDS maintains strong detection performance while improving energy efficiency, making it well-suited for deployment in real-world IoT environments.

[508] Learning to Explore in Diverse Reward Settings via Temporal-Difference-Error Maximization

Sebastian Griesbach, Carlo D’Eramo

Main category: cs.LG

TL;DR: Proposes Stable Error-seeking Exploration (SEE), a novel exploration method that works robustly across dense, sparse, and exploration-adverse reward settings by maximizing TD-error as a separate objective with three design choices to address instability issues.

Details

Motivation: Existing exploration methods require additional tuning for different reward settings and struggle with exploration-adverse rewards that actively discourage exploration. There's a need for a robust exploration approach that works across diverse reward settings without hyperparameter adjustments.

Method: SEE maximizes TD-error as a separate objective with three key design choices: 1) mitigating instability from far-off-policy learning, 2) resolving the conflict of interest in maximizing cumulative TD-error in episodic settings, and 3) addressing the non-stationary nature of TD-errors. It can be combined with off-policy algorithms without modifying their original optimization pipeline.

Result: Experimental analysis shows that Soft-Actor Critic agent with SEE performs robustly across three diverse reward settings (dense, sparse, and exploration-adverse) in various tasks without requiring hyperparameter adjustments.

Conclusion: SEE provides a robust exploration method that works effectively across different reward settings, addressing key challenges in TD-error maximization while maintaining compatibility with existing off-policy algorithms.

Abstract: Numerous heuristics and advanced approaches have been proposed for exploration in different settings for deep reinforcement learning. Noise-based exploration generally fares well with dense-shaped rewards and bonus-based exploration with sparse rewards. However, these methods usually require additional tuning to deal with undesirable reward settings by adjusting hyperparameters and noise distributions. Rewards that actively discourage exploration, i.e., with an action cost and no other dense signal to follow, can pose a major challenge. We propose a novel exploration method, Stable Error-seeking Exploration (SEE), that is robust across dense, sparse, and exploration-adverse reward settings. To this endeavor, we revisit the idea of maximizing the TD-error as a separate objective. Our method introduces three design choices to mitigate instability caused by far-off-policy learning, the conflict of interest of maximizing the cumulative TD-error in an episodic setting, and the non-stationary nature of TD-errors. SEE can be combined with off-policy algorithms without modifying the optimization pipeline of the original objective. In our experimental analysis, we show that a Soft-Actor Critic agent with the addition of SEE performs robustly across three diverse reward settings in a variety of tasks without hyperparameter adjustments.

[509] Scalable Policy-Based RL Algorithms for POMDPs

Ameya Anjarlekar, Rasoul Etesami, R Srikant

Main category: cs.LG

TL;DR: The paper proposes approximating POMDPs as finite-state MDPs (Superstate MDPs) using finite histories, with theoretical guarantees showing exponentially decreasing approximation error with history length, and enables using TD-learning and policy optimization methods.

Details

Motivation: The continuous nature of belief states in POMDPs presents significant computational challenges for learning optimal policies, motivating the need for efficient approximation methods.

Method: Transform POMDP into a finite-state Superstate MDP using finite histories, then apply policy-based learning with linear function approximation using TD-learning followed by policy optimization.

Result: Derived improved theoretical guarantees relating Superstate MDP and original POMDP value functions, and showed approximation error decreases exponentially with history length.

Conclusion: POMDPs can be approximately solved using standard MDP methods by treating them as finite-history MDPs, with quantifiable finite-time bounds for TD learning in non-Markovian settings.

Abstract: The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.

[510] Leveraging Predictive Equivalence in Decision Trees

Hayden McTavish, Zachery Boner, Jon Donnelly, Margo Seltzer, Cynthia Rudin

Main category: cs.LG

TL;DR: Decision trees suffer from predictive equivalence - multiple trees can represent the same decision boundary but behave differently. The paper presents a boolean logical representation that eliminates this equivalence and applies it to improve robustness to missing values, variable importance quantification, and prediction cost optimization.

Details

Motivation: Decision trees have predictive equivalence where identical decision boundaries can be represented by different trees, causing arbitrary model selection and inconsistent behavior in variable importance and missing value handling.

Method: Developed a boolean logical representation of decision trees that eliminates predictive equivalence and faithfully represents the underlying decision boundary.

Result: The representation shows decision trees are surprisingly robust to test-time missing feature values, improves variable importance quantification, and enables optimization of prediction costs.

Conclusion: The boolean logical representation addresses predictive equivalence issues in decision trees, leading to more reliable model behavior and enabling better optimization for practical applications.

Abstract: Decision trees are widely used for interpretable machine learning due to their clearly structured reasoning process. However, this structure belies a challenge we refer to as predictive equivalence: a given tree’s decision boundary can be represented by many different decision trees. The presence of models with identical decision boundaries but different evaluation processes makes model selection challenging. The models will have different variable importance and behave differently in the presence of missing values, but most optimization procedures will arbitrarily choose one such model to return. We present a boolean logical representation of decision trees that does not exhibit predictive equivalence and is faithful to the underlying decision boundary. We apply our representation to several downstream machine learning tasks. Using our representation, we show that decision trees are surprisingly robust to test-time missingness of feature values; we address predictive equivalence’s impact on quantifying variable importance; and we present an algorithm to optimize the cost of reaching predictions.

[511] GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection (Preprint)

Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Qi Hu, Yan Li, Chang Liu

Main category: cs.LG

TL;DR: GTCN-G is a novel deep learning framework that combines Gated Temporal Convolutional Networks and Graph Neural Networks with residual learning to address class imbalance in intrusion detection, achieving state-of-the-art performance.

Details

Motivation: To overcome challenges in intrusion detection systems caused by complex network threats and class imbalance in traffic data, and to create a framework that integrates both topological structure modeling and time-series dependency capture.

Method: Proposes GTCN-G framework that fuses Gated TCN for hierarchical temporal feature extraction with GCN for graph structure learning, incorporating residual learning via Graph Attention Network to preserve original features and mitigate class imbalance.

Result: Extensive experiments on UNSW-NB15 and ToN-IoT datasets show GTCN-G achieves state-of-the-art performance, significantly outperforming baseline models in both binary and multi-class classification tasks.

Conclusion: The GTCN-G framework successfully addresses class imbalance in intrusion detection by synergistically integrating temporal and structural modeling with residual learning, demonstrating superior detection capabilities for rare malicious activities.

Abstract: The escalating complexity of network threats and the inherent class imbalance in traffic data present formidable challenges for modern Intrusion Detection Systems (IDS). While Graph Neural Networks (GNNs) excel in modeling topological structures and Temporal Convolutional Networks (TCNs) are proficient in capturing time-series dependencies, a framework that synergistically integrates both while explicitly addressing data imbalance remains an open challenge. This paper introduces a novel deep learning framework, named Gated Temporal Convolutional Network and Graph (GTCN-G), engineered to overcome these limitations. Our model uniquely fuses a Gated TCN (G-TCN) for extracting hierarchical temporal features from network flows with a Graph Convolutional Network (GCN) designed to learn from the underlying graph structure. The core innovation lies in the integration of a residual learning mechanism, implemented via a Graph Attention Network (GAT). This mechanism preserves original feature information through residual connections, which is critical for mitigating the class imbalance problem and enhancing detection sensitivity for rare malicious activities (minority classes). We conducted extensive experiments on two public benchmark datasets, UNSW-NB15 and ToN-IoT, to validate our approach. The empirical results demonstrate that the proposed GTCN-G model achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.

[512] HEAL: An Empirical Study on Hallucinations in Embodied Agents Driven by Large Language Models

Trishna Chakraborty, Udita Ghosh, Xiaopan Zhang, Fahim Faisal Niloy, Yue Dong, Jiachen Li, Amit K. Roy-Chowdhury, Chengyu Song

Main category: cs.LG

TL;DR: This paper presents the first systematic study of hallucinations in LLM-based embodied agents, showing they can fail to ground instructions in physical environments, leading to navigation errors.

Details

Motivation: LLMs are increasingly used as cognitive cores for embodied agents, but inherited hallucinations can cause navigation failures when agents can't ground user instructions in the actual physical environment.

Method: Constructed a hallucination probing set based on existing benchmarks to induce high hallucination rates, then evaluated 12 models across two simulation environments under scene-task inconsistencies.

Result: Models exhibited reasoning but failed to resolve scene-task inconsistencies, with hallucination rates up to 40x higher than base prompts, highlighting fundamental limitations in handling infeasible tasks.

Conclusion: The study provides actionable insights on ideal model behavior and guidance for developing more robust and reliable planning strategies for LLM-based embodied agents.

Abstract: Large language models (LLMs) are increasingly being adopted as the cognitive core of embodied agents. However, inherited hallucinations, which stem from failures to ground user instructions in the observed physical environment, can lead to navigation errors, such as searching for a refrigerator that does not exist. In this paper, we present the first systematic study of hallucinations in LLM-based embodied agents performing long-horizon tasks under scene-task inconsistencies. Our goal is to understand to what extent hallucinations occur, what types of inconsistencies trigger them, and how current models respond. To achieve these goals, we construct a hallucination probing set by building on an existing benchmark, capable of inducing hallucination rates up to 40x higher than base prompts. Evaluating 12 models across two simulation environments, we find that while models exhibit reasoning, they fail to resolve scene-task inconsistencies-highlighting fundamental limitations in handling infeasible tasks. We also provide actionable insights on ideal model behavior for each scenario, offering guidance for developing more robust and reliable planning strategies.

[513] ICL-Router: In-Context Learned Model Representations for LLM Routing

Chenxu Wang, Hao Li, Yiqun Zhang, Linyao Chen, Jianhao Chen, Ping Jian, Peng Ye, Qiaosheng Zhang, Shuyue Hu

Main category: cs.LG

TL;DR: Proposes a novel model routing method using in-context vectors to represent model capabilities, enabling dynamic query routing without retraining when adding new models.

Details

Motivation: LLMs have complementary strengths, but current routing methods require retraining when adding new models and rely on accurate model representations, limiting scalability.

Method: Two-stage approach: 1) Embed queries and project into vectors, training projector and LLM-based router to reconstruct queries; 2) Profile candidate models on query set and learn to predict model performance using in-context vectors.

Result: Achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks, with seamless integration of new models without retraining.

Conclusion: The method provides an effective and scalable solution for model routing that maintains performance while allowing easy model pool expansion.

Abstract: Large language models (LLMs) often exhibit complementary strengths. Model routing harnesses these strengths by dynamically directing each query to the most suitable model, given a candidate model pool. However, routing performance relies on accurate model representations, and adding new models typically requires retraining, limiting scalability. To address these challenges, we propose a novel routing method using in-context vectors to represent model capabilities. The method proceeds in two stages. First, queries are embedded and projected into vectors, with a projector and LLM-based router trained to reconstruct the original queries, aligning vector representations with the router’s semantic space. Second, each candidate model is profiled on a query set, and the router learns – based on in-context vectors of query and model performance – to predict whether each model can correctly answer new queries. Extensive experiments demonstrate that our method achieves state-of-the-art routing performance in both in-distribution and out-of-distribution tasks. Moreover, our method allows for seamless integration of new models without retraining the router. The code is available at https://github.com/lalalamdbf/ICL-Router.

[514] Riemannian generative decoder

Andreas Bjerregaard, Søren Hauberg, Anders Krogh

Main category: cs.LG

TL;DR: Proposes a Riemannian generative decoder that eliminates the need for encoders in manifold learning, using only a decoder with Riemannian optimization to learn latent representations on any Riemannian manifold.

Details

Motivation: To avoid numerically brittle optimization objectives in Riemannian representation learning that rely on encoders to estimate densities on manifolds, which can harm model training and quality.

Method: Introduces a Riemannian generative decoder that learns manifold-valued latents using Riemannian optimization while jointly training a decoder network, completely discarding the encoder to simplify manifold constraints.

Result: Validated on three case studies (synthetic branching diffusion, human migrations from mitochondrial DNA, cell division cycle) showing learned representations respect prescribed geometry and capture intrinsic non-Euclidean structure.

Conclusion: The method requires only a decoder, is compatible with existing architectures, yields interpretable latent spaces aligned with data geometry, and provides a unifying approach for manifold-valued latents on any Riemannian manifold.

Abstract: Riemannian representation learning typically relies on an encoder to estimate densities on chosen manifolds. This involves optimizing numerically brittle objectives, potentially harming model training and quality. To completely circumvent this issue, we introduce the Riemannian generative decoder, a unifying approach for finding manifold-valued latents on any Riemannian manifold. Latents are learned with a Riemannian optimizer while jointly training a decoder network. By discarding the encoder, we vastly simplify the manifold constraint compared to current approaches which often only handle few specific manifolds. We validate our approach on three case studies – a synthetic branching diffusion process, human migrations inferred from mitochondrial DNA, and cells undergoing a cell division cycle – each showing that learned representations respect the prescribed geometry and capture intrinsic non-Euclidean structure. Our method requires only a decoder, is compatible with existing architectures, and yields interpretable latent spaces aligned with data geometry. Code available on https://github.com/yhsure/riemannian-generative-decoder.

[515] Reconstruction of SINR Maps from Sparse Measurements using Group Equivariant Non-Expansive Operators

Lorenzo Mario Amorosa, Francesco Conti, Nicola Quercioli, Flavio Zabini, Tayebeh Lotfi Mahyari, Yiqun Ge, Patrizio Frosini

Main category: cs.LG

TL;DR: This paper introduces a GENEO-based framework for reconstructing high-resolution SINR maps from sparse measurements in 6G networks, focusing on preserving topological structure rather than just minimizing pixel-wise error.

Details

Motivation: Accurate SINR maps are critical for 6G network optimization, but high-resolution acquisition is cost-prohibitive, creating data scarcity. Existing ML approaches require large datasets and may not preserve important topological features.

Method: Proposes using Group Equivariant Non-Expansive Operators (GENEOs) that embed geometric priors like translation invariance directly into low-complexity operators, enabling reconstruction from very few samples while preserving topological structure.

Result: The method maintains competitive MSE performance while dramatically outperforming established ML baselines in topological fidelity (measured by 1-Wasserstein distance), particularly preserving the geometry of coverage holes and interference patterns.

Conclusion: GENEOs provide a practical advantage for creating structurally accurate SINR maps that are more reliable for downstream network optimization tasks, offering effective reconstruction from sparse measurements with strong geometric priors.

Abstract: As sixth generation (6G) wireless networks evolve, accurate signal-to-interference-noise ratio (SINR) maps are becoming increasingly critical for effective resource management and optimization. However, acquiring such maps at high resolution is often cost-prohibitive, creating a severe data scarcity challenge. This necessitates machine learning (ML) approaches capable of robustly reconstructing the full map from extremely sparse measurements. To address this, we introduce a novel reconstruction framework based on Group Equivariant Non-Expansive Operators (GENEOs). Unlike data-hungry ML models, GENEOs are low-complexity operators that embed domain-specific geometric priors, such as translation invariance, directly into their structure. This provides a strong inductive bias, enabling effective reconstruction from very few samples. Our key insight is that for network management, preserving the topological structure of the SINR map, such as the geometry of coverage holes and interference patterns, is often more critical than minimizing pixel-wise error. We validate our approach on realistic ray-tracing-based urban scenarios, evaluating performance with both traditional statistical metrics (mean squared error (MSE)) and, crucially, a topological metric (1-Wasserstein distance). Results show that while maintaining competitive MSE, our method dramatically outperforms established ML baselines in topological fidelity. This demonstrates the practical advantage of GENEOs for creating structurally accurate SINR maps that are more reliable for downstream network optimization tasks.

[516] Interpretable Reward Model via Sparse Autoencoder

Shuyi Zhang, Wei Shi, Sihang Li, Jiayi Liao, Tao Liang, Hengxing Cai, Xiang Wang

Main category: cs.LG

TL;DR: SARM integrates sparse autoencoders into reward models to create interpretable, feature-level attribution for LLM alignment, enabling dynamic preference adjustment and superior performance.

Details

Motivation: Traditional reward models lack interpretability, provide limited reasoning insight, and are inflexible to user preference shifts, while existing multidimensional RMs fail to offer feature-level attribution and require expensive annotations.

Method: Integrates a pretrained Sparse Autoencoder (SAE) into a reward model to map LLM hidden activations into an interpretable, sparse, monosemantic feature space, with a scalar head aggregating features for transparent reward scoring.

Result: Empirical evaluations show SARM enables direct feature-level reward attribution, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models.

Conclusion: SARM provides a novel architecture that enhances reward model interpretability and flexibility while maintaining strong alignment performance, addressing key limitations of traditional approaches.

Abstract: Large language models (LLMs) have been widely deployed across numerous fields. Reinforcement Learning from Human Feedback (RLHF) leverages reward models (RMs) as proxies for human preferences to align LLM behaviors with human values, making the accuracy, reliability, and interpretability of RMs critical for effective alignment. However, traditional RMs lack interpretability, offer limited insight into the reasoning behind reward assignments, and are inflexible toward user preference shifts. While recent multidimensional RMs aim for improved interpretability, they often fail to provide feature-level attribution and require costly annotations. To overcome these limitations, we introduce the Sparse Autoencoder-enhanced Reward Model (SARM), a novel architecture that integrates a pretrained Sparse Autoencoder (SAE) into a reward model. SARM maps the hidden activations of LLM-based RM into an interpretable, sparse, and monosemantic feature space, from which a scalar head aggregates feature activations to produce transparent and conceptually meaningful reward scores. Empirical evaluations demonstrate that SARM facilitates direct feature-level attribution of reward assignments, allows dynamic adjustment to preference shifts, and achieves superior alignment performance compared to conventional reward models. Our code is available at https://github.com/schrieffer-z/sarm.

[517] Optimally Deep Networks – Adapting Model Depth to Datasets for Superior Efficiency

Shaharyar Ahmed Khan Tareen, Filza Khan Tareen

Main category: cs.LG

TL;DR: ODNs optimize neural network depth to match dataset complexity, reducing memory footprint by up to 98.64% while maintaining competitive accuracy through progressive depth expansion training.

Details

Motivation: Deep neural networks often have unnecessarily large sizes and high computational demands, wasting resources especially for simpler datasets, making deployment on resource-constrained devices impractical.

Method: Progressive depth expansion training strategy that starts with shallow networks and incrementally increases depth as earlier blocks converge, removing redundant layers to use only optimal depth for each dataset.

Result: ResNet-18 and ResNet-34 achieved up to 98.64% and 96.44% memory footprint reduction while maintaining 99.31% and 96.08% accuracy on MNIST and SVHN datasets respectively.

Conclusion: ODNs provide an effective approach to balance model depth with task complexity, significantly reducing computational costs and memory usage while maintaining performance, enabling practical deployment on edge devices.

Abstract: Deep neural networks (DNNs) have provided brilliant performance across various tasks. However, this success often comes at the cost of unnecessarily large model sizes, high computational demands, and substantial memory footprints. Typically, powerful architectures are trained at full depths but not all datasets or tasks require such high model capacity. Training very deep architectures on relatively low-complexity datasets frequently leads to wasted computation, unnecessary energy consumption, and excessive memory usage, which in turn makes deployment of models on resource-constrained devices impractical. To address this problem, we introduce Optimally Deep Networks (ODNs), which provide a balance between model depth and task complexity. Specifically, we propose a NAS like training strategy called progressive depth expansion, which begins by training deep networks at shallower depths and incrementally increases their depth as the earlier blocks converge, continuing this process until the target accuracy is reached. ODNs use only the optimal depth for the given datasets, removing redundant layers. This cuts down future training and inference costs, lowers the memory footprint, enhances computational efficiency, and facilitates deployment on edge devices. Empirical results show that the optimal depths of ResNet-18 and ResNet-34 for MNIST and SVHN, achieve up to 98.64 % and 96.44 % reduction in memory footprint, while maintaining a competitive accuracy of 99.31 % and 96.08 %, respectively.

[518] Pattern-based Knowledge Component Extraction from Student Code Using Representation Learning

Muntasir Hoq, Griffin Pitts, Andrew Lan, Peter Brusilovsky, Bita Akram

Main category: cs.LG

TL;DR: A novel explainable framework for automated Knowledge Component (KC) discovery in CS education using pattern-based KCs extracted from student code via Variational Autoencoder and attention-based code representation.

Details

Motivation: Current automated KC extraction from student code faces challenges due to insufficient explainability of discovered KCs and the open-ended nature of programming problems with structural variability and complex concept interactions.

Method: Train a Variational Autoencoder to generate representative patterns from student code guided by an explainable attention-based code representation model, then cluster patterns to form pattern-based KCs.

Result: Evaluation using learning curve analysis and Deep Knowledge Tracing shows meaningful learning trajectories and significant improvements in predictive performance over traditional KT methods.

Conclusion: The work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns essential for student learning.

Abstract: Effective personalized learning in computer science education depends on accurately modeling what students know and what they need to learn. While Knowledge Components (KCs) provide a foundation for such modeling, automated KC extraction from student code is inherently challenging due to insufficient explainability of discovered KCs and the open-endedness of programming problems with significant structural variability across student solutions and complex interactions among programming concepts. In this work, we propose a novel, explainable framework for automated KC discovery through pattern-based KCs: recurring structural patterns within student code that capture the specific programming patterns and language constructs that students must master. Toward this, we train a Variational Autoencoder to generate important representative patterns from student code guided by an explainable, attention-based code representation model that identifies important correct and incorrect pattern implementations from student code. These patterns are then clustered to form pattern-based KCs. We evaluate our KCs using two well-established methods informed by Cognitive Science: learning curve analysis and Deep Knowledge Tracing (DKT). Experimental results demonstrate meaningful learning trajectories and significant improvements in DKT predictive performance over traditional KT methods. This work advances knowledge modeling in CS education by providing an automated, scalable, and explainable framework for identifying granular code patterns and algorithmic constructs, essential for student learning.

Jiajun Li, Ran Hou, Yu Ding, Yixuan Li, Shisi Guan, Jiahui Duan, Xiongwei Han, Tao Zhong, Vincent Chau, Weiwei Wu, Wanyuan Wang

Main category: cs.LG

TL;DR: A novel constraint-based model reduction approach for mixed integer linear programming that identifies critical inequality constraints to transform into equalities, accelerating MILP solving while preserving feasibility.

Details

Motivation: Existing model reduction methods focus on variable reduction, while constraint reduction has been largely ignored despite its potential to reduce MILP complexity from a dual perspective.

Method: Proposes identifying critical tight-constraints at optimal solution, using heuristic rule for selection, and multi-modal representation technique combining instance-level and abstract-level MILP information to learn these constraints.

Result: Improves solution quality by over 50% and reduces computation time by 17.47% compared to state-of-the-art methods.

Conclusion: Constraint-based model reduction is an effective approach for accelerating large-scale MILP solving, demonstrating significant improvements over existing variable reduction methods.

Abstract: Model reduction, which aims to learn a simpler model of the original mixed integer linear programming (MILP), can solve large-scale MILP problems much faster. Most existing model reduction methods are based on variable reduction, which predicts a solution value for a subset of variables. From a dual perspective, constraint reduction that transforms a subset of inequality constraints into equalities can also reduce the complexity of MILP, but has been largely ignored. Therefore, this paper proposes a novel constraint-based model reduction approach for the MILP. Constraint-based MILP reduction has two challenges: 1) which inequality constraints are critical such that reducing them can accelerate MILP solving while preserving feasibility, and 2) how to predict these critical constraints efficiently. To identify critical constraints, we first label these tight-constraints at the optimal solution as potential critical constraints and design a heuristic rule to select a subset of critical tight-constraints. To learn the critical tight-constraints, we propose a multi-modal representation technique that leverages information from both instance-level and abstract-level MILP formulations. The experimental results show that, compared to the state-of-the-art methods, our method improves the quality of the solution by over 50% and reduces the computation time by 17.47%.

[520] Context-Action Embedding Learning for Off-Policy Evaluation in Contextual Bandits

Kushagra Chandak, Vincent Liu, Haanvid Lee

Main category: cs.LG

TL;DR: CAEL-MIPS learns context-action embeddings to minimize MSE in off-policy evaluation for contextual bandits, outperforming existing methods.

Details

Motivation: IPS estimators suffer from high variance in large action spaces or underexplored regions, while existing MIPS methods don't minimize MSE and ignore context information.

Method: Proposes CAEL-MIPS that learns context-action embeddings from offline data using an MSE-minimizing objective derived from theoretical bias-variance analysis of MIPS.

Result: Empirical studies on synthetic and real-world datasets show CAEL-MIPS outperforms baselines in terms of mean squared error.

Conclusion: Learning context-action embeddings that minimize MSE improves off-policy evaluation performance in contextual bandits.

Abstract: We consider off-policy evaluation (OPE) in contextual bandits with finite action space. Inverse Propensity Score (IPS) weighting is a widely used method for OPE due to its unbiased, but it suffers from significant variance when the action space is large or when some parts of the context-action space are underexplored. Recently introduced Marginalized IPS (MIPS) estimators mitigate this issue by leveraging action embeddings. However, these embeddings do not minimize the mean squared error (MSE) of the estimators and do not consider context information. To address these limitations, we introduce Context-Action Embedding Learning for MIPS, or CAEL-MIPS, which learns context-action embeddings from offline data to minimize the MSE of the MIPS estimator. Building on the theoretical analysis of bias and variance of MIPS, we present an MSE-minimizing objective for CAEL-MIPS. In the empirical studies on a synthetic dataset and a real-world dataset, we demonstrate that our estimator outperforms baselines in terms of MSE.

[521] Fused Lasso Improves Accuracy of Co-occurrence Network Inference in Grouped Samples

Daniel Agyapong, Briana H. Beatty, Peter G. Kennedy, Jane C. Marks, Toby D. Hocking

Main category: cs.LG

TL;DR: Proposes fuser algorithm for microbiome network inference that captures environment-specific associations while sharing information across environments, outperforming existing methods in cross-environment scenarios.

Details

Motivation: Existing co-occurrence network algorithms analyze microbial associations within single environmental niches, failing to capture dynamic adaptations across different ecological conditions when samples from multiple niches are combined.

Method: Developed fuser algorithm that retains subsample-specific signals while sharing relevant information across environments, generating distinct environment-specific predictive networks. Evaluated using Same-All Cross-validation (SAC) framework on microbiome data from multiple locations and time points.

Result: Fuser achieves comparable performance to glmnet within homogeneous environments (Same scenario) and significantly reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

Conclusion: Fuser effectively addresses limitations of conventional network inference by capturing both environment-specific microbial associations and shared patterns across different ecological conditions, improving predictive performance in heterogeneous environments.

Abstract: Co-occurrence network inference algorithms have significantly advanced our understanding of microbiome communities. However, these algorithms typically analyze microbial associations within samples collected from a single environmental niche, often capturing only static snapshots rather than dynamic microbial processes. Previous studies have commonly grouped samples from different environmental niches together without fully considering how microbial communities adapt their associations when faced with varying ecological conditions. Our study addresses this limitation by explicitly investigating both spatial and temporal dynamics of microbial communities. We analyzed publicly available microbiome abundance data across multiple locations and time points, to evaluate algorithm performance in predicting microbial associations using our proposed Same-All Cross-validation (SAC) framework. SAC evaluates algorithms in two distinct scenarios: training and testing within the same environmental niche (Same), and training and testing on combined data from multiple environmental niches (All). To overcome the limitations of conventional algorithms, we propose fuser, an algorithm that, while not entirely new in machine learning, is novel for microbiome community network inference. It retains subsample-specific signals while simultaneously sharing relevant information across environments during training. Unlike standard approaches that infer a single generalized network from combined data, fuser generates distinct, environment-specific predictive networks. Our results demonstrate that fuser achieves comparable predictive performance to existing algorithms such as glmnet when evaluated within homogeneous environments (Same), and notably reduces test error compared to baseline algorithms in cross-environment (All) scenarios.

[522] Central Limit Theorems for Asynchronous Averaged Q-Learning

Xingtu Liu

Main category: cs.LG

TL;DR: Central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates, including non-asymptotic and functional CLTs.

Details

Motivation: To establish rigorous statistical guarantees for Polyak-Ruppert averaged Q-learning algorithms in asynchronous settings, providing explicit dependence on key parameters.

Method: Proving non-asymptotic central limit theorem with explicit convergence rates in Wasserstein distance, and deriving functional central limit theorem for partial-sum processes.

Result: Established convergence rate explicitly depending on iterations, state-action space size, discount factor, and exploration quality; showed partial-sum process converges weakly to Brownian motion.

Conclusion: The paper provides comprehensive central limit theory for averaged Q-learning, offering statistical foundations for practical algorithm analysis and performance guarantees.

Abstract: This paper establishes central limit theorems for Polyak-Ruppert averaged Q-learning under asynchronous updates. We prove a non-asymptotic central limit theorem, where the convergence rate in Wasserstein distance explicitly reflects the dependence on the number of iterations, state-action space size, the discount factor, and the quality of exploration. In addition, we derive a functional central limit theorem, showing that the partial-sum process converges weakly to a Brownian motion.

[523] One Prompt Fits All: Universal Graph Adaptation for Pretrained Models

Yongqi Huang, Jitao Zhao, Dongxiao He, Xiaobao Wang, Yawen Li, Yuxiao Huang, Di Jin, Zhiyong Feng

Main category: cs.LG

TL;DR: The paper analyzes Graph Prompt Learning (GPL), identifies limitations in current approaches, and proposes UniPrompt as a solution that adapts pretrained models while preserving input graphs for better performance across diverse scenarios.

Details

Motivation: To address the lack of consensus on how prompts interact with pretrained models and the limited adaptability of current GPL methods across diverse downstream scenarios, especially under data distribution shifts.

Method: The authors theoretically analyze existing GPL approaches and propose UniPrompt, a novel GPL method that adapts any pretrained models while preserving the input graph structure.

Result: Extensive experiments demonstrate that UniPrompt can effectively integrate with various pretrained models and achieve strong performance across both in-domain and cross-domain scenarios.

Conclusion: Graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier should adapt to downstream scenarios, with UniPrompt providing an effective solution for this approach.

Abstract: Graph Prompt Learning (GPL) has emerged as a promising paradigm that bridges graph pretraining models and downstream scenarios, mitigating label dependency and the misalignment between upstream pretraining and downstream tasks. Although existing GPL studies explore various prompt strategies, their effectiveness and underlying principles remain unclear. We identify two critical limitations: (1) Lack of consensus on underlying mechanisms: Despite current GPLs have advanced the field, there is no consensus on how prompts interact with pretrained models, as different strategies intervene at varying spaces within the model, i.e., input-level, layer-wise, and representation-level prompts. (2) Limited scenario adaptability: Most methods fail to generalize across diverse downstream scenarios, especially under data distribution shifts (e.g., homophilic-to-heterophilic graphs). To address these issues, we theoretically analyze existing GPL approaches and reveal that representation-level prompts essentially function as fine-tuning a simple downstream classifier, proposing that graph prompt learning should focus on unleashing the capability of pretrained models, and the classifier should adapt to downstream scenarios. Based on our findings, we propose UniPrompt, a novel GPL method that adapts any pretrained models, unleashing the capability of pretrained models while preserving the input graph. Extensive experiments demonstrate that our method can effectively integrate with various pretrained models and achieve strong performance across in-domain and cross-domain scenarios.

[524] A Generalized Information Bottleneck Theory of Deep Learning

Charles Westphal, Stephen Hailes, Mirco Musolesi

Main category: cs.LG

TL;DR: The paper introduces a Generalized Information Bottleneck (GIB) framework that reformulates the IB principle using synergy concepts, addressing theoretical ambiguities and estimation challenges while achieving better generalization and interpretability.

Details

Motivation: The Information Bottleneck principle has theoretical appeal for understanding neural network learning but faces practical limitations due to unresolved ambiguities and estimation difficulties.

Method: Reformulates IB through synergy lens using average interaction information, creating a computable GIB framework that bounds the original IB objective.

Result: GIB shows consistent compression phases across architectures (including ReLU networks where standard IB fails), provides interpretable dynamics in CNNs and Transformers, and aligns with adversarial robustness understanding.

Conclusion: GIB successfully addresses IB limitations while maintaining theoretical compatibility, offering a more practical and interpretable framework for understanding neural network learning dynamics.

Abstract: The Information Bottleneck (IB) principle offers a compelling theoretical framework to understand how neural networks (NNs) learn. However, its practical utility has been constrained by unresolved theoretical ambiguities and significant challenges in accurate estimation. In this paper, we present a \textit{Generalized Information Bottleneck (GIB)} framework that reformulates the original IB principle through the lens of synergy, i.e., the information obtainable only through joint processing of features. We provide theoretical and empirical evidence demonstrating that synergistic functions achieve superior generalization compared to their non-synergistic counterparts. Building on these foundations we re-formulate the IB using a computable definition of synergy based on the average interaction information (II) of each feature with those remaining. We demonstrate that the original IB objective is upper bounded by our GIB in the case of perfect estimation, ensuring compatibility with existing IB theory while addressing its limitations. Our experimental results demonstrate that GIB consistently exhibits compression phases across a wide range of architectures (including those with \textit{ReLU} activations where the standard IB fails), while yielding interpretable dynamics in both CNNs and Transformers and aligning more closely with our understanding of adversarial robustness.

[525] PEAR: Planner-Executor Agent Robustness Benchmark

Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing

Main category: cs.LG

TL;DR: PEAR is a benchmark for evaluating utility and vulnerability of planner-executor multi-agent systems, revealing key insights about performance degradation, memory importance, trade-offs between performance and robustness, and planner vulnerability to attacks.

Details

Motivation: Existing studies examine isolated attack surfaces or specific scenarios in multi-agent systems, leaving a lack of holistic understanding of MAS vulnerabilities.

Method: Introduce PEAR benchmark for systematically evaluating planner-executor MAS through extensive experiments, focusing on planner-executor structure which is a practical and widely adopted design.

Result: (1) Weak planner degrades clean task performance more than weak executor; (2) Memory essential for planner but not for executor; (3) Trade-off between task performance and robustness; (4) Planner-targeted attacks are particularly effective.

Conclusion: Findings offer actionable insights for enhancing MAS robustness and lay groundwork for principled defenses in multi-agent settings.

Abstract: Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

[526] Near-Optimal Second-Order Guarantees for Model-Based Adversarial Imitation Learning

Shangzhe Li, Dongruo Zhou, Weitong Zhang

Main category: cs.LG

TL;DR: The paper introduces MB-AIL, a model-based adversarial imitation learning algorithm that achieves horizon-free, second-order sample complexity guarantees for online interaction with limited expert demonstrations.

Details

Motivation: To address the poorly understood benefits of online interaction and impact of stochasticity in adversarial imitation learning, where agents learn from offline expert demonstrations without reward access.

Method: Developed a model-based AIL algorithm (MB-AIL) that uses general function approximations for both expert data and reward-free interactions, establishing second-order sample complexity guarantees.

Result: MB-AIL achieves minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations, and matches lower bounds for expert data dependence on horizon, precision, and policy variance.

Conclusion: The theoretical analysis and experiments demonstrate that MB-AIL provides instance-dependent results that tighten as systems approach determinism, and practical implementations match or surpass existing methods’ sample efficiency.

Abstract: We study online adversarial imitation learning (AIL), where an agent learns from offline expert demonstrations and interacts with the environment online without access to rewards. Despite strong empirical results, the benefits of online interaction and the impact of stochasticity remain poorly understood. We address these gaps by introducing a model-based AIL algorithm (MB-AIL) and establish its horizon-free, second-order sample-complexity guarantees under general function approximations for both expert data and reward-free interactions. These second-order bounds provide an instance-dependent result that can scale with the variance of returns under the relevant policies and therefore tighten as the system approaches determinism. Together with second-order, information-theoretic lower bounds on a newly constructed hard-instance family, we show that MB-AIL attains minimax-optimal sample complexity for online interaction (up to logarithmic factors) with limited expert demonstrations and matches the lower bound for expert demonstrations in terms of the dependence on horizon $H$, precision $\epsilon$ and the policy variance $\sigma^2$. Experiments further validate our theoretical findings and demonstrate that a practical implementation of MB-AIL matches or surpasses the sample efficiency of existing methods.

[527] Multi-View Graph Learning with Graph-Tuple

Shiyu Chen, Ningyuan Huang, Soledad Villar

Main category: cs.LG

TL;DR: A multi-view graph-tuple framework that partitions graphs into disjoint subgraphs to capture multiple interaction scales, overcoming limitations of single-scale sparsification in GNNs.

Details

Motivation: Traditional GNNs scale with graph edges, making them inefficient on dense graphs like point clouds. Existing sparsification methods force arbitrary single-scale choices and discard important multi-scale information.

Method: Partition graphs into disjoint subgraphs capturing primary local interactions and weaker long-range connections. Use heterogeneous message-passing architecture inspired by non-commuting operators theory.

Result: The framework is formally proven to be strictly more expressive with lower oracle risk than single-graph models. Shows better performance on molecular property prediction and cosmological parameter inference tasks.

Conclusion: Multi-view graph-tuple models outperform single-graph baselines, demonstrating the power and versatility of capturing multiple interaction scales in graph representation learning.

Abstract: Graph Neural Networks (GNNs) typically scale with the number of graph edges, making them well suited for sparse graphs but less efficient on dense graphs, such as point clouds or molecular interactions. A common remedy is to sparsify the graph via similarity thresholding or distance pruning, but this forces an arbitrary choice of a single interaction scale and discards crucial information from other scales. To overcome this limitation, we introduce a multi-view graph-tuple framework. Instead of a single graph, our graph-tuple framework partitions the graph into disjoint subgraphs, capturing primary local interactions and weaker, long-range connections. We then learn multi-view representations from the graph-tuple via a heterogeneous message-passing architecture inspired by the theory of non-commuting operators, which we formally prove is strictly more expressive and guarantees a lower oracle risk compared to single-graph message-passing models. We instantiate our framework on two scientific domains: molecular property prediction from feature-scarce Coulomb matrices and cosmological parameter inference from geometric point clouds. On both applications, our multi-view graph-tuple models demonstrate better performance than single-graph baselines, highlighting the power and versatility of our multi-view approach.

cs.MA

[528] Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu

Main category: cs.MA

TL;DR: Large-scale empirical study (82,620 experiments) reveals that cooperation-optimized MARL policies lack robustness and resilience under real-world uncertainties, with hyperparameter tuning being crucial for trustworthy systems.

Details

Motivation: Policies tuned for cooperation in ideal environments fail under real-world uncertainties, highlighting the need to understand robustness (stability under uncertainties) and resilience (recovery from disruptions) in MARL systems.

Method: Conducted extensive experiments across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters to evaluate cooperation, robustness, and resilience in MARL.

Result: Key findings: (1) Cooperation-robustness link weakens with intense perturbations; (2) No generalization across uncertainty types; (3) Hyperparameter tuning significantly improves performance, with some standard practices hurting robustness while others consistently help.

Conclusion: Hyperparameter optimization is essential for building trustworthy MARL systems, substantially improving cooperation, robustness, and resilience across all tested algorithms and environments.

Abstract: In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions–a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark .

[529] Heterogeneous RBCs via deep multi-agent reinforcement learning

Federico Gabriele, Aldo Glielmo, Marco Taboga

Main category: cs.MA

TL;DR: MARL-BC integrates deep multi-agent reinforcement learning with Real Business Cycle models to bridge the gap between computationally cumbersome heterogeneous-agent GE models and flexible but rule-based agent-based models.

Details

Motivation: To address limitations of current macroeconomic models: heterogeneous-agent GE models rely on unrealistic assumptions and are computationally limited, while agent-based models require explicit behavioral rules and trial-and-error development.

Method: Developed MARL-BC framework that combines deep multi-agent reinforcement learning with Real Business Cycle models, allowing agents to learn behaviors through interaction rather than pre-specified rules.

Result: MARL-BC successfully: (1) recovers textbook RBC results with single agent, (2) reproduces mean-field Krusell-Smith model results with identical agents, and (3) effectively simulates rich agent heterogeneity that is challenging for traditional GE approaches.

Conclusion: MARL-BC represents a synthesis of heterogeneous-agent GE models and agent-based models, functioning as an ABM with heterogeneous interacting agents while reproducing GE results in limit cases, bridging two often opposed modeling paradigms.

Abstract: Current macroeconomic models with agent heterogeneity can be broadly divided into two main groups. Heterogeneous-agent general equilibrium (GE) models, such as those based on Heterogeneous Agents New Keynesian (HANK) or Krusell-Smith (KS) approaches, rely on GE and ‘rational expectations’, somewhat unrealistic assumptions that make the models very computationally cumbersome, which in turn limits the amount of heterogeneity that can be modelled. In contrast, agent-based models (ABMs) can flexibly encompass a large number of arbitrarily heterogeneous agents, but typically require the specification of explicit behavioural rules, which can lead to a lengthy trial-and-error model-development process. To address these limitations, we introduce MARL-BC, a framework that integrates deep multi-agent reinforcement learning (MARL) with Real Business Cycle (RBC) models. We demonstrate that MARL-BC can: (1) recover textbook RBC results when using a single agent; (2) recover the results of the mean-field KS model using a large number of identical agents; and (3) effectively simulate rich heterogeneity among agents, a hard task for traditional GE approaches. Our framework can be thought of as an ABM if used with a variety of heterogeneous interacting agents, and can reproduce GE results in limit cases. As such, it is a step towards a synthesis of these often opposed modelling paradigms.

[530] Characterizing Agent-Based Model Dynamics via $ε$-Machines and Kolmogorov-Style Complexity

Roberto Garrone

Main category: cs.MA

TL;DR: A two-level information-theoretic framework for analyzing Agent-Based Model dynamics in Complex Adaptive Systems, using macro-level pooled ε-machines and micro-level individual agent ε-machines with complexity measures.

Details

Motivation: To characterize the informational organization of ABM dynamics within CAS paradigm, preserving agent heterogeneity while providing interpretable macro-level baselines that align with CAS principles of emergence, feedback, and adaptation.

Method: Two-level approach: macro-level uses pooled ε-machine for system-wide informational regime; micro-level reconstructs ε-machines for each agent dyad with Kolmogorov-style measures (normalized LZ78 complexity and bits per symbol from lossless compression).

Result: Feature set {hμ, Cμ, E, LZ78, bps} enables distributional analysis, stratified comparisons, and unsupervised clustering across agents and scenarios. Case study on caregiver-elder interactions demonstrates implementation.

Conclusion: The dual-scale framework preserves agent heterogeneity while providing interpretable macro-level baseline, successfully aligning ABM practice with CAS principles of emergence, feedback, and adaptation.

Abstract: We propose a two-level information-theoretic framework for characterizing the informational organization of Agent-Based Model (ABM) dynamics within the broader paradigm of Complex Adaptive Systems (CAS). At the macro level, a pooled $\epsilon$-machine is reconstructed as a reference model that summarizes the system-wide informational regime. At the micro level, $\epsilon$-machines are reconstructed for each caregiver-elder dyad and variable, and are complemented with algorithm-agnostic Kolmogorov-style measures, including normalized LZ78 complexity and bits per symbol from lossless compression. The resulting feature set ${h_{\mu}, C_{\mu}, E, \mathrm{LZ78}, \mathrm{bps}}$ enables distributional analysis, stratified comparisons, and unsupervised clustering across agents and scenarios. This dual-scale design preserves agent heterogeneity while providing an interpretable macro-level baseline, aligning ABM practice with CAS principles of emergence, feedback, and adaptation. A case study on caregiver-elder interactions illustrates the framework’s implementation; the results and discussion will be completed following final simulation runs.

[531] Offline Fictitious Self-Play for Competitive Games

Jingxiao Chen, Weiji Xie, Weinan Zhang, Yong yu, Ying Wen

Main category: cs.MA

TL;DR: OFF-FSP is the first practical model-free offline RL algorithm for competitive games that uses importance sampling to simulate opponent interactions and combines offline RL with Fictitious Self-Play to approximate Nash equilibrium while handling partial dataset coverage.

Details

Motivation: Offline multi-agent RL is challenging in competitive games due to inability to interact with opponents for self-play learning and partial dataset coverage that prevents identifying Nash equilibrium, making real-world applications difficult.

Method: Uses importance sampling to simulate interactions with various opponents from fixed datasets, combines single-agent offline RL with Fictitious Self-Play to approximate Nash equilibrium by constraining best responses away from out-of-distribution actions.

Result: Achieves significantly lower exploitability than state-of-the-art baselines on matrix games, extensive-form poker, and board games, and successfully validates on real-world human-robot competitive tasks.

Conclusion: OFF-FSP demonstrates potential for solving complex real-world competitive problems without simulators by effectively handling offline multi-agent RL challenges through importance sampling and FSP integration.

Abstract: Offline Reinforcement Learning (RL) enables policy improvement from fixed datasets without online interactions, making it highly suitable for real-world applications lacking efficient simulators. Despite its success in the single-agent setting, offline multi-agent RL remains a challenge, especially in competitive games. Firstly, unaware of the game structure, it is impossible to interact with the opponents and conduct a major learning paradigm, self-play, for competitive games. Secondly, real-world datasets cannot cover all the state and action space in the game, resulting in barriers to identifying Nash equilibrium (NE). To address these issues, this paper introduces OFF-FSP, the first practical model-free offline RL algorithm for competitive games. We start by simulating interactions with various opponents by adjusting the weights of the fixed dataset with importance sampling. This technique allows us to learn the best responses to different opponents and employ the Offline Self-Play learning framework. To overcome the challenge of partial coverage, we combine the single-agent offline RL method with Fictitious Self-Play (FSP) to approximate NE by constraining the approximate best responses away from out-of-distribution actions. Experiments on matrix games, extensive-form poker, and board games demonstrate that OFF-FSP achieves significantly lower exploitability than state-of-the-art baselines. Finally, we validate OFF-FSP on a real-world human-robot competitive task, demonstrating its potential for solving complex, hard-to-simulate real-world problems.

[532] Multi-Agent Autonomous Driving Systems with Large Language Models: A Survey of Recent Advances

Yaozu Wu, Dongyuan Li, Yankai Chen, Renhe Jiang, Henry Peng Zou, Wei-Chieh Huang, Yangning Li, Liancheng Fang, Zhen Wang, Philip S. Yu

Main category: cs.MA

TL;DR: Survey paper on LLM-based multi-agent autonomous driving systems, addressing limitations of single-agent approaches through enhanced collaboration and communication.

Details

Motivation: To overcome challenges in single-agent LLM-based ADSs (limited perception, insufficient collaboration, high computational demands) by leveraging multi-agent systems with language-driven coordination.

Method: Categorizes existing LLM-based methods based on different agent interaction modes and discusses agent-human interactions in various scenarios.

Result: Provides a comprehensive survey of the intersection between NLP and multi-agent ADSs, including background concepts, categorization frameworks, and interaction analysis.

Conclusion: Summarizes key applications, datasets, and challenges to guide future research in LLM-based multi-agent autonomous driving systems.

Abstract: Autonomous Driving Systems (ADSs) are revolutionizing transportation by reducing human intervention, improving operational efficiency, and enhancing safety. Large Language Models (LLMs) have been integrated into ADSs to support high-level decision-making through their powerful reasoning, instruction-following, and communication abilities. However, LLM-based single-agent ADSs face three major challenges: limited perception, insufficient collaboration, and high computational demands. To address these issues, recent advances in LLM-based multi-agent ADSs leverage language-driven communication and coordination to enhance inter-agent collaboration. This paper provides a frontier survey of this emerging intersection between NLP and multi-agent ADSs. We begin with a background introduction to related concepts, followed by a categorization of existing LLM-based methods based on different agent interaction modes. We then discuss agent-human interactions in scenarios where LLM-based agents engage with humans. Finally, we summarize key applications, datasets, and challenges to support future research.

[533] A Hybrid ABM-PDE Framework for Real-World Infectious Disease Simulations

Kristina Maier, Tim O. F. Conrad

Main category: cs.MA

TL;DR: Hybrid ABM-PDE model for epidemic spread simulation that reduces computational complexity while maintaining accuracy through agent-density coupling.

Details

Motivation: To reduce the computational complexity of full Agent-Based Models (ABM) in epidemic simulations while maintaining comparable accuracy for large-scale spatial disease spread modeling.

Method: Couples Agent-Based Model (ABM) with partial differential equation (PDE) model using compartmental structure with seven health states. Agents crossing from ABM to PDE domain are converted to density contributions, while surplus PDE density generates agents using mobile phone data trajectories.

Result: Hybrid model reduces overall simulation runtime (runs × duration) and achieves smaller errors across both 25% and 100% population samples compared to full ABM. Successfully captures core epidemiological dynamics using real-world mobility and infection data from Berlin-Brandenburg region.

Conclusion: The coupled ABM-PDE hybrid model enables efficient large-scale epidemic simulations with significantly faster runtime while maintaining accuracy, making it suitable for practical public health applications.

Abstract: This paper presents a hybrid modeling approach that couples an Agent-Based Model (ABM) with a partial differential equation (PDE) model in an epidemic setting to simulate the spatial spread of infectious diseases using a compartmental structure with seven health states. The goal is to reduce the computational complexity of a full-ABM by introducing a coupled ABM-PDE model that offers significantly faster simulations while maintaining comparable accuracy. Our results demonstrate that the hybrid model not only reduces the overall simulation runtime (defined as the number of runs required for stable results multiplied by the duration of a single run) but also achieves smaller errors across both 25% and 100% population samples. The coupling mechanism ensures consistency at the model interface: agents crossing from the ABM into the PDE domain are removed and represented as density contributions, while surplus density in the PDE domain is used to generate agents with plausible trajectories derived from mobile phone data. We evaluate the hybrid model using real-world mobility and infection data for the Berlin-Brandenburg region in Germany, showing that it captures the core epidemiological dynamics while enabling efficient large-scale simulations.

[534] Abmax: A JAX-based Agent-based Modeling Framework

Siddharth Chaturvedi, Ahmed El-Gazzar, Marcel van Gerven

Main category: cs.MA

TL;DR: Abmax is a JAX-based agent-based modeling framework that enables dynamic agent updates while maintaining high performance through JIT compilation and vectorization.

Details

Motivation: To overcome JAX's limitation of requiring immutable array shapes in agent-based modeling, which restricts dynamic agent manipulation operations.

Method: Developed Abmax framework with JIT-compilable algorithms for dynamically updating selected agents while maintaining performance.

Result: Achieved comparable runtime performance to state-of-the-art implementations on predation model benchmark, and demonstrated vectorization capability for parallel model execution.

Conclusion: Abmax successfully enables dynamic agent updates in JAX-based ABM while preserving performance, with applications shown in traffic-flow and financial market models.

Abstract: Agent-based modeling (ABM) is a principal approach for studying complex systems. By decomposing a system into simpler, interacting agents, agent-based modeling (ABM) allows researchers to observe the emergence of complex phenomena. High-performance array computing libraries like JAX can help scale such computational models to a large number of agents by using automatic vectorization and just-in-time (JIT) compilation. One of the caveats of using JAX to achieve such scaling is that the shapes of arrays used in the computational model should remain immutable throughout the simulation. In the context of agent-based modeling (ABM), this can pose constraints on certain agent manipulation operations that require flexible data structures. A subset of which is represented by the ability to update a dynamically selected number of agents by applying distinct changes to them during a simulation. To this effect, we introduce Abmax, an ABM framework based on JAX that implements multiple just-in-time (JIT) compilable algorithms to provide this functionality. On the canonical predation model benchmark, Abmax achieves runtime performance comparable to state-of-the-art implementations. Further, we show that this functionality can also be vectorized, making it possible to run many similar agent-based models in parallel. We also present two examples in the form of a traffic-flow model and a financial market model to show the use case of Abmax.

Anastasia Psarou, Łukasz Gorczyca, Dominik Gaweł, Rafał Kucharski

Main category: cs.MA

TL;DR: Introducing social awareness via marginal cost rewards in MARL for AV routing reduces training time and improves convergence, benefiting both system-wide and individual performance.

Details

Motivation: Selfish AV routing using MARL destabilizes traffic systems with long convergence times. Moving beyond selfish rewards can relieve this issue.

Method: Add intrinsic reward based on marginal cost matrix to quantify individual route-choice impact on total travel time, aligning agents’ objectives while preserving system equilibria.

Result: MARL algorithms with marginal cost rewards converge to optimal solution in both toy and real-world networks, while baseline algorithms fail.

Conclusion: Social awareness through marginal cost inclusion improves both system-wide and individual performance in future urban AV systems.

Abstract: Previous work has shown that when multiple selfish Autonomous Vehicles (AVs) are introduced to future cities and start learning optimal routing strategies using Multi-Agent Reinforcement Learning (MARL), they may destabilize traffic systems, as they would require a significant amount of time to converge to the optimal solution, equivalent to years of real-world commuting. We demonstrate that moving beyond the selfish component in the reward significantly relieves this issue. If each AV, apart from minimizing its own travel time, aims to reduce its impact on the system, this will be beneficial not only for the system-wide performance but also for each individual player in this routing game. By introducing an intrinsic reward signal based on the marginal cost matrix, we significantly reduce training time and achieve convergence more reliably. Marginal cost quantifies the impact of each individual action (route-choice) on the system (total travel time). Including it as one of the components of the reward can reduce the degree of non-stationarity by aligning agents’ objectives. Notably, the proposed counterfactual formulation preserves the system’s equilibria and avoids oscillations. Our experiments show that training MARL algorithms with our novel reward formulation enables the agents to converge to the optimal solution, whereas the baseline algorithms fail to do so. We show these effects in both a toy network and the real-world network of Saint-Arnoult. Our results optimistically indicate that social awareness (i.e., including marginal costs in routing decisions) improves both the system-wide and individual performance of future urban systems with AVs.

cs.MM

[536] Human-in-the-Loop Bandwidth Estimation for Quality of Experience Optimization in Real-Time Video Communication

Sami Khairy, Gabriel Mittag, Vishak Gopal, Ross Cutler

Main category: cs.MM

TL;DR: A data-driven framework using offline reinforcement learning for bandwidth estimation in video conferencing systems, trained on real-world Microsoft Teams data, which reduces poor call ratio by 11.41%.

Details

Motivation: Bandwidth estimation for real-time communications remains challenging due to evolving network architectures, complex protocol stacks, and difficulty defining reliable QoE metrics that improve user experience.

Method: Train objective QoE reward models from subjective user evaluations, collect 1M network traces with QoE rewards from real Teams calls, and use novel distributional offline RL algorithm to train neural-network bandwidth estimator.

Result: Real-world A/B testing shows 11.41% reduction in subjective poor call ratio compared to baseline bandwidth estimator. The offline RL algorithm also demonstrates generalization on D4RL benchmark tasks.

Conclusion: The proposed human-in-the-loop, data-driven framework effectively addresses bandwidth estimation challenges and improves video conferencing QoE through offline reinforcement learning.

Abstract: The quality of experience (QoE) delivered by video conferencing systems is significantly influenced by accurately estimating the time-varying available bandwidth between the sender and receiver. Bandwidth estimation for real-time communications remains an open challenge due to rapidly evolving network architectures, increasingly complex protocol stacks, and the difficulty of defining QoE metrics that reliably improve user experience. In this work, we propose a deployed, human-in-the-loop, data-driven framework for bandwidth estimation to address these challenges. Our approach begins with training objective QoE reward models derived from subjective user evaluations to measure audio and video quality in real-time video conferencing systems. Subsequently, we collect roughly $1$M network traces with objective QoE rewards from real-world Microsoft Teams calls to curate a bandwidth estimation training dataset. We then introduce a novel distributional offline reinforcement learning (RL) algorithm to train a neural-network-based bandwidth estimator aimed at improving QoE for users. Our real-world A/B test demonstrates that the proposed approach reduces the subjective poor call ratio by $11.41%$ compared to the baseline bandwidth estimator. Furthermore, the proposed offline RL algorithm is benchmarked on D4RL tasks to demonstrate its generalization beyond bandwidth estimation.

Xiangyu Li, Ran Su, Liangliang Liu

Main category: cs.MM

TL;DR: M3ST-DTI is a multi-task learning model for drug-target interaction prediction that integrates textual, structural, and functional features using multi-stage alignment and fusion techniques.

Details

Motivation: Existing DTI prediction methods fail to capture deep intra-modal feature interactions and effective cross-modal alignment, limiting predictive performance and generalization.

Method: Uses multi-task learning with three feature types (textual, structural, functional), self-attention mechanisms, hybrid pooling graph attention, MCA with Gram loss for early alignment, BCA for fine-grained interactions, and deep orthogonal fusion to reduce redundancy.

Result: Extensive evaluations show M3ST-DTI consistently outperforms state-of-the-art methods across diverse metrics on benchmark datasets.

Conclusion: The proposed multi-stage integration and alignment approach effectively improves DTI prediction performance by better capturing intra-modal and cross-modal feature interactions.

Abstract: Accurate prediction of drug-target interactions (DTI) is pivotal in drug discovery. However, existing approaches often fail to capture deep intra-modal feature interactions or achieve effective cross-modal alignment, limiting predictive performance and generalization. To address these challenges, we propose M3ST-DTI, a multi-task learning model that enables multi-stage integration and alignment of multi modal features for DTI prediction. M3ST-DTI incorporates three types of features-textual, structural, and functional and enhances intra-modal representations using self-attention mechanisms and a hybrid pooling graph attention module. For early-stage feature alignment and fusion, the model in tegrates MCA with Gram loss as a structural constraint. In the later stage, a BCA module captures fine-grained interactions between drugs and targets within each modality, while a deep orthogonal fusion module mitigates feature redundancy.Extensive evaluations on benchmark datasets demonstrate that M3ST-DTI consistently outperforms state-of-the art methods across diverse metrics

[538] Towards Robust and Realible Multimodal Misinformation Recognition with Incomplete Modality

Hengyang Zhou, Yiwei Wei, Jian Yang, Zhenyu Zhang

Main category: cs.MM

TL;DR: MMLNet is a robust multimodal fusion strategy for misinformation recognition that handles modality incompleteness through multi-expert collaboration, incomplete modality adapters, and contrastive learning.

Details

Motivation: Real-world multimedia news often loses information during dissemination, causing modality incompleteness that harms existing models' generalization and robustness.

Method: Three-step approach: (1) Multi-Expert Collaborative Reasoning to compensate missing modalities, (2) Incomplete Modality Adapters for feature distribution adaptation, (3) Modality Missing Learning with adaptive weighting and contrastive learning.

Result: Superior performance on three real-world benchmarks across two languages compared to state-of-the-art methods while maintaining simplicity.

Conclusion: MMLNet effectively handles incomplete modality scenarios in misinformation recognition, curbing the spread of malicious misinformation.

Abstract: Multimodal Misinformation Recognition has become an urgent task with the emergence of huge multimodal fake content on social media platforms. Previous studies mainly focus on complex feature extraction and fusion to learn discriminative information from multimodal content. However, in real-world applications, multimedia news may naturally lose some information during dissemination, resulting in modality incompleteness, which is detrimental to the generalization and robustness of existing models. To this end, we propose a novel generic and robust multimodal fusion strategy, termed Multi-expert Modality-incomplete Learning Network (MMLNet), which is simple yet effective. It consists of three key steps: (1) Multi-Expert Collaborative Reasoning to compensate for missing modalities by dynamically leveraging complementary information through multiple experts. (2) Incomplete Modality Adapters compensates for the missing information by leveraging the new feature distribution. (3) Modality Missing Learning leveraging an label-aware adaptive weighting strategy to learn a robust representation with contrastive learning. We evaluate MMLNet on three real-world benchmarks across two languages, demonstrating superior performance compared to state-of-the-art methods while maintaining relative simplicity. By ensuring the accuracy of misinformation recognition in incomplete modality scenarios caused by information propagation, MMLNet effectively curbs the spread of malicious misinformation. Code is publicly available at https://github.com/zhyhome/MMLNet.

eess.AS

[539] FakeMark: Deepfake Speech Attribution With Watermarked Artifacts

Wanying Ge, Xin Wang, Junichi Yamagishi

Main category: eess.AS

TL;DR: FakeMark is a novel watermarking framework that injects artifact-correlated watermarks tied to deepfake systems, enabling robust source attribution even when watermarks or artifacts are partially removed.

Details

Motivation: Existing deepfake speech attribution methods have limitations - classifiers struggle with domain-shifted samples, and conventional watermarking is vulnerable to distortions and removal attacks.

Method: Inject artifact-correlated watermarks associated with deepfake systems (not pre-assigned bitstrings), allowing detectors to leverage both watermarks and intrinsic artifacts for attribution.

Result: FakeMark improves generalization to cross-dataset samples and maintains high accuracy under various distortions where other methods fail.

Conclusion: The proposed framework provides robust deepfake speech attribution by synergistically combining watermarking and artifact analysis, overcoming limitations of existing approaches.

Abstract: Deepfake speech attribution remains challenging for existing solutions. Classifier-based solutions often fail to generalize to domain-shifted samples, and watermarking-based solutions are easily compromised by distortions like codec compression or malicious removal attacks. To address these issues, we propose FakeMark, a novel watermarking framework that injects artifact-correlated watermarks associated with deepfake systems rather than pre-assigned bitstring messages. This design allows a detector to attribute the source system by leveraging both injected watermark and intrinsic deepfake artifacts, remaining effective even if one of these cues is elusive or removed. Experimental results show that FakeMark improves generalization to cross-dataset samples where classifier-based solutions struggle and maintains high accuracy under various distortions where conventional watermarking-based solutions fail.

[540] DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

Yakun Song, Xiaobin Zhuang, Jiawei Chen, Zhikang Niu, Guanrou Yang, Chenpeng Du, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xie Chen

Main category: eess.AS

TL;DR: DISTAR is a zero-shot text-to-speech framework that combines autoregressive language modeling with masked diffusion in discrete RVQ code space, enabling robust, controllable synthesis without alignment or duration prediction.

Details

Motivation: Address brittleness and limited controllability in existing AR-diffusion hybrid TTS systems under distribution shift.

Method: Uses discrete RVQ code space with AR language model for block-level drafting and parallel masked-diffusion for infilling, enabling blockwise parallelism without exposure bias.

Result: Surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, speaker/style consistency while maintaining output diversity.

Conclusion: DISTAR provides high-quality audio with explicit control via classifier-free guidance, variable bit-rate, and controllable computation through RVQ layer pruning.

Abstract: Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.

[541] DeePAQ: A Perceptual Audio Quality Metric Based On Foundational Models and Weakly Supervised Learning

Guanxin Jiang, Andreas Brendel, Pablo M. Delgado, Jürgen Herre

Main category: eess.AS

TL;DR: DeePAQ is a deep learning-based perceptual audio quality metric that uses metric learning with the MERT music foundation model and weakly supervised labels to create an embedding space for evaluating audio quality across various distortions.

Details

Motivation: To develop a robust and versatile audio quality evaluation metric that can handle general audio beyond specific domains, leveraging recent advances in foundation models and weakly supervised learning approaches not yet explored in audio quality assessment.

Method: Uses metric learning with the MERT music foundation model guided by surrogate labels, fine-tuned with Low-Rank Adaptation (LoRA) to construct an embedding space that captures distortion intensity in general audio.

Result: DeePAQ surpasses existing state-of-the-art objective audio quality metrics in detecting coding artifacts and generalizes well to unseen distortions like source separation, demonstrating robustness and versatility.

Conclusion: The proposed DeePAQ method represents a novel approach in audio quality assessment by combining weakly supervised learning, metric learning, and foundation model fine-tuning, achieving superior performance and generalization capabilities compared to existing metrics.

Abstract: This paper presents the Deep learning-based Perceptual Audio Quality metric (DeePAQ) for evaluating general audio quality. Our approach leverages metric learning together with the music foundation model MERT, guided by surrogate labels, to construct an embedding space that captures distortion intensity in general audio. To the best of our knowledge, DeePAQ is the first in the general audio quality domain to leverage weakly supervised labels and metric learning for fine-tuning a music foundation model with Low-Rank Adaptation (LoRA), a direction not yet explored by other state-of-the-art methods. We benchmark the proposed model against state-of-the-art objective audio quality metrics across listening tests spanning audio coding and source separation. Results show that our method surpasses existing metrics in detecting coding artifacts and generalizes well to unseen distortions such as source separation, highlighting its robustness and versatility.

[542] A Phase Synthesizer for Decorrelation to Improve Acoustic Feedback Cancellation

Klaus Linhard, Philipp Bulling

Main category: eess.AS

TL;DR: A unified framework combining frequency shifting and phase modulation for acoustic feedback cancellation, implemented in a DFT filter bank with variable delay lines, showing improved system stability and speech quality.

Details

Motivation: To address undesired acoustic feedback in communication systems while preventing adaptive filters from suppressing desired signals by decorrelating loudspeaker and microphone signals.

Method: Combines frequency shifting and phase modulation in a phase synthesizer implemented in a DFT filter bank, extended with variable delay lines from vibrato/chorus effects, tested with adaptive frequency-domain Kalman filter.

Result: Demonstrated improvements in system stability and speech quality measured by PESQ in speech in-car communication applications.

Conclusion: The proposed phase synthesizer effectively decorrelates signals for acoustic feedback cancellation while maintaining desired signal quality.

Abstract: Undesired acoustic feedback is a known issue in communication systems, such as speech in-car communication, public address systems, or hearing aids. Without additional precautions, there is a high risk that the adaptive filter - intended to cancel the feedback path - also suppresses parts of the desired signal. One solution is to decorrelate the loudspeaker and microphone signals. In this work, we combine the two decorrelation approaches frequency shifting and phase modulation in a unified framework: a so-called \textit{phase synthesizer}, implemented in a discrete Fourier transform (DFT) filter bank. Furthermore, we extend the phase modulation technique using variable delay lines, as known from vibrato and chorus effects. We demonstrate the benefits of the proposed phase synthesizer using an example from speech in-car communication, employing an adaptive frequency-domain Kalman filter. Improvements in system stability, speech quality measured by perceptual evaluation of speech quality (PESQ) are presented.

[543] I-DCCRN-VAE: An Improved Deep Representation Learning Framework for Complex VAE-based Single-channel Speech Enhancement

Jiatong Li, Simon Doclo

Main category: eess.AS

TL;DR: Improved DCCRN-VAE speech enhancement system with three modifications: removing skip connections, using β-VAE for better regularization, and generating both speech/noise latent representations, achieving better generalization on mismatched datasets.

Details

Motivation: To improve the generalization ability of DCCRN-VAE speech enhancement system and simplify the training pipeline while maintaining performance.

Method: Three key modifications: 1) Remove skip connections in pretrained VAEs to encourage more informative latent representations; 2) Use β-VAE in pretraining for better balance between reconstruction and regularization; 3) NSVAE generates both speech and noise latent representations.

Result: Achieves comparable performance on matched DNS3 dataset but outperforms baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization. Ablation study shows similar performance with classical fine-tuning instead of adversarial training.

Conclusion: The proposed modifications successfully improve generalization ability of DCCRN-VAE while simplifying the training pipeline, making it more practical for real-world applications with mismatched conditions.

Abstract: Recently, a complex variational autoencoder (VAE)-based single-channel speech enhancement system based on the DCCRN architecture has been proposed. In this system, a noise suppression VAE (NSVAE) learns to extract clean speech representations from noisy speech using pretrained clean speech and noise VAEs with skip connections. In this paper, we improve DCCRN-VAE by incorporating three key modifications: 1) removing the skip connections in the pretrained VAEs to encourage more informative speech and noise latent representations; 2) using $\beta$-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) a NSVAE generating both speech and noise latent representations. Experiments show that the proposed system achieves comparable performance as the DCCRN and DCCRN-VAE baselines on the matched DNS3 dataset but outperforms the baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization ability. In addition, an ablation study shows that a similar performance can be achieved with classical fine-tuning instead of adversarial training, resulting in a simpler training pipeline.

[544] Stimulus Modality Matters: Impact of Perceptual Evaluations from Different Modalities on Speech Emotion Recognition System Performance

Huang-Cheng Chou, Haibin Wu, Hung-yi Lee, Chi-Chun Lee

Main category: eess.AS

TL;DR: This paper compares the effectiveness of Speech Emotion Recognition (SER) systems trained with emotional labels collected from different modality stimuli (voice-only vs. multimodal), finding that voice-only labels perform better.

Details

Motivation: Different emotion databases use different annotation methods (voice-only vs. multimodal stimuli), creating uncertainty about which type of emotional labels are most effective for training SER systems.

Method: Comprehensive comparison of SER systems trained with labels from different modality stimuli, evaluation on various testing conditions, and introduction of an all-inclusive label combining all modalities.

Result: Training with labels elicited by voice-only stimuli yields better performance on test sets compared to other modalities.

Conclusion: Voice-only emotional labels are more effective for training SER systems than labels from multimodal stimuli or combined approaches.

Abstract: Speech Emotion Recognition (SER) systems rely on speech input and emotional labels annotated by humans. However, various emotion databases collect perceptional evaluations in different ways. For instance, the IEMOCAP dataset uses video clips with sounds for annotators to provide their emotional perceptions. However, the most significant English emotion dataset, the MSP-PODCAST, only provides speech for raters to choose the emotional ratings. Nevertheless, using speech as input is the standard approach to training SER systems. Therefore, the open question is the emotional labels elicited by which scenarios are the most effective for training SER systems. We comprehensively compare the effectiveness of SER systems trained with labels elicited by different modality stimuli and evaluate the SER systems on various testing conditions. Also, we introduce an all-inclusive label that combines all labels elicited by various modalities. We show that using labels elicited by voice-only stimuli for training yields better performance on the test set, whereas labels elicited by voice-only stimuli.

eess.IV

[545] Normalization-equivariant Diffusion Models: Learning Posterior Samplers From Noisy And Partial Measurements

Brett Levac, Jon Tamir, Marcelo Pereyra, Julian Tachella

Main category: eess.IV

TL;DR: First approach for training diffusion models using only noisy measurement data from a single operator, enabling image restoration without clean training data or multiple acquisition processes.

Details

Motivation: Existing diffusion models require clean training data, which is often unavailable in real-world scenarios where only noisy and incomplete measurements exist. Current methods either need mild noise levels, additional clean data, or multiple acquisition processes.

Method: Leverage scale equivariance property of diffusion models to develop denoising score-matching that generalizes to lower noise levels than training data. Combine with equivariant imaging framework to handle incomplete and noisy single-operator measurements.

Result: Successfully trained diffusion models for image restoration tasks (denoising, demosaicing, inpainting) using only noisy data. Method outperforms state-of-the-art approaches and works effectively with incomplete measurements.

Conclusion: The proposed approach enables practical training of diffusion models from noisy real-world data without requiring clean references or multiple acquisition processes, addressing fundamental limitations of existing supervised methods.

Abstract: Diffusion models (DMs) have rapidly emerged as a powerful framework for image generation and restoration. However, existing DMs are primarily trained in a supervised manner by using a large corpus of clean images. This reliance on clean data poses fundamental challenges in many real-world scenarios, where acquiring noise-free data is hard or infeasible, and only noisy and potentially incomplete measurements are available. While some methods can train DMs using noisy data, they are generally effective only when the amount of noise is very mild or when some additional noise-free data is available. In addition, existing methods for training DMs from incomplete measurements require access to multiple complementary acquisition processes, an assumption that poses a significant practical limitation. Here we introduce the first approach for learning DMs for image restoration using only noisy measurement data from a single operator. As a first key contribution, we show that DMs, and more broadly minimum mean squared error denoisers, exhibit a weak form of scale equivariance linking rescaling in signal amplitude to changes in noise intensity. We then leverage this theoretical insight to develop a denoising score-matching strategy that generalizes robustly to noise levels lower than those present in the training data, thereby enabling the learning of DMs from noisy measurements. To further address the challenges of incomplete and noisy data, we integrate our method with equivariant imaging, a complementary self-supervised learning framework that exploits the inherent invariants of imaging problems, to train DMs for image restoration from single-operator measurements that are both incomplete and noisy. We validate the effectiveness of our approach through extensive experiments on image denoising, demosaicing, and inpainting, along with comparisons with the state of the art.

[546] LiteVPNet: A Lightweight Network for Video Encoding Control in Quality-Critical Applications

Vibhoothi Vibhoothi, François Pitié, Anil Kokaram

Main category: eess.IV

TL;DR: LiteVPNet is a lightweight neural network that predicts Quantisation Parameters for NVENC AV1 encoders to achieve specified VMAF scores, improving quality control and energy efficiency in video streaming.

Details

Motivation: New video workflows in cinema production require precise quality control and energy efficiency, but existing transcoding approaches lack adequate quality control or have computational overhead.

Method: Uses a lightweight neural network with low-complexity features including bitstream characteristics, video complexity measures, and CLIP-based semantic embeddings to predict Quantisation Parameters.

Result: Achieves mean VMAF errors below 1.2 points across quality targets, with VMAF errors within 2 points for over 87% of test corpus (vs. 61% with state-of-the-art methods).

Conclusion: LiteVPNet enhances high-value content transport and streaming for more energy-efficient, high-quality media experiences.

Abstract: In the last decade, video workflows in the cinema production ecosystem have presented new use cases for video streaming technology. These new workflows, e.g. in On-set Virtual Production, present the challenge of requiring precise quality control and energy efficiency. Existing approaches to transcoding often fall short of these requirements, either due to a lack of quality control or computational overhead. To fill this gap, we present a lightweight neural network (LiteVPNet) for accurately predicting Quantisation Parameters for NVENC AV1 encoders that achieve a specified VMAF score. We use low-complexity features, including bitstream characteristics, video complexity measures, and CLIP-based semantic embeddings. Our results demonstrate that LiteVPNet achieves mean VMAF errors below 1.2 points across a wide range of quality targets. Notably, LiteVPNet achieves VMAF errors within 2 points for over 87% of our test corpus, c.f. approx 61% with state-of-the-art methods. LiteVPNet’s performance across various quality regions highlights its applicability for enhancing high-value content transport and streaming for more energy-efficient, high-quality media experiences.

[547] An Empirical Study of Reducing AV1 Decoder Complexity and Energy Consumption via Encoder Parameter Tuning

Vibhoothi Vibhoothi, Julien Zouein, Shanker Shreejith, Jean-Baptiste Kempf, Anil Kokaram

Main category: eess.IV

TL;DR: Analyzing AV1 encoder configurations to reduce decoding complexity and energy consumption on battery-constrained devices, showing specific settings can significantly lower decoding cycles with minimal quality loss.

Details

Motivation: High decoding complexity of advanced video codecs like AV1 hinders adoption on battery-constrained devices, requiring strategies to reduce energy consumption during video streaming.

Method: Systematically analyzed impact of disabling coding tools and adjusting parameters in libaom-av1 and SVT-AV1 encoders using system-level energy measurement tools (RAPL, Intel SoC Watch with VTune profiler) to quantify trade-offs between decoding complexity, energy, and compression efficiency.

Result: Specific encoder configurations substantially reduce decoding complexity: libaom-av1 disabling CDEF filter reduces decoding cycles by 10% on average; SVT-AV1 using fast-decode=2 preset achieves 24% reduction in decoding cycles, both with minimal perceptual quality degradation.

Conclusion: Content providers can use these encoder configuration strategies to lower the energy footprint of AV1 video streaming on battery-constrained devices.

Abstract: The widespread adoption of advanced video codecs such as AV1 is often hindered by their high decoding complexity, posing a challenge for battery-constrained devices. While encoders can be configured to produce bitstreams that are decoder-friendly, estimating the decoding complexity and energy overhead for a given video is non-trivial. In this study, we systematically analyse the impact of disabling various coding tools and adjusting coding parameters in two AV1 encoders, libaom-av1 and SVT-AV1. Using system-level energy measurement tools like RAPL (Running Average Power Limit), Intel SoC Watch (integrated with VTune profiler), we quantify the resulting trade-offs between decoding complexity, energy consumption, and compression efficiency for decoding a bitstream. Our results demonstrate that specific encoder configurations can substantially reduce decoding complexity with minimal perceptual quality degradation. For libaom-av1, disabling CDEF, an in-loop filter gives us a mean reduction in decoding cycles by 10%. For SVT-AV1, using the in-built, fast-decode=2 preset achieves a more substantial 24% reduction in decoding cycles. These findings provide strategies for content providers to lower the energy footprint of AV1 video streaming.

[548] MH-LVC: Multi-Hypothesis Temporal Prediction for Learned Conditional Residual Video Coding

Huu-Tai Phung, Zong-Lin Gao, Yi-Chen Yao, Kuan-Wei Ho, Yi-Hsin Chen, Yu-Hsiang Lin, Alessandro Gnutti, Wen-Hsiao Peng

Main category: eess.IV

TL;DR: MH-LVC is a multi-hypothesis temporal prediction method that uses long- and short-term reference frames in conditional residual video coding, reducing memory access while maintaining coding performance.

Details

Motivation: To address the challenge of excessive memory access in temporal context mining approaches that require storing large amounts of implicit contextual information from past decoded frames.

Method: Uses a multi-hypothesis temporal prediction scheme with long- and short-term reference frames, limiting reference frames to two at a time. Implements decoded frame buffer management to flexibly utilize long-term key frames and short-term reference frames, with adaptable temporal prediction structure.

Result: Outperforms VTM-17.0 under low-delay B configuration in PSNR-RGB across test datasets, performs comparably to state-of-the-art learned codecs like DCVC-FM while requiring less decoded frame buffer and similar decoding time.

Conclusion: MH-LVC effectively balances coding performance and memory efficiency by strategically managing reference frames, bringing traditional video codec flexibility to learned video codecs.

Abstract: This work, termed MH-LVC, presents a multi-hypothesis temporal prediction scheme that employs long- and short-term reference frames in a conditional residual video coding framework. Recent temporal context mining approaches to conditional video coding offer superior coding performance. However, the need to store and access a large amount of implicit contextual information extracted from past decoded frames in decoding a video frame poses a challenge due to excessive memory access. Our MH-LVC overcomes this issue by storing multiple long- and short-term reference frames but limiting the number of reference frames used at a time for temporal prediction to two. Our decoded frame buffer management allows the encoder to flexibly utilize the long-term key frames to mitigate temporal cascading errors and the short-term reference frames to minimize prediction errors. Moreover, our buffering scheme enables the temporal prediction structure to be adapted to individual input videos. While this flexibility is common in traditional video codecs, it has not been fully explored for learned video codecs. Extensive experiments show that the proposed method outperforms VTM-17.0 under the low-delay B configuration in terms of PSNR-RGB across commonly used test datasets, and performs comparably to the state-of-the-art learned codecs (e.g.~DCVC-FM) while requiring less decoded frame buffer and similar decoding time.

[549] A High-Level Feature Model to Predict the Encoding Energy of a Hardware Video Encoder

Diwakara Reddy, Christian Herglotz, André Kaup

Main category: eess.IV

TL;DR: A Gaussian process regression model predicts hardware video encoder energy consumption with ~9% error, showing spatial resolution is key for energy prediction.

Details

Motivation: Live streaming from battery-powered devices requires real-time video encoding, and predicting encoding energy helps optimize battery usage.

Method: High-level feature model using Gaussian process regression to predict hardware video encoder energy consumption, focusing on P-frames and single keyframes.

Result: Model achieves mean absolute percentage error of approximately 9% in predicting encoding energy, with spatial resolution identified as the most important feature.

Conclusion: The model enables practical energy estimation for video encoding at various resolutions, coding standards, and codec presets, useful for battery-constrained live streaming applications.

Abstract: In today’s society, live video streaming and user generated content streamed from battery powered devices are ubiquitous. Live streaming requires real-time video encoding, and hardware video encoders are well suited for such an encoding task. In this paper, we introduce a high-level feature model using Gaussian process regression that can predict the encoding energy of a hardware video encoder. In an evaluation setup restricted to only P-frames and a single keyframe, the model can predict the encoding energy with a mean absolute percentage error of approximately 9%. Further, we demonstrate with an ablation study that spatial resolution is a key high-level feature for encoding energy prediction of a hardware encoder. A practical application of our model is that it can be used to perform a prior estimation of the energy required to encode a video at various spatial resolutions, with different coding standards and codec presets.

[550] Logarithmic Mathematical Morphology: theory and applications

Guillaume Noyel

Main category: eess.IV

TL;DR: The paper introduces Logarithmic Mathematical Morphology (LMM), a new framework that makes mathematical morphology operators robust to lighting variations by using a logarithmic additive law instead of the usual sum.

Details

Motivation: Traditional mathematical morphology for grey-level images is not robust to lighting variations because the structuring function's amplitude doesn't adapt to image intensity changes. This limitation affects analysis in images with varying illumination.

Method: The authors define a new framework using an additive law from Logarithmic Image Processing that models lighting variations with physical causes like changes in light intensity. This allows the structuring function’s amplitude to vary according to the image amplitude.

Result: The proposed Logarithmic Mathematical Morphology (LMM) framework successfully enables the definition of mathematical morphology operators that are robust to lighting variations.

Conclusion: LMM provides an effective solution for handling lighting variations in mathematical morphology operations, making image analysis more reliable under varying illumination conditions.

Abstract: In Mathematical Morphology for grey-level functions, an image is analysed by another image named the structuring function. This structuring function is translated over the image domain and summed to the image. However, in an image presenting lighting variations, the amplitude of the structuring function should vary according to the image intensity. Such a property is not verified in Mathematical Morphology for grey level functions, when the structuring function is summed to the image with the usual additive law. In order to address this issue, a new framework is defined with an additive law for which the amplitude of the structuring function varies according to the image amplitude. This additive law is chosen within the Logarithmic Image Processing framework and models the lighting variations with a physical cause such as a change of light intensity. The new framework is named Logarithmic Mathematical Morphology (LMM) and allows the definition of operators which are robust to such lighting variations.

[551] BAAF: A benchmark attention adaptive framework for medical ultrasound image segmentation tasks

Gongping Chen, Lei Zhao, Xiaotao Yin, Liang Cui, Jianxun Zhang, Yu Dai, Ningning Liu

Main category: eess.IV

TL;DR: Proposes BAAF, a benchmark attention adaptive framework for medical ultrasound image segmentation that combines parallel hybrid attention and adaptive calibration to improve lesion localization accuracy.

Details

Motivation: Address the challenge of automatically and precisely localizing objects in ultrasound images due to severe interference from internal and external factors, aiming to assist doctors in faster and more accurate diagnosis.

Method: BAAF framework with parallel hybrid attention module (PHAM) for coarse calibration from channel and spatial dimensions, and adaptive calibration mechanism (ACM) for selecting robust lesion characterizations from calibrated features.

Result: Evaluated on four medical ultrasound segmentation tasks, showing remarkable performance improvement over state-of-the-art methods and superiority compared to existing attention mechanisms.

Conclusion: Provides potential for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision.

Abstract: The AI-based assisted diagnosis programs have been widely investigated on medical ultrasound images. Complex scenario of ultrasound image, in which the coupled interference of internal and external factors is severe, brings a unique challenge for localize the object region automatically and precisely in ultrasound images. In this study, we seek to propose a more general and robust Benchmark Attention Adaptive Framework (BAAF) to assist doctors segment or diagnose lesions and tissues in ultrasound images more quickly and accurately. Different from existing attention schemes, the BAAF consists of a parallel hybrid attention module (PHAM) and an adaptive calibration mechanism (ACM). Specifically, BAAF first coarsely calibrates the input features from the channel and spatial dimensions, and then adaptively selects more robust lesion or tissue characterizations from the coarse-calibrated feature maps. The design of BAAF further optimizes the “what” and “where” focus and selection problems in CNNs and seeks to improve the segmentation accuracy of lesions or tissues in medical ultrasound images. The method is evaluated on four medical ultrasound segmentation tasks, and the adequate experimental results demonstrate the remarkable performance improvement over existing state-of-the-art methods. In addition, the comparison with existing attention mechanisms also demonstrates the superiority of BAAF. This work provides the possibility for automated medical ultrasound assisted diagnosis and reduces reliance on human accuracy and precision.

[552] Robust Real-Time Endoscopic Stereo Matching under Fuzzy Tissue Boundaries

Yang Ding, Can Han, Sijia Du, Yaqi Wang, Dahong Qian

Main category: eess.IV

TL;DR: RRESM is a real-time stereo matching method for endoscopic images that achieves state-of-the-art accuracy at 42 FPS by using 3D Mamba Coordinate Attention for cost aggregation and High-Frequency Disparity Optimization for boundary refinement.

Details

Motivation: Existing stereo matching methods designed for natural images struggle with endoscopic images due to fuzzy tissue boundaries and fail to meet real-time requirements for high-resolution inputs in robotic minimally invasive surgery.

Method: Proposes RRESM with two key modules: 3D Mamba Coordinate Attention for enhanced cost aggregation using position-sensitive attention and long-range spatial dependency modeling, and High-Frequency Disparity Optimization that refines disparity near tissue boundaries by amplifying high-frequency details in the wavelet domain.

Result: Achieves state-of-the-art matching accuracy on SCARED and SERV-CT datasets with real-time inference speed of 42 FPS.

Conclusion: RRESM effectively addresses the challenges of stereo matching in endoscopic images, providing accurate depth information in real-time for robotic minimally invasive surgery applications.

Abstract: Real-time acquisition of accurate scene depth is essential for automated robotic minimally invasive surgery. Stereo matching with binocular endoscopy can provide this depth information. However, existing stereo matching methods, designed primarily for natural images, often struggle with endoscopic images due to fuzzy tissue boundaries and typically fail to meet real-time requirements for high-resolution endoscopic image inputs. To address these challenges, we propose \textbf{RRESM}, a real-time stereo matching method tailored for endoscopic images. Our approach integrates a 3D Mamba Coordinate Attention module that enhances cost aggregation through position-sensitive attention maps and long-range spatial dependency modeling via the Mamba block, generating a robust cost volume without substantial computational overhead. Additionally, we introduce a High-Frequency Disparity Optimization module that refines disparity predictions near tissue boundaries by amplifying high-frequency details in the wavelet domain. Evaluations on the SCARED and SERV-CT datasets demonstrate state-of-the-art matching accuracy with a real-time inference speed of 42 FPS. The code is available at https://github.com/Sonne-Ding/RRESM.

[553] Anatomically and Metabolically Informed Diffusion for Unified Denoising and Segmentation in Low-Count PET Imaging

Menghua Xia, Kuan-Yin Ko, Der-Shiun Wang, Ming-Kai Chen, Qiong Liu, Huidong Xie, Liang Guo, Wei Ji, Jinsong Ouyang, Reimund Bayerlein, Benjamin A. Spencer, Quanzheng Li, Ramsey D. Badawi, Georges El Fakhri, Chi Liu

Main category: eess.IV

TL;DR: AMDiff is a unified framework that jointly performs PET image denoising and lesion/organ segmentation using diffusion models and nnMamba architecture, enabling direct clinical metric quantification from low-count PET inputs.

Details

Motivation: Existing methods treat PET image denoising and segmentation as independent tasks, overlooking their inherent synergies as correlated steps in the analysis pipeline.

Method: AMDiff integrates a semantic-informed denoiser based on diffusion strategy and a denoising-informed segmenter using nnMamba architecture, connected via a warming-up mechanism with lesion-organ-specific regularization and denoising revision modules.

Result: Experiments on multi-vendor, multi-center, and multi-noise-level datasets demonstrate superior performance of AMDiff.

Conclusion: AMDiff provides a unified framework that exploits mutual benefits between denoising and segmentation tasks for improved PET image analysis and direct clinical metric quantification.

Abstract: Positron emission tomography (PET) image denoising, along with lesion and organ segmentation, are critical steps in PET-aided diagnosis. However, existing methods typically treat these tasks independently, overlooking inherent synergies between them as correlated steps in the analysis pipeline. In this work, we present the anatomically and metabolically informed diffusion (AMDiff) model, a unified framework for denoising and lesion/organ segmentation in low-count PET imaging. By integrating multi-task functionality and exploiting the mutual benefits of these tasks, AMDiff enables direct quantification of clinical metrics, such as total lesion glycolysis (TLG), from low-count inputs. The AMDiff model incorporates a semantic-informed denoiser based on diffusion strategy and a denoising-informed segmenter utilizing nnMamba architecture. The segmenter constrains denoised outputs via a lesion-organ-specific regularizer, while the denoiser enhances the segmenter by providing enriched image information through a denoising revision module. These components are connected via a warming-up mechanism to optimize multi-task interactions. Experiments on multi-vendor, multi-center, and multi-noise-level datasets demonstrate the superior performance of AMDiff.

[554] Unsupervised patch-based dynamic MRI reconstruction using learnable tensor function with implicit neural representation

Yuanyuan Liu, Yuanbiao Yang, Jing Cheng, Zhuo-Xu Cui, Qingyong Zhu, Congcong Liu, Yuliang Zhu, Jingran Xu, Hairong Zheng, Dong Liang, Yanjie Zhu

Main category: eess.IV

TL;DR: TenF-INR integrates low-rank tensor modeling with implicit neural representations for unsupervised dynamic MRI reconstruction, achieving up to 21-fold acceleration while outperforming state-of-the-art methods.

Details

Motivation: Dynamic MRI suffers from limited spatiotemporal resolution due to long acquisition times. Supervised deep learning requires large fully sampled datasets that are difficult to obtain, while existing INR methods struggle with highly undersampled dynamic MRI due to inefficient representation capacity and high computational cost.

Method: Proposes TenF-INR framework that integrates low-rank tensor modeling with INR, modeling each factor matrix in tensor decomposition as a learnable factor function. Uses patch-based nonlocal tensor modeling to exploit temporal correlations and inter-patch similarities, reducing parameter space and computational burden.

Result: Experiments on dynamic cardiac and abdominal datasets show TenF-INR achieves up to 21-fold acceleration and outperforms both supervised and unsupervised state-of-the-art methods in image quality, temporal fidelity, and quantitative accuracy.

Conclusion: TenF-INR provides an effective unsupervised solution for dynamic MRI reconstruction that combines the benefits of tensor modeling and implicit neural representations, addressing limitations of existing methods while achieving superior performance.

Abstract: Dynamic MRI suffers from limited spatiotemporal resolution due to long acquisition times. Undersampling k-space accelerates imaging but makes accurate reconstruction challenging. Supervised deep learning methods achieve impressive results but rely on large fully sampled datasets, which are difficult to obtain. Recently, implicit neural representations (INR) have emerged as a powerful unsupervised paradigm that reconstructs images from a single undersampled dataset without external training data. However, existing INR-based methods still face challenges when applied to highly undersampled dynamic MRI, mainly due to their inefficient representation capacity and high computational cost. To address these issues, we propose TenF-INR, a novel unsupervised framework that integrates low-rank tensor modeling with INR, where each factor matrix in the tensor decomposition is modeled as a learnable factor function. Specifically,we employ INR to model learnable tensor functions within a low-rank decomposition, reducing the parameter space and computational burden. A patch-based nonlocal tensor modeling strategy further exploits temporal correlations and inter-patch similarities, enhancing the recovery of fine spatiotemporal details. Experiments on dynamic cardiac and abdominal datasets demonstrate that TenF-INR achieves up to 21-fold acceleration, outperforming both supervised and unsupervised state-of-the-art methods in image quality, temporal fidelity, and quantitative accuracy.

[555] Semi-Unsupervised Microscopy Segmentation with Fuzzy Logic and Spatial Statistics for Cross-Domain Analysis Using a GUI

Surajit Das, Pavel Zun

Main category: eess.IV

TL;DR: A low-cost, unsupervised segmentation method for unstained live cells that uses one-time calibration and fuzzy logic to handle low contrast and uneven illumination, outperforming SOTA methods without requiring labeled data.

Details

Motivation: Brightfield microscopy of unstained live cells faces challenges like low contrast, dynamic morphology, and uneven illumination. Existing deep learning methods need large labeled datasets, expensive hardware, and fail under uneven illumination.

Method: One-time calibration-assisted unsupervised framework using spatial standard deviation from local mean for background determination, fuzzy logic for uncertain pixels, cumulative squared shift of nodal intensity, statistical features, and post-segmentation denoising calibration saved as reusable profiles.

Result: Outperformed Cellpose 3.0 and StarDist on unstained brightfield myoblast images, improving IoU by up to 48% (average IoU = 0.43, F1 = 0.60). Achieved mean IoU of 0.69 and F1-score of 0.81 on LIVECell dataset with substantial expert agreement (κ > 0.75).

Conclusion: The framework provides a practical, annotation-free solution for live-cell imaging that operates efficiently on CPU, avoids cell staining, and demonstrates cross-modality and cross-domain robustness through the novel Homogeneous Image Plane concept.

Abstract: Brightfield microscopy of unstained live cells is challenging due to low contrast, dynamic morphology, uneven illumination, and lack of labels. Deep learning achieved SOTA performance on stained, high-contrast images but needs large labeled datasets, expensive hardware, and fails under uneven illumination. This study presents a low-cost, lightweight, annotation-free segmentation method by introducing one-time calibration-assisted unsupervised framework adaptable across imaging modalities and image type. The framework determines background via spatial standard deviation from the local mean. Uncertain pixels are resolved using fuzzy logic, cumulative squared shift of nodal intensity, statistical features, followed by post-segmentation denoising calibration which is saved as a profile for reuse until noise pattern or object type substantially change. The program runs as a script or graphical interface for non-programmers. The method was rigorously evaluated using \textit{IoU}, \textit{F1-score}, and other metrics, with statistical significance confirmed via Wilcoxon signed-rank tests. On unstained brightfield myoblast (C2C12) images, it outperformed \textit{Cellpose 3.0} and \textit{StarDist}, improving IoU by up to 48% (average IoU = 0.43, F1 = 0.60). In phase-contrast microscopy, it achieved a mean IoU of 0.69 and an F1-score of 0.81 on the \textit{LIVECell} dataset ($n = 3178$), with substantial expert agreement ($\kappa > 0.75$) confirming cross-modality robustness. Successful segmentation of laser-affected polymer surfaces further confirmed cross-domain robustness. By introducing the \textit{Homogeneous Image Plane} concept, this work provides a new theoretical foundation for training-free, annotation-free segmentation. The framework operates efficiently on CPU, avoids cell staining, and is practical for live-cell imaging and biomedical applications.

[556] Efficient Fine-Tuning of DINOv3 Pretrained on Natural Images for Atypical Mitotic Figure Classification (MIDOG 2025 Task 2 Winner)

Guillaume Balezo, Hana Feki, Raphaël Bourgade, Lily Monnier, Matthieu Blons, Alice Blondel, Etienne Decencière, Albert Pla Planas, Thomas Walter

Main category: eess.IV

TL;DR: Fine-tuned DINOv3-H+ vision transformer using LoRA achieved state-of-the-art results for atypical mitotic figure classification in the MIDOG 2025 challenge, winning first place despite domain gap from natural images.

Details

Motivation: Atypical mitotic figures (AMFs) are important biomarkers for poor prognosis but are difficult to detect due to low prevalence, subtle morphology, and inter-observer variability.

Method: Fine-tuned DINOv3-H+ vision transformer pretrained on natural images using low-rank adaptation (LoRA), training only ~1.3M parameters with extensive augmentation and domain-weighted Focal Loss to handle domain heterogeneity.

Result: Achieved first place on the final test set of MIDOG 2025 challenge, demonstrating effective transfer learning from natural images to histopathology despite the domain gap.

Conclusion: The work highlights the advantages of DINOv3 pretraining and the efficiency and robustness of the fine-tuning strategy using LoRA, yielding state-of-the-art results for atypical mitosis classification.

Abstract: Atypical mitotic figures (AMFs) represent abnormal cell division associated with poor prognosis. Yet their detection remains difficult due to low prevalence, subtle morphology, and inter-observer variability. The MIDOG 2025 challenge introduces a benchmark for AMF classification across multiple domains. In this work, we fine-tuned the recently published DINOv3-H+ vision transformer, pretrained on natural images, using low-rank adaptation (LoRA), training only ~1.3M parameters in combination with extensive augmentation and a domain-weighted Focal Loss to handle domain heterogeneity. Despite the domain gap, our fine-tuned DINOv3 transfers effectively to histopathology, reaching first place on the final test set. These results highlight the advantages of DINOv3 pretraining and underline the efficiency and robustness of our fine-tuning strategy, yielding state-of-the-art results for the atypical mitosis classification challenge in MIDOG 2025.

[557] Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

Derek Jiu, Kiran Nijjer, Nishant Chinta, Ryan Bui, Kevin Zhu

Main category: eess.IV

TL;DR: Deep learning models for chest X-ray analysis show different robustness patterns: semantic segmentation is highly vulnerable to noise (especially electronic noise), while classification is more resilient but with task-specific vulnerabilities to different noise types.

Details

Motivation: To systematically understand how different types of clinical imaging noise (quantum/Poisson and electronic/Gaussian) impact deep learning models across different radiographic analysis tasks.

Method: Used a scalable noise injection framework to apply controlled, clinically-motivated noise severities to state-of-the-art CNNs (UNet, DeepLabV3, FPN for segmentation; ResNet, DenseNet, EfficientNet for classification) on public chest X-ray datasets (Landmark, ChestX-ray14).

Result: Semantic segmentation models collapsed under severe electronic noise (Dice Similarity Coefficient drop of 0.843), while classification showed overall resilience but with differential vulnerabilities - some tasks failed catastrophically under quantum noise (AUROC drop of 0.355) while others were more susceptible to electronic noise.

Conclusion: Classification models have inherent robustness but segmentation tasks are brittle; the task- and noise-specific nature of model failure highlights the need for targeted validation and mitigation strategies before safe clinical deployment of diagnostic AI.

Abstract: Deep learning models are increasingly used for radiographic analysis, but their reliability is challenged by the stochastic noise inherent in clinical imaging. A systematic, cross-task understanding of how different noise types impact these models is lacking. Here, we evaluate the robustness of state-of-the-art convolutional neural networks (CNNs) to simulated quantum (Poisson) and electronic (Gaussian) noise in two key chest X-ray tasks: semantic segmentation and pulmonary disease classification. Using a novel, scalable noise injection framework, we applied controlled, clinically-motivated noise severities to common architectures (UNet, DeepLabV3, FPN; ResNet, DenseNet, EfficientNet) on public datasets (Landmark, ChestX-ray14). Our results reveal a stark dichotomy in task robustness. Semantic segmentation models proved highly vulnerable, with lung segmentation performance collapsing under severe electronic noise (Dice Similarity Coefficient drop of 0.843), signifying a near-total model failure. In contrast, classification tasks demonstrated greater overall resilience, but this robustness was not uniform. We discovered a differential vulnerability: certain tasks, such as distinguishing Pneumothorax from Atelectasis, failed catastrophically under quantum noise (AUROC drop of 0.355), while others were more susceptible to electronic noise. These findings demonstrate that while classification models possess a degree of inherent robustness, pixel-level segmentation tasks are far more brittle. The task- and noise-specific nature of model failure underscores the critical need for targeted validation and mitigation strategies before the safe clinical deployment of diagnostic AI.

Today’s Research Highlights

Table of Contents

cs.CL

[1] PHANTOM RECALL: When Familiar Puzzles Fool Smart Models

[2] R-WoM: Retrieval-augmented World Model For Computer-use Agents

[3] LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance

[4] Omni-Captioner: Data Pipeline, Models, and Benchmark for Omni Detailed Perception

[5] LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

[6] Discrepancy Detection at the Data Level: Toward Consistent Multilingual Question Answering

[7] TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

[8] Not in Sync: Unveiling Temporal Bias in Audio Chat Models

[9] GRAVITY: A Framework for Personalized Text Generation via Profile-Grounded Synthetic Preferences

[10] Evaluating Retrieval-Augmented Generation Systems on Unanswerable, Uncheatable, Realistic, Multi-hop Queries

[11] Direct Multi-Token Decoding

[12] Scaling Long-Horizon LLM Agent via Context-Folding

[13] Conjecturing: An Overlooked Step in Formal Mathematical Reasoning

[14] SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation

[15] Generate Logical Equivalence Questions

[16] Information Extraction from Conversation Transcripts: Neuro-Symbolic vs. LLM

[17] CPR: Mitigating Large Language Model Hallucinations with Curative Prompt Refinement

[18] Multi-stage Prompt Refinement for Mitigating Hallucinations in Large Language Models

[19] On the Interplay between Human Label Variation and Model Fairness

[20] Uncertainty Quantification for Hallucination Detection in Large Language Models: Foundations, Methodology, and Future Directions

[21] Improving Text-to-Image Generation with Input-Side Inference-Time Scaling

[22] Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

[23] APCE: Adaptive Progressive Context Expansion for Long Context Processing

[24] An AI-Based Behavioral Health Safety Filter and Dataset for Identifying Mental Health Crises in Text-Based Conversations

[25] Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

[26] Tracing Multilingual Knowledge Acquisition Dynamics in Domain Adaptation: A Case Study of English-Japanese Biomedical Adaptation

[27] Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

[28] SafeMT: Multi-turn Safety for Multimodal Language Models

[29] Credal Transformer: A Principled Approach for Quantifying and Mitigating Hallucinations in Large Language Models

[30] A Survey on Parallel Reasoning

[31] Towards Inference-time Scaling for Continuous Space Reasoning

[32] From Knowledge to Treatment: Large Language Model Assisted Biomedical Concept Representation for Drug Repurposing

[33] DPO-Tuned Large Language Models for Segmentation in Simultaneous Speech Translation

[34] HALF: Harm-Aware LLM Fairness Evaluation Aligned with Deployment

[35] Analysing Moral Bias in Finetuned LLMs through Mechanistic Interpretability

[36] DSAS: A Universal Plug-and-Play Framework for Attention Optimization in Multi-Document Question Answering

[37] Shallow Robustness, Deep Vulnerabilities: Multi-Turn Evaluation of Medical LLMs

[38] Chinese ModernBERT with Whole-Word Masking

[39] A large-scale, unsupervised pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

[40] Beating Harmful Stereotypes Through Facts: RAG-based Counter-speech Generation

[41] Fine-grained Analysis of Brain-LLM Alignment through Input Attribution

[42] MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts

[43] LLM-REVal: Can We Trust LLM Reviewers Yet?

[44] Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

[45] PRoH: Dynamic Planning and Reasoning over Knowledge Hypergraphs for Retrieval-Augmented Generation

[46] Probing Latent Knowledge Conflict for Faithful Retrieval-Augmented Generation

[47] Resource-sensitive but language-blind: Community size and not grammatical complexity better predicts the accuracy of Large Language Models in a novel Wug Test

[48] SMEC: Rethinking Matryoshka Representation Learning for Retrieval Embedding Compression

[49] When Personalization Tricks Detectors: The Feature-Inversion Trap in Machine-Generated Text Detection

[50] BoN Appetit Team at LeWiDi-2025: Best-of-N Test-time Scaling Can Not Stomach Annotation Disagreements (Yet)

[51] VISaGE: Understanding Visual Generics and Exceptions

[52] Teaching Language Models to Faithfully Express their Uncertainty

[53] StyleDecipher: Robust and Explainable Detection of LLM-Generated Texts with Stylistic Analysis

[54] ACADATA: Parallel Dataset of Academic Data for Machine Translation

[55] COSTAR-A: A prompting framework for enhancing Large Language Model performance on Point-of-View questions

[56] Reasoning Pattern Matters: Learning to Reason without Human Rationales

[57] Generation Space Size: Understanding and Calibrating Open-Endedness of LLM Generations

[58] Which Word Orders Facilitate Length Generalization in LMs? An Investigation with GCG-Based Artificial Languages

[59] Hey, wait a minute: on at-issue sensitivity in Language Models

[60] Language Models Model Language

[61] Dr.LLM: Dynamic Layer Routing in LLMs

[62] Cost Analysis of Human-corrected Transcription for Predominately Oral Languages

[63] MLRIP: Pre-training a military language representation model with informative factual knowledge and professional knowledge base

[64] GRDD: A Dataset for Greek Dialectal NLP

[65] When “Competency” in Reasoning Opens the Door to Vulnerability: Jailbreaking LLMs via Novel Complex Ciphers

[66] Hybrid Multi-stage Decoding for Few-shot NER with Entity-aware Contrastive Learning

[67] Cross-Modal Safety Alignment: Is textual unlearning all you need?

[68] The Open Source Advantage in Large Language Models (LLMs)

[69] AFRIDOC-MT: Document-level MT Corpus for African Languages

[70] ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference

[71] From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models

[72] A Survey of Multilingual Reasoning in Language Models

[73] Reasoning on a Spectrum: Aligning LLMs to System 1 and System 2 Thinking

[74] Persuasion at Play: Understanding Misinformation Dynamics in Demographic-Aware Human-LLM Interactions

[75] EmoDebt: Bayesian-Optimized Emotional Intelligence for Strategic Agent-to-Agent Debt Recovery

[76] Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities

[77] AgentAda: Skill-Adaptive Data Analytics for Tailored Insight Discovery