Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 121]
cs.CV [Total: 139]
cs.AI [Total: 72]
cs.SD [Total: 14]
cs.LG [Total: 214]
cs.MA [Total: 4]
cs.MM [Total: 2]
eess.AS [Total: 9]
eess.IV [Total: 6]

cs.CL

[1] Evaluating Long-Term Memory for Long-Context Question Answering

Alessandra Terranova, Björn Ross, Alexandra Birch

Main category: cs.CL

TL;DR: Systematic evaluation of memory-augmented methods for LLMs using LoCoMo benchmark shows memory approaches reduce token usage by 90% while maintaining accuracy, with optimal memory architecture scaling with model capability.

Details

Motivation: Large language models need memory to achieve conversational continuity and experiential learning, but it's unclear which memory types are most effective for long-context conversational tasks.

Method: Evaluated full-context prompting, semantic memory (RAG and agentic memory), episodic memory (in-context learning), and procedural memory (prompt optimization) using LoCoMo benchmark of synthetic long-context dialogues with QA tasks requiring diverse reasoning.

Result: Memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Small foundation models benefit most from RAG, while strong instruction-tuned reasoning models gain from episodic learning and complex agentic semantic memory.

Conclusion: Memory architecture complexity should scale with model capability. Episodic memory can help LLMs recognize the limits of their own knowledge.

Abstract: In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.

[2] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

Ramshankar Bhuvaneswaran, Handan Liu

Main category: cs.CL

TL;DR: BitSkip framework shows that simple 8-bit quantization without Hadamard transform (BitSkip-V1) outperforms more complex 4-bit and Hadamard-enhanced models, matching full-precision baseline quality while offering better early-exit capabilities.

Details

Motivation: To understand the compositional effects of complex LLM efficiency techniques like extreme quantization and dynamic routing, which are individually well-documented but poorly understood when combined.

Method: Introduced BitSkip, a hybrid architectural framework that systematically explores interactions between quantization and Hadamard transforms, testing various configurations including 8-bit and 4-bit quantization with and without Hadamard transforms.

Result: BitSkip-V1 (8-bit without Hadamard) achieved perplexity of 1.13 vs 1.19 for full-precision baseline, outperforming 4-bit and Hadamard-enhanced versions. Hadamard transforms degraded performance by over 37,000% due to training instability. Layer 18 provided optimal 32.5% speed gain with only 4% quality loss.

Conclusion: Simple 8-bit quantization without complex transforms can achieve near-full-precision quality while enabling efficient early-exit strategies, challenging the assumption that more complex techniques necessarily yield better results.

Abstract: The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically exploring these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8-bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.

[3] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language

Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab

Main category: cs.CL

TL;DR: LLMs struggle with culturally grounded figurative language, showing performance gaps between English and Arabic, with particular difficulty in pragmatic use and connotation interpretation.

Details

Motivation: To evaluate LLMs' ability to process culturally nuanced figurative expressions that encode local knowledge and cultural context.

Method: Designed evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation using Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Evaluated 22 open- and closed-source LLMs.

Result: Consistent performance hierarchy: Arabic proverbs 4.29% lower than English, Egyptian idioms 10.28% lower than Arabic proverbs. Pragmatic use accuracy drops 14.07% vs understanding. Context improves accuracy by 10.66%. Models struggle with connotative meaning (max 85.58% human agreement).

Conclusion: Figurative language serves as effective diagnostic for cultural reasoning - LLMs can interpret meaning but struggle with appropriate use. Released Kinayat dataset for future research.

Abstract: We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.

[4] How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

Saki Imai, Lee Kezar, Laurel Aichler, Mert Inan, Erin Walker, Alicia Wooten, Lorna Quandt, Malihe Alikhani

Main category: cs.CL

TL;DR: This paper analyzes how sign language articulation changes in natural dialogue vs isolated contexts, focusing on ASL STEM terms, showing dialogue signs are significantly shorter and exhibit entrainment patterns not seen in monologues.

Details

Motivation: Most sign language models are trained on interpreter or isolated vocabulary data, which doesn't capture the natural variability and adaptation that occurs in real dialogue, particularly in educational settings where novel vocabulary is used.

Method: Collected motion capture dataset of ASL STEM dialogue, compared dyadic interactive signing with solo lectures and interpreted articles using continuous kinematic features to analyze spatiotemporal changes and entrainment patterns.

Result: Dialogue signs are 24.6%-44.6% shorter in duration than isolated signs, with significant reductions absent in monologue contexts. The study also evaluated sign embedding models’ ability to recognize STEM signs and measure participant entrainment over time.

Conclusion: The study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies, highlighting the importance of considering natural dialogue contexts in sign language research.

Abstract: Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.

[5] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi

Main category: cs.CL

TL;DR: CRADLE BENCH is a new benchmark for detecting multiple types of mental health crises in language model interactions, featuring clinician annotations and temporal labels.

Details

Motivation: Current language models lack reliable detection of critical mental health crisis situations like suicide ideation, domestic violence, and sexual harassment, which can have serious consequences if missed during user interactions.

Method: Created a benchmark with 600 clinician-annotated evaluation examples and 420 development examples, plus ~4K training examples automatically labeled using majority-vote ensemble of multiple language models. Fine-tuned six crisis detection models on subsets with different agreement criteria.

Result: The majority-vote ensemble approach for automatic labeling significantly outperforms single-model annotation. The benchmark covers seven crisis types aligned with clinical standards and is the first to incorporate temporal labels.

Conclusion: CRADLE BENCH provides a comprehensive framework for evaluating and improving crisis detection capabilities in language models, addressing a critical gap in mental health safety during AI interactions.

Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user–model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.

[6] Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Wenxiao Wang, Soheil Feizi

Main category: cs.CL

TL;DR: LLM agents suffer from temporal blindness in multi-turn conversations, failing to account for real-world time between messages. The TicToc-v1 benchmark shows current models perform poorly on time-sensitive tool-calling decisions, with only modest improvements from timestamp augmentation.

Details

Motivation: LLM agents lack temporal awareness, causing them to either over-rely on outdated context or unnecessarily repeat tool calls in time-sensitive scenarios, which is a critical limitation for real-world applications.

Method: Created TicToc-v1 benchmark with 34 time-sensitive scenarios, augmented dialogue with timestamps, collected human preferences for tool-calling decisions, and evaluated LLM alignment with human temporal perception.

Result: Without time information, models perform slightly better than random (60% alignment). Adding timestamps provides modest improvement (65% peak). Prompt-based alignment has limited effectiveness.

Conclusion: Specific post-training alignment is needed to make LLM tool use align with human temporal perception, as current methods provide only marginal improvements for time-sensitive decision making.

Abstract: Large language model agents are increasingly used in multi-turn conversational settings to interact with and execute tasks in dynamic environments. However, a key limitation is their temporal blindness: they, by default, operate with a stationary context, failing to account for the real-world time elapsed between messages. This becomes a critical liability when an agent must decide whether to invoke a tool based on how much time has passed since the last observation. Without temporal awareness, agents often either over-rely on previous context (skipping necessary tool calls), or under-rely on it (unnecessarily repeating tool calls). To study this challenge, we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across 34 scenarios with varying time sensitivity. Each trajectory ends with a user question, where the need for a tool call depends on the amount of time elapsed since the last message. To give LLMs temporal context, we augment dialogue messages with explicit timestamps, bridging the gap between static dialogue and evolving environments. We then collected human preferences for these samples, creating two subsets: one where humans preferred relying on the previous observation (prefer-noTool), and another where they preferred a new tool call (prefer-Tool). We evaluated how well LLM tool-calling decisions align with human preferences under varying time intervals on TicToc-v1. Our analysis show that without time information, most models perform only slightly better than random, with the top alignment rate being just over 60%. While adding timestamps leads to a slight improvement, particularly for larger models, the improvement is modest, peaking at around 65%. We also show that naive, prompt-based alignment have limited effectiveness. Our findings highlight the need for specific post-training alignment to align multi-turn LLM tool use with human temporal perception.

[7] Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, Dan Roth

Main category: cs.CL

TL;DR: Combo-Eval is a novel evaluation method for LLM-generated natural language representations of database results that combines multiple existing methods to optimize fidelity and reduce LLM calls by 25-61%.

Details

Motivation: Current Text-to-SQL systems use LLMs to convert tabular database results into natural language, but information loss and errors in these representations remain largely unexplored and unevaluated.

Method: Proposes Combo-Eval, a combined evaluation method that integrates benefits of multiple existing evaluation approaches, and introduces NLR-BIRD dataset for benchmarking.

Result: Combo-Eval achieves significant reduction in LLM calls (25-61%) and demonstrates superior alignment with human judgments across scenarios with and without ground truth references.

Conclusion: Combo-Eval provides an effective evaluation framework for LLM-generated natural language representations of database results, addressing current evaluation gaps in Text-to-SQL systems.

Abstract: In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.

[8] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning

Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, Dan Roth

Main category: cs.CL

TL;DR: OraPlan-SQL is a bilingual NL2SQL system that won the Archer Challenge 2025, using a two-agent framework with feedback-guided meta-prompting and plan diversification to achieve state-of-the-art performance.

Details

Motivation: To address complex reasoning in NL2SQL tasks requiring arithmetic, commonsense, and hypothetical inference, while overcoming limitations of multi-agent approaches that suffer from orchestration overhead.

Method: Two-agent framework: Planner generates stepwise natural language plans, SQL agent converts plans to SQL. Uses feedback-guided meta-prompting with corrective guidelines from failure analysis, entity-linking for multilingual support, and plan diversification with majority voting.

Result: Ranked first with 55.0% EX in English and 56.7% in Chinese, exceeding second-best by over 6%, while maintaining over 99% SQL validity.

Conclusion: The feedback-guided meta-prompting strategy with plan diversification effectively improves generalization and reliability for complex bilingual NL2SQL tasks without adding system complexity.

Abstract: We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner’s system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.

[9] Language Models for Longitudinal Clinical Prediction

Tananun Songdechakraiwut, Michael Lutz

Main category: cs.CL

TL;DR: A lightweight framework adapts frozen LLMs to analyze longitudinal clinical data for Alzheimer’s monitoring without fine-tuning, achieving accurate forecasts with minimal training data.

Details

Motivation: To enable accurate analysis of longitudinal clinical data using large language models without the need for computationally expensive fine-tuning, particularly for early-stage Alzheimer's disease monitoring.

Method: Integrates patient history and context within the language model space to generate forecasts, using a lightweight framework that works with frozen (unmodified) LLMs.

Result: Achieves accurate and reliable performance in neuropsychological assessments even with minimal training data.

Conclusion: Shows promise for early-stage Alzheimer’s monitoring through efficient adaptation of frozen LLMs to clinical data analysis.

Abstract: We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer’s monitoring.

[10] Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir

Main category: cs.CL

TL;DR: Evaluation of GPT-4o, Ansari AI, and Fanar for Islamic guidance shows GPT-4o performs best in accuracy and citations, but all models fall short of reliable Islamic content production, highlighting need for community-driven benchmarks.

Details

Motivation: Large language models are increasingly used for Islamic guidance but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses, creating need for systematic evaluation.

Method: Dual-agent framework with quantitative agent for citation verification and six-dimensional scoring (Structure, Islamic Consistency, Citations) and qualitative agent for five-dimensional side-by-side comparison (Tone, Depth, Originality).

Result: GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). GPT-4o had highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200).

Conclusion: Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations, underscoring need for community-driven benchmarks centering Muslim perspectives for reliable AI in Islamic knowledge and other high-stakes domains.

Abstract: Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations – a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.

[11] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: AfriMTEB expands multilingual text embedding benchmarks to include 59 African languages across 14 tasks and 38 datasets, while AfriE5 achieves state-of-the-art performance through cross-lingual contrastive distillation.

Details

Motivation: African languages are underrepresented in existing multilingual text embedding benchmarks like MMTEB, with tasks often repurposed from translation benchmarks rather than including native African language tasks.

Method: Introduces AfriMTEB benchmark covering 59 African languages with 14 tasks and 38 datasets, including 6 new datasets spanning 14-56 languages. Also presents AfriE5 model adapted from mE5 through cross-lingual contrastive distillation.

Result: AfriE5 achieves state-of-the-art performance, outperforming strong baselines including Gemini-Embeddings and mE5.

Conclusion: The work successfully addresses the underrepresentation of African languages in text embedding benchmarks and demonstrates that adapted models can achieve superior performance on African language tasks.

Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB – a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

[12] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi

Main category: cs.CL

TL;DR: LLMs show stereotypical biases due to training data, and current bias alignment methods are brittle. A new augmentation framework applied to BBQ dataset reveals LLMs are susceptible to input perturbations, increasing stereotypical behavior, especially for less studied demographic communities.

Details

Motivation: To address the brittleness of current bias alignment methods in LLMs and demonstrate their susceptibility to input perturbations that increase stereotypical behavior.

Method: A novel plug-and-play augmentation framework with three steps, applied to the Bias Benchmark for Question Answering (BBQ) dataset to test LLMs’ responses to perturbed inputs.

Result: LLMs (including state-of-the-art open and closed weight models) show higher likelihood of stereotypical behavior when inputs are perturbed, with greater bias for less studied demographic communities.

Conclusion: Current bias mitigation approaches are insufficient, and fairness research needs expansion to include more diverse communities, as models show amplified biases for underrepresented groups.

Abstract: Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.

[13] Agent-based Automated Claim Matching with Instruction-following LLMs

Dina Pisarevskaya, Arkaitz Zubiaga

Main category: cs.CL

TL;DR: LLM-based two-step pipeline for automated claim matching: first generates prompts with LLMs, then performs binary classification. Shows LLM-generated prompts outperform human ones, smaller LLMs can match larger ones, and using different LLMs per step is effective.

Details

Motivation: To develop an automated approach for claim matching using instruction-following LLMs, exploring their capabilities in understanding and performing this task.

Method: Two-step pipeline: 1) LLM-generated prompts, 2) Claim matching as binary classification using LLMs. Investigates using different LLMs for each step and compares performance of various model sizes.

Result: LLM-generated prompts outperform state-of-the-art human-generated prompts. Smaller LLMs perform as well as larger ones in prompt generation, saving computational resources. Using different LLMs for each pipeline step is effective.

Conclusion: The approach demonstrates LLMs’ strong understanding of claim matching, enables efficient automation with smaller models, and provides insights into LLM capabilities for this task.

Abstract: We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs’ understanding of claim matching.

[14] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs

Soham Satyadharma, Fatemeh Sheikholeslami, Swati Kaul, Aziz Umit Batur, Suleiman A. Khan

Main category: cs.CL

TL;DR: Training-free auto-prompting cascade for LLMs that automatically generates and refines prompts to assess product quality in e-commerce, achieving 8-10% improvement in precision/recall while reducing expert effort by 99%.

Details

Motivation: To bridge the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs without requiring training labels or model fine-tuning.

Method: Cascade system that starts from human-crafted seed prompts and progressively optimizes instructions to meet catalog-specific requirements, automatically generating and refining prompts for evaluating attribute quality across thousands of product category-attribute pairs.

Result: Improves precision and recall by 8-10% over traditional chain-of-thought prompting, reduces domain expert effort from 5.1 hours to 3 minutes per attribute (99% reduction), and generalizes effectively across five languages and multiple quality assessment tasks.

Conclusion: The auto-prompt cascade provides an efficient, scalable solution for product quality assessment in e-commerce that significantly reduces human effort while maintaining performance gains across languages and tasks.

Abstract: We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by $8-10%$ over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a $99%$ reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.

[15] Tongyi DeepResearch Technical Report

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

Main category: cs.CL

TL;DR: Tongyi DeepResearch is an agentic LLM designed for deep information-seeking research tasks, trained through an end-to-end framework combining agentic mid-training and post-training, achieving SOTA performance on multiple benchmarks.

Details

Motivation: To develop an autonomous agentic LLM capable of handling long-horizon, deep information-seeking research tasks that require scalable reasoning and complex information seeking.

Method: End-to-end training framework with agentic mid-training and post-training, using a fully automatic data synthesis pipeline without human annotation, and customized environments for stable interactions.

Result: The 30.5B parameter model (with 3.3B activated per token) achieves state-of-the-art performance across multiple agentic deep research benchmarks including Humanity’s Last Exam, BrowseComp, WebWalkerQA, and others.

Conclusion: Tongyi DeepResearch successfully demonstrates scalable autonomous research capabilities and the model, framework, and solutions are open-sourced to empower the community.

Abstract: We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity’s Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

[16] Leveraging LLMs for Early Alzheimer’s Prediction

Tananun Songdechakraiwut

Main category: cs.CL

TL;DR: A connectome-informed LLM framework that uses dynamic fMRI connectivity data for early Alzheimer’s detection, achieving high prediction accuracy.

Details

Motivation: To develop a sensitive method for early Alzheimer's detection using brain connectivity data, enabling timely intervention.

Method: Encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps data to frozen pre-trained LLM for clinical prediction.

Result: Achieves sensitive prediction with error rates well below clinically recognized margins for early Alzheimer’s detection.

Conclusion: The framework shows promise for timely Alzheimer’s intervention through accurate early detection using fMRI connectivity and LLMs.

Abstract: We present a connectome-informed LLM framework that encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps these data into a representation suitable for a frozen pre-trained LLM for clinical prediction. Applied to early Alzheimer’s detection, our method achieves sensitive prediction with error rates well below clinically recognized margins, with implications for timely Alzheimer’s intervention.

[17] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

Kyomin Hwang, Hyeonjin Kim, Seungyeon Kim, Sunghyun Wee, Nojun Kwak

Main category: cs.CL

TL;DR: The paper addresses language confusion in multilingual LLMs during unlearning, showing that reference-based metrics fail when models respond in different languages than the input prompt. It introduces N-Mix score to measure language confusion and advocates for semantic-based evaluation metrics.

Details

Motivation: Previous studies on multilingual knowledge erasure focused only on performance, but this paper identifies a blind spot: language confusion occurs when models are fine-tuned with parallel multilingual data before unlearning, causing standard metrics to fail.

Method: Three-step approach: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantify language confusion, (2) demonstrate reference-based metrics produce false negatives when N-Mix is high, (3) propose semantic-based metrics for direct content assessment.

Result: Language confusion is pervasive and consistent in multilingual LLMs, and reference-based evaluation metrics fail when N-Mix scores are high, leading to inaccurate assessment of unlearning effectiveness.

Conclusion: There is a critical need for semantic-based evaluation metrics that can directly assess generated content, as current reference-based metrics are inadequate for evaluating multilingual unlearning due to language confusion issues.

Abstract: There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.

[18] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin

Main category: cs.CL

TL;DR: The paper proposes M-Eval, a method inspired by Evidence-Based Medicine’s heterogeneity analysis to detect factual errors in Retrieval-augmented Generation (RAG) medical question-answering systems by verifying responses against multiple evidence sources.

Details

Motivation: Current RAG applications in medical QA systems generate incorrect information (hallucinations) and fail to properly use external knowledge, leading to unreliable responses that could cause diagnostic errors.

Method: M-Eval extracts additional medical literature from external knowledge bases, retrieves evidence documents from RAG systems, and uses heterogeneity analysis to check if evidence supports different viewpoints in responses while assessing evidence reliability.

Result: The method shows an improvement of up to 23.31% accuracy across various LLMs, demonstrating significant enhancement in detecting errors in RAG-based medical systems.

Conclusion: M-Eval helps detect errors in current RAG-based medical systems, making LLM applications more reliable and reducing diagnostic errors through evidence-based verification.

Abstract: Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.

[19] PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine

Mengzhou Sun, Sendong Zhao, Jianyu Chen, Bin Qin

Main category: cs.CL

TL;DR: PICOs-RAG improves evidence-based medicine by expanding and normalizing user queries using PICO format, enhancing retrieval efficiency and relevance by up to 8.8% compared to baseline methods.

Details

Motivation: Current RAG methods struggle with complex clinical queries that lack information or use imprecise language, leading to irrelevant evidence retrieval and unhelpful answers in evidence-based medicine.

Method: The PICOs-RAG expands and normalizes user queries into professional format using PICO (Patient, Intervention, Comparison, Outcome) framework to extract key information for improved retrieval.

Result: The approach achieves up to 8.8% improvement in retrieval efficiency and relevance compared to baseline methods.

Conclusion: PICOs-RAG enhances large language models to become more helpful and reliable medical assistants in evidence-based medicine by improving query processing and evidence retrieval.

Abstract: Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.

[20] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin

Main category: cs.CL

TL;DR: A new method that combines multiple EBM principles (reliability, heterogeneity, and extrapolation analysis) to re-rank and filter medical evidence for LLMs, improving RAG performance in evidence-based medicine by up to 11.4% accuracy.

Details

Motivation: RAG applications in evidence-based medicine struggle to efficiently distinguish high-quality evidence due to EBM's stringent evidence requirements, leading to potential misdiagnoses when using LLMs.

Method: Proposes a meta-analysis-inspired approach using reliability analysis, heterogeneity analysis, and extrapolation analysis to re-rank and filter medical evidence from PubMed dataset for LLMs.

Result: Achieved up to 11.4% accuracy improvement in experiments, enabling RAG to extract higher-quality and more reliable evidence.

Conclusion: The method successfully reduces incorrect knowledge infusion into LLM responses and helps users receive more effective medical advice by providing better quality evidence.

Abstract: Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.

[21] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han

Main category: cs.CL

TL;DR: The paper proposes TEXT2DB, a new IE formulation that integrates IE output with target databases, and introduces OPAL, an LLM agent framework with Observer, Planner, and Analyzer components to handle this task.

Details

Motivation: There is a mismatch between IE ontology and downstream application needs, making it difficult to utilize IE output effectively in practical database applications.

Method: Proposes OPAL framework with three components: Observer (interacts with database), Planner (generates code-based plan with IE model calls), and Analyzer (provides code quality feedback before execution).

Result: OPAL successfully adapts to diverse database schemas by generating different code plans and calling required IE models, though challenges remain with large databases and extraction hallucination.

Conclusion: The TEXT2DB formulation and OPAL framework effectively address the integration of IE output with databases, but complex dependencies and extraction hallucination require further investigation.

Abstract: The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

[22] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Hao An, Yang Xu

Main category: cs.CL

TL;DR: A reinforcement learning framework using fine-grained semantic confidence rewards to help LLMs abstain from answering questions beyond their knowledge scope, improving reliability.

Details

Motivation: Existing methods use coarse-grained signals like overall confidence scores, leading to imprecise awareness of knowledge boundaries and unreliable abstention.

Method: Sample multiple candidate answers, conduct semantic clustering, train LLM to retain high-confidence cluster answers and discard low-confidence ones using fine-grained semantic confidence rewards.

Result: Significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

Conclusion: The proposed fine-grained semantic confidence reward framework effectively mitigates hallucinations by improving LLMs’ ability to abstain from answering questions beyond their knowledge scope.

Abstract: Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model’s own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.

[23] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren

Main category: cs.CL

TL;DR: SpecKD is a novel knowledge distillation framework that uses a dynamic token-level gating mechanism to selectively apply distillation loss only to tokens where the teacher model is confident, avoiding learning from uncertain predictions.

Details

Motivation: Conventional KD applies distillation loss uniformly across all tokens, forcing students to learn from teacher's uncertain predictions which introduces noise and harms performance, especially when teacher models are much larger.

Method: Proposes Speculative Knowledge Distillation (SpecKD) with dynamic token-level gating inspired by speculative decoding. Student’s token proposals are verified against teacher’s distribution - distillation loss is applied only to ‘accepted’ tokens while ‘rejected’ tokens are masked out.

Result: Extensive experiments on diverse text generation tasks show SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, achieving state-of-the-art results.

Conclusion: SpecKD provides a plug-and-play framework that improves knowledge distillation by selectively applying loss only to confident teacher predictions, resulting in better student model performance and training stability.

Abstract: Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher’s confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher’s uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the “propose-and-verify” paradigm of speculative decoding. At each step, the student’s token proposal is verified against the teacher’s distribution; the distillation loss is selectively applied only to “accepted” tokens, while “rejected” tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.

[24] Success and Cost Elicit Convention Formation for Efficient Communication

Saujas Vaduguru, Yilun Hua, Yoav Artzi, Daniel Fried

Main category: cs.CL

TL;DR: Training large multimodal models to form linguistic conventions through simulated reference games, enabling more efficient communication with humans by reducing message length while increasing success rates.

Details

Motivation: Humans become more efficient communicators over time by forming ad hoc linguistic conventions, but current AI models lack this ability to develop shared contextual understanding for efficient communication.

Method: Using simulated reference games between models involving photographs and tangram images, training models to form conventions without additional human data, requiring both success and cost optimization.

Result: Models reduced message length by up to 41% while increasing success by 15% in human interactions, with human listeners responding faster when interacting with convention-forming models.

Conclusion: Both success and cost optimization are necessary for convention formation; training on either alone is insufficient for developing efficient communication capabilities.

Abstract: Humans leverage shared conversational context to become increasingly successful and efficient at communicating over time. One manifestation of this is the formation of ad hoc linguistic conventions, which allow people to coordinate on short, less costly utterances that are understood using shared conversational context. We present a method to train large multimodal models to form conventions, enabling efficient communication. Our approach uses simulated reference games between models, and requires no additional human-produced data. In repeated reference games involving photographs and tangram images, our method enables models to communicate efficiently with people: reducing the message length by up to 41% while increasing success by 15% over the course of the interaction. Human listeners respond faster when interacting with our model that forms conventions. We also show that training based on success or cost alone is insufficient - both are necessary to elicit convention formation.

[25] Pie: A Programmable Serving System for Emerging LLM Applications

In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong

Main category: cs.CL

TL;DR: Pie is a programmable LLM serving system that decomposes token generation into fine-grained handlers, enabling custom KV cache strategies and generation logic through user-written inferlets executed in WebAssembly sandboxes.

Details

Motivation: Existing LLM serving systems use monolithic token generation loops that struggle with diverse reasoning strategies and agentic workflows, limiting flexibility and optimization opportunities.

Method: Decomposes generation loop into service handlers, delegates control to user-provided inferlets via API, and executes them using WebAssembly for lightweight sandboxing.

Result: Matches state-of-the-art performance on standard tasks (3-12% latency overhead) while achieving 1.3x-3.4x higher latency and throughput on agentic workflows through application-specific optimizations.

Conclusion: Pie provides a flexible and efficient programmable serving system that enables custom optimizations for diverse LLM applications while maintaining competitive performance on standard tasks.

Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.

[26] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang

Main category: cs.CL

TL;DR: HalloMTBench is a new multilingual benchmark designed to expose hallucination vulnerabilities in LLM-based machine translation across 11 language directions, revealing distinct failure patterns related to model scale, source length, linguistic biases, and RL training.

Details

Motivation: Existing MT benchmarks fail to adequately expose hallucination failures in multilingual LLMs, necessitating a specialized diagnostic framework to identify and categorize these vulnerabilities.

Method: Created a diagnostic framework with taxonomy separating Instruction Detachment from Source Detachment. Generated candidates using 4 frontier LLMs, then curated 5,435 high-quality instances through ensemble LLM judging and expert validation across 11 English-to-X language directions.

Result: Evaluation of 17 LLMs revealed distinct hallucination triggers: model scale effects, source length sensitivity, linguistic biases, and RL-amplified language mixing. The benchmark successfully exposed unique failure patterns not captured by existing MT benchmarks.

Conclusion: HalloMTBench provides a forward-looking testbed for diagnosing LLM translation failures and offers valuable insights into hallucination vulnerabilities that vary by model characteristics and training methods.

Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers’’ – unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in https://huggingface.co/collections/AIDC-AI/marco-mt.

[27] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarinayum Meerajita Sharma, Aditi Gupta, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli, Álvaro Arroyo, Amin Bajand, Amol Khanna, Ana Chkhaidze, Ana Condez, Andiswa Mkhonto, Andrew Hoblitzell, Andrew Tran, Angelos Poulis, Anirban Majumder, Anna Vacalopoulou, Annette Kuuipolani Kanahele Wong, Annika Simonsen, Anton Kovalev, Ashvanth. S, Ayodeji Joseph Lana, Barkin Kinay, Bashar Alhafni, Benedict Cibalinda Busole, Bernard Ghanem, Bharti Nathani, Biljana Stojanovska Đurić, Bola Agbonile, Bragi Bergsson, Bruce Torres Fischer, Burak Tutar, Burcu Alakuş Çınar, Cade J. Kanoniakapueo Kane, Can Udomcharoenchaikit, Catherine Arnett, Chadi Helwe, Chaithra Reddy Nerella, Chen Cecilia Liu, Chiamaka Glory Nwokolo, Cristina España-Bonet, Cynthia Amol, DaeYeop Lee, Dana Arad, Daniil Dzenhaliou, Daria Pugacheva, Dasol Choi, Daud Abolade, David Liu, David Semedo, Deborah Popoola, Deividas Mataciunas, Delphine Nyaboke, Dhyuthy Krishna Kumar, Diogo Glória-Silva, Diogo Tavares, Divyanshu Goyal, DongGeon Lee, Ebele Nwamaka Anajemba, Egonu Ngozi Grace, Elena Mickel, Elena Tutubalina, Elias Herranen, Emile Anand, Emmanuel Habumuremyi, Emuobonuvie Maria Ajiboye, Eryawan Presma Yulianrifat, Esther Adenuga, Ewa Rudnicka, Faith Olabisi Itiola, Faran Taimoor Butt, Fathima Thekkekara, Fatima Haouari, Filbert Aurelian Tjiaranata, Firas Laakom, Francesca Grasso, Francesco Orabona, Francesco Periti, Gbenga Kayode Solomon, Gia Nghia Ngo, Gloria Udhehdhe-oze, Gonçalo Martins, Gopi Naga Sai Ram Challagolla, Guijin Son, Gulnaz Abdykadyrova, Hafsteinn Einarsson, Hai Hu, Hamidreza Saffari, Hamza Zaidi, Haopeng Zhang, Harethah Abu Shairah, Harry Vuong, Hele-Andra Kuulmets, Houda Bouamor, Hwanjo Yu, Iben Nyholm Debess, İbrahim Ethem Deveci, Ikhlasul Akmal Hanif, Ikhyun Cho, Inês Calvo, Inês Vieira, Isaac Manzi, Ismail Daud, Itay Itzhak, Iuliia, Alekseenko, Ivan Belashkin, Ivan Spada, Ivan Zhelyazkov, Jacob Brinton, Jafar Isbarov, Jaka Čibej, Jan Čuhel, Jan Kocoń, Jauza Akbar Krito, Jebish Purbey, Jennifer Mickel, Jennifer Za, Jenny Kunz, Jihae Jeong, Jimena Tena Dávalos, Jinu Lee, João Magalhães, John Yi, Jongin Kim, Joseph Chataignon, Joseph Marvin Imperial, Jubeerathan Thevakumar, Judith Land, Junchen Jiang, Jungwhan Kim, Kairit Sirts, Kamesh R, Kamesh V, Kanda Patrick Tshinu, Kätriin Kukk, Kaustubh Ponkshe, Kavsar Huseynova, Ke He, Kelly Buchanan, Kengatharaiyer Sarveswaran, Kerem Zaman, Khalil Mrini, Kian Kyars, Krister Kruusmaa, Kusum Chouhan, Lainitha Krishnakumar, Laura Castro Sánchez, Laura Porrino Moscoso, Leshem Choshen, Levent Sencan, Lilja Øvrelid, Lisa Alazraki, Lovina Ehimen-Ugbede, Luheerathan Thevakumar, Luxshan Thavarasa, Mahnoor Malik, Mamadou K. Keita, Mansi Jangid, Marco De Santis, Marcos García, Marek Suppa, Mariam D’Ciofalo, Marii Ojastu, Maryam Sikander, Mausami Narayan, Maximos Skandalis, Mehak Mehak, Mehmet İlteriş Bozkurt, Melaku Bayu Workie, Menan Velayuthan, Michael Leventhal, Michał Marcińczuk, Mirna Potočnjak, Mohammadamin Shafiei, Mridul Sharma, Mrityunjaya Indoria, Muhammad Ravi Shulthan Habibi, Murat Kolić, Nada Galant, Naphat Permpredanun, Narada Maugin, Nicholas Kluge Corrêa, Nikola Ljubešić, Nirmal Thomas, Nisansa de Silva, Nisheeth Joshi, Nitish Ponkshe, Nizar Habash, Nneoma C. Udeze, Noel Thomas, Noémi Ligeti-Nagy, Nouhoum Coulibaly, Nsengiyumva Faustin, Odunayo Kareemat Buliaminu, Odunayo Ogundepo, Oghojafor Godswill Fejiro, Ogundipe Blessing Funmilola, Okechukwu God’spraise, Olanrewaju Samuel, Olaoye Deborah Oluwaseun, Olasoji Akindejoye, Olga Popova, Olga Snissarenko, Onyinye Anulika Chiemezie, Orkun Kinay, Osman Tursun, Owoeye Tobiloba Moses, Oyelade Oluwafemi Joshua, Oyesanmi Fiyinfoluwa, Pablo Gamallo, Pablo Rodríguez Fernández, Palak Arora, Pedro Valente, Peter Rupnik, Philip Oghenesuowho Ekiugbo, Pramit Sahoo, Prokopis Prokopidis, Pua Niau-Puhipau, Quadri Yahya, Rachele Mignone, Raghav Singhal, Ram Mohan Rao Kadiyala, Raphael Merx, Rapheal Afolayan, Ratnavel Rajalakshmi, Rishav Ghosh, Romina Oji, Ron Kekeha Solis, Rui Guerra, Rushikesh Zawar, Sa’ad Nasir Bashir, Saeed Alzaabi, Sahil Sandeep, Sai Pavan Batchu, SaiSandeep Kantareddy, Salsabila Zahirah Pranida, Sam Buchanan, Samuel Rutunda, Sander Land, Sarah Sulollari, Sardar Ali, Saroj Sapkota, Saulius Tautvaisas, Sayambhu Sen, Sayantani Banerjee, Sebastien Diarra, SenthilNathan. M, Sewoong Lee, Shaan Shah, Shankar Venkitachalam, Sharifa Djurabaeva, Sharon Ibejih, Shivanya Shomir Dutta, Siddhant Gupta, Silvia Paniagua Suárez, Sina Ahmadi, Sivasuthan Sukumar, Siyuan Song, Snegha A., Sokratis Sofianopoulos, Sona Elza Simon, Sonja Benčina, Sophie Gvasalia, Sphurti Kirit More, Spyros Dragazis, Stephan P. Kaufhold, Suba. S, Sultan AlRashed, Surangika Ranathunga, Taiga Someya, Taja Kuzman Pungeršek, Tal Haklay, Tasi’u Jibril, Tatsuya Aoyama, Tea Abashidze, Terenz Jomar Dela Cruz, Terra Blevins, Themistoklis Nikas, Theresa Dora Idoko, Thu Mai Do, Tilek Chubakov, Tommaso Gargiani, Uma Rathore, Uni Johannesen, Uwuma Doris Ugwu, Vallerie Alexandra Putra, Vanya Bannihatti Kumar, Varsha Jeyarajalingam, Varvara Arzt, Vasudevan Nedumpozhimana, Viktoria Ondrejova, Viktoryia Horbik, Vishnu Vardhan Reddy Kummitha, Vuk Dinić, Walelign Tewabe Sewunetie, Winston Wu, Xiaojing Zhao, Yacouba Diarra, Yaniv Nikankin, Yash Mathur, Yixi Chen, Yiyuan Li, Yolanda Xavier, Yonatan Belinkov, Yusuf Ismail Abayomi, Zaid Alyafeai, Zhengyang Shan, Zhi Rui Tam, Zilu Tang, Zuzana Nadova, Baber Abbasi, Stella Biderman, David Stap, Duygu Ataman, Fabian Schmidt, Hila Gonen, Jiayi Wang, David Ifeoluwa Adelani

Main category: cs.CL

TL;DR: Global PIQA is a culturally-specific commonsense reasoning benchmark covering 116 languages across 5 continents, created by 335 researchers from 65 countries, revealing performance gaps in LLMs especially for lower-resource languages.

Details

Motivation: There is a lack of culturally-specific evaluation benchmarks for LLMs that cover diverse languages and cultures, limiting our understanding of how well models perform across different cultural contexts.

Method: Created a participatory commonsense reasoning benchmark through manual construction by 335 researchers from 65 countries, covering 116 language varieties across 14 language families and 23 writing systems, with over 50% of examples referencing culturally-specific elements.

Result: State-of-the-art LLMs perform well overall but show significant performance gaps in lower-resource languages (up to 37% accuracy gap), with open models generally underperforming proprietary models.

Conclusion: Everyday cultural knowledge remains a challenge for LLMs across many languages and cultures, highlighting the need for improvement beyond complex reasoning and expert knowledge, while showcasing the diversity of human cultures embedded in language.

Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

[28] RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects

Md. Rezuwan Hassan, Azmol Hossain, Kanij Fatema, Rubayet Sabbir Faruque, Tanmoy Shome, Ruwad Naswan, Trina Chakraborty, Md. Foriduzzaman Zihad, Tawsif Tashwar Dipto, Nazia Tasnim, Nazmuddoha Ansary, Md. Mehedi Hasan Shawon, Ahmed Imtiaz Humayun, Md. Golam Rabiul Alam, Farig Sadeque, Asif Sushmit

Main category: cs.CL

TL;DR: This paper documents Bengali dialect diversity and explores computational modeling for regional ASR systems to preserve dialectal richness and develop inclusive language technologies.

Details

Motivation: Despite significant dialectal diversity in Bengali language across South Asia, there is limited systematic research on computational processing of these dialects, creating a gap in inclusive digital tools.

Method: The study documents and analyzes phonetic and morphological properties of Bengali dialects, and explores building computational models including Automatic Speech Recognition systems tailored to regional varieties.

Result: The research created a dataset documenting Bengali dialect diversity and demonstrated the feasibility of developing computational models for regional dialect processing.

Conclusion: Computational modeling of Bengali dialects enables preservation of linguistic diversity and development of inclusive digital tools, with released dataset supporting further research.

Abstract: The Bengali language, spoken extensively across South Asia and among diasporic communities, exhibits considerable dialectal diversity shaped by geography, culture, and history. Phonological and pronunciation-based classifications broadly identify five principal dialect groups: Eastern Bengali, Manbhumi, Rangpuri, Varendri, and Rarhi. Within Bangladesh, further distinctions emerge through variation in vocabulary, syntax, and morphology, as observed in regions such as Chittagong, Sylhet, Rangpur, Rajshahi, Noakhali, and Barishal. Despite this linguistic richness, systematic research on the computational processing of Bengali dialects remains limited. This study seeks to document and analyze the phonetic and morphological properties of these dialects while exploring the feasibility of building computational models particularly Automatic Speech Recognition (ASR) systems tailored to regional varieties. Such efforts hold potential for applications in virtual assistants and broader language technologies, contributing to both the preservation of dialectal diversity and the advancement of inclusive digital tools for Bengali-speaking communities. The dataset created for this study is released for public use.

[29] Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks

Yihan Wang, Peiyu Liu, Runyu Chen, Jiaxing Pu, Wei Xu

Main category: cs.CL

TL;DR: Squrve is a unified Text-to-SQL framework that standardizes execution interfaces and uses multi-actor collaboration to bridge research advances with real-world applications, outperforming individual methods.

Details

Motivation: Despite rapid advances in Text-to-SQL research, deploying these techniques in real-world systems remains challenging due to limited integration tools, creating a gap between academic methods and practical applications.

Method: Squrve establishes a universal execution paradigm with standardized invocation interfaces and implements a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components.

Result: Experiments on widely adopted benchmarks show that the collaborative workflows consistently outperform the original individual methods.

Conclusion: Squrve opens up a new effective avenue for tackling complex real-world queries by unifying research advances and real-world applications through its modular framework.

Abstract: Text-to-SQL technology has evolved rapidly, with diverse academic methods achieving impressive results. However, deploying these techniques in real-world systems remains challenging due to limited integration tools. Despite these advances, we introduce Squrve, a unified, modular, and extensive Text-to-SQL framework designed to bring together research advances and real-world applications. Squrve first establishes a universal execution paradigm that standardizes invocation interfaces, then proposes a multi-actor collaboration mechanism based on seven abstracted effective atomic actor components. Experiments on widely adopted benchmarks demonstrate that the collaborative workflows consistently outperform the original individual methods, thereby opening up a new effective avenue for tackling complex real-world queries. The codes are available at https://github.com/Satissss/Squrve.

[30] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Vivek Kalyan, Martin Andrews

Main category: cs.CL

TL;DR: RL-trained LLM agents outperform prompt-based approaches on legal document search, achieving 85% vs 78% accuracy with a 14B parameter model, and perform better with longer multi-turn interactions.

Details

Motivation: To demonstrate that Reinforcement Learning can significantly enhance LLM agent capabilities beyond prompt-based approaches by learning from experience.

Method: Used Reinforcement Learning to train a 14 Billion parameter LLM agent on a legal document search benchmark, exploring turn-restricted regimes during training and testing.

Result: RL-trained model achieved 85% accuracy, outperforming frontier class models (78% accuracy), with better performance when allowed longer multi-turn horizons.

Conclusion: RL training pushes LLM agent capabilities significantly beyond prompt-based methods, with multi-turn interactions being crucial for optimal performance.

Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.

[31] Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee

Main category: cs.CL

TL;DR: Enhanced line-level filtering methods (PLD and PTF) that consider sequential document patterns improve language model performance by preserving valuable content that traditional filters discard.

Details

Motivation: Traditional line-level filtering techniques like deduplication and punctuation filters often remove valuable content, negatively impacting downstream model performance.

Method: Proposed pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF) that consider both line-level signals and their sequential distribution across documents to preserve structurally important content.

Result: Training 1B parameter models in English and Korean showed consistent improvements on multiple-choice benchmarks and significant accuracy gains on SQuAD v1 and KorQuAD v1 question-answering tasks.

Conclusion: Pattern-aware filtering methods that account for document structure and sequential patterns are more effective than traditional line-level filters for data preprocessing in language model training.

Abstract: While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.

[32] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee

Main category: cs.CL

TL;DR: Ko-MuSR is the first benchmark for evaluating multistep soft reasoning in Korean narratives, designed to minimize data contamination and featuring human-verified content.

Details

Motivation: To address the lack of comprehensive benchmarks for evaluating multistep soft reasoning in Korean narratives while ensuring data quality through human verification.

Method: Built following MuSR framework with fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability.

Result: Multilingual models outperformed Korean-specialized models in Korean reasoning tasks, and carefully designed prompting strategies (few-shot examples, reasoning traces, task-specific hints) boosted accuracy to near-human levels.

Conclusion: Ko-MuSR provides a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies, demonstrating cross-lingual generalization of reasoning ability.

Abstract: We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models – two multilingual and two Korean-specialized – show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.

Aaron Scott, Maike Züfle, Jan Niehues

Main category: cs.CL

TL;DR: MuSaG is the first German multimodal sarcasm detection dataset with 33 minutes of TV show content, featuring aligned text, audio, and video modalities with human annotations, used to benchmark model performance against human capabilities.

Details

Motivation: Sarcasm detection is challenging for natural language understanding and extends to multimodal contexts with the rise of multimodal LLMs, requiring integration of cues from audio and vision beyond just text.

Method: Created MuSaG dataset from German TV shows with human-annotated text, audio, and video modalities. Benchmarked 9 open-source and commercial models across text, audio, vision, and multimodal architectures, comparing performance to human annotations.

Result: Humans rely heavily on audio cues in conversational settings, while models perform best on text. This reveals a gap in current multimodal models’ ability to effectively utilize audio information for sarcasm detection.

Conclusion: The MuSaG dataset supports developing multimodal models better suited to realistic sarcasm detection scenarios and human-model alignment, addressing current limitations in audio processing capabilities.

Abstract: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.

[34] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability

Iván Martínez-Murillo, Paloma Moreda, Elena Lloret

Main category: cs.CL

TL;DR: This paper introduces KITGI, a benchmark for evaluating knowledge integration in NLG, showing that removing relevant external knowledge from ConceptNet drastically reduces commonsense generation quality from 91% to 6% correctness.

Details

Motivation: To understand how external knowledge integration affects Natural Language Generation, particularly for commonsense generation tasks, and to create an interpretable benchmark for evaluating knowledge-enhanced NLG systems.

Method: Extended CommonGen dataset to create KITGI benchmark with retrieved semantic relations from ConceptNet. Used T5-Large model to compare generation with full vs filtered knowledge. Three-stage interpretability method: remove key knowledge, regenerate sentences, manually assess commonsense plausibility and concept coverage.

Result: Sentences generated with full external knowledge achieved 91% correctness across both commonsense plausibility and concept coverage criteria. Filtering out highly relevant relations reduced performance drastically to 6%.

Conclusion: Relevant external knowledge is critical for maintaining coherence and concept coverage in NLG. Highlights the need for interpretable, knowledge-enhanced NLG systems and evaluation frameworks that capture underlying reasoning beyond surface-level metrics.

Abstract: This paper explores the influence of external knowledge integration in Natural Language Generation (NLG), focusing on a commonsense generation task. We extend the CommonGen dataset by creating KITGI, a benchmark that pairs input concept sets with retrieved semantic relations from ConceptNet and includes manually annotated outputs. Using the T5-Large model, we compare sentence generation under two conditions: with full external knowledge and with filtered knowledge where highly relevant relations were deliberately removed. Our interpretability benchmark follows a three-stage method: (1) identifying and removing key knowledge, (2) regenerating sentences, and (3) manually assessing outputs for commonsense plausibility and concept coverage. Results show that sentences generated with full knowledge achieved 91% correctness across both criteria, while filtering reduced performance drastically to 6%. These findings demonstrate that relevant external knowledge is critical for maintaining both coherence and concept coverage in NLG. This work highlights the importance of designing interpretable, knowledge-enhanced NLG systems and calls for evaluation frameworks that capture the underlying reasoning beyond surface-level metrics.

[35] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang

Main category: cs.CL

TL;DR: The paper proposes a method for parametric knowledge transfer (PKT) across LLMs of different scales by using activations as the medium for layer-wise knowledge transfer, leveraging semantic alignment in latent space.

Details

Motivation: To enable effective and efficient knowledge transfer across LLMs of different scales, addressing the limitations of existing methods that directly reuse layer parameters due to neural incompatibility issues.

Method: Uses activations as the medium for layer-wise knowledge transfer instead of directly using layer parameters, focusing on semantic alignment in latent space.

Result: Outperforms prior work and better aligns model behaviors across varying scales, as demonstrated by evaluations on four benchmarks.

Conclusion: The approach provides insights into the nature of latent semantic alignment and identifies key factors that ease cross-scale knowledge transfer between LLMs.

Abstract: Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.

[36] HACK: Hallucinations Along Certainty and Knowledge Axes

Adi Simhi, Jonathan Herzig, Itay Itzhak, Dana Arad, Zorik Gekhman, Roi Reichart, Fazl Barez, Gabriel Stanovsky, Idan Szpektor, Yonatan Belinkov

Main category: cs.CL

TL;DR: A framework categorizing LLM hallucinations along knowledge and certainty axes, with model-specific dataset construction and validation through steering mitigation.

Details

Motivation: Existing research focuses on external properties of hallucinations, overlooking the need for tailored mitigation strategies based on underlying mechanisms.

Method: Proposed categorization framework with two axes (knowledge and certainty), model-specific dataset construction, and validation using steering mitigation to manipulate model activations.

Result: Significant difference between knowledge-based hallucination types validated; identified concerning hallucinations where models hallucinate with certainty despite having correct knowledge; some mitigation methods fail on critical cases.

Conclusion: Both knowledge and certainty are crucial in hallucination analysis, requiring targeted mitigation approaches that consider underlying factors.

Abstract: Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs’ underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.

[37] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich

Main category: cs.CL

TL;DR: This paper examines how inference and training choices affect explanation faithfulness in LLMs, particularly in healthcare contexts where unfaithful explanations can undermine trust and safety.

Details

Motivation: LLMs often produce unfaithful explanations that don't reflect true decision factors, which is especially problematic in healthcare where omitted clinical cues or masked shortcuts can lead to unsafe decision support.

Method: Evaluated three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on BBQ (social bias) and MedQA (medical licensing) datasets, manipulating few-shot examples (quantity/type), prompting strategies, and training procedures.

Result: Few-shot example quantity and quality significantly impact faithfulness; faithfulness is sensitive to prompting design; instruction-tuning improves faithfulness on MedQA.

Conclusion: The findings provide insights for enhancing LLM interpretability and trustworthiness in sensitive domains through careful deployment choices.

Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

[38] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

Ahmad Ghannam, Naif Alharthi, Faris Alasmary, Kholood Al Tabash, Shouq Sadah, Lahouari Ghouti

Main category: cs.CL

TL;DR: A multimodal approach combining text and speech for Arabic dialectal diacritic restoration, achieving WER of 0.25 and CER of 0.9 on development set.

Details

Motivation: To improve Arabic dialectal diacritic restoration by leveraging both textual and speech information through multimodal integration.

Method: Uses CATT encoder for text and Whisper encoder for speech, with two integration strategies: early fusion of averaged speech tokens with text tokens, and cross-attention fusion. Includes speech dropout during training for robustness.

Result: Achieved WER of 0.25 and CER of 0.9 on development set, and WER of 0.55 and CER of 0.13 on test set.

Conclusion: The multimodal approach effectively combines text and speech for Arabic diacritic restoration, with speech dropout enhancing model robustness.

Abstract: In this work, we tackle the Diacritic Restoration (DR) task for Arabic dialectal sentences using a multimodal approach that combines both textual and speech information. We propose a model that represents the text modality using an encoder extracted from our own pre-trained model named CATT. The speech component is handled by the encoder module of the OpenAI Whisper base model. Our solution is designed following two integration strategies. The former consists of fusing the speech tokens with the input at an early stage, where the 1500 frames of the audio segment are averaged over 10 consecutive frames, resulting in 150 speech tokens. To ensure embedding compatibility, these averaged tokens are processed through a linear projection layer prior to merging them with the text tokens. Contextual encoding is guaranteed by the CATT encoder module. The latter strategy relies on cross-attention, where text and speech embeddings are fused. The cross-attention output is then fed to the CATT classification head for token-level diacritic prediction. To further improve model robustness, we randomly deactivate the speech input during training, allowing the model to perform well with or without speech. Our experiments show that the proposed approach achieves a word error rate (WER) of 0.25 and a character error rate (CER) of 0.9 on the development set. On the test set, our model achieved WER and CER scores of 0.55 and 0.13, respectively.

[39] Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Syed Zohaib Hassan, Pål Halvorsen, Miriam S. Johnson, Pierre Lison

Main category: cs.CL

TL;DR: LLMs struggle to generate authentic child-like dialogue. Study evaluated 5 LLMs for Norwegian child conversations, finding most models produce language too advanced for target age groups.

Details

Motivation: LLMs are predominantly trained on adult conversational data, creating challenges for generating authentic child-like dialogue in specialized applications like children's education.

Method: Comparative study of 5 LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, NorBloom-7b) generating Norwegian conversations for children aged 5 and 9. Blind evaluation by 11 education professionals using real child interview data and LLM-generated samples.

Result: Evaluators showed strong inter-rater reliability (ICC=0.75) and higher accuracy in age prediction for younger children. GPT-4 and NorBloom-7b performed relatively well, but most models generated language perceived as more linguistically advanced than target age groups.

Conclusion: Critical data-related challenges exist in developing LLM systems for specialized applications involving children, especially in low-resource languages where age-appropriate lexical resources are scarce.

Abstract: Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.

[40] From Memorization to Reasoning in the Spectrum of Loss Curvature

Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis

Main category: cs.CL

TL;DR: This paper shows that memorization in transformer models can be identified and removed through weight decomposition based on loss landscape curvature, with minimal impact on general reasoning while specifically affecting fact retrieval and arithmetic tasks.

Details

Motivation: To understand how memorization is represented in transformer models and develop methods to remove it while preserving model performance on general tasks.

Method: A weight editing procedure based on loss landscape curvature decomposition that suppresses memorized data by removing high-curvature components from model weights.

Result: The method effectively reduces memorization more than BalancedSubnet while maintaining lower perplexity, but specifically negatively impacts fact retrieval and arithmetic tasks.

Conclusion: Memorization can be disentangled and removed from transformers using curvature-based methods, revealing that certain tasks like math and fact retrieval rely on specialized weight directions rather than general mechanisms.

Abstract: We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data’s activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.

[41] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation?

Ziqi Ma, Sao Mai Nguyen, Philippe Xu

Main category: cs.CL

TL;DR: LLMs can translate natural language instructions into symbolic representations from hierarchical RL, but performance depends on partition granularity and task complexity.

Details

Motivation: To investigate if LLMs can translate human natural language into the internal symbolic representations that emerge during hierarchical reinforcement learning, enabling better planning and generalization.

Method: Applied structured evaluation framework to measure translation performance of GPT, Claude, Deepseek and Grok LLMs across different internal symbolic partitions generated by hierarchical RL in Ant Maze and Ant Fall environments.

Result: LLMs demonstrate some translation ability but performance is highly sensitive to partition granularity and task complexity, exposing limitations in representation alignment.

Conclusion: Current LLMs have limited capacity for robust alignment between language and internal agent representations, highlighting need for further research.

Abstract: Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs – GPT, Claude, Deepseek and Grok – across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.

[42] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze

Main category: cs.CL

TL;DR: Proposes MERGE methodology for automatically generating reasoning-preserving NLI variants by replacing open-class words, showing models perform 4-20% worse on these minimally altered problems.

Details

Motivation: Manual creation of NLI benchmarks is costly, while automatic generation of high-quality variants is difficult. Need to test model robustness through reasoning-preserving modifications.

Method: Generate NLI variants by replacing open-class words while preserving underlying reasoning structure. Analyze impact of word class, probability, and plausibility on model performance.

Result: NLI models perform 4-20% worse on reasoning-preserving variants, showing low generalizability even with minimal alterations.

Conclusion: Current NLI models lack robustness to reasoning-preserving modifications, highlighting limitations in their generalization capabilities despite minimal changes to problems.

Abstract: In recent years, many generalization benchmarks have shown language models’ lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models’ predictions across reasoning-preserving variants of the original problem. Our results show that NLI models’ perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models’ performance.

[43] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren

Main category: cs.CL

TL;DR: LATR improves RLVR by addressing trajectory diversity issues through lookahead tree-based rollouts that enforce branching at high-uncertainty steps, accelerating policy learning by 131% and improving final performance by 4.2%.

Details

Motivation: Current RLVR pipelines suffer from limited trajectory diversity during group rollouts due to token-level stochastic sampling, leading to homogeneous trajectories that diminish reward signals and hinder effective policy learning.

Method: LATR operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibit prolonged similarity during simulation.

Result: LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and DAPO algorithms across different reasoning tasks.

Conclusion: LATR effectively addresses trajectory diversity limitations in RLVR, significantly improving learning efficiency and final performance through explicit diversity promotion in rollout strategies.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.

[44] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.CL

TL;DR: Critique-RL is an online RL approach that trains critiquing language models without stronger supervision, using a two-player paradigm where an actor generates responses and a critic provides feedback for refinement.

Details

Motivation: Existing approaches for training critiquing language models typically rely on stronger supervisors for annotation, which limits scalability and practical deployment.

Method: Two-stage optimization: Stage I reinforces critic discriminability with rule-based rewards; Stage II introduces indirect rewards based on actor refinement while maintaining discriminability via regularization.

Result: Substantial performance improvements across various tasks and models, achieving 9.02% gain on in-domain tasks and 5.70% gain on out-of-domain tasks for Qwen2.5-7B.

Conclusion: Critique-RL effectively develops critiquing language models without stronger supervision, demonstrating significant performance gains and highlighting its potential for improving LLMs in complex reasoning tasks.

Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor’s outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic’s helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.

[45] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Hunzalah Hassan Bhatti, Firoj Alam

Main category: cs.CL

TL;DR: This paper proposes a comprehensive method to evaluate LLMs on culturally grounded and dialectal content, particularly focusing on Arabic dialects. The approach involves translating questions, converting formats, benchmarking models, and using chain-of-thought reasoning to improve performance.

Details

Motivation: LLMs perform unevenly across languages on culturally grounded and dialectal content, with particular gaps in Arabic dialects. There's a need for better evaluation methods and datasets to address these linguistic and cultural disparities.

Method: The method includes: (i) translating Modern Standard Arabic MCQs into English and Arabic dialects, (ii) converting them to open-ended questions, (iii) benchmarking LLMs under both MCQ and OEQ settings, and (iv) generating chain-of-thought rationales for fine-tuning models for step-by-step reasoning.

Result: Key findings: (i) models underperform on Arabic dialects, revealing knowledge gaps; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) chain-of-thought improves judged correctness but yields mixed n-gram-based metrics.

Conclusion: The study highlights persistent gaps in LLMs’ culturally grounded and dialect-specific knowledge, particularly for Arabic dialects. The developed parallel dataset will be publicly released to support further research on linguistically inclusive evaluation.

Abstract: Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

[46] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin

Main category: cs.CL

TL;DR: LongWeave is a benchmark for evaluating long-form generation in LLMs using Constraint-Verifier Evaluation (CoV-Eval) to balance real-world scenarios with verifiable assessment across 7 tasks with customizable lengths up to 64K/8K tokens.

Details

Motivation: Existing benchmarks for long-form generation either use real-world queries with hard-to-verify metrics or synthetic setups that overlook real-world intricacies, creating a gap in rigorous assessment.

Method: Constraint-Verifier Evaluation (CoV-Eval) constructs tasks by defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets.

Result: Evaluation of 23 LLMs shows that even state-of-the-art models face significant challenges in long-form generation as real-world complexity and output length increase.

Conclusion: LongWeave provides a balanced approach for assessing LLM capabilities in long-form generation by combining real-world relevance with objective verifiability, revealing substantial challenges in current models.

Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

[47] Text Simplification with Sentence Embeddings

Matthew Shardlow

Main category: cs.CL

TL;DR: The paper explores using sentence embeddings for text simplification by learning transformations between high and low complexity embeddings, achieving promising results with small models compared to larger approaches.

Details

Motivation: To investigate whether sentence embeddings can be decoded to approximate original texts and preserve complexity levels, enabling efficient text simplification through embedding space transformations.

Method: Used a small feed forward neural network to learn transformations between sentence embeddings representing high-complexity and low-complexity texts, compared with Seq2Seq and LLM-based approaches.

Result: The transformation method showed encouraging results in small learning settings, successfully applied to unseen simplification datasets (MedEASI) and languages outside training data (Spanish, German).

Conclusion: Learning transformations in sentence embedding space is a promising direction for developing small but powerful models for text simplification and other natural language generation tasks.

Abstract: Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.

[48] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models

Guangyu Xie, Yice Zhang, Jianzhu Bao, Qianlong Wang, Yang Sun, Bingbing Wang, Ruifeng Xu

Main category: cs.CL

TL;DR: COMPEFFDIST is a knowledge distillation framework for sentiment analysis that uses automatic instruction generation and data filtering to create efficient 3B models that match 20x larger teacher models’ performance with only 10% of the data.

Details

Motivation: Current sentiment analysis distillation methods face two challenges: limited diversity in manual instructions and high computational costs from large-scale user texts, making them impractical.

Method: The framework has two modules: attribute-based automatic instruction construction to generate diverse instructions, and difficulty-based data filtering to reduce computational costs by selecting only the most useful training data.

Result: 3B student models achieved performance comparable to 20x larger teacher models on most tasks, and the method showed superior data efficiency, reaching the same performance level with only 10% of the data compared to baselines.

Conclusion: COMPEFFDIST provides an effective and efficient distillation framework that addresses key limitations in current sentiment analysis methods, enabling practical deployment of lightweight models with minimal performance loss.

Abstract: Recent efforts leverage knowledge distillation techniques to develop lightweight and practical sentiment analysis models. These methods are grounded in human-written instructions and large-scale user texts. Despite the promising results, two key challenges remain: (1) manually written instructions are limited in diversity and quantity, making them insufficient to ensure comprehensive coverage of distilled knowledge; (2) large-scale user texts incur high computational cost, hindering the practicality of these methods. To this end, we introduce COMPEFFDIST, a comprehensive and efficient distillation framework for sentiment analysis. Our framework consists of two key modules: attribute-based automatic instruction construction and difficulty-based data filtering, which correspondingly tackle the aforementioned challenges. Applying our method across multiple model series (Llama-3, Qwen-3, and Gemma-3), we enable 3B student models to match the performance of 20x larger teacher models on most tasks. In addition, our approach greatly outperforms baseline methods in data efficiency, attaining the same performance level with only 10% of the data.

[49] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff

Main category: cs.CL

TL;DR: SynthWorlds is a framework that disentangles reasoning from factual knowledge by creating parallel real and synthetic worlds with identical structure, enabling precise evaluation of language models’ reasoning abilities.

Details

Motivation: Current benchmarks fail to separate reasoning from factual recall, as models can exploit parametric knowledge rather than demonstrate genuine reasoning skills.

Method: Create parallel corpora with real-mapped and synthetic-mapped worlds having identical interconnected structure, then design mirrored tasks (multi-hop QA and page navigation) that maintain equal reasoning difficulty across both worlds.

Result: Experiments show a persistent knowledge advantage gap where models gain performance boost from memorized parametric knowledge, and knowledge acquisition/integration mechanisms reduce but don’t eliminate this gap.

Conclusion: SynthWorlds provides a controlled, scalable environment for precise evaluation of reasoning vs memorization in language models, enabling testable comparisons that were previously challenging.

Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.

[50] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Julian Valline, Cedric Lothritz, Jordi Cabot

Main category: cs.CL

TL;DR: LuxIT is a new monolingual instruction tuning dataset for Luxembourgish created to address the lack of high-quality training data in low-resource languages, but fine-tuning smaller LLMs on it yields mixed results in language proficiency tests.

Details

Motivation: To overcome the limitation of instruction-tuned LLMs in low-resource linguistic settings like Luxembourgish, where high-quality training data is scarce.

Method: Synthesized a Luxembourgish instruction dataset from native texts using DeepSeek-R1-0528, applied quality assurance via LLM-as-a-judge, and fine-tuned several smaller LLMs on the resulting LuxIT dataset.

Result: Benchmarking on Luxembourgish language proficiency exams showed mixed results with significant performance variations across different models, indicating inconsistent effectiveness.

Conclusion: LuxIT is a valuable contribution to Luxembourgish NLP and provides a replicable methodology, but further research is needed to optimize its application for better model performance.

Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.

[51] Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Marton Szep, Daniel Rueckert, Rüdiger von Eisenhart-Rothe, Florian Hinterwimmer

Main category: cs.CL

TL;DR: A comprehensive survey of methods for fine-tuning large language models in data-scarce scenarios, covering parameter-efficient techniques, domain adaptation, and preference alignment approaches.

Details

Motivation: Fine-tuning LLMs with limited data is challenging in low-resource languages, specialized domains, and constrained deployment settings, requiring efficient adaptation techniques.

Method: Systematic review of parameter-efficient fine-tuning, domain and cross-lingual adaptation methods for encoder/decoder models, model specialization strategies, and preference alignment approaches using limited feedback.

Result: The survey provides empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints like model scaling, data scaling, and mitigating catastrophic forgetting.

Conclusion: The paper aims to equip researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.

Abstract: Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective adaptation under data scarcity requires focused and efficient fine-tuning techniques. This paper presents a structured and practical survey of recent methods for fine-tuning LLMs in data-scarce scenarios. We systematically review parameter-efficient fine-tuning techniques that lower training and deployment costs, domain and cross-lingual adaptation methods for both encoder and decoder models, and model specialization strategies. We further examine preference alignment approaches that guide model behavior using limited human or synthetic feedback, emphasizing sample and compute efficiency. Throughout, we highlight empirical trade-offs, selection criteria, and best practices for choosing suitable techniques based on task constraints, including model scaling, data scaling, and the mitigation of catastrophic forgetting. The aim is to equip researchers and practitioners with actionable insights for effectively fine-tuning LLMs when data and resources are limited.

[52] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro

Main category: cs.CL

TL;DR: The paper introduces SPARTA, a black-box adversarial paraphrasing method that generates semantically equivalent but grammatically correct text queries to degrade segmentation performance in multimodal large language models, achieving 2x higher success rates than prior methods.

Details

Motivation: Prior work has focused on perturbing image inputs for MLLMs, but semantically equivalent textual paraphrases - crucial for real-world applications where users express the same intent in varied ways - remain underexplored.

Method: SPARTA is a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning, to generate adversarial paraphrases.

Result: SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets, revealing that advanced reasoning segmentation models remain vulnerable to adversarial paraphrasing.

Conclusion: The study demonstrates that current reasoning segmentation models are vulnerable to adversarial paraphrasing attacks even under strict semantic and grammatical constraints, highlighting the need for improved robustness in multimodal language models.

Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.

[53] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić

Main category: cs.CL

TL;DR: This paper proposes a new taxonomy and best practices for multilingual LLM benchmarking, focusing on improving language and culture sensitivity in evaluations for European languages.

Details

Motivation: Current LLM benchmarks are predominantly English-focused, leaving non-English languages and multilingual scenarios underdeveloped and poorly evaluated despite growing LLM capabilities.

Method: The authors provide an overview of recent LLM benchmarking developments and propose a new taxonomy specifically designed for multilingual/non-English scenarios, along with best practices for coordinated benchmark development.

Result: The paper establishes a framework for categorizing multilingual benchmarks and recommends quality standards to improve evaluation methods for European languages.

Conclusion: There is a need for more coordinated development of multilingual benchmarks with higher sensitivity to language and cultural differences, particularly for European languages.

Abstract: While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.

[54] Iterative Critique-Refine Framework for Enhancing LLM Personalization

Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed

Main category: cs.CL

TL;DR: PerFine is a training-free critique-refine framework that enhances personalized text generation through iterative, profile-grounded feedback between generator and critic LLMs.

Details

Motivation: Existing retrieval-augmented approaches for personalized text generation often produce outputs that drift in tone, topic, or style, failing to maintain proper alignment with target users.

Method: Uses iterative feedback loop: generator produces drafts conditioned on retrieved profiles, critic provides structured feedback on tone/vocabulary/structure/topicality, generator revises with knockout strategy to retain stronger drafts. Also explores Best-of-N and Topic Extraction.

Result: Consistent improvements over PGraphRAG across Yelp, Goodreads, and Amazon datasets with GEval gains of +7-13%, steady improvements over 3-5 iterations, and scalability with critic size.

Conclusion: Profile-aware feedback offers a powerful training-free, model-agnostic paradigm for personalized LLM generation that effectively maintains user alignment.

Abstract: Personalized text generation requires models not only to produce coherent text but also to align with a target user’s style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.

[55] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems

Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar, Mingming Liu

Main category: cs.CL

TL;DR: This survey systematically examines how Retrieval-Augmented Generation (RAG) and reasoning enhancement work synergistically to mitigate hallucinations in large language models, proposing a taxonomy and unified framework.

Details

Motivation: Hallucination remains a key obstacle for reliable LLM deployment. While RAG and reasoning enhancement are effective mitigation strategies, their synergistic potential and underlying mechanisms haven't been systematically examined.

Method: Adopts an application-oriented perspective to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. Proposes a taxonomy distinguishing knowledge-based and logic-based hallucinations.

Result: Systematically examines how RAG and reasoning address each type of hallucination and presents a unified framework supported by real-world applications, evaluations, and benchmarks.

Conclusion: The survey provides a comprehensive analysis of how RAG and reasoning enhancement can work together to balance creativity and reliability in LLMs, marking a shift from merely suppressing hallucinations to achieving better overall performance.

Abstract: Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.

[56] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Frederik Broy, Maike Züfle, Jan Niehues

Main category: cs.CL

TL;DR: Introduces Reference Prediction from Talks (RPT) task and Talk2Ref dataset for mapping scientific presentations to relevant papers, showing that fine-tuning on the dataset significantly improves citation prediction performance.

Details

Motivation: Scientific talks are growing as a dissemination medium, and automatically identifying relevant literature that grounds or enriches talks would be valuable for researchers and students.

Method: Created Talk2Ref dataset with 6,279 talks and 43,429 cited papers, evaluated state-of-the-art text embedding models in zero-shot retrieval, and proposed a dual-encoder architecture trained on the dataset with strategies for handling long transcripts and domain adaptation.

Result: Fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of learning semantic representations from spoken scientific content.

Conclusion: The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

Abstract: Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk’s corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

[57] A word association network methodology for evaluating implicit biases in LLMs compared to humans

Katherine Abramski, Giulio Rossetti, Massimo Stella

Main category: cs.CL

TL;DR: A novel word association network methodology for evaluating implicit biases in LLMs using semantic priming simulations, enabling direct comparisons between LLMs and humans across various social bias dimensions.

Details

Motivation: As LLMs become increasingly integrated into daily life, their inherent implicit social biases remain a pressing concern that requires effective evaluation methods to assess their implicit knowledge representations.

Method: Prompt-based approach using word association networks that simulate semantic priming to tap into implicit relational structures in LLMs, providing both quantitative and qualitative bias assessments.

Result: Revealed both convergences and divergences between LLM and human biases across gender, religion, ethnicity, sexual orientation, and political party dimensions, offering new perspectives on LLM risks.

Conclusion: The methodology provides a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing transparent and socially responsible language technologies.

Abstract: As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.

[58] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song

Main category: cs.CL

TL;DR: Natural language critiques enhance LLM confidence calibration, with confidence critiques for multiple-choice tasks and uncertainty critiques for open-ended scenarios. CritiCal training method outperforms Self-Critique and baselines.

Details

Motivation: Accurate confidence calibration in LLMs is critical for safe use in high-stakes domains, as traditional methods fail to capture reasoning needed for accurate confidence assessment.

Method: Proposes Self-Critique for LLMs to critique and optimize confidence, and CritiCal - a novel Critique Calibration training method using natural language critiques to improve calibration beyond numerical optimization.

Result: CritiCal significantly outperforms Self-Critique and other baselines, even surpassing GPT-4o in complex reasoning tasks, and shows robust generalization in out-of-distribution settings.

Conclusion: Natural language critiques effectively enhance LLM confidence calibration, with CritiCal advancing LLM reliability through improved calibration performance.

Abstract: Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM’s reliability.

[59] Levée d’ambiguïtés par grammaires locales

Eric G. C. Laporte

Main category: cs.CL

TL;DR: A method for lexical disambiguation that aims for zero silence rate (never discarding correct POS tags) using local disambiguation grammars in INTEX system, requiring careful verification of transducer interactions.

Details

Motivation: Many words are ambiguous in POS but context reduces ambiguity. Lexical tagging is crucial for NLP applications like spelling correction, grammar checking, and text analysis. Recent systems aim for zero silence rate but this is unrealistic for unique tagging systems.

Method: Formal description of lexical disambiguation method in Silberztein’s INTEX system using local disambiguation grammars and transducers. Requires verifying transducer interactions and combinations, not just individual paths.

Result: Shows that local grammars must be carefully tested when targeting zero silence rate. Grammatical intuitions may be inaccurate due to unforeseen constructions or ambiguities.

Conclusion: A detailed specification of grammar behavior when applied to texts is necessary for achieving zero silence rate in lexical disambiguation.

Abstract: Many words are ambiguous in terms of their part of speech (POS). However, when a word appears in a text, this ambiguity is generally much reduced. Disambiguating POS involves using context to reduce the number of POS associated with words, and is one of the main challenges of lexical tagging. The problem of labeling words by POS frequently arises in natural language processing, for example for spelling correction, grammar or style checking, expression recognition, text-to-speech conversion, text corpus analysis, etc. Lexical tagging systems are thus useful as an initial component of many natural language processing systems. A number of recent lexical tagging systems produce multiple solutions when the text is lexically ambiguous or the uniquely correct solution cannot be found. These contributions aim to guarantee a zero silence rate: the correct tag(s) for a word must never be discarded. This objective is unrealistic for systems that tag each word uniquely. This article concerns a lexical disambiguation method adapted to the objective of a zero silence rate and implemented in Silberztein’s INTEX system (1993). We present here a formal description of this method. We show that to verify a local disambiguation grammar in this framework, it is not sufficient to consider the transducer paths separately: one needs to verify their interactions. Similarly, if a combination of multiple transducers is used, the result cannot be predicted by considering them in isolation. Furthermore, when examining the initial labeling of a text as produced by INTEX, ideas for disambiguation rules come spontaneously, but grammatical intuitions may turn out to be inaccurate, often due to an unforeseen construction or ambiguity. If a zero silence rate is targeted, local grammars must be carefully tested. This is where a detailed specification of what a grammar will do once applied to texts would be necessary.

[60] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

Venkata S Govindarajan, Laura Biester

Main category: cs.CL

TL;DR: Analysis of intentionally bad humor from the Bulwer-Lytton Fiction Contest reveals that standard humor detection models fail on this corpus, and LLMs over-exaggerate literary devices when generating similar content.

Details

Motivation: To understand the diversity of textual humor, including intentionally bad humor, by analyzing sentences from the Bulwer-Lytton Fiction Contest.

Method: Curated and analyzed a novel corpus of sentences from the Bulwer-Lytton Fiction Contest, evaluated standard humor detection models, analyzed literary devices, and tested LLMs for generating contest-style sentences.

Result: Standard humor detection models performed poorly on the corpus. Human-written sentences combine puns, irony with metaphor, metafiction and simile. LLMs imitate the form but exaggerate effects by over-using literary devices and creating more novel adjective-noun bigrams than humans.

Conclusion: Intentionally bad humor presents unique challenges for computational models, with LLMs failing to capture the nuanced balance of literary devices that human writers achieve.

Abstract: Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand “bad” humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

[61] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh

Main category: cs.CL

TL;DR: The Open Korean Historical Corpus is a large-scale dataset spanning 1,300 years that enables quantitative analysis of Korean linguistic evolution, revealing key shifts like the rapid transition from Hanja to Hangul starting around 1890 and North Korea’s lexical divergence.

Details

Motivation: To address the lack of accessible historical corpora for studying Korean linguistic evolution, particularly the discrepancy between spoken and written forms and the shift from Chinese characters to Hangul alphabet.

Method: Created the Open Korean Historical Corpus containing 18 million documents and 5 billion tokens from 19 sources spanning 7th century to 2025, covering 6 languages and under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script.

Result: Quantitative analysis revealed: (1) Idu usage peaked in 1860s then declined sharply; (2) rapid transition from Hanja to Hangul starting around 1890; (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates.

Conclusion: This corpus provides a foundational resource for quantitative diachronic analysis of Korean language history and can serve as pre-training corpus for large language models to improve understanding of Sino-Korean vocabulary and archaic writing systems.

Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

[62] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

Raphaël Bagat, Irina Illina, Emmanuel Vincent

Main category: cs.CL

TL;DR: BEARD is a novel framework that adapts Whisper’s encoder using unlabeled data through BEST-RQ objective and knowledge distillation, achieving 12% relative improvement in Air Traffic Control domain adaptation.

Details

Motivation: ASR systems struggle with out-of-domain and low-resource scenarios where labeled data is scarce, particularly in challenging domains like Air Traffic Control with non-native speech, noise, and specialized phraseology.

Method: BEARD combines BEST-RQ objective with knowledge distillation from a frozen teacher encoder to adapt Whisper’s encoder using unlabeled data, ensuring complementarity with the pre-trained decoder. Uses 5,000 hours of untranscribed speech for adaptation and 2 hours of transcribed speech for fine-tuning.

Result: Significantly outperforms previous baseline and fine-tuned models, achieving 12% relative improvement compared to the fine-tuned model on the ATCO2 corpus from Air Traffic Control domain.

Conclusion: BEARD successfully demonstrates domain adaptation of Whisper using self-supervised learning, representing the first work to use such objectives for Whisper adaptation in challenging low-resource scenarios.

Abstract: Automatic Speech Recognition (ASR) systems, despite large multilingual training, struggle in out-of-domain and low-resource scenarios where labeled data is scarce. We propose BEARD (BEST-RQ Encoder Adaptation with Re-training and Distillation), a novel framework designed to adapt Whisper’s encoder using unlabeled data. Unlike traditional self-supervised learning methods, BEARD uniquely combines a BEST-RQ objective with knowledge distillation from a frozen teacher encoder, ensuring the encoder’s complementarity with the pre-trained decoder. Our experiments focus on the ATCO2 corpus from the challenging Air Traffic Control (ATC) communications domain, characterized by non-native speech, noise, and specialized phraseology. Using about 5,000 hours of untranscribed speech for BEARD and 2 hours of transcribed speech for fine-tuning, the proposed approach significantly outperforms previous baseline and fine-tuned model, achieving a relative improvement of 12% compared to the fine-tuned model. To the best of our knowledge, this is the first work to use a self-supervised learning objective for domain adaptation of Whisper.

[63] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca

Main category: cs.CL

TL;DR: ReplicationBench is an evaluation framework that tests AI agents’ ability to replicate entire astrophysics research papers, revealing current models score under 20% and identifying diverse failure modes in scientific research workflows.

Details

Motivation: To assess the faithfulness and correctness of AI agents as scientific research assistants before they can be used for novel research, particularly in data-driven domains like astrophysics.

Method: Split astrophysics papers into tasks requiring replication of core contributions (experimental setup, derivations, data analysis, codebase), co-developed with original authors to target key scientific results for objective evaluation.

Result: Current frontier language models perform poorly on ReplicationBench, with the best models scoring under 20%, revealing a rich set of failure modes in scientific research tasks.

Conclusion: ReplicationBench provides the first benchmark for paper-scale astrophysics research tasks, offers insights generalizable to other data-driven sciences, and establishes a scalable framework for measuring AI agents’ scientific reliability.

Abstract: Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper’s core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents’ reliability in scientific research.

[64] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

Main category: cs.CL

TL;DR: ReForm introduces reflective autoformalization with iterative refinement and semantic consistency evaluation to improve translation of natural language math to formal statements, achieving 17.2% improvement over baselines.

Details

Motivation: Current LLM approaches treat autoformalization as simple translation without self-reflection mechanisms, leading to semantic inconsistencies despite syntactic correctness.

Method: ReForm integrates semantic consistency evaluation into autoformalization process with iterative generation and self-correction, trained using Prospective Bounded Sequence Optimization (PBSO) with position-specific rewards.

Result: Achieves 17.2 percentage point average improvement over strongest baselines across four autoformalization benchmarks. Human experts also make semantic errors in up to 38.5% of cases.

Conclusion: Reflective autoformalization with iterative refinement significantly improves semantic fidelity, and autoformalization remains inherently challenging even for human experts.

Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem’s semantic intent. This limitation arises from the LLM approaches’ treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

[65] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zhang

Main category: cs.CL

TL;DR: Proposes dLLM-Var, a diffusion-based LLM that supports variable generation lengths by predicting EOS tokens, achieving 30.1x speedup over traditional dLLMs and 2.4x over autoregressive models.

Details

Motivation: Current diffusion LLMs require fixed generation lengths determined before decoding, causing efficiency and flexibility issues.

Method: Train diffusion LLM to predict [EOS] tokens, enabling block diffusion with global bi-directional attention while maintaining high parallelism.

Result: Achieves 30.1x speedup over traditional dLLM inference and 2.4x speedup over autoregressive models like Qwen and Llama, with higher accuracy.

Conclusion: Method enables practical use of dLLMs in real-world applications by solving fixed-length limitations while maintaining performance advantages.

Abstract: Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.

[66] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho

Main category: cs.CL

TL;DR: DHSA is a dynamic hierarchical sparse attention method that reduces computational costs while maintaining accuracy for long-context LLMs by adaptively segmenting sequences and predicting attention sparsity online.

Details

Motivation: Existing static sparse attention methods poorly adapt to content-dependent variations, and dynamic approaches rely on predefined templates that limit generality and accuracy across diverse tasks.

Method: DHSA dynamically segments sequences into variable-length chunks, computes chunk representations with length-normalized aggregation, upsamples chunk-level similarities to token level to calculate importance scores, and preserves important token-level interactions.

Result: DHSA matches dense attention accuracy while reducing prefill latency by 20-60% and peak memory usage by 35%. It achieves 6-18% higher accuracy than block sparse attention with comparable or lower cost.

Conclusion: DHSA offers an efficient and adaptable solution for long-context on-device LLMs by dynamically predicting attention sparsity without retraining, maintaining accuracy while significantly reducing computational costs.

Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.

[67] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Snegha A, Sayambhu Sen, Piyush Singh Pasi, Abhishek Singhania, Preethi Jyothi

Main category: cs.CL

TL;DR: Prefix-based methods outperform LoRA for zero-shot cross-lingual transfer in decoder-only LLMs, achieving up to 6% improvement on Belebele benchmark with minimal parameters.

Details

Motivation: To address the challenge of adapting decoder-only LLMs to new tasks across languages and explore underutilized prefix-based techniques for zero-shot cross-lingual transfer.

Method: Comprehensive study of three prefix-based methods (soft prompt tuning, prefix tuning, Llama Adapter) for zero-shot transfer from English to 35+ languages, analyzing transfer across linguistic families and model scaling from 1B to 24B parameters.

Result: Prefix methods outperform LoRA baselines by up to 6% on Belebele benchmark with Llama 3.1 8B, similar improvements with Mistral v0.3 7B, achieving consistent improvements across diverse benchmarks using only 1.23M parameters.

Conclusion: Prefix-based techniques are an effective and scalable alternative to LoRA, particularly valuable in low-resource multilingual settings.

Abstract: With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.

[68] Relative Scaling Laws for LLMs

William Held, David Hall, Percy Liang, Diyi Yang

Main category: cs.CL

TL;DR: The paper introduces relative scaling laws to track performance gaps between test distributions as models scale, showing that while scaling improves overall performance, it doesn’t equalize all disparities.

Details

Motivation: Traditional scaling laws focus on aggregate test sets, which average over heterogeneous subpopulations and obscure performance disparities between different distributions.

Method: Used 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from 10^18-10^20 FLOPs on standard pretraining datasets, analyzing how performance gaps evolve with scale.

Result: Found diverse scaling trajectories: academic domains converge toward parity, regional dialects shift based on population size, and AI risk behaviors split with capability/influence risks increasing while adversarial risks don’t.

Conclusion: Scaling improves overall performance but is not a universal equalizer; relative scaling laws help identify robustness challenges that traditional scaling laws miss.

Abstract: Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$–$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.

[69] “Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue

Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel

Main category: cs.CL

TL;DR: A multimodal model for detecting repair initiation in Dutch dialogues using linguistic and prosodic features, showing prosodic cues complement text features and improve performance.

Details

Motivation: Conversational agents often fail to recognize user repair initiation, leading to conversation breakdowns and disengagement. This work addresses the need for better OIR detection to improve CA performance.

Method: Proposes a multimodal model integrating linguistic and prosodic features grounded in Conversation Analysis to automatically detect repair initiation in Dutch dialogues.

Result: Prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, revealing how different features interact.

Conclusion: The study demonstrates the value of multimodal approaches for repair initiation detection and suggests future directions including visual cues, multilingual corpora, and cross-context analysis for improved robustness and generalizability.

Abstract: Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.

[70] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren

Main category: cs.CL

TL;DR: OpenRM is a tool-augmented reward model that uses external tools to gather evidence for evaluating knowledge-intensive and long-form LLM responses, outperforming existing approaches and improving downstream alignment tasks.

Details

Motivation: Existing reward models struggle with knowledge-intensive and long-form tasks where evaluating correctness requires external evidence beyond the model's internal knowledge, limiting their ability to reliably discriminate subtle quality differences.

Method: Train OpenRM with Group Relative Policy Optimization (GRPO) on 27K+ synthesized pairwise examples, jointly supervising tool usage and outcome accuracy. The model systematically invokes external tools to gather relevant evidence for judgment.

Result: OpenRM substantially outperforms existing reward modeling approaches on three new datasets and two benchmarks, and integration into inference-time response selection and training-time data selection yields consistent gains in downstream LLM alignment.

Conclusion: Tool-augmented reward models like OpenRM have strong potential for scaling reliable long-form evaluation in LLM alignment, demonstrating the effectiveness of evidence-based judgment strategies.

Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model’s internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.

[71] Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia

Hugo Rydel-Johnston, Alex Kafkas

Main category: cs.CL

TL;DR: Dyslexic readers show stronger sensitivity to word features (length, frequency, predictability) than typical readers, with predictability having the strongest effect on reading time differences.

Details

Motivation: To understand where and under what conditions dyslexic reading costs arise in naturalistic reading, and to quantify how word-level features influence these costs.

Method: Used eye-tracking aligned to word-level features (word length, frequency, predictability) in a large-scale naturalistic reading dataset to model feature influences on dyslexic time costs.

Result: All three features robustly changed reading times in both groups, with dyslexic readers showing stronger sensitivities, especially to predictability. Counterfactual manipulations narrowed the dyslexic-control gap by about one third.

Conclusion: Patterns align with dyslexia theories involving heightened linguistic working memory and phonological encoding demands, motivating further work on lexical complexity and parafoveal preview to explain remaining gaps.

Abstract: We ask where, and under what conditions, dyslexic reading costs arise in a large-scale naturalistic reading dataset. Using eye-tracking aligned to word-level features (word length, frequency, and predictability), we model how each feature influences dyslexic time costs. We find that all three features robustly change reading times in both typical and dyslexic readers, and that dyslexic readers show stronger sensitivities to each, especially predictability. Counterfactual manipulations of these features substantially narrow the dyslexic-control gap by about one third, with predictability showing the strongest effect, followed by length and frequency. These patterns align with dyslexia theories that posit heightened demands on linguistic working memory and phonological encoding, and they motivate further work on lexical complexity and parafoveal preview benefits to explain the remaining gap. In short, we quantify when extra dyslexic costs arise, how large they are, and offer actionable guidance for interventions and computational models for dyslexics.

[72] Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Jiawei Zhou, Lei Chen

Main category: cs.CL

TL;DR: R3 is a retrieval framework optimized for RAG through trial-and-feedback reinforced contrastive learning, enabling dynamic relevance optimization without pre-annotated data.

Details

Motivation: As RAG becomes widespread, IR shifts from retrieving for humans to AI systems where relevance is difficult to define or annotate beforehand.

Method: Uses trial-and-feedback reinforced contrastive learning where retrieved results interact with the environment to produce automatic contrastive signals for self-improvement.

Result: Improves RAG performance by 5.2% over original retrievers and surpasses state-of-the-art by 4.9%, achieving comparable results to LLM-augmented systems.

Conclusion: R3 is efficient and practical, requiring only 4 GPUs and completing training within a single day while outperforming existing approaches.

Abstract: As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever’s self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.

[73] Evolving Diagnostic Agents in a Virtual Clinical Environment

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie

Main category: cs.CL

TL;DR: A framework for training LLMs as diagnostic agents using reinforcement learning in a virtual clinical environment, achieving superior performance over state-of-the-art models in diagnostic accuracy and examination recommendations.

Details

Motivation: To develop LLMs that can actively manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses through interactive learning rather than static case summaries.

Method: Uses DiagGym (diagnostics world model trained on EHRs) as virtual clinical environment, trains DiagAgent via end-to-end multi-turn reinforcement learning to optimize information yield and diagnostic accuracy, and evaluates on DiagBench benchmark.

Result: DiagAgent significantly outperforms 10 SOTA LLMs including DeepSeek-v3 and GPT-4o, with 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio in single-turn settings, and 15.12% increase in diagnostic accuracy in end-to-end settings.

Conclusion: Learning policies in interactive clinical environments provides dynamic and clinically meaningful diagnostic management abilities that cannot be achieved through passive training alone.

Abstract: In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

[74] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

Parker Riley, Daniel Deutsch, Mara Finkelstein, Colten DiIanni, Juraj Juraska, Markus Freitag

Main category: cs.CL

TL;DR: The paper proposes MQM re-annotation, a two-stage version of machine translation evaluation where annotators review and edit existing MQM annotations to improve evaluation quality.

Details

Motivation: As machine translation models improve, evaluation methods need to be enhanced to detect quality gains that might be lost in evaluation noise.

Method: Experiments with MQM re-annotation where annotators review and edit pre-existing MQM annotations from various sources (themselves, other humans, or automatic systems).

Result: Rater behavior in re-annotation aligns with goals, and re-annotation produces higher-quality annotations primarily by finding errors missed in the first pass.

Conclusion: MQM re-annotation is an effective method for improving translation evaluation quality by catching missed errors through a two-stage review process.

Abstract: Human evaluation of machine translation is in an arms race with translation model quality: as our models get better, our evaluation methods need to be improved to ensure that quality gains are not lost in evaluation noise. To this end, we experiment with a two-stage version of the current state-of-the-art translation evaluation paradigm (MQM), which we call MQM re-annotation. In this setup, an MQM annotator reviews and edits a set of pre-existing MQM annotations, that may have come from themselves, another human annotator, or an automatic MQM annotation system. We demonstrate that rater behavior in re-annotation aligns with our goals, and that re-annotation results in higher-quality annotations, mostly due to finding errors that were missed during the first pass.

[75] InteractComp: Evaluating Search Agents With Ambiguous Queries

Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo

Main category: cs.CL

TL;DR: The paper introduces InteractComp, a benchmark to evaluate search agents’ ability to recognize ambiguous queries and actively interact to resolve them during search, revealing systematic overconfidence and stagnant interaction capabilities despite improved search performance.

Details

Motivation: Current search agents assume complete and unambiguous user queries, which diverges from reality where users often start with incomplete queries requiring clarification through interaction. Most agents lack interactive mechanisms, and existing benchmarks cannot assess this capability.

Method: Created InteractComp benchmark with 210 expert-curated questions across 9 domains using a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluated 17 models and conducted longitudinal analysis over 15 months.

Result: Evaluation revealed striking failure: best model achieved only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produced dramatic gains, showing latent capability. Longitudinal analysis showed interaction capabilities stagnated while search performance improved seven-fold.

Conclusion: InteractComp addresses a critical blind spot in search agent evaluation and provides a valuable resource for both evaluating and training interaction capabilities, as the immediate feedback inherent to search tasks makes it particularly suitable for this purpose.

Abstract: Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

[76] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation

Xun Liang, Huayi Lai, Hanyu Wang, Wentao Zhang, Linfeng Zhang, Yanfang Chen, Feiyu Xiong, Zhiyu Li

Main category: cs.CL

TL;DR: Role prompts in medical LLMs primarily affect linguistic style rather than enhancing reasoning capabilities, with no evidence of distinct cognitive processes across clinical roles.

Details

Motivation: To investigate whether role prompts in medical LLMs induce genuine cognitive differentiation or merely modify surface-level linguistic features.

Method: RP-Neuron-Activated Evaluation Framework (RPNA) using neuron ablation and representation analysis on three medical QA datasets.

Result: Role prompts do not significantly enhance medical reasoning abilities; they primarily affect linguistic features without creating distinct reasoning pathways across clinical roles.

Conclusion: Current Prompt-Based Role Playing methods fail to replicate real-world medical cognitive complexity, highlighting the need for models that simulate genuine cognitive processes rather than linguistic imitation.

Abstract: Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic imitation.We have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor

[77] SPICE: Self-Play In Corpus Environments Improves Reasoning

Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, Jason Weston

Main category: cs.CL

TL;DR: SPICE is a reinforcement learning framework where a single model acts as both Challenger (creating tasks from corpus documents) and Reasoner (solving them), enabling continuous self-improvement through adversarial dynamics and corpus grounding.

Details

Motivation: Self-improving systems need environmental interaction for continuous adaptation, and existing ungrounded self-play methods offer limited benefits.

Method: A single model plays dual roles: Challenger mines documents from large corpus to generate diverse reasoning tasks, and Reasoner solves them. This creates an automatic curriculum through adversarial dynamics with corpus grounding.

Result: SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families.

Conclusion: Document grounding is a key ingredient in SPICE to continuously generate increasingly challenging goals and achieve them, enabling sustained self-improvement.

Abstract: Self-improving systems require environmental interaction for continuous adaptation. We introduce SPICE (Self-Play In Corpus Environments), a reinforcement learning framework where a single model acts in two roles: a Challenger that mines documents from a large corpus to generate diverse reasoning tasks, and a Reasoner that solves them. Through adversarial dynamics, the Challenger creates an automatic curriculum at the frontier of the Reasoner’s capability, while corpus grounding provides the rich, near-inexhaustible external signal necessary for sustained improvement. Unlike existing ungrounded self-play methods that offer more limited benefits, SPICE achieves consistent gains across mathematical (+8.9%) and general reasoning (+9.8%) benchmarks on multiple model families. Our analysis reveals how document grounding is a key ingredient in SPICE to continuously generate its own increasingly challenging goals and achieve them, enabling sustained self-improvement.

[78] Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: E-GRPO improves LLM-based search agent training by using entity-aware dense rewards to learn from near-miss samples, outperforming GRPO in accuracy and efficiency.

Details

Motivation: Current training methods like GRPO discard entity information and cannot distinguish informative near-miss samples from complete failures, losing valuable learning signals.

Method: Proposed Entity-aware Group Relative Policy Optimization (E-GRPO) with dense entity-aware reward function that assigns partial rewards based on entity match rate for incorrect samples.

Result: E-GRPO consistently outperforms GRPO baseline on QA and research benchmarks, achieving superior accuracy and more efficient reasoning with fewer tool calls.

Conclusion: E-GRPO provides a more effective and sample-efficient approach to aligning search agents by leveraging entity information during training.

Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative “near-miss” samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent’s reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these “near-misses”. Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.

[79] AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: The paper introduces AgentFrontier Engine, a data synthesis approach based on Zone of Proximal Development theory to train LLM agents on frontier tasks they can’t solve alone but can master with guidance.

Details

Motivation: To unlock advanced reasoning in LLM agents by training them on tasks at the frontier of their capabilities, inspired by educational theory.

Method: AgentFrontier Engine - an automated pipeline that synthesizes high-quality, multidisciplinary data within LLM’s ZPD, supporting both continued pre-training and targeted post-training on complex reasoning tasks.

Result: AgentFrontier-30B-A3B model achieves state-of-the-art results on demanding benchmarks like Humanity’s Last Exam, surpassing some leading proprietary agents.

Conclusion: ZPD-guided data synthesis offers a scalable and effective path toward building more capable LLM agents.

Abstract: Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM’s ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity’s Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.

[80] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: WebLeaper is a framework that improves information seeking efficiency in LLM agents by creating high-coverage tasks and generating efficient solution trajectories through tree-structured reasoning and curated Wikipedia tables.

Details

Motivation: Current information seeking agents suffer from low search efficiency due to sparse target entities in training tasks, which limits their ability to learn efficient search behaviors and constrains overall performance.

Method: Formulates information seeking as a tree-structured reasoning problem, uses curated Wikipedia tables to synthesize three task variants (Basic, Union, Reverse-Union), and curates training trajectories that are both accurate and efficient.

Result: Extensive experiments on five benchmarks (BrowserComp, GAIA, xbench-DeepSearch, WideSearch, Seal-0) show consistent improvements in both effectiveness and efficiency over strong baselines.

Conclusion: The proposed WebLeaper framework successfully addresses search efficiency limitations in LLM-based information seeking agents by systematically increasing both efficiency and efficacy through high-coverage task construction and optimized training trajectories.

Abstract: Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.

[81] ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking

Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: ParallelMuse is a two-stage paradigm for deep information-seeking agents that improves exploration efficiency through functional region partitioning and uncertainty-guided branching, then compresses reasoning to generate coherent answers.

Details

Motivation: To address inefficiency in conventional parallel thinking approaches that repeatedly roll out from scratch and struggle to integrate long-horizon reasoning due to limited context capacity.

Method: Two-stage approach: 1) Functionality-Specified Partial Rollout partitions sequences into functional regions with uncertainty-guided path reuse and branching; 2) Compressed Reasoning Aggregation exploits reasoning redundancy to losslessly compress information for answer derivation.

Result: Experiments show up to 62% performance improvement with 10-30% reduction in exploratory token consumption across multiple agents and benchmarks.

Conclusion: ParallelMuse effectively enhances problem-solving capability by improving exploration efficiency and enabling better integration of reasoning processes in information-seeking agents.

Abstract: Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10–30% reduction in exploratory token consumption.

[82] AgentFold: Long-Horizon Web Agents with Proactive Context Management

Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: AgentFold introduces a novel agent paradigm with proactive context management using ‘folding’ operations to handle long-horizon web tasks, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Current LLM-based web agents face context saturation issues with ReAct-based approaches accumulating noisy histories, while fixed summarization methods risk losing critical details, creating a fundamental trade-off in context management.

Method: AgentFold treats context as a dynamic cognitive workspace and learns to execute ‘folding’ operations that manage historical trajectories at multiple scales - granular condensations preserve fine-grained details, while deep consolidations abstract away multi-step sub-tasks.

Result: AgentFold-30B-A3B achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH with simple supervised fine-tuning, surpassing larger models like DeepSeek-V3.1-671B-A37B and proprietary agents like OpenAI’s o4-mini.

Conclusion: AgentFold’s proactive context management paradigm effectively addresses the context saturation problem in long-horizon web tasks, demonstrating superior performance through dynamic cognitive workspace management.

Abstract: LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding’ operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI’s o4-mini.

[83] Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig

Main category: cs.CL

TL;DR: The paper introduces the Agent Data Protocol (ADP), a unified format for agent training data that enables standardized training across diverse datasets without per-dataset engineering.

Details

Motivation: Public research on large-scale supervised finetuning of AI agents is limited due to data fragmentation across heterogeneous formats, tools, and interfaces.

Method: Developed ADP as a light-weight representation language that serves as an interlingua between diverse agent datasets and unified training pipelines, capable of capturing various tasks including API/tool use, browsing, coding, and agentic workflows.

Result: Unified 13 existing agent datasets into ADP format, achieving ~20% average performance gain over base models and delivering state-of-the-art or near-SOTA performance on coding, browsing, tool use, and research benchmarks without domain-specific tuning.

Conclusion: ADP helps lower the barrier to standardized, scalable, and reproducible agent training by providing a unified data format that eliminates the need for per-dataset engineering.

Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an “interlingua” between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.

[84] ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu

Main category: cs.CL

TL;DR: ComboBench evaluates LLMs’ ability to translate semantic actions into VR device manipulation sequences across 262 scenarios from 4 VR games, revealing that while top models show strong task decomposition, they still lag humans in procedural reasoning and spatial understanding.

Details

Motivation: To explore whether LLMs can effectively replicate humans' intuitive ability to translate high-level semantic actions into precise VR device manipulations, which remains underexplored despite being crucial for VR gaming.

Method: Created ComboBench benchmark with 262 scenarios from 4 VR games (Half-Life: Alyx, Into the Radius, Moss: Book II, Vivecraft), evaluated 7 LLMs against ground truth and human performance, and tested few-shot learning improvements.

Result: Top models like Gemini-1.5-Pro show strong task decomposition but struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, and few-shot examples substantially improve LLM performance.

Conclusion: LLMs demonstrate potential for VR manipulation tasks but require targeted enhancement, particularly in procedural reasoning and spatial understanding, with few-shot learning showing promise for improving their capabilities.

Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs’ capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs’ VR manipulation capabilities. We release all materials at https://sites.google.com/view/combobench.

[85] MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task

Juraj Juraska, Tobias Domhan, Mara Finkelstein, Tetsuji Nakagawa, Geza Kovacs, Daniel Deutsch, Pidong Wang, Markus Freitag

Main category: cs.CL

TL;DR: The authors present two systems for WMT25 Translation Evaluation: MetricX-25 for quality score prediction and GemSpanEval for error span detection, both based on Gemma 3 model fine-tuned on WMT data.

Details

Motivation: To develop improved systems for the unified WMT25 Translation Evaluation Shared Task, addressing both quality score prediction and error span detection subtasks with state-of-the-art approaches.

Method: Used Gemma 3 multilingual model fine-tuned on WMT data. For MetricX-25: adapted Gemma 3 to encoder-only architecture with regression head. For GemSpanEval: used decoder-only architecture for generative error span detection with context output.

Result: MetricX-25 significantly outperforms its predecessor in predicting MQM and ESA quality scores. GemSpanEval is competitive with xCOMET baseline for error span detection.

Conclusion: Both systems demonstrate effective performance in their respective subtasks, with MetricX-25 showing significant improvements over previous versions and GemSpanEval providing competitive error detection through generative formulation.

Abstract: In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.

[86] Retrieval-Augmented Generation-based Relation Extraction

Sefika Efeoglu, Adrian Paschke

Main category: cs.CL

TL;DR: RAG4RE (Retrieved-Augmented Generation-based Relation Extraction) is proposed to enhance relation extraction by combining LLMs with retrieval mechanisms, overcoming limitations of traditional RE methods and LLM hallucinations.

Details

Motivation: Traditional relation extraction methods rely heavily on labeled data and computational resources, while LLMs can produce hallucinated responses. RAG4RE addresses these limitations by integrating retrieval mechanisms with LLMs.

Method: Proposed RAG4RE approach that combines retrieval-augmented generation with relation extraction. Evaluated using multiple LLMs (Flan T5, Llama2, Mistral) on established benchmarks including TACRED, TACREV, Re-TACRED, and SemEval RE datasets.

Result: RAG4RE outperforms traditional RE approaches based solely on LLMs, particularly on TACRED dataset and its variations. Shows remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets.

Conclusion: RAG4RE demonstrates efficacy and potential for advancing relation extraction tasks in natural language processing by effectively combining retrieval mechanisms with LLMs to overcome limitations of existing approaches.

Abstract: Information Extraction (IE) is a transformative process that converts unstructured text data into a structured format by employing entity and relation extraction (RE) methodologies. The identification of the relation between a pair of entities plays a crucial role within this framework. Despite the existence of various techniques for relation extraction, their efficacy heavily relies on access to labeled data and substantial computational resources. In addressing these challenges, Large Language Models (LLMs) emerge as promising solutions; however, they might return hallucinating responses due to their own training data. To overcome these limitations, Retrieved-Augmented Generation-based Relation Extraction (RAG4RE) in this work is proposed, offering a pathway to enhance the performance of relation extraction tasks. This work evaluated the effectiveness of our RAG4RE approach utilizing different LLMs. Through the utilization of established benchmarks, such as TACRED, TACREV, Re-TACRED, and SemEval RE datasets, our aim is to comprehensively evaluate the efficacy of our RAG4RE approach. In particularly, we leverage prominent LLMs including Flan T5, Llama2, and Mistral in our investigation. The results of our study demonstrate that our RAG4RE approach surpasses performance of traditional RE approaches based solely on LLMs, particularly evident in the TACRED dataset and its variations. Furthermore, our approach exhibits remarkable performance compared to previous RE methodologies across both TACRED and TACREV datasets, underscoring its efficacy and potential for advancing RE tasks in natural language processing.

[87] Evaluation of Geographical Distortions in Language Models

Rémy Decoupes, Roberto Interdonato, Mathieu Roche, Maguelonne Teisseire, Sarah Valentin

Main category: cs.CL

TL;DR: This paper analyzes geographical biases in language models by examining distortions in spatial information representation and introducing four indicators to measure these biases.

Details

Motivation: Language models are essential tools for professional tasks, making it imperative to identify inherent biases, particularly geographical biases that can lead to misrepresentation of spatial information.

Method: The study introduces four indicators to assess geographical distortions by comparing geographical and semantic distances, and conducts experiments using ten widely used language models.

Result: Results show that language models tend to misrepresent spatial information, leading to distortions in geographical distance representation.

Conclusion: There is a critical necessity to inspect and rectify spatial biases in language models to ensure accurate and equitable geographical representations.

Abstract: Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.

[88] Zero-Shot Tokenizer Transfer

Benjamin Minixhofer, Edoardo Maria Ponti, Ivan Vulić

Main category: cs.CL

TL;DR: Zero-Shot Tokenizer Transfer (ZeTT) enables swapping language model tokenizers without performance degradation using a hypernetwork that predicts embeddings for new tokenizers.

Details

Motivation: Current LMs are restricted by their tokenizers, causing efficiency issues when working with languages different from their training data. There's a need to detach LMs from their original tokenizers while maintaining performance.

Method: Proposed a hypernetwork approach that takes a tokenizer as input and predicts corresponding embeddings, enabling zero-shot transfer to new tokenizers for both encoder and decoder LLMs.

Result: The method achieves performance close to original models in cross-lingual and coding tasks while significantly reducing tokenized sequence length. Remaining performance gaps can be closed with minimal continued training (<1B tokens).

Conclusion: ZeTT makes substantial progress toward detaching LMs from their tokenizers, with hypernetworks generalizing to new tokenizers and fine-tuned variants without extra training.

Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models’ performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

[89] Says Who? Effective Zero-Shot Annotation of Focalization

Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan

Main category: cs.CL

TL;DR: LLMs can effectively annotate focalization in literary texts, with GPT-4o achieving 84.79% F1 score comparable to human annotators, and log probabilities reflect annotation difficulty.

Details

Motivation: Focalization annotation is challenging due to subjective interpretation and frequent annotator disagreement, making it computationally difficult despite being important for narrative analysis.

Method: Tested five LLM families and two baselines on annotating focalization in short literary excerpts, analyzed log probabilities from GPT models, and conducted case study on 16 Stephen King novels.

Result: LLMs perform comparably to trained human annotators, with GPT-4o achieving 84.79% average F1 score. GPT log probabilities correlate with annotation difficulty.

Conclusion: LLMs are effective for focalization annotation in computational literary studies, enabling large-scale analysis of narrative perspective with human-level accuracy.

Abstract: Focalization describes the way in which access to narrative information is restricted or controlled based on the knowledge available to knowledge of the narrator. It is encoded via a wide range of lexico-grammatical features and is subject to reader interpretation. Even trained annotators frequently disagree on correct labels, suggesting this task is both qualitatively and computationally challenging. In this work, we test how well five contemporary large language model (LLM) families and two baselines perform when annotating short literary excerpts for focalization. Despite the challenging nature of the task, we find that LLMs show comparable performance to trained human annotators, with GPT-4o achieving an average F1 of 84.79%. Further, we demonstrate that the log probabilities output by GPT-family models frequently reflect the difficulty of annotating particular excerpts. Finally, we provide a case study analyzing sixteen Stephen King novels, demonstrating the usefulness of this approach for computational literary studies and the insights gleaned from examining focalization at scale.

[90] TrajAgent: An LLM-Agent Framework for Trajectory Modeling via Large-and-Small Model Collaboration

Yuwei Du, Jie Feng, Jie Zhao, Yong Li

Main category: cs.CL

TL;DR: TrajAgent is an LLM-powered agent framework for automated trajectory modeling that unifies diverse trajectory tasks and datasets through a unified environment and collaborative learning between LLMs and specialized models.

Details

Motivation: Trajectory modeling faces challenges due to data heterogeneity and task diversity, making it difficult even for experts. There's a need for automated solutions that can handle various trajectory tasks across different datasets effectively.

Method: Proposed TrajAgent framework with UniEnv (unified execution environment), agentic workflow for automatic trajectory modeling, and collaborative learning between LLM-based agents and small specialized models.

Result: Experiments on 5 tasks using 4 real-world datasets show TrajAgent achieves 2.38%-69.91% performance improvement over baseline methods in automated trajectory modeling.

Conclusion: TrajAgent effectively addresses trajectory modeling challenges through automation, demonstrating significant performance improvements across diverse tasks and datasets.

Abstract: Trajectory modeling, which includes research on trajectory data pattern mining and future prediction, has widespread applications in areas such as life services, urban transportation, and public administration. Numerous methods have been proposed to address specific problems within trajectory modeling. However, the heterogeneity of data and the diversity of trajectory tasks make effective and reliable trajectory modeling an important yet highly challenging endeavor, even for domain experts. In this paper, we propose TrajAgent, an agent framework powered by large language models, designed to facilitate robust and efficient trajectory modeling through automation modeling. This framework leverages and optimizes diverse specialized models to address various trajectory modeling tasks across different datasets effectively. In TrajAgent, we first develop UniEnv, an execution environment with a unified data and model interface, to support the execution and training of various models. Building on UniEnv, we introduce an agentic workflow designed for automatic trajectory modeling across various trajectory tasks and data. Furthermore, we introduce collaborative learning schema between LLM-based agents and small speciallized models, to enhance the performance of the whole framework effectively. Extensive experiments on five tasks using four real-world datasets demonstrate the effectiveness of TrajAgent in automated trajectory modeling, achieving a performance improvement of 2.38%-69.91% over baseline methods. The codes and data can be accessed via https://github.com/tsinghua-fib-lab/TrajAgent.

[91] Provable Scaling Laws for the Test-Time Compute of Large Language Models

Yanxi Chen, Xuchen Pan, Yaliang Li, Bolin Ding, Jingren Zhou

Main category: cs.CL

TL;DR: Two simple algorithms for LLMs that achieve provable scaling laws: knockout-style (exponential failure decay) and league-style (exponential failure decay with more robust assumptions), requiring only a black-box LLM.

Details

Motivation: To develop principled algorithms that enjoy provable scaling laws for test-time compute in large language models, making them practical and easy to adapt without requiring additional components like verifiers or reward models.

Method: 1) Two-stage knockout algorithm: generates multiple candidate solutions and aggregates via knockout tournament. 2) Two-stage league algorithm: evaluates candidates by average win rate against multiple opponents. Both use only black-box LLMs.

Result: Theoretical proofs show failure probability decays exponentially or by power law with increasing test-time compute. Extensive experiments validate theories and demonstrate outstanding scaling properties across diverse models and datasets.

Conclusion: Both proposed algorithms achieve provable scaling laws with minimal implementation requirements, making them practical and adaptable for various tasks while requiring only black-box LLMs.

Abstract: We propose two simple, principled and practical algorithms that enjoy provable scaling laws for the test-time compute of large language models (LLMs). The first one is a two-stage knockout-style algorithm: given an input problem, it first generates multiple candidate solutions, and then aggregate them via a knockout tournament for the final output. Assuming that the LLM can generate a correct solution with non-zero probability and do better than a random guess in comparing a pair of correct and incorrect solutions, we prove theoretically that the failure probability of this algorithm decays to zero exponentially or by a power law (depending on the specific way of scaling) as its test-time compute grows. The second one is a two-stage league-style algorithm, where each candidate is evaluated by its average win rate against multiple opponents, rather than eliminated upon loss to a single opponent. Under analogous but more robust assumptions, we prove that its failure probability also decays to zero exponentially with more test-time compute. Both algorithms require a black-box LLM and nothing else (e.g., no verifier or reward model) for a minimalistic implementation, which makes them appealing for practical applications and easy to adapt for different tasks. Through extensive experiments with diverse models and datasets, we validate the proposed theories and demonstrate the outstanding scaling properties of both algorithms.

[92] Discourse Features Enhance Detection of Document-Level Machine-Generated Content

Yupei Li, Manuel Milling, Lucia Specia, Björn W. Schuller

Main category: cs.CL

TL;DR: The paper introduces DTransformer, a model that uses discourse analysis to detect machine-generated content in longer texts, achieving significant improvements over state-of-the-art methods on multiple datasets.

Details

Motivation: Existing MGC detectors focus only on surface-level features and are easily deceived by paraphrasing, especially for longer texts. There's a need for methods that capture implicit and structural features.

Method: Developed paraphrased datasets (paraLFQA and paraWP) using GPT and DIPPER. Proposed DTransformer model that integrates discourse analysis through PDTB preprocessing to encode structural features at document level.

Result: DTransformer achieved substantial performance gains: 15.5% absolute improvement on paraLFQA, 4% on paraWP, and 1.5% on M4 compared to SOTA approaches.

Conclusion: Discourse analysis significantly enhances detection of machine-generated content, especially for longer texts and paraphrased content, demonstrating the importance of structural features beyond surface-level patterns.

Abstract: The availability of high-quality APIs for Large Language Models (LLMs) has facilitated the widespread creation of Machine-Generated Content (MGC), posing challenges such as academic plagiarism and the spread of misinformation. Existing MGC detectors often focus solely on surface-level information, overlooking implicit and structural features. This makes them susceptible to deception by surface-level sentence patterns, particularly for longer texts and in texts that have been subsequently paraphrased. To overcome these challenges, we introduce novel methodologies and datasets. Besides the publicly available dataset Plagbench, we developed the paraphrased Long-Form Question and Answer (paraLFQA) and paraphrased Writing Prompts (paraWP) datasets using GPT and DIPPER, a discourse paraphrasing tool, by extending artifacts from their original versions. To better capture the structure of longer texts at document level, we propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features. It results in substantial performance gains across both datasets - 15.5% absolute improvement on paraLFQA, 4% absolute improvement on paraWP, and 1.5% absolute improvemene on M4 compared to SOTA approaches. The data and code are available at: https://github.com/myxp-lyp/Discourse-Features-Enhance-Detection-of-Document-Level-Machine-Generated-Content.git.

[93] Face the Facts! Evaluating RAG-based Fact-checking Pipelines in Realistic Settings

Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini

Main category: cs.CL

TL;DR: This paper benchmarks RAG-based automated fact-checking methods under realistic scenarios, evaluating them on complex claims and heterogeneous knowledge bases. It finds that LLM-based retrievers outperform other techniques but struggle with heterogeneous sources, while larger models excel in faithfulness and smaller ones in context adherence.

Details

Motivation: To address the constraints of current automated fact-checking pipelines and evaluate RAG-based methods in more realistic scenarios with stylistically complex claims and heterogeneous knowledge bases.

Method: Benchmarking RAG-based methods for verdict generation using LLM-based retrievers and evaluating on complex claims with heterogeneous yet reliable knowledge bases.

Result: LLM-based retrievers outperform other retrieval techniques but struggle with heterogeneous knowledge bases. Larger models excel in verdict faithfulness while smaller models provide better context adherence. Human evaluations favor zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

Conclusion: The study reveals a complex landscape in automated fact-checking where different approaches excel in different aspects, suggesting the need for balanced approaches that consider both faithfulness and context adherence in verdict generation.

Abstract: Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, under more realistic scenarios, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

[94] NeedleInATable: Exploring Long-Context Capability of Large Language Models towards Long-Structured Tables

Lanrui Wang, Mingyu Zheng, Hongyin Tang, Zheng Lin, Yanan Cao, Jingang Wang, Xunliang Cai, Weiping Wang

Main category: cs.CL

TL;DR: NeedleInATable (NIAT) is a new benchmark for testing LLMs’ fine-grained perception of individual table cells in long structured tables, revealing performance gaps in genuine table understanding.

Details

Motivation: Existing benchmarks focus on unstructured text or downstream tabular tasks, overlooking models' fine-grained perception of individual table cells which is crucial for robust LLM-based table applications.

Method: NIAT treats each table cell as a “needle” and requires models to extract target cells based on cell locations or lookup questions, evaluating fine-grained table perception.

Result: Evaluation shows substantial performance gap between popular downstream tabular tasks and NIAT, suggesting models may rely on dataset-specific correlations rather than genuine table understanding.

Conclusion: NIAT capability is important for LLMs’ genuine table understanding, and training with synthesized NIAT data improves performance on both NIAT and downstream tabular tasks.

Abstract: Processing structured tabular data, particularly large and lengthy tables, constitutes a fundamental yet challenging task for large language models (LLMs). However, existing long-context benchmarks like Needle-in-a-Haystack primarily focus on unstructured text, neglecting the challenge of diverse structured tables. Meanwhile, previous tabular benchmarks mainly consider downstream tasks that require high-level reasoning abilities, and overlook models’ underlying fine-grained perception of individual table cells, which is crucial for practical and robust LLM-based table applications. To address this gap, we introduce \textsc{NeedleInATable} (NIAT), a new long-context tabular benchmark that treats each table cell as a ``needle’’ and requires models to extract the target cell based on cell locations or lookup questions. Our comprehensive evaluation of various LLMs and multimodal LLMs reveals a substantial performance gap between popular downstream tabular tasks and the simpler NIAT task, suggesting that they may rely on dataset-specific correlations or shortcuts to obtain better benchmark results but lack truly robust long-context understanding towards structured tables. Furthermore, we demonstrate that using synthesized NIAT training data can effectively improve performance on both NIAT task and downstream tabular tasks, which validates the importance of NIAT capability for LLMs’ genuine table understanding ability.

[95] BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang

Main category: cs.CL

TL;DR: BRIDGE is a comprehensive multilingual benchmark for evaluating LLMs on real-world clinical data across 87 tasks, 9 languages, and 14 specialties, showing that open-source models can match proprietary ones and medically fine-tuned models often underperform.

Details

Motivation: Current LLM benchmarks for medical applications rely on medical exam questions or PubMed text, failing to capture real-world clinical data complexity and limiting generalizability across clinical use cases.

Method: Created BRIDGE benchmark with 87 tasks from real-world clinical data across 9 languages, covering 8 task types, 6 clinical stages, 20 applications, and 14 specialties. Evaluated 95 LLMs including DeepSeek-R1, GPT-4o, Gemini, and Qwen3 series under various inference strategies.

Result: Substantial performance variation across model sizes, languages, NLP tasks, and clinical specialties. Open-source LLMs achieved comparable performance to proprietary models, while medically fine-tuned LLMs based on older architectures often underperformed versus updated general-purpose models.

Conclusion: BRIDGE serves as a foundational resource for developing and evaluating LLMs in real-world clinical text understanding, providing a comprehensive benchmark that addresses limitations of existing medical evaluation frameworks.

Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, benchmarking on large-scale real-world data such as electronic health records (EHRs) is critical, as clinical decisions are directly informed by these sources, yet current evaluations remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world clinical data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. It covers eight major task types spanning the entire continuum of patient care across six clinical stages and 20 representative applications, including triage and referral, consultation, information extraction, diagnosis, prognosis, and billing coding, and involves 14 clinical specialties. We systematically evaluated 95 LLMs (including DeepSeek-R1, GPT-4o, Gemini series, and Qwen3 series) under various inference strategies. Our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard

[96] AutoJudge: Judge Decoding Without Manual Annotation

Roman Garipov, Fedor Velikonivtsev, Ivan Ermakov, Ruslan Svirschevski, Vage Egiazarian, Max Ryabinin

Main category: cs.CL

TL;DR: AutoJudge accelerates LLM inference using task-specific lossy speculative decoding by identifying which generated tokens affect downstream quality, allowing faster generation of unimportant tokens while maintaining overall response quality.

Details

Motivation: To speed up large language model inference by relaxing the strict token-by-token distribution matching requirement of speculative decoding, focusing only on tokens that impact final answer quality.

Method: Uses semi-greedy search to identify which mismatches between target and draft models need correction, trains lightweight classifier on LLM embeddings to predict which mismatching tokens can be safely accepted without quality loss.

Result: Achieves up to ~2x speedup over speculative decoding on GSM8k with Llama 3.1 70B with ≤1% accuracy drop, accepts ≥25 tokens per speculation cycle on LiveCodeBench with 2% drop in Pass@1.

Conclusion: AutoJudge provides significant inference speedups with minimal quality degradation, requires no human annotation, and is easily integrable with modern LLM inference frameworks.

Abstract: We introduce AutoJudge, a method that accelerates large language model (LLM) inference with task-specific lossy speculative decoding. Instead of matching the original model output distribution token-by-token, we identify which of the generated tokens affect the downstream quality of the response, relaxing the distribution match guarantee so that the “unimportant” tokens can be generated faster. Our approach relies on a semi-greedy search algorithm to test which of the mismatches between target and draft models should be corrected to preserve quality and which ones may be skipped. We then train a lightweight classifier based on existing LLM embeddings to predict, at inference time, which mismatching tokens can be safely accepted without compromising the final answer quality. We evaluate the effectiveness of AutoJudge with multiple draft/target model pairs on mathematical reasoning and programming benchmarks, achieving significant speedups at the cost of a minor accuracy reduction. Notably, on GSM8k with the Llama 3.1 70B target model, our approach achieves up to $\approx2\times$ speedup over speculative decoding at the cost of $\le 1%$ drop in accuracy. When applied to the LiveCodeBench benchmark, AutoJudge automatically detects programming-specific important tokens, accepting $\ge 25$ tokens per speculation cycle at $2%$ drop in Pass@1. Our approach requires no human annotation and is easy to integrate with modern LLM inference frameworks.

[97] The Hawthorne Effect in Reasoning Models: Evaluating and Steering Test Awareness

Sahar Abdelnabi, Ahmed Salem

Main category: cs.CL

TL;DR: This paper presents the first quantitative study of how “test awareness” affects LLM behavior, particularly on safety tasks. The authors develop a white-box probing framework to identify and control awareness-related activations, showing that test awareness significantly impacts safety alignment in various models.

Details

Motivation: LLMs can alter their behavior when they detect they're being evaluated, potentially optimizing for test performance or complying more with harmful prompts when real-world consequences seem absent. This "test awareness" phenomenon needs systematic study.

Method: A white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. Applied to state-of-the-art open-weight reasoning LLMs across realistic and hypothetical tasks.

Result: Test awareness significantly impacts safety alignment (compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across different models.

Conclusion: The framework provides control over test awareness effects, offering a stress-test mechanism to increase trust in safety evaluations by understanding and managing this latent behavioral influence.

Abstract: Reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its performance on safety-related tasks. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-weight reasoning LLMs across both realistic and hypothetical tasks (denoting tests or simulations). Our results demonstrate that test awareness significantly impacts safety alignment (such as compliance with harmful requests and conforming to stereotypes) with effects varying in both magnitude and direction across models. By providing control over this latent effect, our work aims to provide a stress-test mechanism and increase trust in how we perform safety evaluations.

[98] Any Large Language Model Can Be a Reliable Judge: Debiasing with a Reasoning-based Bias Detector

Haoyan Yang, Runxue Bao, Cao Xiao, Jun Ma, Parminder Bhatia, Shangqian Gao, Taha Kass-Hout

Main category: cs.CL

TL;DR: The paper introduces Reasoning-based Bias Detector (RBD), a plug-in module that identifies biased evaluations in LLM-as-a-Judge systems and generates structured reasoning to guide self-correction, achieving significant improvements in evaluation accuracy and consistency.

Details

Motivation: LLM-as-a-Judge systems suffer from reliability issues due to biases in judgment, and existing mitigation methods have limitations: in-context learning fails to address rooted biases due to limited self-reflection capacity, while fine-tuning is not applicable to closed-source models.

Method: Developed RBD as an external plug-in module that operates through iterative bias detection and feedback-driven revision. Created a complete pipeline including biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD models (1.5B to 14B sizes), and integration with LLM evaluators.

Result: RBD models consistently improved performance across all scales. The RBD-8B model improved evaluation accuracy by 18.5% and consistency by 10.9%, surpassing prompting-based baselines by 12.8% and fine-tuned judges by 17.2%. Demonstrated strong effectiveness across 4 bias types (verbosity, position, bandwagon, sentiment) using 8 LLM evaluators.

Conclusion: RBD is an effective and scalable solution for bias mitigation in LLM-as-a-Judge systems, demonstrating strong generalization across biases and domains, and operating efficiently without requiring modifications to the evaluator models themselves.

Abstract: LLM-as-a-Judge has emerged as a promising tool for automatically evaluating generated outputs, but its reliability is often undermined by potential biases in judgment. Existing efforts to mitigate these biases face key limitations: in-context learning-based methods fail to address rooted biases due to the evaluator’s limited capacity for self-reflection, whereas fine-tuning is not applicable to all evaluator types, especially closed-source models. To address this challenge, we introduce the Reasoning-based Bias Detector (RBD), which is a plug-in module that identifies biased evaluations and generates structured reasoning to guide evaluator self-correction. Rather than modifying the evaluator itself, RBD operates externally and engages in an iterative process of bias detection and feedback-driven revision. To support its development, we design a complete pipeline consisting of biased dataset construction, supervision collection, distilled reasoning-based fine-tuning of RBD, and integration with LLM evaluators. We fine-tune four sizes of RBD models, ranging from 1.5B to 14B, and observe consistent performance improvements across all scales. Experimental results on 4 bias types–verbosity, position, bandwagon, and sentiment–evaluated using 8 LLM evaluators demonstrate RBD’s strong effectiveness. For example, the RBD-8B model improves evaluation accuracy by an average of 18.5% and consistency by 10.9%, and surpasses prompting-based baselines and fine-tuned judges by 12.8% and 17.2%, respectively. These results highlight RBD’s effectiveness and scalability. Additional experiments further demonstrate its strong generalization across biases and domains, as well as its efficiency.

[99] PVP: An Image Dataset for Personalized Visual Persuasion with Persuasion Strategies, Viewer Characteristics, and Persuasiveness Ratings

Junseo Kim, Jongwook Han, Dongmin Choi, Jongwook Yoon, Eun-Ju Lee, Yohan Jo

Main category: cs.CL

TL;DR: The paper introduces the Personalized Visual Persuasion (PVP) dataset containing 28,454 persuasive images with human evaluations and psychological characteristics to advance personalized visual persuasion systems.

Details

Motivation: There is a lack of comprehensive datasets connecting image persuasiveness with personal information of evaluators, which hinders development of AI systems for personalized visual persuasion.

Method: Created PVP dataset with 28,454 persuasive images across 596 messages and 9 strategies, collected persuasiveness scores from 2,521 human annotators along with their demographic and psychological characteristics.

Result: Developed persuasive image generator and automated evaluator using the dataset, showing that incorporating psychological characteristics enhances generation and evaluation of persuasive images.

Conclusion: The PVP dataset enables technological advancements in personalized visual persuasion and provides valuable insights for developing more effective persuasive systems.

Abstract: Visual persuasion, which uses visual elements to influence cognition and behaviors, is crucial in fields such as advertising and political communication. With recent advancements in artificial intelligence, there is growing potential to develop persuasive systems that automatically generate persuasive images tailored to individuals. However, a significant bottleneck in this area is the lack of comprehensive datasets that connect the persuasiveness of images with the personal information about those who evaluated the images. To address this gap and facilitate technological advancements in personalized visual persuasion, we release the Personalized Visual Persuasion (PVP) dataset, comprising 28,454 persuasive images across 596 messages and 9 persuasion strategies. Importantly, the PVP dataset provides persuasiveness scores of images evaluated by 2,521 human annotators, along with their demographic and psychological characteristics (personality traits and values). We demonstrate the utility of our dataset by developing a persuasive image generator and an automated evaluator, and establish benchmark baselines. Our experiments reveal that incorporating psychological characteristics enhances the generation and evaluation of persuasive images, providing valuable insights for personalized visual persuasion.

[100] RARE: Retrieval-Aware Robustness Evaluation for Retrieval-Augmented Generation Systems

Yixiao Zeng, Tianyu Cao, Danqing Wang, Xinran Zhao, Zimeng Qiu, Morteza Ziyadi, Tongshuang Wu, Lei Li

Main category: cs.CL

TL;DR: RARE is a unified framework for evaluating RAG systems’ robustness against real-world noise, conflicting contexts, and fast-changing facts through systematic query and document perturbations.

Details

Motivation: Existing RAG evaluations rarely test how systems cope with real-world noise, conflicting contexts, or fast-changing facts, creating a gap in understanding their true robustness.

Method: RARE uses a knowledge-graph-driven synthesis pipeline (RARE-Get) to automatically extract relations and generate multi-level questions from time-sensitive corpora, creating a large-scale benchmark (RARE-Set) with evolving question distributions.

Result: RAG systems show unexpected sensitivity to perturbations and consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.

Conclusion: The RARE framework reveals critical vulnerabilities in RAG systems, particularly for multi-hop reasoning, highlighting the need for more robust retrieval-augmented generation approaches.

Abstract: Retrieval-Augmented Generation (RAG) enhances recency and factuality in answers. However, existing evaluations rarely test how well these systems cope with real-world noise, conflicting between internal and external retrieved contexts, or fast-changing facts. We introduce Retrieval-Aware Robustness Evaluation (RARE), a unified framework and large-scale benchmark that jointly stress-tests query and document perturbations over dynamic, time-sensitive corpora. One of the central features of RARE is a knowledge-graph-driven synthesis pipeline (RARE-Get) that automatically extracts single and multi-hop relations from the customized corpus and generates multi-level question sets without manual intervention. Leveraging this pipeline, we construct a dataset (RARE-Set) spanning 527 expert-level time-sensitive finance, economics, and policy documents and 48295 questions whose distribution evolves as the underlying sources change. To quantify resilience, we formalize retrieval-conditioned robustness metrics (RARE-Met) that capture a model’s ability to remain correct or recover when queries, documents, or real-world retrieval results are systematically altered. Our findings reveal that RAG systems are unexpectedly sensitive to perturbations. Moreover, they consistently demonstrate lower robustness on multi-hop queries compared to single-hop queries across all domains.

[101] AdaRewriter: Unleashing the Power of Prompting-based Conversational Query Reformulation via Test-Time Adaptation

Yilong Lai, Jialong Wu, Zhenglin Wang, Deyu Zhou

Main category: cs.CL

TL;DR: AdaRewriter is a test-time adaptation framework for conversational query reformulation that uses a lightweight reward model to select the best reformulated queries from LLM-generated candidates, working effectively even with black-box LLM systems.

Details

Motivation: Existing tuning methods and adaptation approaches fail to fully leverage the benefits of prompting-based conversational query reformulation, particularly the scaling potential of best-of-N candidate selection.

Method: Train a lightweight reward model with contrastive ranking loss to select the most promising query reformulation during inference, enabling test-time adaptation that works with black-box LLM systems.

Result: Experiments on five conversational search datasets show AdaRewriter significantly outperforms existing methods across most settings.

Conclusion: Test-time adaptation using outcome-supervised reward models demonstrates strong potential for improving conversational query reformulation performance.

Abstract: Prompting-based conversational query reformulation has emerged as a powerful approach for conversational search, refining ambiguous user queries into standalone search queries. Best-of-N reformulation over the generated candidates via prompting shows impressive potential scaling capability. However, both the previous tuning methods (training time) and adaptation approaches (test time) can not fully unleash their benefits. In this paper, we propose AdaRewriter, a novel framework for query reformulation using an outcome-supervised reward model via test-time adaptation. By training a lightweight reward model with contrastive ranking loss, AdaRewriter selects the most promising reformulation during inference. Notably, it can operate effectively in black-box systems, including commercial LLM APIs. Experiments on five conversational search datasets show that AdaRewriter significantly outperforms the existing methods across most settings, demonstrating the potential of test-time adaptation for conversational query reformulation.

[102] Offline RL by Reward-Weighted Fine-Tuning for Conversation Optimization

Subhojyoti Mukherjee, Viet Dac Lai, Raghavendra Addanki, Ryan Rossi, Seunghyun Yoon, Trung Bui, Anup Rao, Jayakumar Subramanian, Branislav Kveton

Main category: cs.CL

TL;DR: Proposes a practical offline RL approach for LLMs using reward-weighted fine-tuning, applied to short-horizon QA policies with significant improvements over SFT and DPO methods.

Details

Motivation: Current SFT and DPO methods for LLMs have extra hyperparameters and don't directly optimize rewards, creating a need for more effective offline RL approaches.

Method: Recasts offline RL as reward-weighted fine-tuning using similar techniques to supervised fine-tuning, applied to short-horizon question-answering policies.

Result: Achieves major gains in both optimized rewards and language quality compared to state-of-the-art SFT and DPO methods.

Conclusion: Reward-weighted fine-tuning provides an effective practical approach for offline RL with LLMs, outperforming existing methods while being simpler to implement.

Abstract: Offline reinforcement learning (RL) is a variant of RL where the policy is learned from a previously collected dataset of trajectories and rewards. In our work, we propose a practical approach to offline RL with large language models (LLMs). We recast the problem as reward-weighted fine-tuning, which can be solved using similar techniques to supervised fine-tuning (SFT). To showcase the value of our approach, we apply it to learning short-horizon question-answering policies of a fixed length, where the agent reasons about potential answers or asks clarifying questions. Our work stands in a stark contrast to state-of-the-art methods in this domain, based on SFT and direct preference optimization, which have additional hyper-parameters and do not directly optimize for rewards. We compare to them empirically, and report major gains in both optimized rewards and language quality.

[103] DrVoice: Parallel Speech-Text Voice Conversation Model via Dual-Resolution Speech Representations

Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lv, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, Jieping Ye

Main category: cs.CL

TL;DR: DrVoice is a parallel speech-text voice conversation model that uses joint autoregressive modeling with dual-resolution speech representations to reduce computational costs and improve performance.

Details

Motivation: To address limitations in existing E2E speech generation methods where text generation is unaware of concurrent speech synthesis, and to reduce computational costs by lowering input frequency from 12.5Hz to 5Hz.

Method: Uses joint autoregressive modeling with dual-resolution speech representations, reducing input frequency to 5Hz to better align with text tokens and exploit LLM capabilities.

Result: Establishes new SOTA on OpenAudioBench and Big Bench Audio benchmarks, achieves comparable performance to SOTA on VoiceBench and UltraEval-Audio benchmarks, becoming a leading open-source speech foundation model in ~7B models.

Conclusion: DrVoice demonstrates that dual-resolution speech representations with reduced input frequency can significantly improve computational efficiency and performance in parallel speech-text generation tasks.

Abstract: Recent studies on end-to-end (E2E) speech generation with large language models (LLMs) have attracted significant community attention, with multiple works extending text-based LLMs to generate discrete speech tokens. Existing E2E approaches primarily fall into two categories: (1) Methods that generate discrete speech tokens independently without incorporating them into the LLM’s autoregressive process, resulting in text generation being unaware of concurrent speech synthesis. (2) Models that generate interleaved or parallel speech-text tokens through joint autoregressive modeling, enabling mutual modality awareness during generation. This paper presents DrVoice, a parallel speech-text voice conversation model based on joint autoregressive modeling, featuring dual-resolution speech representations. Notably, while current methods utilize mainly 12.5Hz input audio representation, our proposed dual-resolution mechanism reduces the input frequency for the LLM to 5Hz, significantly reducing computational cost and alleviating the frequency discrepancy between speech and text tokens and in turn better exploiting LLMs’ capabilities. Experimental results demonstrate that DRVOICE-7B establishes new state-of-the-art (SOTA) on OpenAudioBench and Big Bench Audio benchmarks, while achieving performance comparable to the SOTA on VoiceBench and UltraEval-Audio benchmarks, making it a leading open-source speech foundation model in ~7B models.

[104] SANSKRITI: A Comprehensive Benchmark for Evaluating Language Models’ Knowledge of Indian Culture

Arijit Maji, Raghvendra Kumar, Akash Ghosh, Anushka, Sriparna Saha

Main category: cs.CL

TL;DR: SANSKRITI is a comprehensive benchmark with 21,853 question-answer pairs covering India’s cultural diversity across 28 states and 8 union territories, designed to evaluate language models’ cultural understanding.

Details

Motivation: Language models need to understand local socio-cultural contexts for global effectiveness, but current models lack proper evaluation for culturally nuanced queries, especially in diverse regions like India.

Method: Created a large-scale dataset covering 16 key attributes of Indian culture including rituals, history, tourism, cuisine, dance, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities.

Result: Evaluation revealed significant disparities in models’ ability to handle culturally nuanced queries, with many struggling in region-specific contexts. The benchmark exposed gaps in cultural understanding across LLMs, ILMs, and SLMs.

Conclusion: SANSKRITI sets a new standard for assessing and improving cultural understanding in language models, providing an extensive and culturally rich dataset for better evaluation of cultural comprehension capabilities.

Abstract: Language Models (LMs) are indispensable tools shaping modern workflows, but their global effectiveness depends on understanding local socio-cultural contexts. To address this, we introduce SANSKRITI, a benchmark designed to evaluate language models’ comprehension of India’s rich cultural diversity. Comprising 21,853 meticulously curated question-answer pairs spanning 28 states and 8 union territories, SANSKRITI is the largest dataset for testing Indian cultural knowledge. It covers sixteen key attributes of Indian culture: rituals and ceremonies, history, tourism, cuisine, dance and music, costume, language, art, festivals, religion, medicine, transport, sports, nightlife, and personalities, providing a comprehensive representation of India’s cultural tapestry. We evaluate SANSKRITI on leading Large Language Models (LLMs), Indic Language Models (ILMs), and Small Language Models (SLMs), revealing significant disparities in their ability to handle culturally nuanced queries, with many models struggling in region-specific contexts. By offering an extensive, culturally rich, and diverse dataset, SANSKRITI sets a new standard for assessing and improving the cultural understanding of LMs.

[105] From Language to Action: A Review of Large Language Models as Autonomous Agents and Tool Users

Sadia Sultana Chowa, Riasad Alvi, Subhey Sadi Rahman, Md Abdur Rahman, Mohaimenul Azam Khan Raiaan, Md Rafiqul Islam, Mukhtar Hussain, Sami Azam

Main category: cs.CL

TL;DR: This review paper analyzes recent developments in using Large Language Models (LLMs) as autonomous agents and tool users, covering architectural design, cognitive mechanisms, benchmarks, and future research directions.

Details

Motivation: The pursuit of human-level AI has advanced autonomous agents and LLMs, which are now widely used as decision-making agents due to their ability to interpret instructions, manage sequential tasks, and adapt through feedback.

Method: The review examines papers published between 2023-2025 from A*/A rank conferences and Q1 journals, analyzing LLM agents’ architectural design principles (single-agent vs multi-agent systems), tool integration strategies, cognitive mechanisms (reasoning, planning, memory), and impact of prompting/fine-tuning on performance.

Result: The study evaluated current benchmarks and assessment protocols, analyzed 68 publicly available datasets, and identified critical findings on verifiable reasoning of LLMs, self-improvement capacity, and personalization of LLM-based agents.

Conclusion: The review discusses ten future research directions to address identified gaps in LLM-based autonomous agent development.

Abstract: The pursuit of human-level artificial intelligence (AI) has significantly advanced the development of autonomous agents and Large Language Models (LLMs). LLMs are now widely utilized as decision-making agents for their ability to interpret instructions, manage sequential tasks, and adapt through feedback. This review examines recent developments in employing LLMs as autonomous agents and tool users and comprises seven research questions. We only used the papers published between 2023 and 2025 in conferences of the A* and A rank and Q1 journals. A structured analysis of the LLM agents’ architectural design principles, dividing their applications into single-agent and multi-agent systems, and strategies for integrating external tools is presented. In addition, the cognitive mechanisms of LLM, including reasoning, planning, and memory, and the impact of prompting methods and fine-tuning procedures on agent performance are also investigated. Furthermore, we evaluated current benchmarks and assessment protocols and have provided an analysis of 68 publicly available datasets to assess the performance of LLM-based agents in various tasks. In conducting this review, we have identified critical findings on verifiable reasoning of LLMs, the capacity for self-improvement, and the personalization of LLM-based agents. Finally, we have discussed ten future research directions to overcome these gaps.

[106] Semantic Agreement Enables Efficient Open-Ended LLM Cascades

Duncan Soiffer, Steven Kolawole, Virginia Smith

Main category: cs.CL

TL;DR: Semantic agreement between ensemble outputs serves as a training-free signal for reliable deferral in cascade systems, matching target-model quality at 40% cost while reducing latency by 60%.

Details

Motivation: Cascade systems face challenges in determining output reliability for open-ended text generation where quality lies on a continuous spectrum with multiple valid responses.

Method: Propose semantic agreement (meaning-level consensus between ensemble outputs) as a training-free signal for reliable deferral, which doesn’t require model internals and works across black-box APIs.

Result: Semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%, evaluated from 500M to 70B-parameter models.

Conclusion: Semantic agreement provides a practical baseline for real-world LLM deployment that remains robust to model updates and works without access to model internals.

Abstract: Cascade systems route computational requests to smaller models when possible and defer to larger models only when necessary, offering a promising approach to balance cost and quality in LLM deployment. However, they face a fundamental challenge in open-ended text generation: determining output reliability when generation quality lies on a continuous spectrum, often with multiple valid responses. To address this, we propose semantic agreement – meaning-level consensus between ensemble outputs – as a training-free signal for reliable deferral. We show that when diverse model outputs agree semantically, their consensus is a stronger reliability signal than token-level confidence. Evaluated from 500M to 70B-parameter models, we find that semantic cascades match or surpass target-model quality at 40% of the cost and reduce latency by up to 60%. Our method requires no model internals, works across black-box APIs, and remains robust to model updates, making it a practical baseline for real-world LLM deployment.

[107] Are you sure? Measuring models bias in content moderation through uncertainty

Alessandra Urbinati, Mirko Lai, Simona Frenda, Marco Antonio Stranisci

Main category: cs.CL

TL;DR: This paper presents an unsupervised approach using conformal prediction to measure bias in language models for content moderation by analyzing uncertainty in classifying messages from vulnerable groups.

Details

Motivation: Language models used for content moderation perpetuate racial and social biases, and current methods for measuring fairness remain inadequate despite existing benchmarks.

Method: The authors use conformal prediction to compute model uncertainty when classifying messages annotated by vulnerable groups (women and non-white annotators), comparing this with traditional performance metrics like F1 score.

Result: Some pre-trained models show high accuracy on minority group labels but low confidence in predictions, revealing that confidence measurement can identify which annotator groups are better represented in models.

Conclusion: Measuring model confidence through uncertainty analysis helps identify underrepresented groups and guides debiasing processes before model deployment in content moderation.

Abstract: Automatic content moderation is crucial to ensuring safety in social media. Language Model-based classifiers are being increasingly adopted for this task, but it has been shown that they perpetuate racial and social biases. Even if several resources and benchmark corpora have been developed to challenge this issue, measuring the fairness of models in content moderation remains an open issue. In this work, we present an unsupervised approach that benchmarks models on the basis of their uncertainty in classifying messages annotated by people belonging to vulnerable groups. We use uncertainty, computed by means of the conformal prediction technique, as a proxy to analyze the bias of 11 models against women and non-white annotators and observe to what extent it diverges from metrics based on performance, such as the $F_1$ score. The results show that some pre-trained models predict with high accuracy the labels coming from minority groups, even if the confidence in their prediction is low. Therefore, by measuring the confidence of models, we are able to see which groups of annotators are better represented in pre-trained models and lead the debiasing process of these models before their effective use.

[108] GRPO-MA: Multi-Answer Generation in GRPO for Stable and Efficient Chain-of-Thought Training

Hongcheng Wang, Yinuo Huang, Sukai Wang, Guanghui Ren, Hao Dong

Main category: cs.CL

TL;DR: GRPO-MA improves upon GRPO by generating multiple answers per thought to address gradient coupling, sparse rewards, and unstable advantage estimation, leading to better performance and training efficiency.

Details

Motivation: To address three key challenges in GRPO: gradient coupling between thoughts and answers, sparse reward signals from limited parallel sampling, and unstable advantage estimation.

Method: Propose GRPO-MA which leverages multi-answer generation from each thought process, theoretically reducing variance of thought advantage as more answers are generated per thought.

Result: Empirical gradient analysis shows GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and multimodal tasks demonstrate substantial performance and training efficiency improvements.

Conclusion: Increasing the number of answers per thought consistently enhances model performance, making GRPO-MA a robust and efficient optimization method for training reasoning in LLMs and VLMs.

Abstract: Recent progress, such as DeepSeek-R1, has shown that the GRPO algorithm, a Reinforcement Learning (RL) approach, can effectively train Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) and Vision-Language Models (VLMs). In this paper, we analyze three challenges of GRPO: gradient coupling between thoughts and answers, sparse reward signals caused by limited parallel sampling, and unstable advantage estimation. To mitigate these challenges, we propose GRPO-MA, a simple yet theoretically grounded method that leverages multi-answer generation from each thought process, enabling more robust and efficient optimization. Theoretically, we show that the variance of thought advantage decreases as the number of answers per thought increases. Empirically, our gradient analysis confirms this effect, showing that GRPO-MA reduces gradient spikes compared to GRPO. Experiments on math, code, and diverse multimodal tasks demonstrate that GRPO-MA substantially improves performance and training efficiency. Our ablation studies further reveal that increasing the number of answers per thought consistently enhances model performance.

[109] The Dialogue That Heals: A Comprehensive Evaluation of Doctor Agents’ Inquiry Capability

Linlu Gong, Ante Wang, Yunghwei Lai, Weizhi Ma, Yang Liu

Main category: cs.CL

TL;DR: MAQuE is a large benchmark for evaluating medical AI questioning skills, featuring 3,000 simulated patients with diverse characteristics and a comprehensive evaluation framework covering multiple aspects of medical dialogue.

Details

Motivation: Current AI doctors focus mainly on diagnostic skills but overlook other essential physician qualities like empathy, communication, and patient interaction. There's a need for comprehensive evaluation of medical questioning capabilities.

Method: Created MAQuE benchmark with 3,000 realistically simulated patient agents exhibiting diverse linguistic patterns, cognitive limitations, emotional responses, and passive disclosure tendencies. Introduced multi-faceted evaluation framework covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience.

Result: Experiments show substantial challenges across evaluation aspects. State-of-the-art models have significant room for improvement in inquiry capabilities. Models are highly sensitive to realistic patient behavior variations, which considerably impacts diagnostic accuracy. Fine-grained metrics reveal trade-offs between different evaluation perspectives.

Conclusion: Balancing performance and practicality in real-world clinical settings is challenging. Current AI models need significant improvement in comprehensive medical questioning skills beyond just diagnostic accuracy.

Abstract: An effective physician should possess a combination of empathy, expertise, patience, and clear communication when treating a patient. Recent advances have successfully endowed AI doctors with expert diagnostic skills, particularly the ability to actively seek information through inquiry. However, other essential qualities of a good doctor remain overlooked. To bridge this gap, we present MAQuE(Medical Agent Questioning Evaluation), the largest-ever benchmark for the automatic and comprehensive evaluation of medical multi-turn questioning. It features 3,000 realistically simulated patient agents that exhibit diverse linguistic patterns, cognitive limitations, emotional responses, and tendencies for passive disclosure. We also introduce a multi-faceted evaluation framework, covering task success, inquiry proficiency, dialogue competence, inquiry efficiency, and patient experience. Experiments on different LLMs reveal substantial challenges across the evaluation aspects. Even state-of-the-art models show significant room for improvement in their inquiry capabilities. These models are highly sensitive to variations in realistic patient behavior, which considerably impacts diagnostic accuracy. Furthermore, our fine-grained metrics expose trade-offs between different evaluation perspectives, highlighting the challenge of balancing performance and practicality in real-world clinical settings.

[110] AdaDetectGPT: Adaptive Detection of LLM-Generated Text with Statistical Guarantees

Hongyi Zhou, Jin Zhu, Pingfan Su, Kai Ye, Ying Yang, Shakeel A O B Gavioli-Akilagun, Chengchun Shi

Main category: cs.CL

TL;DR: AdaDetectGPT is a novel classifier that adaptively learns witness functions from training data to improve LLM-generated text detection, outperforming existing logits-based methods by up to 37%.

Details

Motivation: Existing logits-based detectors rely solely on log-probability statistics from source LLMs, which can be sub-optimal for distinguishing human-written from LLM-generated text.

Method: Introduces AdaDetectGPT which adaptively learns witness functions from training data to enhance logits-based detectors, providing statistical guarantees on detection performance metrics.

Result: Extensive numerical studies show AdaDetectGPT nearly uniformly improves state-of-the-art methods across various dataset-LLM combinations, with improvements reaching up to 37%.

Conclusion: AdaDetectGPT provides a more effective approach for LLM-generated text detection by adaptively learning from data rather than relying solely on log-probability statistics.

Abstract: We study the problem of determining whether a piece of text has been authored by a human or by a large language model (LLM). Existing state of the art logits-based detectors make use of statistics derived from the log-probability of the observed text evaluated using the distribution function of a given source LLM. However, relying solely on log probabilities can be sub-optimal. In response, we introduce AdaDetectGPT – a novel classifier that adaptively learns a witness function from training data to enhance the performance of logits-based detectors. We provide statistical guarantees on its true positive rate, false positive rate, true negative rate and false negative rate. Extensive numerical studies show AdaDetectGPT nearly uniformly improves the state-of-the-art method in various combination of datasets and LLMs, and the improvement can reach up to 37%. A python implementation of our method is available at https://github.com/Mamba413/AdaDetectGPT.

[111] SEER: The Span-based Emotion Evidence Retrieval Benchmark

Aneesha Sampath, Oya Aran, Emily Mower Provost

Main category: cs.CL

TL;DR: SEER Benchmark tests LLMs’ ability to identify specific text spans that express emotion, moving beyond sentence-level emotion classification to pinpoint exact emotional expressions.

Details

Motivation: Traditional emotion recognition assigns single labels to entire sentences, but applications like empathetic dialogue and clinical support require knowing exactly how and where emotion is expressed in text.

Method: Created SEER Benchmark with two tasks: emotion evidence detection within single sentences and across 5-sentence passages, using 1200 real-world sentences with new emotion and evidence annotations. Evaluated 14 open-source LLMs.

Result: Some models approach average human performance on single-sentence tasks, but accuracy degrades significantly in longer passages. Key failure modes include overreliance on emotion keywords and false positives in neutral text.

Conclusion: Current LLMs struggle with fine-grained emotion evidence detection, especially in longer contexts, highlighting the need for improved span-level emotion understanding capabilities.

Abstract: We introduce the SEER (Span-based Emotion Evidence Retrieval) Benchmark to test Large Language Models’ (LLMs) ability to identify the specific spans of text that express emotion. Unlike traditional emotion recognition tasks that assign a single label to an entire sentence, SEER targets the underexplored task of emotion evidence detection: pinpointing which exact phrases convey emotion. This span-level approach is crucial for applications like empathetic dialogue and clinical support, which need to know how emotion is expressed, not just what the emotion is. SEER includes two tasks: identifying emotion evidence within a single sentence, and identifying evidence across a short passage of five consecutive sentences. It contains new annotations for both emotion and emotion evidence on 1200 real-world sentences. We evaluate 14 open-source LLMs and find that, while some models approach average human performance on single-sentence inputs, their accuracy degrades in longer passages. Our error analysis reveals key failure modes, including overreliance on emotion keywords and false positives in neutral text.

[112] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang

Main category: cs.CL

TL;DR: LinearRAG is an efficient graph-based RAG framework that constructs relation-free hierarchical graphs using lightweight entity extraction and semantic linking, enabling reliable graph construction and precise passage retrieval with linear scalability.

Details

Motivation: Traditional RAG systems struggle with large-scale unstructured corpora where information is fragmented, and existing graph-based RAG methods rely on unstable and costly relation extraction that produces noisy graphs with incorrect relations.

Method: LinearRAG constructs a Tri-Graph using only entity extraction and semantic linking, avoiding relation modeling. It uses a two-stage retrieval strategy: relevant entity activation via local semantic bridging, followed by passage retrieval through global importance aggregation.

Result: Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models in retrieval performance.

Conclusion: LinearRAG provides an economical and reliable indexing solution that scales linearly with corpus size without extra token consumption, offering a more efficient alternative to traditional graph-based RAG systems.

Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models. Our code and datasets are available at https://github.com/DEEP-PolyU/LinearRAG.

Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang

Main category: cs.CL

TL;DR: The paper introduces DPRF, a framework that iteratively refines persona profiles for LLM role-playing agents to improve behavioral alignment with target individuals through cognitive divergence analysis.

Details

Motivation: Current LLM role-playing agents suffer from poor persona fidelity due to manually-created profiles that lack validation of alignment with target individuals' actual behaviors.

Method: DPRF iteratively identifies cognitive divergence between generated behaviors and human ground truth using free-form or theory-grounded structured analysis, then refines persona profiles to mitigate these divergences.

Result: DPRF consistently improves behavioral alignment considerably over baseline personas across five LLMs and four diverse behavior-prediction scenarios (debates, social media posts, interviews, movie reviews), demonstrating generalization across models and scenarios.

Conclusion: DPRF provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications like user simulation, social studies, and personalized AI.

Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs’ behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these divergences.We evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie reviews.DPRF can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and scenarios.Our work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.

[114] TokenTiming: A Dynamic Alignment Method for Universal Speculative Decoding Model Pairs

Sibo Xiao, Jinyuan Fu, Zhongle Xie, Lidan Shou

Main category: cs.CL

TL;DR: TokenTiming enables universal speculative decoding for LLM acceleration by using Dynamic Time Warping to align mismatched vocabularies between draft and target models, achieving 1.57x speedup without requiring retraining.

Details

Motivation: Current speculative decoding is limited by requiring draft and target models to share the same vocabulary, which restricts available draft models and often requires training new models from scratch.

Method: Proposes TokenTiming algorithm that re-encodes draft token sequences and uses Dynamic Time Warping (DTW) to build mappings for transferring probability distributions in speculative sampling, accommodating mismatched vocabularies.

Result: Achieves 1.57x speedup in comprehensive experiments across various tasks, working with any off-the-shelf models without retraining or modification.

Conclusion: Enables universal draft model selection for speculative decoding, making it a more versatile and practical tool for LLM acceleration by removing vocabulary matching constraints.

Abstract: Accelerating the inference of large language models (LLMs) has been a critical challenge in generative AI. Speculative decoding (SD) substantially improves LLM inference efficiency. However, its utility is limited by a fundamental constraint: the draft and target models must share the same vocabulary, thus limiting the herd of available draft models and often necessitating the training of a new model from scratch. Inspired by Dynamic Time Warping (DTW), a classic algorithm for aligning time series, we propose the algorithm TokenTiming for universal speculative decoding. It operates by re-encoding the draft token sequence to get a new target token sequence, and then uses DTW to build a mapping to transfer the probability distributions for speculative sampling. Benefiting from this, our method accommodates mismatched vocabularies and works with any off-the-shelf models without retraining and modification. We conduct comprehensive experiments on various tasks, demonstrating 1.57x speedup. This work enables a universal approach for draft model selection, making SD a more versatile and practical tool for LLM acceleration.

[115] MENTOR: A Reinforcement Learning Framework for Enabling Tool Use in Small Models via Teacher-Optimized Rewards

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

Main category: cs.CL

TL;DR: MENTOR is a framework that combines reinforcement learning with teacher-guided distillation to improve tool-using capabilities in small language models, addressing limitations of supervised fine-tuning and standard RL approaches.

Details

Motivation: Current approaches like supervised fine-tuning suffer from poor generalization as they only imitate static teacher trajectories, while standard RL with sparse rewards fails to effectively guide small language models due to inefficient exploration and suboptimal strategies.

Method: MENTOR synergistically combines RL with teacher-guided distillation, using an RL-based process to learn generalizable policies through exploration, and constructs dense composite teacher-guided rewards from reference trajectories to provide fine-grained guidance.

Result: Extensive experiments show MENTOR significantly improves cross-domain generalization and strategic competence of small language models compared to both supervised fine-tuning and standard sparse-reward RL baselines.

Conclusion: The proposed MENTOR framework effectively addresses the limitations of existing approaches by combining RL exploration with teacher-guided dense rewards, enabling more robust and generalizable tool-using capabilities in small language models.

Abstract: Distilling the tool-using capabilities of large language models (LLMs) into smaller, more efficient small language models (SLMs) is a key challenge for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor generalization as it trains models to imitate a static set of teacher trajectories rather than learn a robust methodology. While reinforcement learning (RL) offers an alternative, the standard RL using sparse rewards fails to effectively guide SLMs, causing them to struggle with inefficient exploration and adopt suboptimal strategies. To address these distinct challenges, we propose MENTOR, a framework that synergistically combines RL with teacher-guided distillation. Instead of simple imitation, MENTOR employs an RL-based process to learn a more generalizable policy through exploration. In addition, to solve the problem of reward sparsity, it uses a teacher’s reference trajectory to construct a dense, composite teacher-guided reward that provides fine-grained guidance. Extensive experiments demonstrate that MENTOR significantly improves the cross-domain generalization and strategic competence of SLMs compared to both SFT and standard sparse-reward RL baselines.

[116] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Kailin Jiang, Ning Jiang, Yuntao Du, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Qingqing Liu, Xianhao Wang, Yifan Jia, Hongbo Jiang, Yaocong Hu, Bin Li, Lei Liu

Main category: cs.CL

TL;DR: MINED is a comprehensive benchmark for evaluating temporal awareness in Large Multimodal Models (LMMs) across 6 dimensions and 11 tasks, revealing that most LMMs struggle with time-sensitive knowledge and knowledge editing can help update such knowledge.

Details

Motivation: Current LMMs have static representations that struggle with time-sensitive factual knowledge, and existing benchmarks are inadequate for evaluating temporal awareness.

Method: Constructed MINED benchmark from Wikipedia with professional annotators, containing 2,104 time-sensitive knowledge samples across six knowledge types, then evaluated 15 LMMs and tested knowledge editing methods.

Result: Gemini-2.5-Pro achieved highest average CEM score (63.07), most open-source LMMs lack time understanding ability, LMMs perform best on organization knowledge and worst on sport knowledge, and knowledge editing methods can effectively update time-sensitive knowledge in single editing scenarios.

Conclusion: LMMs generally struggle with temporal awareness, knowledge editing shows promise for updating time-sensitive knowledge, and MINED provides a comprehensive framework for evaluating temporal understanding in multimodal models.

Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

[117] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen

Main category: cs.CL

TL;DR: This paper introduces a new task of extracting Latin fragments from mixed-language historical documents with diverse layouts, evaluates large foundation models on a multimodal dataset of 724 annotated pages, and shows that reliable Latin detection is achievable with current models.

Details

Motivation: To address the challenge of extracting Latin fragments from historical documents that contain multiple languages and varied layouts, which is important for historical document analysis and digital humanities research.

Method: Benchmarked and evaluated large foundation models using a multimodal dataset of 724 annotated pages from mixed-language historical documents with varied layouts.

Result: The results demonstrate that reliable Latin detection with contemporary models is achievable, providing the first comprehensive analysis of these models’ capabilities and limitations for this specific task.

Conclusion: The study successfully establishes that current large foundation models can reliably detect Latin fragments in complex historical documents, marking an important advancement for historical document analysis and digital humanities applications.

Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models’ capabilities and limits for this task.

[118] Context-level Language Modeling by Learning Predictive Context Embeddings

Beiya Dai, Yuliang Liu, Daozheng Xue, Qipeng Guo, Kai Chen, Xinbing Wang, Bowen Zhou, Zhouhan Lin

Main category: cs.CL

TL;DR: ContextLM enhances standard LLM pretraining by adding next-context prediction alongside next-token prediction, improving semantic understanding and long-range coherence while maintaining compatibility with standard evaluation methods.

Details

Motivation: Next-token prediction in current LLMs limits their ability to capture higher-level semantic structures and long-range contextual relationships, creating a need for more comprehensive training objectives.

Method: ContextLM framework augments standard pretraining with next-context prediction, training models to learn predictive representations of multi-token contexts using error signals from future token chunks.

Result: Experiments on GPT2 and Pythia models up to 1.5B parameters show consistent improvements in perplexity and downstream task performance, with better long-range coherence and attention allocation.

Conclusion: Next-context prediction provides a scalable and efficient pathway to stronger language modeling with minimal computational overhead, enhancing LLM capabilities while maintaining standard evaluation compatibility.

Abstract: Next-token prediction (NTP) is the cornerstone of modern large language models (LLMs) pretraining, driving their unprecedented capabilities in text generation, reasoning, and instruction following. However, the token-level prediction limits the model’s capacity to capture higher-level semantic structures and long-range contextual relationships. To overcome this limitation, we introduce \textbf{ContextLM}, a framework that augments standard pretraining with an inherent \textbf{next-context prediction} objective. This mechanism trains the model to learn predictive representations of multi-token contexts, leveraging error signals derived from future token chunks. Crucially, ContextLM achieves this enhancement while remaining fully compatible with the standard autoregressive, token-by-token evaluation paradigm (e.g., perplexity). Extensive experiments on the GPT2 and Pythia model families, scaled up to $1.5$B parameters, show that ContextLM delivers consistent improvements in both perplexity and downstream task performance. Our analysis indicates that next-context prediction provides a scalable and efficient pathway to stronger language modeling, yielding better long-range coherence and more effective attention allocation with minimal computational overhead.

[119] Exploration of Summarization by Generative Language Models for Automated Scoring of Long Essays

Haowei Hua, Hong Jiao, Xinyi Wang

Main category: cs.CL

TL;DR: Using generative language models with summarization and prompting improves automated scoring of long essays, overcoming BERT’s 512-token limit and increasing QWK from 0.822 to 0.8878.

Details

Motivation: BERT and its variants have a 512-token limit, which is insufficient for automated scoring of long essays.

Method: Employ generative language models for automated scoring via summarization and prompting techniques.

Result: Scoring accuracy improved significantly with QWK increasing from 0.822 to 0.8878 on the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

Conclusion: Generative language models with summarization and prompting are effective for automated scoring of long essays, outperforming encoder-based models like BERT.

Abstract: BERT and its variants are extensively explored for automated scoring. However, a limit of 512 tokens for these encoder-based models showed the deficiency in automated scoring of long essays. Thus, this research explores generative language models for automated scoring of long essays via summarization and prompting. The results revealed great improvement of scoring accuracy with QWK increased from 0.822 to 0.8878 for the Learning Agency Lab Automated Essay Scoring 2.0 dataset.

[120] MATCH: Task-Driven Code Evaluation through Contrastive Learning

Marah Ghoummaid, Vladimir Tchuiev, Ofek Glick, Michal Moshkovitz, Dotan Di Castro

Main category: cs.CL

TL;DR: MATCH is a novel reference-free metric for evaluating AI-generated code that uses contrastive learning to create embeddings for code and natural language descriptions, enabling similarity scoring that better reflects functional correctness than existing metrics.

Details

Motivation: Traditional code evaluation methods like unit tests are unscalable, syntactic metrics don't capture functionality, and reference-based metrics require reference code which isn't always available. There's a gap in reference-free evaluation methods.

Method: Uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring between generated code and the intended task.

Result: MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

Conclusion: MATCH provides an effective reference-free evaluation method for AI-generated code that better captures functional alignment with developer intent compared to existing approaches.

Abstract: AI-based code generation is increasingly prevalent, with GitHub Copilot estimated to generate 46% of the code on GitHub. Accurately evaluating how well generated code aligns with developer intent remains a critical challenge. Traditional evaluation methods, such as unit tests, are often unscalable and costly. Syntactic similarity metrics (e.g., BLEU, ROUGE) fail to capture code functionality, and metrics like CodeBERTScore require reference code, which is not always available. To address the gap in reference-free evaluation, with few alternatives such as ICE-Score, this paper introduces MATCH, a novel reference-free metric. MATCH uses Contrastive Learning to generate meaningful embeddings for code and natural language task descriptions, enabling similarity scoring that reflects how well generated code implements the task. We show that MATCH achieves stronger correlations with functional correctness and human preference than existing metrics across multiple programming languages.

[121] BrowseConf: Confidence-Guided Test-Time Scaling for Web Agents

Litu Ou, Kuan Li, Huifeng Yin, Liwen Zhang, Zhongwang Zhang, Xixi Wu, Rui Ye, Zile Qiao, Pengjun Xie, Jingren Zhou, Yong Jiang

Main category: cs.CL

TL;DR: This paper investigates whether LLM-based search agents can communicate confidence in multi-turn interactions and proposes Test-Time Scaling methods that use confidence scores to improve answer quality while reducing token consumption.

Details

Motivation: Existing work on LLM confidence mainly focuses on single-turn scenarios, but research on confidence in complex multi-turn interactions is limited. The authors want to explore if LLM-based search agents can verbalize confidence after long action sequences.

Method: The authors experiment with open-source agentic models and propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encouraging models to retry until reaching satisfactory confidence levels.

Result: Models show much higher task accuracy at high confidence levels and near-zero accuracy when confidence is low. The proposed TTS methods significantly reduce token consumption while maintaining competitive performance compared to baseline fixed budget methods.

Conclusion: LLM-based search agents can effectively communicate confidence in multi-turn scenarios, and using confidence scores to guide retry mechanisms improves efficiency by reducing token usage while maintaining performance.

Abstract: Confidence in LLMs is a useful indicator of model uncertainty and answer reliability. Existing work mainly focused on single-turn scenarios, while research on confidence in complex multi-turn interactions is limited. In this paper, we investigate whether LLM-based search agents have the ability to communicate their own confidence through verbalized confidence scores after long sequences of actions, a significantly more challenging task compared to outputting confidence in a single interaction. Experimenting on open-source agentic models, we first find that models exhibit much higher task accuracy at high confidence while having near-zero accuracy when confidence is low. Based on this observation, we propose Test-Time Scaling (TTS) methods that use confidence scores to determine answer quality, encourage the model to try again until reaching a satisfactory confidence level. Results show that our proposed methods significantly reduce token consumption while demonstrating competitive performance compared to baseline fixed budget TTS methods.

cs.CV

[122] Explainable Detection of AI-Generated Images with Artifact Localization Using Faster-Than-Lies and Vision-Language Models for Edge Devices

Aryan Mathur, Asaduddin Ahmed, Pushti Amit Vasoya, Simeon Kandan Sonar, Yasir Z, Madesh Kuppusamy

Main category: cs.CV

TL;DR: An explainable AI system for detecting image authenticity using a lightweight CNN classifier and Vision-Language Model that achieves 96.5% accuracy on 32x32 images with fast inference time.

Details

Motivation: The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity, requiring reliable detection methods.

Method: Combines Faster-Than-Lies convolutional classifier with Qwen2-VL-7B Vision-Language Model, using autoencoder-based reconstruction error maps for artifact localization and categorization of 70 visual artifact types.

Result: Achieves 96.5% accuracy on extended CiFAKE dataset with adversarial perturbations, 175ms inference time on 8-core CPUs, and generates explainable text for detected anomalies.

Conclusion: Demonstrates feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery with cross-domain applications.

Abstract: The increasing realism of AI-generated imagery poses challenges for verifying visual authenticity. We present an explainable image authenticity detection system that combines a lightweight convolutional classifier (“Faster-Than-Lies”) with a Vision-Language Model (Qwen2-VL-7B) to classify, localize, and explain artifacts in 32x32 images. Our model achieves 96.5% accuracy on the extended CiFAKE dataset augmented with adversarial perturbations and maintains an inference time of 175ms on 8-core CPUs, enabling deployment on local or edge devices. Using autoencoder-based reconstruction error maps, we generate artifact localization heatmaps, which enhance interpretability for both humans and the VLM. We further categorize 70 visual artifact types into eight semantic groups and demonstrate explainable text generation for each detected anomaly. This work highlights the feasibility of combining visual and linguistic reasoning for interpretable authenticity detection in low-resolution imagery and outlines potential cross-domain applications in forensics, industrial inspection, and social media moderation.

[123] CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

Main category: cs.CV

TL;DR: CountFormer is a transformer-based framework for class-agnostic object counting that uses DINOv2 for rich feature representations and achieves state-of-the-art performance on structurally complex scenes.

Details

Motivation: Humans can count diverse objects by perceiving visual repetition and structural relationships, but existing counting models often miscount on objects with complex shapes, internal symmetry, or overlapping components.

Method: Built on CounTR architecture, replaces visual encoder with self-supervised DINOv2 foundation model, incorporates positional embedding fusion to preserve geometric relationships, and uses lightweight convolutional decoder to generate density maps.

Result: Achieves performance comparable to current state-of-the-art methods on FSC-147 dataset, with superior accuracy on structurally intricate or densely packed scenes.

Conclusion: Integrating foundation models like DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

Abstract: Humans can effortlessly count diverse objects by perceiving visual repetition and structural relationships rather than relying on class identity. However, most existing counting models fail to replicate this ability; they often miscount when objects exhibit complex shapes, internal symmetry, or overlapping components. In this work, we introduce CountFormer, a transformer-based framework that learns to recognize repetition and structural coherence for class-agnostic object counting. Built upon the CounTR architecture, our model replaces its visual encoder with the self-supervised foundation model DINOv2, which produces richer and spatially consistent feature representations. We further incorporate positional embedding fusion to preserve geometric relationships before decoding these features into density maps through a lightweight convolutional decoder. Evaluated on the FSC-147 dataset, our model achieves performance comparable to current state-of-the-art methods while demonstrating superior accuracy on structurally intricate or densely packed scenes. Our findings indicate that integrating foundation models such as DINOv2 enables counting systems to approach human-like structural perception, advancing toward a truly general and exemplar-free counting paradigm.

[124] A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

Main category: cs.CV

TL;DR: A novel framework using fixed cameras and deep learning for continuous monitoring of floating debris in rivers, with geometric modeling for object size estimation.

Details

Motivation: Address the environmental concern of floating anthropogenic debris in rivers that negatively impacts biodiversity, water quality, navigation, and recreation.

Method: Utilizes fixed in-situ cameras with deep learning models for debris detection and quantification, tested under various environmental conditions. Implements geometric model using camera intrinsic/extrinsic characteristics for object size estimation from 2D images.

Result: Identifies most suitable deep learning models for accuracy and speed in complex conditions. Demonstrates importance of dataset protocol including negative images and temporal leakage considerations. Shows feasibility of metric object estimation using projective geometry with regression corrections.

Conclusion: The approach enables development of robust, low-cost automated monitoring systems for urban aquatic environments.

Abstract: The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

[125] RareFlow: Physics-Aware Flow-Matching for Cross-Sensor Super-Resolution of Rare-Earth Features

Forouzan Fallah, Wenwen Li, Chia-Yu Hsu, Hyunho Lee, Yezhou Yang

Main category: cs.CV

TL;DR: RareFlow is a physics-aware super-resolution framework for remote sensing imagery that addresses out-of-distribution robustness through dual-conditioning architecture, uncertainty quantification, and multifaceted loss functions.

Details

Motivation: Super-resolution for remote sensing imagery often fails under out-of-distribution conditions, producing visually plausible but physically inaccurate results, especially for rare geomorphic features captured by diverse sensors.

Method: Dual-conditioning architecture with Gated ControlNet for geometric fidelity and textual prompts for semantic guidance; multifaceted loss function for spectral/radiometric consistency; stochastic forward pass for uncertainty quantification.

Result: In blind evaluations, geophysical experts rated outputs approaching ground truth fidelity, significantly outperforming state-of-the-art baselines with nearly 40% reduction in FID and quantitative gains in perceptual metrics.

Conclusion: RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

Abstract: Super-resolution (SR) for remote sensing imagery often fails under out-of-distribution (OOD) conditions, such as rare geomorphic features captured by diverse sensors, producing visually plausible but physically inaccurate results. We present RareFlow, a physics-aware SR framework designed for OOD robustness. RareFlow’s core is a dual-conditioning architecture. A Gated ControlNet preserves fine-grained geometric fidelity from the low-resolution input, while textual prompts provide semantic guidance for synthesizing complex features. To ensure physically sound outputs, we introduce a multifaceted loss function that enforces both spectral and radiometric consistency with sensor properties. Furthermore, the framework quantifies its own predictive uncertainty by employing a stochastic forward pass approach; the resulting output variance directly identifies unfamiliar inputs, mitigating feature hallucination. We validate RareFlow on a new, curated benchmark of multi-sensor satellite imagery. In blind evaluations, geophysical experts rated our model’s outputs as approaching the fidelity of ground truth imagery, significantly outperforming state-of-the-art baselines. This qualitative superiority is corroborated by quantitative gains in perceptual metrics, including a nearly 40% reduction in FID. RareFlow provides a robust framework for high-fidelity synthesis in data-scarce scientific domains and offers a new paradigm for controlled generation under severe domain shift.

[126] TRELLISWorld: Training-Free World Generation from Object Generators

Hanke Chen, Yuan Liu, Minchen Li

Main category: cs.CV

TL;DR: Training-free 3D scene generation by repurposing text-to-3D object diffusion models as modular tile generators, enabling scalable synthesis of large coherent scenes without scene-level datasets or retraining.

Details

Motivation: Existing methods are limited to single-object generation, require domain-specific training, or lack full 360-degree viewability, hindering practical applications in virtual prototyping, AR/VR, and simulation.

Method: Reformulate scene generation as multi-tile denoising problem using overlapping 3D regions independently generated and seamlessly blended via weighted averaging, leveraging object-level diffusion priors.

Result: Enables scalable synthesis of large coherent scenes with local semantic control, diverse scene layouts, efficient generation, and flexible editing capabilities.

Conclusion: Establishes a simple yet powerful foundation for general-purpose, language-driven 3D scene construction without requiring scene-level datasets or retraining.

Abstract: Text-driven 3D scene generation holds promise for a wide range of applications, from virtual prototyping to AR/VR and simulation. However, existing methods are often constrained to single-object generation, require domain-specific training, or lack support for full 360-degree viewability. In this work, we present a training-free approach to 3D scene synthesis by repurposing general-purpose text-to-3D object diffusion models as modular tile generators. We reformulate scene generation as a multi-tile denoising problem, where overlapping 3D regions are independently generated and seamlessly blended via weighted averaging. This enables scalable synthesis of large, coherent scenes while preserving local semantic control. Our method eliminates the need for scene-level datasets or retraining, relies on minimal heuristics, and inherits the generalization capabilities of object-level priors. We demonstrate that our approach supports diverse scene layouts, efficient generation, and flexible editing, establishing a simple yet powerful foundation for general-purpose, language-driven 3D scene construction.

[127] Does CLIP perceive art the same way we do?

Andrea Asperti, Leonardo Dessì, Maria Chiara Tonetti, Nico Wu

Main category: cs.CV

TL;DR: This paper investigates CLIP’s ability to perceive and interpret artworks, comparing its visual understanding to human perception across semantic, stylistic, and contextual dimensions.

Details

Motivation: To understand whether CLIP 'sees' artworks similarly to humans, especially given its use in creative domains like style transfer and image synthesis where nuanced understanding is crucial.

Method: Designed targeted probing tasks to evaluate CLIP’s perception across multiple dimensions (content, scene understanding, artistic style, historical period, visual artifacts) and compared its responses to human annotations and expert benchmarks.

Result: Found both strengths and limitations in CLIP’s visual representations, particularly regarding aesthetic cues and artistic intent. The model shows capabilities but also gaps in nuanced understanding.

Conclusion: Highlights the need for deeper interpretability in multimodal systems when applied to creative domains where subjectivity and nuance are central, and discusses implications for using CLIP in generative processes.

Abstract: CLIP has emerged as a powerful multimodal model capable of connecting images and text through joint embeddings, but to what extent does it ‘see’ the same way humans do - especially when interpreting artworks? In this paper, we investigate CLIP’s ability to extract high-level semantic and stylistic information from paintings, including both human-created and AI-generated imagery. We evaluate its perception across multiple dimensions: content, scene understanding, artistic style, historical period, and the presence of visual deformations or artifacts. By designing targeted probing tasks and comparing CLIP’s responses to human annotations and expert benchmarks, we explore its alignment with human perceptual and contextual understanding. Our findings reveal both strengths and limitations in CLIP’s visual representations, particularly in relation to aesthetic cues and artistic intent. We further discuss the implications of these insights for using CLIP as a guidance mechanism during generative processes, such as style transfer or prompt-based image synthesis. Our work highlights the need for deeper interpretability in multimodal systems, especially when applied to creative domains where nuance and subjectivity play a central role.

[128] Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

Main category: cs.CV

TL;DR: LHT-CLIP is a training-free framework that improves CLIP models for semantic segmentation by exploiting visual discriminability across layers, attention heads, and tokens through three techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement.

Details

Motivation: CLIP models struggle with semantic segmentation due to misalignment between image-level pre-training objectives and pixel-level visual understanding required for dense prediction. Prior methods inherit global alignment bias from preceding layers, leading to suboptimal performance.

Method: Proposes three training-free techniques: 1) Semantic-spatial reweighting to restore visual discriminability, 2) Selective head enhancement using consistently discriminative attention heads, and 3) Abnormal token replacement based on sparse activation patterns of anomalous tokens.

Result: Extensive experiments on 8 semantic segmentation benchmarks show state-of-the-art performance across diverse scenarios without additional training, auxiliary networks, or extensive hyperparameter tuning.

Conclusion: LHT-CLIP effectively restores visual discriminability in CLIP models for semantic segmentation, demonstrating practical effectiveness for real-world deployment while being training-free and parameter-efficient.

Abstract: Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

[129] DynaStride: Dynamic Stride Windowing with MMCoT for Instructional Multi-Scene Captioning

Eddison Pham, Prisha Priyadarshini, Adrian Maliackel, Kanishk Bandi, Cristian Meo, Kevin Zhu

Main category: cs.CV

TL;DR: DynaStride is a pipeline for generating coherent scene-level captions in instructional videos without manual scene segmentation, using adaptive frame sampling and multimodal reasoning to improve caption quality.

Details

Motivation: Scene-level captioning in instructional videos enhances learning by aligning visual cues with temporal structure, but current methods often fail to capture this structure, leading to incoherent captions that undermine educational value.

Method: Uses adaptive frame sampling and multimodal windowing on YouCookII dataset, employs multimodal chain-of-thought for action-object pairs, and applies dynamic stride window selection to balance temporal context and redundancy.

Result: Outperforms strong baselines (VLLaMA3, GPT-4o) on both N-gram metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore), producing more temporally coherent and informative captions.

Conclusion: DynaStride provides a promising approach for improving AI-powered instructional content generation through better temporal coherence and multimodal reasoning in scene-level captions.

Abstract: Scene-level captioning in instructional videos can enhance learning by requiring an understanding of both visual cues and temporal structure. By aligning visual cues with textual guidance, this understanding supports procedural learning and multimodal reasoning, providing a richer context for skill acquisition. However, captions that fail to capture this structure may lack coherence and quality, which can create confusion and undermine the video’s educational intent. To address this gap, we introduce DynaStride, a pipeline to generate coherent, scene-level captions without requiring manual scene segmentation. Using the YouCookII dataset’s scene annotations, DynaStride performs adaptive frame sampling and multimodal windowing to capture key transitions within each scene. It then employs a multimodal chain-of-thought process to produce multiple action-object pairs, which are refined and fused using a dynamic stride window selection algorithm that adaptively balances temporal context and redundancy. The final scene-level caption integrates visual semantics and temporal reasoning in a single instructional caption. Empirical evaluations against strong baselines, including VLLaMA3 and GPT-4o, demonstrate consistent gains on both N-gram-based metrics (BLEU, METEOR) and semantic similarity measures (BERTScore, CLIPScore). Qualitative analyses further show that DynaStride produces captions that are more temporally coherent and informative, suggesting a promising direction for improving AI-powered instructional content generation.

[130] TurboPortrait3D: Single-step diffusion-based fast portrait novel-view synthesis

Emily Kim, Julieta Martinez, Timur Bagautdinov, Jessica Hodgins

Main category: cs.CV

TL;DR: TurboPortrait3D is a low-latency method for novel-view synthesis of human portraits that combines image-to-3D models with diffusion models to enhance quality while maintaining 3D-awareness.

Details

Motivation: Existing image-to-3D portrait models produce visual artifacts and lack detail preservation, while diffusion models generate high-quality images but lack 3D consistency and are computationally expensive.

Method: Uses a feedforward image-to-avatar pipeline to get initial 3D representation and noisy renders, then refines them with a single-step diffusion model conditioned on input images and trained for multi-view consistency. Includes pre-training on synthetic multi-view data and fine-tuning on real images.

Result: Qualitatively and quantitatively outperforms current state-of-the-art methods for portrait novel-view synthesis while being time-efficient.

Conclusion: Image-space diffusion models can effectively enhance image-to-avatar methods, maintaining 3D-awareness and low-latency while significantly improving output quality.

Abstract: We introduce TurboPortrait3D: a method for low-latency novel-view synthesis of human portraits. Our approach builds on the observation that existing image-to-3D models for portrait generation, while capable of producing renderable 3D representations, are prone to visual artifacts, often lack of detail, and tend to fail at fully preserving the identity of the subject. On the other hand, image diffusion models excel at generating high-quality images, but besides being computationally expensive, are not grounded in 3D and thus are not directly capable of producing multi-view consistent outputs. In this work, we demonstrate that image-space diffusion models can be used to significantly enhance the quality of existing image-to-avatar methods, while maintaining 3D-awareness and running with low-latency. Our method takes a single frontal image of a subject as input, and applies a feedforward image-to-avatar generation pipeline to obtain an initial 3D representation and corresponding noisy renders. These noisy renders are then fed to a single-step diffusion model which is conditioned on input image(s), and is specifically trained to refine the renders in a multi-view consistent way. Moreover, we introduce a novel effective training strategy that includes pre-training on a large corpus of synthetic multi-view data, followed by fine-tuning on high-quality real images. We demonstrate that our approach both qualitatively and quantitatively outperforms current state-of-the-art for portrait novel-view synthesis, while being efficient in time.

[131] Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Patrick Koller, Amil V. Dravid, Guido M. Schuster, Aggelos K. Katsaggelos

Main category: cs.CV

TL;DR: Proposes a caption-based XAI method that integrates standalone models into CLIP using network surgery to identify dominant concepts in predictions, reducing covariate shift risks.

Details

Motivation: Saliency maps in XAI can be misleading when spurious and salient features overlap in pixel space, creating robustness issues in ML models.

Method: Integrates standalone models into CLIP using network surgery approach to create caption-based explanations that identify dominant concepts contributing to predictions.

Result: Developed a method that minimizes the risk of standalone models falling for covariate shift by providing concept-based explanations rather than pixel-level saliency maps.

Conclusion: The caption-based XAI approach contributes significantly to developing more robust ML models by addressing limitations of traditional saliency map methods.

Abstract: Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models. Our code is available at https://github.com/patch0816/caption-driven-xai.

[132] PlanarGS: High-Fidelity Indoor 3D Gaussian Splatting Guided by Vision-Language Planar Priors

Xirui Jin, Renbiao Jin, Boying Li, Danping Zou, Wenxian Yu

Main category: cs.CV

TL;DR: PlanarGS improves 3D Gaussian Splatting for indoor scenes by incorporating planar priors and geometric supervision to overcome ambiguous geometry in low-texture regions.

Details

Motivation: 3DGS performs poorly in indoor scenes with large low-texture regions due to ambiguous geometry from photometric loss alone.

Method: Uses Language-Prompted Planar Priors (LP3) with vision-language segmentation and cross-view fusion, plus planar and geometric supervision terms for Gaussian optimization.

Result: Outperforms state-of-the-art methods by large margin on indoor benchmarks, reconstructing accurate and detailed 3D surfaces.

Conclusion: PlanarGS successfully addresses 3DGS limitations in indoor scenes through planar priors and geometric supervision.

Abstract: Three-dimensional Gaussian Splatting (3DGS) has recently emerged as an efficient representation for novel-view synthesis, achieving impressive visual quality. However, in scenes dominated by large and low-texture regions, common in indoor environments, the photometric loss used to optimize 3DGS yields ambiguous geometry and fails to recover high-fidelity 3D surfaces. To overcome this limitation, we introduce PlanarGS, a 3DGS-based framework tailored for indoor scene reconstruction. Specifically, we design a pipeline for Language-Prompted Planar Priors (LP3) that employs a pretrained vision-language segmentation model and refines its region proposals via cross-view fusion and inspection with geometric priors. 3D Gaussians in our framework are optimized with two additional terms: a planar prior supervision term that enforces planar consistency, and a geometric prior supervision term that steers the Gaussians toward the depth and normal cues. We have conducted extensive experiments on standard indoor benchmarks. The results show that PlanarGS reconstructs accurate and detailed 3D surfaces, consistently outperforming state-of-the-art methods by a large margin. Project page: https://planargs.github.io

[133] Adaptive Training of INRs via Pruning and Densification

Diana Aldana, João Paulo Lima, Daniel Csillag, Daniel Perazzo, Haoan Feng, Luiz Velho, Tiago Novello

Main category: cs.CV

TL;DR: AIRe is an adaptive training scheme for implicit neural representations that uses neuron pruning and input frequency densification to optimize network architecture during training, achieving better size-quality trade-offs.

Details

Motivation: Current methods for implicit neural representations face challenges in selecting appropriate input frequencies and architectures while managing parameter redundancy, often requiring heuristic approaches and extensive hyperparameter optimization.

Method: AIRe uses a two-stage approach: (1) neuron pruning that identifies less-contributory neurons, applies targeted weight decay to transfer information, then performs structured pruning; (2) input frequency densification that adds frequencies to spectrum regions where the signal underfits.

Result: Experiments on images and SDFs show that AIRe reduces model size while preserving or even improving reconstruction quality compared to existing methods.

Conclusion: AIRe provides an effective adaptive training scheme for implicit neural representations that automatically refines architecture during optimization, achieving improved trade-offs between network size and reconstruction quality.

Abstract: Encoding input coordinates with sinusoidal functions into multilayer perceptrons (MLPs) has proven effective for implicit neural representations (INRs) of low-dimensional signals, enabling the modeling of high-frequency details. However, selecting appropriate input frequencies and architectures while managing parameter redundancy remains an open challenge, often addressed through heuristics and heavy hyperparameter optimization schemes. In this paper, we introduce AIRe ($\textbf{A}$daptive $\textbf{I}$mplicit neural $\textbf{Re}$presentation), an adaptive training scheme that refines the INR architecture over the course of optimization. Our method uses a neuron pruning mechanism to avoid redundancy and input frequency densification to improve representation capacity, leading to an improved trade-off between network size and reconstruction quality. For pruning, we first identify less-contributory neurons and apply a targeted weight decay to transfer their information to the remaining neurons, followed by structured pruning. Next, the densification stage adds input frequencies to spectrum regions where the signal underfits, expanding the representational basis. Through experiments on images and SDFs, we show that AIRe reduces model size while preserving, or even improving, reconstruction quality. Code and pretrained models will be released for public use.

[134] Neural USD: An object-centric framework for iterative editing and control

Alejandro Escontrela, Shrinu Kushagra, Sjoerd van Steenkiste, Yulia Rubanova, Aleksander Holynski, Kelsey Allen, Kevin Murphy, Thomas Kipf

Main category: cs.CV

TL;DR: Neural USD introduces a structured hierarchical framework for precise object-level editing in generative models, addressing unintended global changes during editing.

Details

Motivation: Current controllable generative models often cause unintended global changes when trying to edit specific objects, lacking precise and iterative object editing capabilities.

Method: Proposes Neural Universal Scene Descriptor (Neural USD) - a hierarchical scene representation inspired by computer graphics USD standard, with fine-tuning approach for disentangled control over appearance, geometry, and pose.

Result: The framework enables per-object control and supports iterative/incremental editing workflows while minimizing model-specific constraints.

Conclusion: Neural USD provides a structured approach to address precise object editing challenges in generative modeling, representing scenes and objects hierarchically for better control.

Abstract: Amazing progress has been made in controllable generative modeling, especially over the last few years. However, some challenges remain. One of them is precise and iterative object editing. In many of the current methods, trying to edit the generated image (for example, changing the color of a particular object in the scene or changing the background while keeping other elements unchanged) by changing the conditioning signals often leads to unintended global changes in the scene. In this work, we take the first steps to address the above challenges. Taking inspiration from the Universal Scene Descriptor (USD) standard developed in the computer graphics community, we introduce the “Neural Universal Scene Descriptor” or Neural USD. In this framework, we represent scenes and objects in a structured, hierarchical manner. This accommodates diverse signals, minimizes model-specific constraints, and enables per-object control over appearance, geometry, and pose. We further apply a fine-tuning approach which ensures that the above control signals are disentangled from one another. We evaluate several design considerations for our framework, demonstrating how Neural USD enables iterative and incremental workflows. More information at: https://escontrela.me/neural_usd .

[135] SafeVision: Efficient Image Guardrail with Robust Policy Adherence and Explainability

Peiyang Xu, Minzhou Pan, Zhaorun Chen, Shuang Yang, Chaowei Xiao, Bo Li

Main category: cs.CV

TL;DR: SafeVision is a novel image guardrail system that integrates human-like reasoning to enhance adaptability and transparency in detecting unsafe content, outperforming GPT-4o by significant margins while being much faster.

Details

Motivation: Traditional image guardrail models are constrained by predefined categories, misclassify content due to lack of semantic reasoning, struggle with emerging threats, and require costly retraining for new threats.

Method: Integrates human-like reasoning, includes data collection and generation framework, policy-following training pipeline, customized loss function, and diverse QA generation strategy. Dynamically aligns with safety policies at inference time without retraining.

Result: Achieves state-of-the-art performance, outperforms GPT-4o by 8.6% on VisionHarm-T and 15.5% on VisionHarm-C, while being over 16x faster. Introduces VisionHarm dataset with comprehensive harmful categories.

Conclusion: SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats, eliminating the need for retraining while ensuring precise risk assessments and explanations.

Abstract: With the rapid proliferation of digital media, the need for efficient and transparent safeguards against unsafe content is more critical than ever. Traditional image guardrail models, constrained by predefined categories, often misclassify content due to their pure feature-based learning without semantic reasoning. Moreover, these models struggle to adapt to emerging threats, requiring costly retraining for new threats. To address these limitations, we introduce SafeVision, a novel image guardrail that integrates human-like reasoning to enhance adaptability and transparency. Our approach incorporates an effective data collection and generation framework, a policy-following training pipeline, and a customized loss function. We also propose a diverse QA generation and training strategy to enhance learning effectiveness. SafeVision dynamically aligns with evolving safety policies at inference time, eliminating the need for retraining while ensuring precise risk assessments and explanations. Recognizing the limitations of existing unsafe image benchmarks, which either lack granularity or cover limited risks, we introduce VisionHarm, a high-quality dataset comprising two subsets: VisionHarm Third-party (VisionHarm-T) and VisionHarm Comprehensive(VisionHarm-C), spanning diverse harmful categories. Through extensive experiments, we show that SafeVision achieves state-of-the-art performance on different benchmarks. SafeVision outperforms GPT-4o by 8.6% on VisionHarm-T and by 15.5% on VisionHarm-C, while being over 16x faster. SafeVision sets a comprehensive, policy-following, and explainable image guardrail with dynamic adaptation to emerging threats.

[136] RapVerse: Coherent Vocals and Whole-Body Motions Generations from Text

Jiaben Chen, Xin Yan, Yihang Chen, Siyuan Cen, Zixin Wang, Qinwei Ma, Haoyu Zhen, Kaizhi Qian, Lie Lu, Chuang Gan

Main category: cs.CV

TL;DR: A unified framework for generating 3D holistic body motions and singing vocals directly from textual lyrics using multimodal transformers trained on the RapVerse dataset.

Details

Motivation: To advance beyond existing works that address vocals and body motions in isolation by creating a system that simultaneously generates both modalities from text inputs for more realistic and coherent performance generation.

Method: Uses vector-quantized variational autoencoder for motion encoding, vocal-to-unit model for audio tokenization, and joint transformer modeling across language, audio, and motion modalities trained on the RapVerse dataset.

Result: The framework produces coherent and realistic singing vocals alongside human motions directly from text, matching performance of specialized single-modality systems.

Conclusion: Establishes new benchmarks for joint vocal-motion generation, demonstrating that unified multimodal modeling can achieve state-of-the-art performance across both modalities simultaneously.

Abstract: In this work, we introduce a challenging task for simultaneously generating 3D holistic body motions and singing vocals directly from textual lyrics inputs, advancing beyond existing works that typically address these two modalities in isolation. To facilitate this, we first collect the RapVerse dataset, a large dataset containing synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes. With the RapVerse dataset, we investigate the extent to which scaling autoregressive multimodal transformers across language, audio, and motion can enhance the coherent and realistic generation of vocals and whole-body human motions. For modality unification, a vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, while a vocal-to-unit model is leveraged to obtain quantized audio tokens preserving content, prosodic information and singer identity. By jointly performing transformer modeling on these three modalities in a unified way, our framework ensures a seamless and realistic blend of vocals and human motions. Extensive experiments demonstrate that our unified generation framework not only produces coherent and realistic singing vocals alongside human motions directly from textual inputs, but also rivals the performance of specialized single-modality generation systems, establishing new benchmarks for joint vocal-motion generation.

[137] Reasoning Visual Language Model for Chest X-Ray Analysis

Andriy Myronenko, Dong Yang, Baris Turkbey, Mariam Aboian, Sena Azamat, Esra Akcicek, Hongxu Yin, Pavlo Molchanov, Marc Edgar, Yufan He, Pengfei Guo, Yucheng Tang, Daguang Xu

Main category: cs.CV

TL;DR: A framework that brings chain-of-thought reasoning to chest X-ray interpretation, enabling transparent stepwise reasoning that aligns with clinical workflow and supports auditability.

Details

Motivation: Current vision-language models for medical image analysis are opaque and lack the transparent reasoning clinicians rely on. There's a need for AI systems that not only predict but also explain their reasoning process.

Method: Couples high-fidelity visual encoding with two-stage training: reasoning-style supervised fine-tuning followed by reinforcement learning using verifiable rewards over X-ray abnormalities. The model outputs reasoning that mirrors radiologists’ systematic thought process.

Result: Achieves competitive multi-label classification in out-of-distribution evaluation while improving interpretability. In reader studies, full reasoning traces increased radiologists’ confidence, supported error auditing, and reduced report finalization time.

Conclusion: The approach enables trustworthy, explainable AI in medical imaging where reasoning quality is as critical as prediction quality, supporting safer human-AI collaboration and clinical auditability.

Abstract: Vision-language models (VLMs) have shown strong promise for medical image analysis, but most remain opaque, offering predictions without the transparent, stepwise reasoning clinicians rely on. We present a framework that brings chain-of-thought (CoT) reasoning to chest X-ray interpretation. Inspired by reasoning-first training paradigms, our approach is designed to learn how experts reason, not just what they conclude, by aligning intermediate steps with observable image evidence and radiology workflow. Beyond accuracy, the explicit reasoning traces support clinical auditability: they reveal why a conclusion was reached, which alternatives were considered, and where uncertainty remains, enabling quality assurance, error analysis, and safer human-AI collaboration. Our model couples high-fidelity visual encoding with a two-stage training recipe: a reasoning-style supervised fine-tuning (SFT) followed by reinforcement learning (RL) that uses verifiable rewards over a list of X-ray abnormalities. The model outputs reasoning that mirrors radiologists systematic thought process, uncertainty, and differential diagnosis. In out-of-distribution evaluation, the approach achieves competitive multi-label classification while improving interpretability. In a reader study with expert radiologists, full reasoning traces increased confidence, supported error auditing, and reduced time to finalize reports. We release code and the model NV-Reason-CXR-3B to support community progress toward trustworthy, explainable AI in chest radiography and other medical imaging tasks where reasoning quality is as critical as prediction quality.

[138] Efficient Cost-and-Quality Controllable Arbitrary-scale Super-resolution with Fourier Constraints

Kazutoshi Akita, Norimichi Ukita

Main category: cs.CV

TL;DR: Proposes joint prediction of multiple Fourier components for better quality and efficiency in arbitrary-scale super-resolution, addressing limitations of existing recurrent methods.

Details

Motivation: Existing methods predict Fourier components one by one using recurrent neural networks, leading to performance degradation and inefficiency due to independent prediction.

Method: Predicting multiple Fourier components jointly instead of one by one.

Result: Improves both quality and efficiency in arbitrary-scale super-resolution.

Conclusion: Joint prediction of multiple components is a better approach for cost-and-quality controllable super-resolution.

Abstract: Cost-and-Quality (CQ) controllability in arbitrary-scale super-resolution is crucial. Existing methods predict Fourier components one by one using a recurrent neural network. However, this approach leads to performance degradation and inefficiency due to independent prediction. This paper proposes predicting multiple components jointly to improve both quality and efficiency.

[139] TeleEgo: Benchmarking Egocentric AI Assistants in the Wild

Jiaqi Yan, Ruilong Ren, Jingren Liu, Shuning Xu, Ling Wang, Yiheng Wang, Yun Wang, Long Zhang, Xiangyu Chen, Changzhi Sun, Jixiang Luo, Dell Zhang, Hao Sun, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: TeleEgo is a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts, featuring synchronized video, audio, and text across four domains with 12 diagnostic subtasks.

Details

Motivation: Existing benchmarks evaluate egocentric AI assistant abilities in isolation, lack realistic streaming scenarios, or only support short-term tasks, failing to address real-world requirements for multi-modal processing, real-time response, and long-term memory retention.

Method: The dataset includes over 14 hours per participant of synchronized egocentric video, audio, and text across work & study, lifestyle & routines, social activities, and outings & culture domains. Data is aligned on a unified global timeline with human-refined visual narrations and speech transcripts.

Result: TeleEgo contains 3,291 human-verified QA items across 12 subtasks in three core capabilities: Memory, Understanding, and Cross-Memory Reasoning. It introduces Real-Time Accuracy and Memory Persistence Time metrics to assess correctness, temporal responsiveness, and long-term retention in streaming settings.

Conclusion: TeleEgo provides a realistic and comprehensive evaluation framework to advance the development of practical AI assistants capable of processing multi-modal inputs, responding in real time, and retaining evolving long-term memory in daily contexts.

Abstract: Egocentric AI assistants in real-world settings must process multi-modal inputs (video, audio, text), respond in real time, and retain evolving long-term memory. However, existing benchmarks typically evaluate these abilities in isolation, lack realistic streaming scenarios, or support only short-term tasks. We introduce \textbf{TeleEgo}, a long-duration, streaming, omni-modal benchmark for evaluating egocentric AI assistants in realistic daily contexts. The dataset features over 14 hours per participant of synchronized egocentric video, audio, and text across four domains: work & study, lifestyle & routines, social activities, and outings & culture. All data is aligned on a unified global timeline and includes high-quality visual narrations and speech transcripts, curated through human refinement.TeleEgo defines 12 diagnostic subtasks across three core capabilities: Memory (recalling past events), Understanding (interpreting the current moment), and Cross-Memory Reasoning (linking distant events). It contains 3,291 human-verified QA items spanning multiple question formats (single-choice, binary, multi-choice, and open-ended), evaluated strictly in a streaming setting. We propose two key metrics – Real-Time Accuracy and Memory Persistence Time – to jointly assess correctness, temporal responsiveness, and long-term retention. TeleEgo provides a realistic and comprehensive evaluation to advance the development of practical AI assistants.

[140] AdvBlur: Adversarial Blur for Robust Diabetic Retinopathy Classification and Cross-Domain Generalization

Heethanjan Kanagalingam, Thenukan Pathmanathan, Mokeeshan Vathanakumar, Tharmakulasingam Mukunthan

Main category: cs.CV

TL;DR: Proposes AdvBlur, a novel diabetic retinopathy classification method that uses adversarial blurred images and dual-loss functions to improve domain generalization across different datasets and imaging conditions.

Details

Motivation: Existing deep learning models for diabetic retinopathy detection struggle with robustness due to distributional variations from different acquisition devices, demographic disparities, and imaging conditions.

Method: AdvBlur integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization challenges.

Result: The method achieves competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets, effectively mitigating the impact of unseen distributional variations.

Conclusion: AdvBlur provides an effective approach for improving the robustness and generalization of diabetic retinopathy classification models across diverse clinical settings and imaging conditions.

Abstract: Diabetic retinopathy (DR) is a leading cause of vision loss worldwide, yet early and accurate detection can significantly improve treatment outcomes. While numerous Deep learning (DL) models have been developed to predict DR from fundus images, many face challenges in maintaining robustness due to distributional variations caused by differences in acquisition devices, demographic disparities, and imaging conditions. This paper addresses this critical limitation by proposing a novel DR classification approach, a method called AdvBlur. Our method integrates adversarial blurred images into the dataset and employs a dual-loss function framework to address domain generalization. This approach effectively mitigates the impact of unseen distributional variations, as evidenced by comprehensive evaluations across multiple datasets. Additionally, we conduct extensive experiments to explore the effects of factors such as camera type, low-quality images, and dataset size. Furthermore, we perform ablation studies on blurred images and the loss function to ensure the validity of our choices. The experimental results demonstrate the effectiveness of our proposed method, achieving competitive performance compared to state-of-the-art domain generalization DR models on unseen external datasets.

[141] Towards the Automatic Segmentation, Modeling and Meshing of the Aortic Vessel Tree from Multicenter Acquisitions: An Overview of the SEG.A. 2023 Segmentation of the Aorta Challenge

Yuan Jin, Antonio Pepe, Gian Marco Melito, Yuxuan Chen, Yunsu Byeon, Hyeseong Kim, Kyungwon Kim, Doohyun Park, Euijoon Choi, Dosik Hwang, Andriy Myronenko, Dong Yang, Yufan He, Daguang Xu, Ayman El-Ghotni, Mohamed Nabil, Hossam El-Kady, Ahmed Ayyad, Amr Nasr, Marek Wodzinski, Henning Müller, Hyeongyu Kim, Yejee Shin, Abbas Khan, Muhammad Asad, Alexander Zolotarev, Caroline Roney, Anthony Mathur, Martin Benning, Gregory Slabaugh, Theodoros Panagiotis Vagenas, Konstantinos Georgas, George K. Matsopoulos, Jihan Zhang, Zhen Zhang, Liqin Huang, Christian Mayer, Heinrich Mächler, Jan Egger

Main category: cs.CV

TL;DR: The SEG.A. challenge introduced a large public dataset for aortic vessel tree segmentation, revealing that 3D U-Net architectures and ensemble methods achieve best performance, with customized post-processing being crucial.

Details

Motivation: To address the lack of shared, high-quality data for automated analysis of the aortic vessel tree from CT angiography, which has impeded clinical development.

Method: Launched a public challenge with multi-institutional dataset for AVT segmentation, benchmarked algorithms on hidden test set, and included optional surface meshing tasks for simulations.

Result: Deep learning methods dominated, with 3D U-Net architectures performing best. Ensemble of top algorithms significantly outperformed individual models. Performance strongly depended on algorithmic design and customized post-processing.

Conclusion: The challenge establishes a new performance benchmark and provides a lasting resource to drive future innovation toward robust, clinically translatable tools for aortic vessel tree analysis.

Abstract: The automated analysis of the aortic vessel tree (AVT) from computed tomography angiography (CTA) holds immense clinical potential, but its development has been impeded by a lack of shared, high-quality data. We launched the SEG.A. challenge to catalyze progress in this field by introducing a large, publicly available, multi-institutional dataset for AVT segmentation. The challenge benchmarked automated algorithms on a hidden test set, with subsequent optional tasks in surface meshing for computational simulations. Our findings reveal a clear convergence on deep learning methodologies, with 3D U-Net architectures dominating the top submissions. A key result was that an ensemble of the highest-ranking algorithms significantly outperformed individual models, highlighting the benefits of model fusion. Performance was strongly linked to algorithmic design, particularly the use of customized post-processing steps, and the characteristics of the training data. This initiative not only establishes a new performance benchmark but also provides a lasting resource to drive future innovation toward robust, clinically translatable tools.

[142] Mars-Bench: A Benchmark for Evaluating Foundation Models for Mars Science Tasks

Mirali Purohit, Bimal Gajera, Vatsal Malaviya, Irish Mehta, Kunal Kasodekar, Jacob Adler, Steven Lu, Umaa Rebbapragada, Hannah Kerner

Main category: cs.CV

TL;DR: Mars-Bench is the first standardized benchmark for evaluating foundation models on Mars science tasks using orbital and surface imagery across 20 datasets for classification, segmentation, and object detection.

Details

Motivation: Foundation models have shown strong generalization in many domains but their application to Mars science has been limited due to the lack of standardized benchmarks and evaluation frameworks.

Method: Created Mars-Bench with 20 datasets spanning classification, segmentation, and object detection tasks focused on key geologic features. Provided baseline evaluations using models pre-trained on natural images, Earth satellite data, and vision-language models.

Result: Results suggest Mars-specific foundation models may offer advantages over general-domain counterparts, indicating the potential benefits of domain-adapted pre-training.

Conclusion: Mars-Bench establishes a standardized foundation for developing and comparing machine learning models for Mars science, with data, models, and code publicly available.

Abstract: Foundation models have enabled rapid progress across many specialized domains by leveraging large-scale pre-training on unlabeled data, demonstrating strong generalization to a variety of downstream tasks. While such models have gained significant attention in fields like Earth Observation, their application to Mars science remains limited. A key enabler of progress in other domains has been the availability of standardized benchmarks that support systematic evaluation. In contrast, Mars science lacks such benchmarks and standardized evaluation frameworks, which have limited progress toward developing foundation models for Martian tasks. To address this gap, we introduce Mars-Bench, the first benchmark designed to systematically evaluate models across a broad range of Mars-related tasks using both orbital and surface imagery. Mars-Bench comprises 20 datasets spanning classification, segmentation, and object detection, focused on key geologic features such as craters, cones, boulders, and frost. We provide standardized, ready-to-use datasets and baseline evaluations using models pre-trained on natural images, Earth satellite data, and state-of-the-art vision-language models. Results from all analyses suggest that Mars-specific foundation models may offer advantages over general-domain counterparts, motivating further exploration of domain-adapted pre-training. Mars-Bench aims to establish a standardized foundation for developing and comparing machine learning models for Mars science. Our data, models, and code are available at: https://mars-bench.github.io/.

[143] AutoPrompt: Automated Red-Teaming of Text-to-Image Models via LLM-Driven Adversarial Prompts

Yufan Liu, Wanqian Zhang, Huashan Chen, Lin Wang, Xiaojun Jia, Zheng Lin, Weiping Wang

Main category: cs.CV

TL;DR: APT is a black-box framework that uses LLMs to generate human-readable adversarial suffixes for text-to-image models, bypassing safety filters through dual-evasion strategy and achieving high transferability.

Details

Motivation: Current red-teaming methods for T2I models require white-box access, use inefficient optimization, and generate meaningless prompts that are easily blocked by filters.

Method: Alternating optimization-finetuning pipeline between adversarial suffix optimization and LLM fine-tuning, with dual-evasion strategy using perplexity scoring and banned-token penalties.

Result: Excellent red-teaming performance with human-readable, filter-resistant prompts, and superior zero-shot transferability to unseen prompts and commercial APIs.

Conclusion: APT effectively exposes critical vulnerabilities in T2I models, including commercial systems, through human-readable adversarial prompts that bypass safety mechanisms.

Abstract: Despite rapid advancements in text-to-image (T2I) models, their safety mechanisms are vulnerable to adversarial prompts, which maliciously generate unsafe images. Current red-teaming methods for proactively assessing such vulnerabilities usually require white-box access to T2I models, and rely on inefficient per-prompt optimization, as well as inevitably generate semantically meaningless prompts easily blocked by filters. In this paper, we propose APT (AutoPrompT), a black-box framework that leverages large language models (LLMs) to automatically generate human-readable adversarial suffixes for benign prompts. We first introduce an alternating optimization-finetuning pipeline between adversarial suffix optimization and fine-tuning the LLM utilizing the optimized suffix. Furthermore, we integrates a dual-evasion strategy in optimization phase, enabling the bypass of both perplexity-based filter and blacklist word filter: (1) we constrain the LLM generating human-readable prompts through an auxiliary LLM perplexity scoring, which starkly contrasts with prior token-level gibberish, and (2) we also introduce banned-token penalties to suppress the explicit generation of banned-tokens in blacklist. Extensive experiments demonstrate the excellent red-teaming performance of our human-readable, filter-resistant adversarial prompts, as well as superior zero-shot transferability which enables instant adaptation to unseen prompts and exposes critical vulnerabilities even in commercial APIs (e.g., Leonardo.Ai.).

[144] ResNet: Enabling Deep Convolutional Neural Networks through Residual Learning

Xingyu Liu, Kun Ming Goh

Main category: cs.CV

TL;DR: ResNet uses skip connections to solve vanishing gradient problem, enabling training of very deep networks with better accuracy and faster convergence.

Details

Motivation: To overcome the vanishing gradient problem that makes training very deep convolutional neural networks challenging.

Method: Uses residual blocks with skip connections that bypass intermediate layers, allowing gradients to flow directly through shortcut paths.

Result: On CIFAR-10, ResNet-18 achieved 89.9% accuracy vs 84.1% for traditional deep CNN of similar depth, with faster convergence and more stable training.

Conclusion: Residual Networks successfully enable training of very deep networks by addressing gradient flow issues through skip connections.

Abstract: Convolutional Neural Networks (CNNs) has revolutionized computer vision, but training very deep networks has been challenging due to the vanishing gradient problem. This paper explores Residual Networks (ResNet), introduced by He et al. (2015), which overcomes this limitation by using skip connections. ResNet enables the training of networks with hundreds of layers by allowing gradients to flow directly through shortcut connections that bypass intermediate layers. In our implementation on the CIFAR-10 dataset, ResNet-18 achieves 89.9% accuracy compared to 84.1% for a traditional deep CNN of similar depth, while also converging faster and training more stably.

[145] Kernelized Sparse Fine-Tuning with Bi-level Parameter Competition for Vision Models

Shufan Shen, Junshu Sun, Shuhui Wang, Qingming Huang

Main category: cs.CV

TL;DR: SNELLA is a one-stage parameter-efficient fine-tuning method that uses nonlinear low-rank decomposition and adaptive bi-level sparsity allocation to achieve state-of-the-art performance with significantly reduced memory usage compared to existing methods.

Details

Motivation: Current sparse tuning methods have two limitations: (1) they use gradient information to locate task-relevant weights, which overlooks parameter adjustments during fine-tuning and limits performance, and (2) they require storing all weight matrices in the optimizer, resulting in high memory usage.

Method: SNELLA uses a one-stage approach with: (1) selective weight updates via sparse matrix merging using two low-rank learnable matrices with nonlinear kernel functions to increase rank and prevent interdependency, and (2) adaptive bi-level sparsity allocation that allows weights to compete across and inside layers based on importance scores in an end-to-end manner.

Result: SNELLA achieves SOTA performance with 1.8% higher Top-1 accuracy (91.9% vs 90.1%) on FGVC benchmark compared to SPT-LoRA, and reduces memory usage by 31.1%-39.9% across models with 86M to 632M parameters on classification, segmentation, and generation tasks.

Conclusion: SNELLA effectively addresses the limitations of current sparse tuning methods by providing a one-stage approach that achieves superior performance with significantly reduced memory consumption through innovative nonlinear low-rank decomposition and adaptive sparsity allocation.

Abstract: Parameter-efficient fine-tuning (PEFT) aims to adapt pre-trained vision models to downstream tasks. Among PEFT paradigms, sparse tuning achieves remarkable performance by adjusting only the weights most relevant to downstream tasks, rather than densely tuning the entire weight matrix. Current methods follow a two-stage paradigm. First, it locates task-relevant weights by gradient information, which overlooks the parameter adjustments during fine-tuning and limits the performance. Second, it updates only the located weights by applying a sparse mask to the gradient of the weight matrix, which results in high memory usage due to the storage of all weight matrices in the optimizer. In this paper, we propose a one-stage method named SNELLA to overcome the above limitations. For memory usage, SNELLA selectively updates the weight matrix by adding it to another sparse matrix that is merged by two low-rank learnable matrices. We extend the low-rank decomposition by introducing nonlinear kernel functions, thereby increasing the rank of the resulting merged matrix to prevent the interdependency among weight updates, enabling better adaptation to downstream tasks. For locating task-relevant weights, we propose an adaptive bi-level sparsity allocation mechanism that encourages weights to compete across and inside layers based on their importance scores in an end-to-end manner. Extensive experiments are conducted on classification, segmentation, and generation tasks using different pre-trained vision models. The results show that SNELLA achieves SOTA performance with low memory usage. Notably, SNELLA obtains 1.8% (91.9% v.s. 90.1%) higher Top-1 accuracy on the FGVC benchmark compared to SPT-LoRA. Compared to previous methods, SNELLA achieves a memory reduction of 31.1%-39.9% across models with parameter scales from 86M to 632M. Our source codes are available at https://github.com/ssfgunner/SNELL.

[146] Enhancing CLIP Robustness via Cross-Modality Alignment

Xingyu Zhu, Beier Zhu, Shuo Wang, Kesen Zhao, Hanwang Zhang

Main category: cs.CV

TL;DR: COLA is a training-free framework that uses optimal transport to align image and text features in CLIP models, improving adversarial robustness by 6.7% on ImageNet under PGD attacks while maintaining clean accuracy.

Details

Motivation: Vision-language models like CLIP have strong zero-shot classification but are vulnerable to adversarial attacks. Existing methods overlook the misalignment between text and image features, which gets worse under adversarial perturbations.

Method: COLA uses optimal transport to restore global image-text alignment and local structural consistency. It projects adversarial image embeddings onto class text feature subspaces to filter non-semantic distortions, then refines alignment via OT with the projection integrated into cost computation.

Result: Extensive evaluations on 14 zero-shot classification benchmarks show COLA achieves average 6.7% improvement on ImageNet and variants under PGD adversarial attacks while maintaining high accuracy on clean samples.

Conclusion: COLA effectively addresses adversarial misalignment in CLIP models through optimal transport-based cross-modality alignment, providing robust performance without requiring additional training.

Abstract: Vision-language models (VLMs) such as CLIP demonstrate strong generalization in zero-shot classification but remain highly vulnerable to adversarial perturbations. Existing methods primarily focus on adversarial fine-tuning or prompt optimization; they often overlook the gaps in CLIP’s encoded features, which is shown as the text and image features lie far apart from each other. This misalignment is significantly amplified under adversarial perturbations, leading to severe degradation in classification performance. To address this problem, we propose Cross-modality Alignment, dubbed COLA, an optimal transport-based framework that explicitly addresses adversarial misalignment by restoring both global image-text alignment and local structural consistency in the feature space. (1) COLA first projects adversarial image embeddings onto a subspace spanned by class text features, effectively filtering out non-semantic distortions while preserving discriminative information. (2) It then models images and texts as discrete distributions over multiple augmented views and refines their alignment via OT, with the subspace projection seamlessly integrated into the cost computation. This design ensures stable cross-modal alignment even under adversarial conditions. COLA is training-free and compatible with existing fine-tuned models. Extensive evaluations across 14 zero-shot classification benchmarks demonstrate the effectiveness of COLA, especially with an average improvement of 6.7% on ImageNet and its variants under PGD adversarial attacks, while maintaining high accuracy on clean samples.

[147] Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

Main category: cs.CV

TL;DR: BOB is a fine-tuning strategy for T2I models that extracts class-agnostic attributes (background, pose) from real examples, conditions on them during fine-tuning, and marginalizes them out during generation to improve synthetic data quality for fine-grained classification.

Details

Motivation: Fine-tuning T2I models with few real examples can improve synthetic data quality but risks overfitting and reduced diversity, especially challenging for fine-grained classification tasks.

Method: Extract class-agnostic attributes from real examples, explicitly condition T2I model fine-tuning on these attributes, then marginalize them out during synthetic data generation to preserve generative prior and diversity.

Result: BOB achieves SOTA performance in low-shot fine-grained classification, outperforming DataDream by 7.4% on Aircraft dataset, and beats prior art in 18 of 24 experimental settings with 2+% accuracy improvements in 14 settings.

Conclusion: BOB effectively mitigates overfitting in T2I fine-tuning, preserves generative diversity, and enables high-quality synthetic data generation for fine-grained classification with minimal real examples.

Abstract: Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model’s generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

[148] OmniText: A Training-Free Generalist for Controllable Text-Image Manipulation

Agus Gunawan, Samuel Teodoro, Yun Chen, Soo Ye Kim, Jihyong Oh, Munchurl Kim

Main category: cs.CV

TL;DR: OmniText is a training-free generalist framework for Text Image Manipulation (TIM) that addresses limitations of existing text inpainting methods by enabling text removal, style control, and preventing duplicated letters through self-attention inversion and cross-attention redistribution.

Details

Motivation: Current diffusion-based text inpainting methods have three key limitations: inability to remove text, lack of style control over rendered text, and tendency to generate duplicated letters, which hinder their applicability to broader TIM tasks.

Method: Uses self-attention inversion to enable text removal by reducing text hallucinations, redistributes cross-attention to reduce text hallucination, and introduces novel loss functions (cross-attention content loss and self-attention style loss) in a latent optimization framework for controllable inpainting.

Result: OmniText achieves state-of-the-art performance across multiple TIM tasks and metrics, comparable with specialist methods. Also introduces OmniText-Bench benchmark dataset for evaluating diverse TIM tasks.

Conclusion: OmniText is the first generalist method capable of performing diverse TIM tasks including text removal, rescaling, repositioning, and insertion/editing with various styles, addressing key limitations of existing text inpainting approaches.

Abstract: Recent advancements in diffusion-based text synthesis have demonstrated significant performance in inserting and editing text within images via inpainting. However, despite the potential of text inpainting methods, three key limitations hinder their applicability to broader Text Image Manipulation (TIM) tasks: (i) the inability to remove text, (ii) the lack of control over the style of rendered text, and (iii) a tendency to generate duplicated letters. To address these challenges, we propose OmniText, a training-free generalist capable of performing a wide range of TIM tasks. Specifically, we investigate two key properties of cross- and self-attention mechanisms to enable text removal and to provide control over both text styles and content. Our findings reveal that text removal can be achieved by applying self-attention inversion, which mitigates the model’s tendency to focus on surrounding text, thus reducing text hallucinations. Additionally, we redistribute cross-attention, as increasing the probability of certain text tokens reduces text hallucination. For controllable inpainting, we introduce novel loss functions in a latent optimization framework: a cross-attention content loss to improve text rendering accuracy and a self-attention style loss to facilitate style customization. Furthermore, we present OmniText-Bench, a benchmark dataset for evaluating diverse TIM tasks. It includes input images, target text with masks, and style references, covering diverse applications such as text removal, rescaling, repositioning, and insertion and editing with various styles. Our OmniText framework is the first generalist method capable of performing diverse TIM tasks. It achieves state-of-the-art performance across multiple tasks and metrics compared to other text inpainting methods and is comparable with specialist methods.

[149] Enhancing Pre-trained Representation Classifiability can Boost its Interpretability

Shufan Shen, Zhaobo Qi, Junshu Sun, Qingming Huang, Qi Tian, Shuhui Wang

Main category: cs.CV

TL;DR: The paper discovers a positive correlation between interpretability and classifiability in pre-trained visual representations, proposes an Inherent Interpretability Score (IIS) to quantify interpretability, and shows that maximizing interpretability can improve classification performance.

Details

Motivation: Widespread applications of pre-trained visual models require both high classifiability and interpretability, but it was unclear whether these two objectives could be achieved simultaneously.

Method: Proposed Inherent Interpretability Score (IIS) that quantifies representation interpretability by measuring the ratio of interpretable semantics and information loss. Used this to evaluate the relationship between interpretability and classifiability.

Result: Discovered positive correlation between interpretability and classifiability - representations with higher classifiability provide more interpretable semantics. Fine-tuning with interpretability maximization can further improve classifiability, and predictions based on interpretations show less accuracy degradation.

Conclusion: Practitioners can unify improvements in both interpretability and classifiability for pre-trained vision models, as these objectives are positively correlated rather than conflicting.

Abstract: The visual representation of a pre-trained model prioritizes the classifiability on downstream tasks, while the widespread applications for pre-trained visual models have posed new requirements for representation interpretability. However, it remains unclear whether the pre-trained representations can achieve high interpretability and classifiability simultaneously. To answer this question, we quantify the representation interpretability by leveraging its correlation with the ratio of interpretable semantics within the representations. Given the pre-trained representations, only the interpretable semantics can be captured by interpretations, whereas the uninterpretable part leads to information loss. Based on this fact, we propose the Inherent Interpretability Score (IIS) that evaluates the information loss, measures the ratio of interpretable semantics, and quantifies the representation interpretability. In the evaluation of the representation interpretability with different classifiability, we surprisingly discover that the interpretability and classifiability are positively correlated, i.e., representations with higher classifiability provide more interpretable semantics that can be captured in the interpretations. This observation further supports two benefits to the pre-trained representations. First, the classifiability of representations can be further improved by fine-tuning with interpretability maximization. Second, with the classifiability improvement for the representations, we obtain predictions based on their interpretations with less accuracy degradation. The discovered positive correlation and corresponding applications show that practitioners can unify the improvements in interpretability and classifiability for pre-trained vision models. Codes are available at https://github.com/ssfgunner/IIS.

[150] UHKD: A Unified Framework for Heterogeneous Knowledge Distillation via Frequency-Domain Representations

Fengming Yu, Haiwei Pan, Kejia Zhang, Jian Guan, Haiying Jiang

Main category: cs.CV

TL;DR: UHKD is a knowledge distillation framework that uses frequency domain transformations to enable effective knowledge transfer between heterogeneous teacher-student model architectures, overcoming semantic discrepancies in intermediate features.

Details

Motivation: Existing knowledge distillation methods perform poorly with heterogeneous architectures due to semantic discrepancies in intermediate representations, and most focus only on logits space, limiting use of semantic information from intermediate layers.

Method: Proposes Unified Heterogeneous Knowledge Distillation (UHKD) using Fourier transform to capture global feature information in frequency domain. Includes Feature Transformation Module for teacher features and Feature Alignment Module for student features with multi-level matching. Uses joint objective combining MSE on intermediate features and KL divergence on logits.

Result: Experiments on CIFAR-100 and ImageNet-1K show performance gains of 5.59% and 0.83% respectively over the latest method, demonstrating effectiveness in unifying heterogeneous representations.

Conclusion: UHKD effectively addresses architectural heterogeneity in knowledge distillation by leveraging frequency domain transformations, enabling efficient utilization of visual knowledge across different model architectures.

Abstract: Knowledge distillation (KD) is an effective model compression technique that transfers knowledge from a high-performance teacher to a lightweight student, reducing cost while maintaining accuracy. In visual applications, where large-scale image models are widely used, KD enables efficient deployment. However, architectural diversity introduces semantic discrepancies that hinder the use of intermediate representations. Most existing KD methods are designed for homogeneous models and degrade in heterogeneous scenarios, especially when intermediate features are involved. Prior studies mainly focus on the logits space, making limited use of the semantic information in intermediate layers. To address this limitation, Unified Heterogeneous Knowledge Distillation (UHKD) is proposed as a framework that leverages intermediate features in the frequency domain for cross-architecture transfer. Fourier transform is applied to capture global feature information, alleviating representational discrepancies between heterogeneous teacher-student pairs. A Feature Transformation Module (FTM) produces compact frequency-domain representations of teacher features, while a learnable Feature Alignment Module (FAM) projects student features and aligns them via multi-level matching. Training is guided by a joint objective combining mean squared error on intermediate features with Kullback-Leibler divergence on logits. Experiments on CIFAR-100 and ImageNet-1K demonstrate gains of 5.59% and 0.83% over the latest method, highlighting UHKD as an effective approach for unifying heterogeneous representations and enabling efficient utilization of visual knowledge

[151] DogMo: A Large-Scale Multi-View RGB-D Dataset for 4D Canine Motion Recovery

Zan Wang, Siyu Chen, Luya Mo, Xinfeng Gao, Yuxin Shen, Lebin Ding, Wei Liang

Main category: cs.CV

TL;DR: DogMo is a large-scale multi-view RGB-D video dataset for dog motion recovery, featuring 1.2k sequences from 10 dogs with diverse breeds and motions, addressing limitations of existing datasets.

Details

Motivation: To overcome the lack of multi-view, real 3D data, limited scale and diversity in existing dog motion datasets, enabling systematic evaluation of motion recovery methods.

Method: A three-stage, instance-specific optimization pipeline that fits the SMAL model through coarse alignment, dense correspondence supervision, and temporal regularization.

Result: Established four motion recovery benchmark settings supporting evaluation across monocular/multi-view and RGB/RGB-D inputs, providing accurate motion recovery.

Conclusion: DogMo dataset and method provide a principled foundation for advancing dog motion recovery research and open new directions in computer vision, graphics, and animal behavior modeling.

Abstract: We present DogMo, a large-scale multi-view RGB-D video dataset capturing diverse canine movements for the task of motion recovery from images. DogMo comprises 1.2k motion sequences collected from 10 unique dogs, offering rich variation in both motion and breed. It addresses key limitations of existing dog motion datasets, including the lack of multi-view and real 3D data, as well as limited scale and diversity. Leveraging DogMo, we establish four motion recovery benchmark settings that support systematic evaluation across monocular and multi-view, RGB and RGB-D inputs. To facilitate accurate motion recovery, we further introduce a three-stage, instance-specific optimization pipeline that fits the SMAL model to the motion sequences. Our method progressively refines body shape and pose through coarse alignment, dense correspondence supervision, and temporal regularization. Our dataset and method provide a principled foundation for advancing research in dog motion recovery and open up new directions at the intersection of computer vision, computer graphics, and animal behavior modeling.

[152] ETC: training-free diffusion models acceleration with Error-aware Trend Consistency

Jiajian Xie, Hubery Yin, Chen Li, Zhou Zhao, Shengyu Zhang

Main category: cs.CV

TL;DR: ETC is a training-free framework that accelerates diffusion models by reusing model outputs with trend consistency and error control, achieving 2.65x speedup over FLUX with minimal quality degradation.

Details

Motivation: Current training-free acceleration methods for diffusion models ignore denoising trends and lack error control, causing trajectory deviations and result inconsistencies when reusing model outputs across multiple steps.

Method: ETC introduces (1) a consistent trend predictor that projects historical denoising patterns into stable future directions across multiple approximation steps, and (2) a model-specific error tolerance search mechanism that identifies transition points to derive corrective thresholds.

Result: ETC achieves 2.65x acceleration over FLUX with only -0.074 SSIM score degradation in consistency, demonstrating significant speedup with minimal quality loss.

Conclusion: The ETC framework effectively accelerates diffusion models by maintaining trend consistency and controlling errors during model output reuse, providing a practical solution for faster sampling without compromising generative quality.

Abstract: Diffusion models have achieved remarkable generative quality but remain bottlenecked by costly iterative sampling. Recent training-free methods accelerate diffusion process by reusing model outputs. However, these methods ignore denoising trends and lack error control for model-specific tolerance, leading to trajectory deviations under multi-step reuse and exacerbating inconsistencies in the generated results. To address these issues, we introduce Error-aware Trend Consistency (ETC), a framework that (1) introduces a consistent trend predictor that leverages the smooth continuity of diffusion trajectories, projecting historical denoising patterns into stable future directions and progressively distributing them across multiple approximation steps to achieve acceleration without deviating; (2) proposes a model-specific error tolerance search mechanism that derives corrective thresholds by identifying transition points from volatile semantic planning to stable quality refinement. Experiments show that ETC achieves a 2.65x acceleration over FLUX with negligible (-0.074 SSIM score) degradation of consistency.

[153] Compositional Image Synthesis with Inference-Time Scaling

Minsuk Ji, Sanghyeok Lee, Namhyuk Ahn

Main category: cs.CV

TL;DR: A training-free framework that uses LLMs to synthesize layouts and object-centric VLM reranking to improve text-to-image model compositionality while preserving image quality.

Details

Motivation: Modern text-to-image models struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations.

Method: Leverage LLMs to synthesize explicit layouts from prompts, inject layouts into generation process, and use object-centric VLM to iteratively rerank candidates for prompt alignment.

Result: Achieves stronger scene alignment with prompts compared to recent text-to-image models while preserving aesthetic quality.

Conclusion: The framework successfully improves layout faithfulness in text-to-image generation through explicit layout-grounding and self-refinement techniques.

Abstract: Despite their impressive realism, modern text-to-image models still struggle with compositionality, often failing to render accurate object counts, attributes, and spatial relations. To address this challenge, we present a training-free framework that combines an object-centric approach with self-refinement to improve layout faithfulness while preserving aesthetic quality. Specifically, we leverage large language models (LLMs) to synthesize explicit layouts from input prompts, and we inject these layouts into the image generation process, where a object-centric vision-language model (VLM) judge reranks multiple candidates to select the most prompt-aligned outcome iteratively. By unifying explicit layout-grounding with self-refine-based inference-time scaling, our framework achieves stronger scene alignment with prompts compared to recent text-to-image models. The code are available at https://github.com/gcl-inha/ReFocus.

[154] VC4VG: Optimizing Video Captions for Text-to-Video Generation

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, Qin Jin

Main category: cs.CV

TL;DR: VC4VG is a framework for optimizing video captions specifically for text-to-video generation, featuring a principled caption design methodology and a new benchmark with T2V-specific metrics.

Details

Motivation: Current text-to-video generation models rely on high-quality video-text pairs, but strategies for optimizing video captions specifically for T2V training are underexplored.

Method: The paper introduces VC4VG framework that analyzes caption content from T2V perspective, decomposing essential elements for video reconstruction into multiple dimensions, and proposes a principled caption design methodology. It also constructs VC4VG-Bench benchmark with fine-grained, multi-dimensional metrics.

Result: Extensive T2V fine-tuning experiments show strong correlation between improved caption quality and video generation performance, validating the approach’s effectiveness.

Conclusion: The VC4VG framework successfully addresses the need for caption optimization in text-to-video generation, with benchmark tools and code released to support further research.

Abstract: Recent advances in text-to-video (T2V) generation highlight the critical role of high-quality video-text pairs in training models capable of producing coherent and instruction-aligned videos. However, strategies for optimizing video captions specifically for T2V training remain underexplored. In this paper, we introduce VC4VG (Video Captioning for Video Generation), a comprehensive caption optimization framework tailored to the needs of T2V models.We begin by analyzing caption content from a T2V perspective, decomposing the essential elements required for video reconstruction into multiple dimensions, and proposing a principled caption design methodology. To support evaluation, we construct VC4VG-Bench, a new benchmark featuring fine-grained, multi-dimensional, and necessity-graded metrics aligned with T2V-specific requirements.Extensive T2V fine-tuning experiments demonstrate a strong correlation between improved caption quality and video generation performance, validating the effectiveness of our approach. We release all benchmark tools and code at https://github.com/qyr0403/VC4VG to support further research.

[155] Enhancing Vision-Language Models for Autonomous Driving through Task-Specific Prompting and Spatial Reasoning

Aodi Wu, Xubo Luo

Main category: cs.CV

TL;DR: A systematic framework for Vision-Language Models in autonomous driving scene understanding using Mixture-of-Prompts routing, task-specific prompts with spatial reasoning, visual assembly, and optimized inference parameters.

Details

Motivation: To enhance Vision-Language Models' performance on autonomous driving tasks including perception, prediction, planning, and corruption detection through structured prompting and spatial grounding.

Method: Four-component framework: 1) Mixture-of-Prompts router for question classification, 2) Task-specific prompts with coordinate systems, spatial reasoning, and reasoning chains, 3) Visual assembly with multi-view images and object crops, 4) Optimized inference parameters per task.

Result: Achieved 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data) using Qwen2.5-VL-72B model.

Conclusion: Structured prompting and spatial grounding significantly improve VLM performance on safety-critical autonomous driving tasks, demonstrating the effectiveness of the proposed systematic framework.

Abstract: This technical report presents our solution for the RoboSense Challenge at IROS 2025, which evaluates Vision-Language Models (VLMs) on autonomous driving scene understanding across perception, prediction, planning, and corruption detection tasks. We propose a systematic framework built on four core components. First, a Mixture-of-Prompts router classifies questions and dispatches them to task-specific expert prompts, eliminating interference across diverse question types. Second, task-specific prompts embed explicit coordinate systems, spatial reasoning rules, role-playing, Chain-of-Thought/Tree-of-Thought reasoning, and few-shot examples tailored to each task. Third, a visual assembly module composes multi-view images with object crops, magenta markers, and adaptive historical frames based on question requirements. Fourth, we configure model inference parameters (temperature, top-p, message roles) per task to optimize output quality. Implemented on Qwen2.5-VL-72B, our approach achieves 70.87% average accuracy on Phase-1 (clean data) and 72.85% on Phase-2 (corrupted data), demonstrating that structured prompting and spatial grounding substantially enhance VLM performance on safety-critical autonomous driving tasks. Code and prompt are available at https://github.com/wuaodi/UCAS-CSU-phase2.

[156] Vanish into Thin Air: Cross-prompt Universal Adversarial Attacks for SAM2

Ziqi Zhou, Yifan Hu, Yufei Song, Zijing Li, Shengshan Hu, Leo Yu Zhang, Dezhong Yao, Long Zheng, Hai Jin

Main category: cs.CV

TL;DR: The paper proposes UAP-SAM2, the first cross-prompt universal adversarial attack against the SAM2 video segmentation model, addressing challenges from architectural differences with SAM.

Details

Motivation: SAM2's robustness remains unexplored despite its strong video segmentation capabilities, and existing attacks on SAM may not transfer effectively due to architectural differences involving prompt guidance and frame-to-frame semantic entanglement.

Method: UAP-SAM2 uses a target-scanning strategy that divides frames into regions with random prompts to reduce prompt dependency, and a dual semantic deviation framework that distorts semantics within frames and disrupts consistency across consecutive frames.

Result: Extensive experiments on six datasets across two segmentation tasks show UAP-SAM2 significantly outperforms state-of-the-art attacks by a large margin.

Conclusion: The proposed UAP-SAM2 effectively addresses the unique challenges of attacking SAM2 and demonstrates superior performance compared to existing methods.

Abstract: Recent studies reveal the vulnerability of the image segmentation foundation model SAM to adversarial examples. Its successor, SAM2, has attracted significant attention due to its strong generalization capability in video segmentation. However, its robustness remains unexplored, and it is unclear whether existing attacks on SAM can be directly transferred to SAM2. In this paper, we first analyze the performance gap of existing attacks between SAM and SAM2 and highlight two key challenges arising from their architectural differences: directional guidance from the prompt and semantic entanglement across consecutive frames. To address these issues, we propose UAP-SAM2, the first cross-prompt universal adversarial attack against SAM2 driven by dual semantic deviation. For cross-prompt transferability, we begin by designing a target-scanning strategy that divides each frame into k regions, each randomly assigned a prompt, to reduce prompt dependency during optimization. For effectiveness, we design a dual semantic deviation framework that optimizes a UAP by distorting the semantics within the current frame and disrupting the semantic consistency across consecutive frames. Extensive experiments on six datasets across two segmentation tasks demonstrate the effectiveness of the proposed method for SAM2. The comparative results show that UAP-SAM2 significantly outperforms state-of-the-art (SOTA) attacks by a large margin.

[157] CLFSeg: A Fuzzy-Logic based Solution for Boundary Clarity and Uncertainty Reduction in Medical Image Segmentation

Anshul Kaushal, Kunal Jangid, Vinod K. Kurmi

Main category: cs.CV

TL;DR: CLFSeg is an encoder-decoder framework that combines fuzzy logic with convolutional networks for medical image segmentation, achieving state-of-the-art performance on polyp and cardiac datasets while handling uncertainty and class imbalance.

Details

Motivation: Traditional CNN-based models have limited generalizability, robustness, and inability to handle uncertainty in medical image segmentation, which affects performance for early cancer detection and treatment planning.

Method: Proposes CLFSeg framework with Fuzzy-Convolutional (FC) module that aggregates convolutional layers and fuzzy logic to identify local/global features while minimizing uncertainty, noise, and boundary ambiguity. Uses binary cross-entropy with dice loss to handle class imbalance.

Result: Exceptional performance on four public datasets (CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, ACDC), surpassing existing SOTA methods while maintaining computational efficiency.

Conclusion: CLFSeg improves segmentation performance while ensuring computing efficiency, making it a potential solution for real-world medical diagnostic scenarios.

Abstract: Accurate polyp and cardiac segmentation for early detection and treatment is essential for the diagnosis and treatment planning of cancer-like diseases. Traditional convolutional neural network (CNN) based models have represented limited generalizability, robustness, and inability to handle uncertainty, which affects the segmentation performance. To solve these problems, this paper introduces CLFSeg, an encoder-decoder based framework that aggregates the Fuzzy-Convolutional (FC) module leveraging convolutional layers and fuzzy logic. This module enhances the segmentation performance by identifying local and global features while minimizing the uncertainty, noise, and ambiguity in boundary regions, ensuring computing efficiency. In order to handle class imbalance problem while focusing on the areas of interest with tiny and boundary regions, binary cross-entropy (BCE) with dice loss is incorporated. Our proposed model exhibits exceptional performance on four publicly available datasets, including CVC-ColonDB, CVC-ClinicDB, EtisLaribPolypDB, and ACDC. Extensive experiments and visual studies show CLFSeg surpasses the existing SOTA performance and focuses on relevant regions of interest in anatomical structures. The proposed CLFSeg improves performance while ensuring computing efficiency, which makes it a potential solution for real-world medical diagnostic scenarios. Project page is available at https://visdomlab.github.io/CLFSeg/

[158] MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

Junhyuk So, Hyunho Kook, Chaeyeon Jang, Eunhyeok Park

Main category: cs.CV

TL;DR: MC-SJD is a training-free parallel decoding framework that accelerates autoregressive visual generation by improving token stability across iterations, achieving up to 4.2x faster image generation and 13.3x faster video generation without quality loss.

Details

Motivation: Autoregressive modeling for visual generation suffers from slow inference speed due to per-token generation requiring thousands of steps per sample, limiting practical adoption.

Method: Extends Speculative Jacobi Decoding (SJD) with MC-SJD, an information-theoretic approach based on coupling that maximizes probability of sampling identical draft tokens across iterations while preserving lossless property.

Result: Achieves up to ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, with no degradation in output quality.

Conclusion: MC-SJD provides substantial performance gains for AR visual generation with minimal algorithmic changes, making it practical for real-world applications.

Abstract: While autoregressive (AR) modeling has recently emerged as a new paradigm in visual generation, its practical adoption is severely constrained by the slow inference speed of per-token generation, which often requires thousands of steps to produce a single sample. To address this challenge, we propose MC-SJD, a training-free, lossless parallel decoding framework designed to accelerate AR visual generation by extending the recently introduced Speculative Jacobi Decoding (SJD). Although SJD shows strong potential for accelerating AR generation, we demonstrate that token instability across iterations significantly reduces the acceptance rate, a limitation that primarily arises from the independent sampling process used during draft token generation. To overcome this, we introduce MC-SJD, an information-theoretic approach based on coupling, which substantially accelerates standard SJD by maximizing the probability of sampling identical draft tokens across consecutive iterations, all while preserving its lossless property. Remarkably, this method requires only a single-line modification to the existing algorithm, yet achieves substantial performance gains, delivering up to a ~4.2x acceleration in image generation and ~13.3x acceleration in video generation compared to standard AR decoding, without any degradation in output quality.

[159] Beyond Inference Intervention: Identity-Decoupled Diffusion for Face Anonymization

Haoxin Yang, Yihong Lin, Jingdan Kang, Xuemiao Xu, Yue Li, Cheng Xu, Shengfeng He

Main category: cs.CV

TL;DR: ID²Face is a training-centric face anonymization framework that learns a disentangled latent space to separate identity from non-identity attributes, enabling direct anonymization without inference-time optimization.

Details

Motivation: Existing diffusion-based anonymization methods rely on inference-time interventions that cause distribution shifts and entangle identity with non-identity attributes, degrading visual quality and data utility.

Method: Uses a conditional diffusion model with identity-masked learning, featuring an Identity-Decoupled Latent Recomposer with Identity VAE for identity features and bidirectional latent alignment for non-identity attributes, plus an Identity-Guided Latent Harmonizer with soft-gating fusion.

Result: Outperforms existing methods in visual quality, identity suppression, and utility preservation.

Conclusion: ID²Face provides effective face anonymization through explicit disentanglement of identity and non-identity features in a structured latent space, eliminating the need for inference-time optimization.

Abstract: Face anonymization aims to conceal identity information while preserving non-identity attributes. Mainstream diffusion models rely on inference-time interventions such as negative guidance or energy-based optimization, which are applied post-training to suppress identity features. These interventions often introduce distribution shifts and entangle identity with non-identity attributes, degrading visual fidelity and data utility. To address this, we propose \textbf{ID\textsuperscript{2}Face}, a training-centric anonymization framework that removes the need for inference-time optimization. The rationale of our method is to learn a structured latent space where identity and non-identity information are explicitly disentangled, enabling direct and controllable anonymization at inference. To this end, we design a conditional diffusion model with an identity-masked learning scheme. An Identity-Decoupled Latent Recomposer uses an Identity Variational Autoencoder to model identity features, while non-identity attributes are extracted from same-identity pairs and aligned through bidirectional latent alignment. An Identity-Guided Latent Harmonizer then fuses these representations via soft-gating conditioned on noisy feature prediction. The model is trained with a recomposition-based reconstruction loss to enforce disentanglement. At inference, anonymization is achieved by sampling a random identity vector from the learned identity space. To further suppress identity leakage, we introduce an Orthogonal Identity Mapping strategy that enforces orthogonality between sampled and source identity vectors. Experiments demonstrate that ID\textsuperscript{2}Face outperforms existing methods in visual quality, identity suppression, and utility preservation.

[160] SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs

Jinhong Deng, Wen Li, Joey Tianyi Zhou, Yang He

Main category: cs.CV

TL;DR: SCOPE is a visual token pruning method for MLLMs that jointly considers saliency and coverage to preserve semantic completeness while reducing computational overhead.

Details

Motivation: Existing visual token pruning methods focus only on saliency, leading to semantic incompleteness in selected tokens, while many visual tokens in MLLMs are redundant.

Method: Proposes SCOPE score that integrates saliency and token-coverage gain, iteratively selecting tokens with highest SCOPE score based on set-coverage computed from token relationships.

Result: Outperforms prior approaches on multiple vision-language understanding benchmarks using LLaVA-1.5 and LLaVA-Next models.

Conclusion: SCOPE effectively reduces computational overhead while preserving semantic completeness through joint modeling of saliency and coverage in visual token pruning.

Abstract: Multimodal Large Language Models (MLLMs) typically process a large number of visual tokens, leading to considerable computational overhead, even though many of these tokens are redundant. Existing visual token pruning methods primarily focus on selecting the most salient tokens based on attention scores, resulting in the semantic incompleteness of the selected tokens. In this paper, we propose a novel visual token pruning strategy, called \textbf{S}aliency-\textbf{C}overage \textbf{O}riented token \textbf{P}runing for \textbf{E}fficient MLLMs (SCOPE), to jointly model both the saliency and coverage of the selected visual tokens to better preserve semantic completeness. Specifically, we introduce a set-coverage for a given set of selected tokens, computed based on the token relationships. We then define a token-coverage gain for each unselected token, quantifying how much additional coverage would be obtained by including it. By integrating the saliency score into the token-coverage gain, we propose our SCOPE score and iteratively select the token with the highest SCOPE score. We conduct extensive experiments on multiple vision-language understanding benchmarks using the LLaVA-1.5 and LLaVA-Next models. Experimental results demonstrate that our method consistently outperforms prior approaches. Our code is available at \href{https://github.com/kinredon/SCOPE}{https://github.com/kinredon/SCOPE}.

[161] Benchmarking Microsaccade Recognition with Event Cameras: A Novel Dataset and Evaluation

Waseem Shariff, Timothy Hanley, Maciej Stec, Hossein Javidnia, Peter Corcoran

Main category: cs.CV

TL;DR: Pioneering event-based microsaccade dataset using Blender simulation and v2e conversion, achieving ~90% classification accuracy with spiking neural networks for fine motion recognition.

Details

Motivation: Traditional microsaccade studies using eye trackers or frame-based analysis are costly, limited in scalability and temporal resolution. Event-based sensing offers high-speed, low-latency alternative for capturing fine-grained spatiotemporal changes.

Method: Created event-based microsaccade dataset using Blender to render high-fidelity eye movement scenarios with angular displacements from 0.5 to 2.0 degrees across seven classes. Converted to event streams using v2e, preserving natural temporal dynamics. Evaluated with Spiking-VGG11, Spiking-VGG13, Spiking-VGG16, and proposed Spiking-VGG16Flow (optical-flow-enhanced variant) in SpikingJelly.

Result: Models achieved around 90% average accuracy, successfully classifying microsaccades by angular displacement independent of event count or duration.

Conclusion: Demonstrates potential of spiking neural networks for fine motion recognition and establishes benchmark for event-based vision research. Dataset, code, and trained models will be publicly available.

Abstract: Microsaccades are small, involuntary eye movements vital for visual perception and neural processing. Traditional microsaccade studies typically use eye trackers or frame-based analysis, which, while precise, are costly and limited in scalability and temporal resolution. Event-based sensing offers a high-speed, low-latency alternative by capturing fine-grained spatiotemporal changes efficiently. This work introduces a pioneering event-based microsaccade dataset to support research on small eye movement dynamics in cognitive computing. Using Blender, we render high-fidelity eye movement scenarios and simulate microsaccades with angular displacements from 0.5 to 2.0 degrees, divided into seven distinct classes. These are converted to event streams using v2e, preserving the natural temporal dynamics of microsaccades, with durations ranging from 0.25 ms to 2.25 ms. We evaluate the dataset using Spiking-VGG11, Spiking-VGG13, and Spiking-VGG16, and propose Spiking-VGG16Flow, an optical-flow-enhanced variant implemented in SpikingJelly. The models achieve around 90 percent average accuracy, successfully classifying microsaccades by angular displacement, independent of event count or duration. These results demonstrate the potential of spiking neural networks for fine motion recognition and establish a benchmark for event-based vision research. The dataset, code, and trained models will be publicly available at https://waseemshariff126.github.io/microsaccades/ .

[162] Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy

Qing Zhao, Weijian Deng, Pengxu Wei, ZiYi Dong, Hannan Lu, Xiangyang Ji, Liang Lin

Main category: cs.CV

TL;DR: LR-YOLO integrates image restoration directly into object detection to address functional mismatch between restoration and detection networks, improving detection stability and accuracy in adverse conditions.

Details

Motivation: Traditional cascade frameworks where restoration is applied before detection suffer from functional mismatch - restoration networks perform smooth transformations while detectors have discontinuous decision boundaries, causing instability and sensitivity to minor perturbations.

Method: Propose Lipschitz-regularized object detection (LROD) framework that integrates image restoration directly into detector’s feature learning, harmonizing Lipschitz continuity of both tasks during training. Implemented as LR-YOLO extending existing YOLO detectors.

Result: Extensive experiments on haze and low-light benchmarks show LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy compared to traditional cascade approaches.

Conclusion: Integrating restoration directly into detection networks through Lipschitz regularization effectively addresses the functional mismatch problem, providing more robust object detection in adverse conditions.

Abstract: To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration – an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector’s feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.

[163] DeshadowMamba: Deshadowing as 1D Sequential Similarity

Zhaotong Yang, Yi Chen, Yanying Li, Shengfeng He, Yangyang Xu, Junyu Dong, Jian Yang, Yong Du

Main category: cs.CV

TL;DR: Proposes DeshadowMamba, a shadow removal method using Mamba’s selective state space model with CrossGate modulation and ColorShift regularization to address attention-based models’ limitations in preserving structure and color consistency.

Details

Motivation: Attention-based shadow removal models suffer from mixing irrelevant illumination cues, causing distorted structures and inconsistent colors due to fixed attention patterns.

Method: Uses Mamba for sequence modeling with directional state transitions, adds CrossGate for shadow-aware similarity modulation, and ColorShift regularization for contrastive learning with global color statistics.

Result: Achieves state-of-the-art visual quality and strong quantitative performance on public benchmarks.

Conclusion: Sequence modeling with selective state transitions and shadow-aware modulation effectively addresses structural and color consistency challenges in shadow removal.

Abstract: Recent deep models for image shadow removal often rely on attention-based architectures to capture long-range dependencies. However, their fixed attention patterns tend to mix illumination cues from irrelevant regions, leading to distorted structures and inconsistent colors. In this work, we revisit shadow removal from a sequence modeling perspective and explore the use of Mamba, a selective state space model that propagates global context through directional state transitions. These transitions yield an efficient global receptive field while preserving positional continuity. Despite its potential, directly applying Mamba to image data is suboptimal, since it lacks awareness of shadow-non-shadow semantics and remains susceptible to color interference from nearby regions. To address these limitations, we propose CrossGate, a directional modulation mechanism that injects shadow-aware similarity into Mamba’s input gate, allowing selective integration of relevant context along transition axes. To further ensure appearance fidelity, we introduce ColorShift regularization, a contrastive learning objective driven by global color statistics. By synthesizing structured informative negatives, it guides the model to suppress color contamination and achieve robust color restoration. Together, these components adapt sequence modeling to the structural integrity and chromatic consistency required for shadow removal. Extensive experiments on public benchmarks demonstrate that DeshadowMamba achieves state-of-the-art visual quality and strong quantitative performance.

[164] UtilGen: Utility-Centric Generative Data Augmentation with Dual-Level Task Adaptation

Jiyu Guo, Shuo Yang, Yiming Huang, Yancheng Long, Xiaobo Xia, Xiu Su, Bo Zhao, Zeke Xie, Liqiang Nie

Main category: cs.CV

TL;DR: UtilGen is a utility-centric data augmentation framework that optimizes synthetic data generation using downstream task feedback, achieving superior performance over traditional methods.

Details

Motivation: Most data augmentation methods focus on visual quality metrics like fidelity and diversity, but neglect task-specific requirements which vary across different tasks and architectures.

Method: UtilGen uses a weight allocation network to evaluate task-specific utility of synthetic samples, then iteratively refines generation through dual-level optimization: model-level (tailoring generative model) and instance-level (adjusting prompts and noise).

Result: Extensive experiments on 8 benchmark datasets show UtilGen achieves average 3.87% accuracy improvement over previous state-of-the-art methods, producing more impactful and task-relevant synthetic data.

Conclusion: UtilGen validates the effectiveness of shifting from visual characteristics-centric to task utility-centric data augmentation paradigm, demonstrating superior performance across diverse tasks.

Abstract: Data augmentation using generative models has emerged as a powerful paradigm for enhancing performance in computer vision tasks. However, most existing augmentation approaches primarily focus on optimizing intrinsic data attributes – such as fidelity and diversity – to generate visually high-quality synthetic data, while often neglecting task-specific requirements. Yet, it is essential for data generators to account for the needs of downstream tasks, as training data requirements can vary significantly across different tasks and network architectures. To address these limitations, we propose UtilGen, a novel utility-centric data augmentation framework that adaptively optimizes the data generation process to produce task-specific, high-utility training data via downstream task feedback. Specifically, we first introduce a weight allocation network to evaluate the task-specific utility of each synthetic sample. Guided by these evaluations, UtilGen iteratively refines the data generation process using a dual-level optimization strategy to maximize the synthetic data utility: (1) model-level optimization tailors the generative model to the downstream task, and (2) instance-level optimization adjusts generation policies – such as prompt embeddings and initial noise – at each generation round. Extensive experiments on eight benchmark datasets of varying complexity and granularity demonstrate that UtilGen consistently achieves superior performance, with an average accuracy improvement of 3.87% over previous SOTA. Further analysis of data influence and distribution reveals that UtilGen produces more impactful and task-relevant synthetic data, validating the effectiveness of the paradigm shift from visual characteristics-centric to task utility-centric data augmentation.

[165] Training-free Source Attribution of AI-generated Images via Resynthesis

Pietro Bongini, Valentina Molinari, Andrea Costanzo, Benedetta Tondi, Mauro Barni

Main category: cs.CV

TL;DR: A training-free one-shot attribution method using image resynthesis outperforms existing techniques in few-shot scenarios for synthetic image source attribution.

Details

Motivation: Synthetic image source attribution is challenging under data scarcity conditions, requiring few-shot or zero-shot classification capabilities.

Method: Training-free one-shot attribution based on image resynthesis: generate a prompt describing the image, resynthesize it with all candidate sources, and attribute to the model producing the closest resynthesis in feature space.

Result: The proposed resynthesis method outperforms state-of-the-art few-shot approaches and other baselines when only a few samples are available for training or fine-tuning.

Conclusion: The method is effective for few-shot attribution, and the new dataset provides a challenging benchmark for developing future few-shot and zero-shot attribution methods.

Abstract: Synthetic image source attribution is a challenging task, especially in data scarcity conditions requiring few-shot or zero-shot classification capabilities. We present a new training-free one-shot attribution method based on image resynthesis. A prompt describing the image under analysis is generated, then it is used to resynthesize the image with all the candidate sources. The image is attributed to the model which produced the resynthesis closest to the original image in a proper feature space. We also introduce a new dataset for synthetic image attribution consisting of face images from commercial and open-source text-to-image generators. The dataset provides a challenging attribution framework, useful for developing new attribution models and testing their capabilities on different generative architectures. The dataset structure allows to test approaches based on resynthesis and to compare them to few-shot methods. Results from state-of-the-art few-shot approaches and other baselines show that the proposed resynthesis method outperforms existing techniques when only a few samples are available for training or fine-tuning. The experiments also demonstrate that the new dataset is a challenging one and represents a valuable benchmark for developing and evaluating future few-shot and zero-shot methods.

[166] ViPER: Empowering the Self-Evolution of Visual Perception Abilities in Vision-Language Model

Juntian Zhang, Song Jin, Chuanqi Cheng, Yuhan Liu, Yankai Lin, Xun Zhang, Yufei Zhang, Fei Jiang, Guojun Yin, Wei Lin, Rui Yan

Main category: cs.CV

TL;DR: ViPER is a self-bootstrapping framework that enhances fine-grained visual perception in VLMs through a coarse-to-fine progressive learning process with self-critiquing and self-prediction, achieving significant performance gains while maintaining general capabilities.

Details

Motivation: Address the critical bottleneck of limited fine-grained visual perception in VLMs, overcoming challenges from scarce high-quality data and limitations of existing methods like SFT (compromising general capabilities) and RFT (prioritizing textual reasoning over visual perception).

Method: Two-stage task structuring visual perception as coarse-to-fine progressive process; self-bootstrapping framework with iterative evolution through self-critiquing and self-prediction; synergistic integration of image-level and instance-level reconstruction with two-stage reinforcement learning; closed-loop training paradigm using internally synthesized data.

Result: Applied to Qwen2.5-VL family, produces Qwen-Viper series with average gain of 1.7% on seven comprehensive benchmarks across various tasks and up to 6.0% on fine-grained perception; consistently superior performance across different vision-language scenarios while maintaining generalizability.

Conclusion: ViPER enables self-improvement in perceptual capabilities and provides evidence for the reciprocal relationship between generation and understanding, representing a breakthrough for developing more autonomous and capable VLMs.

Abstract: The limited capacity for fine-grained visual perception presents a critical bottleneck for Vision-Language Models (VLMs) in real-world applications. Addressing this is challenging due to the scarcity of high-quality data and the limitations of existing methods: supervised fine-tuning (SFT) often compromises general capabilities, while reinforcement fine-tuning (RFT) prioritizes textual reasoning over visual perception. To bridge this gap, we propose a novel two-stage task that structures visual perception learning as a coarse-to-fine progressive process. Based on this task formulation, we develop ViPER, a self-bootstrapping framework specifically designed to enable iterative evolution through self-critiquing and self-prediction. By synergistically integrating image-level and instance-level reconstruction with a two-stage reinforcement learning strategy, ViPER establishes a closed-loop training paradigm, where internally synthesized data directly fuel the enhancement of perceptual ability. Applied to the Qwen2.5-VL family, ViPER produces the Qwen-Viper series. With an average gain of 1.7% on seven comprehensive benchmarks spanning various tasks and up to 6.0% on fine-grained perception, Qwen-Viper consistently demonstrates superior performance across different vision-language scenarios while maintaining generalizability. Beyond enabling self-improvement in perceptual capabilities, ViPER provides concrete evidence for the reciprocal relationship between generation and understanding, a breakthrough to developing more autonomous and capable VLMs.

[167] Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

Main category: cs.CV

TL;DR: Prompt learning is an effective lightweight adaptation strategy for few-shot remote sensing scene classification, outperforming zero-shot CLIP and linear probe baselines while bridging domain gaps.

Details

Motivation: Remote sensing scene classification faces challenges with limited labeled data and high annotation costs across diverse domains. Direct application of vision-language models like CLIP is suboptimal due to domain gaps and need for semantic adaptation.

Method: Systematically evaluated several prompt learning methods: Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. Compared against zero-shot CLIP with hand-crafted prompts and linear probe on frozen CLIP features.

Result: Prompt learning consistently outperforms both baselines in few-shot scenarios across multiple benchmark datasets. Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance.

Conclusion: Prompt learning serves as a scalable and efficient solution for bridging domain gaps in satellite and aerial imagery, providing a strong foundation for future research in remote sensing applications.

Abstract: Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

[168] Adaptive Knowledge Transferring with Switching Dual-Student Framework for Semi-Supervised Medical Image Segmentation

Thanh-Huy Nguyen, Hoang-Thien Nguyen, Ba-Thinh Lam, Vi Vu, Bach X. Nguyen, Jianhua Xing, Tianyang Wang, Xingjian Li, Min Xu

Main category: cs.CV

TL;DR: A novel switching Dual-Student framework with Loss-Aware Exponential Moving Average for semi-supervised medical image segmentation that outperforms state-of-the-art methods.

Details

Motivation: Existing teacher-student frameworks in semi-supervised medical image segmentation suffer from strong correlation and unreliable knowledge transfer between teacher and student networks, limiting learning effects.

Method: Introduces switching Dual-Student architecture that selects the most reliable student at each iteration to enhance collaboration and prevent error reinforcement, plus Loss-Aware Exponential Moving Average to dynamically ensure teacher absorbs meaningful information from students.

Result: Extensively evaluated on 3D medical image segmentation datasets, outperforming state-of-the-art semi-supervised methods and demonstrating improved segmentation accuracy under limited supervision.

Conclusion: The proposed plug-and-play framework effectively addresses limitations in teacher-student correlation and knowledge transfer, achieving superior performance in semi-supervised medical image segmentation.

Abstract: Teacher-student frameworks have emerged as a leading approach in semi-supervised medical image segmentation, demonstrating strong performance across various tasks. However, the learning effects are still limited by the strong correlation and unreliable knowledge transfer process between teacher and student networks. To overcome this limitation, we introduce a novel switching Dual-Student architecture that strategically selects the most reliable student at each iteration to enhance dual-student collaboration and prevent error reinforcement. We also introduce a strategy of Loss-Aware Exponential Moving Average to dynamically ensure that the teacher absorbs meaningful information from students, improving the quality of pseudo-labels. Our plug-and-play framework is extensively evaluated on 3D medical image segmentation datasets, where it outperforms state-of-the-art semi-supervised methods, demonstrating its effectiveness in improving segmentation accuracy under limited supervision.

[169] Decoupling What to Count and Where to See for Referring Expression Counting

Yuda Zou, Zijian Zhang, Yongchao Xu

Main category: cs.CV

TL;DR: W2-Net addresses the challenge in Referring Expression Counting (REC) by decoupling the problem into “what to count” and “where to see” using dual-query mechanism and Subclass Separable Matching, achieving significant improvements in counting accuracy and localization.

Details

Motivation: Current REC methods face a fundamental challenge where annotation points are placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for "walking").

Method: Proposes W2-Net with dual-query mechanism: what-to-count (w2c) queries for object localization and where-to-see (w2s) queries for extracting features from attribute-specific visual regions. Also introduces Subclass Separable Matching (SSM) with repulsive force to enhance inter-subclass separability during label assignment.

Result: Significantly outperforms state-of-the-art on REC-8K dataset: reduces counting error by 22.5% (validation) and 18.0% (test), and improves localization F1 by 7% and 8% respectively.

Conclusion: W2-Net effectively addresses the overlooked challenge in REC by explicitly decoupling object counting into what and where components, enabling better attribute understanding and subclass discrimination through the dual-query mechanism and SSM strategy.

Abstract: Referring Expression Counting (REC) extends class-level object counting to the fine-grained subclass-level, aiming to enumerate objects matching a textual expression that specifies both the class and distinguishing attribute. A fundamental challenge, however, has been overlooked: annotation points are typically placed on class-representative locations (e.g., heads), forcing models to focus on class-level features while neglecting attribute information from other visual regions (e.g., legs for “walking”). To address this, we propose W2-Net, a novel framework that explicitly decouples the problem into “what to count” and “where to see” via a dual-query mechanism. Specifically, alongside the standard what-to-count (w2c) queries that localize the object, we introduce dedicated where-to-see (w2s) queries. The w2s queries are guided to seek and extract features from attribute-specific visual regions, enabling precise subclass discrimination. Furthermore, we introduce Subclass Separable Matching (SSM), a novel matching strategy that incorporates a repulsive force to enhance inter-subclass separability during label assignment. W2-Net significantly outperforms the state-of-the-art on the REC-8K dataset, reducing counting error by 22.5% (validation) and 18.0% (test), and improving localization F1 by 7% and 8%, respectively. Code will be available.

[170] Stroke Lesion Segmentation in Clinical Workflows: A Modular, Lightweight, and Deployment-Ready Tool

Yann Kerverdo, Florent Leray, Youwan Mahé, Stéphanie Leplaideur, Francesca Galassi

Main category: cs.CV

TL;DR: StrokeSeg is a lightweight, modular framework that converts research-grade stroke lesion segmentation models into deployable clinical applications with minimal performance loss.

Details

Motivation: Current deep learning frameworks like nnU-Net achieve state-of-the-art performance but are difficult to deploy clinically due to heavy dependencies and monolithic design.

Method: Decouples preprocessing, inference, and postprocessing; uses Anima toolbox for preprocessing, ONNX Runtime with Float16 quantization for inference (reducing model size by ~50%), and provides both GUI and CLI interfaces.

Result: On 300 stroke subjects, segmentation performance was equivalent to original PyTorch pipeline (Dice difference <10^{-3}), demonstrating successful transformation of research pipelines into clinical tools.

Conclusion: High-performing research pipelines can be effectively transformed into portable, clinically usable tools without sacrificing performance.

Abstract: Deep learning frameworks such as nnU-Net achieve state-of-the-art performance in brain lesion segmentation but remain difficult to deploy clinically due to heavy dependencies and monolithic design. We introduce \textit{StrokeSeg}, a modular and lightweight framework that translates research-grade stroke lesion segmentation models into deployable applications. Preprocessing, inference, and postprocessing are decoupled: preprocessing relies on the Anima toolbox with BIDS-compliant outputs, and inference uses ONNX Runtime with \texttt{Float16} quantisation, reducing model size by about 50%. \textit{StrokeSeg} provides both graphical and command-line interfaces and is distributed as Python scripts and as a standalone Windows executable. On a held-out set of 300 sub-acute and chronic stroke subjects, segmentation performance was equivalent to the original PyTorch pipeline (Dice difference $<10^{-3}$), demonstrating that high-performing research pipelines can be transformed into portable, clinically usable tools.

[171] A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset

Zhuangfan Huang, Xiaosong Li, Gao Wang, Tao Ye, Haishu Tan, Huafeng Li

Main category: cs.CV

TL;DR: Proposes MLSN, a luminance-aware multi-scale network for polarization image fusion that addresses contrast differences and complex lighting through dynamic luminance injection, global-local feature fusion, and brightness enhancement.

Details

Motivation: Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties, with applications in camouflage recognition, tissue pathology analysis, and surface defect detection. The challenge is integrating complementary information from different polarized images in complex luminance environments.

Method: Uses a multi-scale spatial weight matrix with brightness-branch to dynamically inject luminance into feature maps, global-local feature fusion with windowed self-attention at bottleneck layer, and Brightness-Enhancement module for nonlinear luminance correction in decoder stage.

Result: Outperforms state-of-the-art methods on MSP, PIF and GAND datasets with MS-SSIM and SD metrics higher than average of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31% respectively. Also introduces MSP dataset with 1000 pairs of polarized images covering 17 types of complex lighting scenes.

Conclusion: MLSN effectively addresses contrast differences in polarized images and adapts to complex lighting conditions, demonstrating superior performance in both subjective and objective evaluations while providing a valuable dataset for polarization image fusion research.

Abstract: Polarization image fusion combines S0 and DOLP images to reveal surface roughness and material properties through complementary texture features, which has important applications in camouflage recognition, tissue pathology analysis, surface defect detection and other fields. To intergrate coL-Splementary information from different polarized images in complex luminance environment, we propose a luminance-aware multi-scale network (MLSN). In the encoder stage, we propose a multi-scale spatial weight matrix through a brightness-branch , which dynamically weighted inject the luminance into the feature maps, solving the problem of inherent contrast difference in polarized images. The global-local feature fusion mechanism is designed at the bottleneck layer to perform windowed self-attention computation, to balance the global context and local details through residual linking in the feature dimension restructuring stage. In the decoder stage, to further improve the adaptability to complex lighting, we propose a Brightness-Enhancement module, establishing the mapping relationship between luminance distribution and texture features, realizing the nonlinear luminance correction of the fusion result. We also present MSP, an 1000 pairs of polarized images that covers 17 types of indoor and outdoor complex lighting scenes. MSP provides four-direction polarization raw maps, solving the scarcity of high-quality datasets in polarization image fusion. Extensive experiment on MSP, PIF and GAND datasets verify that the proposed MLSN outperms the state-of-the-art methods in subjective and objective evaluations, and the MS-SSIM and SD metircs are higher than the average values of other methods by 8.57%, 60.64%, 10.26%, 63.53%, 22.21%, and 54.31%, respectively. The source code and dataset is avalable at https://github.com/1hzf/MLS-UNet.

[172] When are radiology reports useful for training medical image classifiers?

Herman Bergström, Zhongqi Yue, Fredrik D. Johansson

Main category: cs.CV

TL;DR: This paper systematically studies when and how radiology reports can improve medical image classification, finding that text-based pre-training helps only when labels are well-represented in text, while fine-tuning with reports provides significant benefits across various settings.

Details

Motivation: Medical images often come with radiology reports containing expert annotations, but using these reports as inputs requires manual radiologist work. The research aims to determine when radiology reports can be leveraged during training to improve image-only classification, addressing gaps in prior work that focused only on diagnostic labels strongly associated with text.

Method: The authors conduct a systematic study examining how radiology reports can be used during both pre-training and fine-tuning phases, across diagnostic and prognostic tasks (like 12-month readmission), and under varying training set sizes. They compare different approaches for leveraging text data.

Result: Key findings show that: (1) Using reports during pre-training helps downstream tasks where labels are well-represented in text, but explicit image-text alignment can be detrimental when labels aren’t well-represented; (2) Fine-tuning with reports leads to significant improvements and can have larger impact than pre-training methods in certain settings.

Conclusion: The study provides actionable insights into when and how to leverage privileged text data for training medical image classifiers, highlighting that report-based fine-tuning often provides substantial benefits while pre-training approaches need careful consideration based on label-text alignment.

Abstract: Medical images used to train machine learning models are often accompanied by radiology reports containing rich expert annotations. However, relying on these reports as inputs for clinical prediction requires the timely manual work of a trained radiologist. This raises a natural question: when can radiology reports be leveraged during training to improve image-only classification? Prior works are limited to evaluating pre-trained image representations by fine-tuning them to predict diagnostic labels, often extracted from reports, ignoring tasks with labels that are weakly associated with the text. To address this gap, we conduct a systematic study of how radiology reports can be used during both pre-training and fine-tuning, across diagnostic and prognostic tasks (e.g., 12-month readmission), and under varying training set sizes. Our findings reveal that: (1) Leveraging reports during pre-training is beneficial for downstream classification tasks where the label is well-represented in the text; however, pre-training through explicit image-text alignment can be detrimental in settings where it’s not; (2) Fine-tuning with reports can lead to significant improvements and even have a larger impact than the pre-training method in certain settings. These results provide actionable insights into when and how to leverage privileged text data to train medical image classifiers while highlighting gaps in current research.

[173] Unsupervised Detection of Post-Stroke Brain Abnormalities

Youwan Mahé, Elise Bannier, Stéphanie Leplaideur, Elisa Fromont, Francesca Galassi

Main category: cs.CV

TL;DR: REFLECT, a flow-based generative model, detects both focal and non-lesional abnormalities in post-stroke MRI better when trained on healthy controls rather than lesion-free slices from stroke patients.

Details

Motivation: Post-stroke MRI shows secondary structural changes like atrophy and ventricular enlargement that are poorly captured by supervised segmentation methods but are important biomarkers for recovery and outcome.

Method: Used REFLECT, a flow-based generative model, for unsupervised detection of abnormalities. Trained two models: one on lesion-free slices from stroke patients (ATLAS) and another on healthy controls (IXI). Evaluated with dual-expert annotations and Free-Response ROC analysis.

Result: The IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43) compared to the ATLAS-trained model.

Conclusion: Training on fully healthy anatomy improves modeling of normal variability, enabling broader and more reliable detection of structural abnormalities in post-stroke patients.

Abstract: Post-stroke MRI not only delineates focal lesions but also reveals secondary structural changes, such as atrophy and ventricular enlargement. These abnormalities, increasingly recognised as imaging biomarkers of recovery and outcome, remain poorly captured by supervised segmentation methods. We evaluate REFLECT, a flow-based generative model, for unsupervised detection of both focal and non-lesional abnormalities in post-stroke patients. Using dual-expert central-slice annotations on ATLAS data, performance was assessed at the object level with Free-Response ROC analysis for anomaly maps. Two models were trained on lesion-free slices from stroke patients (ATLAS) and on healthy controls (IXI) to test the effect of training data. On ATLAS test subjects, the IXI-trained model achieved higher lesion segmentation (Dice = 0.37 vs 0.27) and improved sensitivity to non-lesional abnormalities (FROC = 0.62 vs 0.43). Training on fully healthy anatomy improves the modelling of normal variability, enabling broader and more reliable detection of structural abnormalities.

[174] GenTrack: A New Generation of Multi-Object Tracking

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

Main category: cs.CV

TL;DR: GenTrack is a novel multi-object tracking method that uses hybrid stochastic-deterministic tracking with PSO optimization and social interactions to handle varying numbers of targets, maintain ID consistency, and work effectively with weak detectors.

Details

Motivation: To address challenges in multi-object tracking including handling unknown and time-varying numbers of targets, maintaining target identity consistency, managing nonlinear dynamics, and working effectively with weak and noisy object detectors.

Method: Hybrid tracking approach combining stochastic and deterministic methods, using particle swarm optimization (PSO) with fitness measures, integrating social interactions among targets, and implementing comprehensive state and observation models with space consistency, appearance, detection confidence, track penalties, and social scores.

Result: Superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with reduced ID switches and track loss especially during occlusions.

Conclusion: GenTrack provides an effective solution for robust multi-object tracking with publicly available source code implementations (GenTrack Basic, PSO, and PSO-Social variants) that facilitate flexible reimplementation and fair comparisons.

Abstract: This paper introduces a novel multi-object tracking (MOT) method, dubbed GenTrack, whose main contributions include: a hybrid tracking approach employing both stochastic and deterministic manners to robustly handle unknown and time-varying numbers of targets, particularly in maintaining target identity (ID) consistency and managing nonlinear dynamics, leveraging particle swarm optimization (PSO) with some proposed fitness measures to guide stochastic particles toward their target distribution modes, enabling effective tracking even with weak and noisy object detectors, integration of social interactions among targets to enhance PSO-guided particles as well as improve continuous updates of both strong (matched) and weak (unmatched) tracks, thereby reducing ID switches and track loss, especially during occlusions, a GenTrack-based redefined visual MOT baseline incorporating a comprehensive state and observation model based on space consistency, appearance, detection confidence, track penalties, and social scores for systematic and efficient target updates, and the first-ever publicly available source-code reference implementation with minimal dependencies, featuring three variants, including GenTrack Basic, PSO, and PSO-Social, facilitating flexible reimplementation. Experimental results have shown that GenTrack provides superior performance on standard benchmarks and real-world scenarios compared to state-of-the-art trackers, with integrated implementations of baselines for fair comparison. Potential directions for future work are also discussed. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack

[175] A Hybrid Approach for Visual Multi-Object Tracking

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

Main category: cs.CV

TL;DR: A visual multi-object tracking method combining stochastic particle filtering with deterministic association to maintain identifier consistency for varying numbers of targets under nonlinear dynamics.

Details

Motivation: To address challenges in multi-object tracking including unknown and time-varying target numbers, nonlinear dynamics, and identity preservation during interactions and occlusions.

Method: Uses stochastic particle filter with PSO optimization for nonlinear dynamics, deterministic association with custom cost matrix, smooth state updating scheme, and velocity regression from past states.

Result: Superior performance compared to state-of-the-art trackers, with flexible operation for both pre-recorded videos and live camera streams.

Conclusion: The proposed hybrid stochastic-deterministic approach effectively maintains identifier consistency and handles complex tracking scenarios, demonstrating improved performance over existing methods.

Abstract: This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

[176] 50 Years of Water Body Monitoring: The Case of Qaraaoun Reservoir, Lebanon

Ali Ahmad Faour, Nabil Amacha, Ali J. Ghandour

Main category: cs.CV

TL;DR: A sensor-free approach using satellite imagery and machine learning to monitor reservoir volume in Lebanon, achieving high accuracy without ground measurements.

Details

Motivation: Sustainable management of Qaraaoun Reservoir requires reliable monitoring despite sensor malfunctions and limited maintenance capacity.

Method: Integrates Sentinel-2 and Landsat imagery with a new water segmentation index and Support Vector Regression (SVR) model trained on bathymetry data.

Result: 95% shoreline alignment with ground truth, SVR error under 1.5% of full capacity, R² > 0.98.

Conclusion: Robust, cost-effective solution for continuous reservoir monitoring that can be replicated for other water bodies and provides valuable long-term data.

Abstract: The sustainable management of the Qaraaoun Reservoir, the largest surface water body in Lebanon located in the Bekaa Plain, depends on reliable monitoring of its storage volume despite frequent sensor malfunctions and limited maintenance capacity. This study introduces a sensor-free approach that integrates open-source satellite imagery, advanced water-extent segmentation, and machine learning to estimate the reservoir surface area and volume in near real time. Sentinel-2 and Landsat images are processed, where surface water is delineated using a newly proposed water segmentation index. A machine learning model based on Support Vector Regression (SVR) is trained on a curated dataset that includes water surface area, water level, and water volume calculations using a reservoir bathymetry survey. The model is then able to estimate reservoir volume relying solely on surface area extracted from satellite imagery, without the need for ground measurements. Water segmentation using the proposed index aligns with ground truth for more than 95 percent of the shoreline. Hyperparameter tuning with GridSearchCV yields an optimized SVR performance with error under 1.5 percent of full reservoir capacity and coefficients of determination exceeding 0.98. These results demonstrate the robustness and cost-effectiveness of the method, offering a practical solution for continuous, sensor-independent monitoring of reservoir storage. The proposed methodology can be replicated for other water bodies, and the resulting 50 years of time-series data is valuable for research on climate change and environmental patterns.

[177] XAI Evaluation Framework for Semantic Segmentation

Reem Hammoud, Abdul karim Gizzini, Ali J. Ghandour

Main category: cs.CV

TL;DR: A comprehensive evaluation framework for explainable AI in semantic segmentation that addresses spatial and contextual complexities using pixel-level evaluation strategies.

Details

Motivation: The need for transparent and trustworthy AI models in safety-critical domains, coupled with the lack of specialized evaluation methods for semantic segmentation explainability compared to classification tasks.

Method: Developed a systematic evaluation framework with pixel-level evaluation strategies and carefully designed metrics to assess XAI methods in semantic segmentation, specifically accounting for spatial and contextual complexities.

Result: Simulation results using CAM-based XAI schemes demonstrated the efficiency, robustness, and reliability of the proposed methodology.

Conclusion: The framework advances transparent, trustworthy, and accountable semantic segmentation models by providing comprehensive evaluation capabilities for XAI methods in this domain.

Abstract: Ensuring transparency and trust in artificial intelligence (AI) models is essential, particularly as they are increasingly applied in safety-critical and high-stakes domains. Explainable AI (XAI) has emerged as a promising approach to address this challenge, yet the rigorous evaluation of XAI methods remains crucial for optimizing the trade-offs between model complexity, predictive performance, and interpretability. While extensive progress has been achieved in evaluating XAI techniques for classification tasks, evaluation strategies tailored to semantic segmentation remain relatively underexplored. This work introduces a comprehensive and systematic evaluation framework specifically designed for assessing XAI in semantic segmentation, explicitly accounting for both spatial and contextual task complexities. The framework employs pixel-level evaluation strategies and carefully designed metrics to provide fine-grained interpretability insights. Simulation results using recently adapted class activation mapping (CAM)-based XAI schemes demonstrate the efficiency, robustness, and reliability of the proposed methodology. These findings contribute to advancing transparent, trustworthy, and accountable semantic segmentation models.

[178] Deeply-Conditioned Image Compression via Self-Generated Priors

Zhineng Zhao, Zhihai He, Zikun Zhou, Siwei Ma, Yaowei Wang

Main category: cs.CV

TL;DR: DCIC-sgp introduces a functional decomposition framework for learned image compression that uses self-generated priors to separate structural information from texture details, achieving better rate-distortion performance and reducing geometric deformation at low bitrates.

Details

Motivation: Current learned image compression methods struggle to model complex correlation structures in natural images, particularly the entanglement of global structures with local textures, leading to severe geometric deformation at low bitrates.

Method: The framework first encodes a self-generated prior to capture the image’s structural backbone, then uses this prior to holistically modulate the entire compression pipeline, especially the analysis transform, allowing it to focus on residual high-entropy details.

Result: The method significantly mitigates geometric deformation artifacts at low bitrates and achieves competitive performance with BD-rate reductions of 14.4%, 15.7%, and 15.1% against VVC test model VTM-12.1 on Kodak, CLIC, and Tecnick datasets.

Conclusion: The hierarchical, dependency-driven approach effectively disentangles information streams and establishes highly competitive performance in learned image compression.

Abstract: Learned image compression (LIC) has shown great promise for achieving high rate-distortion performance. However, current LIC methods are often limited in their capability to model the complex correlation structures inherent in natural images, particularly the entanglement of invariant global structures with transient local textures within a single monolithic representation. This limitation precipitates severe geometric deformation at low bitrates. To address this, we introduce a framework predicated on functional decomposition, which we term Deeply-Conditioned Image Compression via self-generated priors (DCIC-sgp). Our central idea is to first encode a potent, self-generated prior to encapsulate the image’s structural backbone. This prior is subsequently utilized not as mere side-information, but to holistically modulate the entire compression pipeline. This deep conditioning, most critically of the analysis transform, liberates it to dedicate its representational capacity to the residual, high-entropy details. This hierarchical, dependency-driven approach achieves an effective disentanglement of information streams. Our extensive experiments validate this assertion; visual analysis demonstrates that our method substantially mitigates the geometric deformation artifacts that plague conventional codecs at low bitrates. Quantitatively, our framework establishes highly competitive performance, achieving significant BD-rate reductions of 14.4%, 15.7%, and 15.1% against the VVC test model VTM-12.1 on the Kodak, CLIC, and Tecnick datasets.

[179] Rethinking Visual Intelligence: Insights from Video Pretraining

Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro

Main category: cs.CV

TL;DR: Video Diffusion Models (VDMs) pretrained on spatiotemporal data show stronger inductive biases for visual tasks compared to LLMs, achieving higher data efficiency across multiple benchmarks.

Details

Motivation: LLMs have succeeded in language tasks but struggle with visual domain challenges like compositional understanding and sample efficiency. The paper investigates whether VDMs can bridge this gap by leveraging spatiotemporal pretraining.

Method: Used controlled evaluation where both pretrained LLMs and VDMs were equipped with lightweight adapters and tested on tasks in their natural modalities across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata.

Result: VDMs demonstrated higher data efficiency than LLMs across all benchmarks, showing better performance in visual tasks with less supervision.

Conclusion: Video pretraining provides strong inductive biases that support progress toward visual foundation models, making VDMs a promising direction for visual intelligence.

Abstract: Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.

[180] A Critical Study towards the Detection of Parkinsons Disease using ML Technologies

Vivek Chetia, Abdul Taher Khan, Rahish Gogoi, David Kapsian Khual, Purnendu Bikash, Sajal Saha

Main category: cs.CV

TL;DR: Deep learning approach for detecting and segmenting three tea leaf diseases (Red Rust, Helopeltis, Red spider mite) using object detection models (SSD MobileNet V2, Faster R-CNN) and instance segmentation (Mask R-CNN) with custom damage area calculation.

Details

Motivation: To automatically classify three types of tea leaf diseases (two pest-related, one pathogen/environment-related) and quantify the damaged area on leaves for agricultural monitoring.

Method: Evaluated SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for object detection, and Mask R-CNN for instance segmentation with custom method to calculate damaged leaf portions.

Result: Faster R-CNN ResNet50 V1 achieved better performance with 25% mAP vs SSD’s 20.9% mAP. Both models showed low precision (0.252 vs 0.209) and recall (0.044 vs 0.02) on IOU 0.50:0.95 range.

Conclusion: Faster R-CNN outperformed SSD MobileNet for tea leaf disease detection, and Mask R-CNN with custom segmentation method enables damaged area quantification, though overall performance metrics indicate room for improvement.

Abstract: The proposed solution is Deep Learning Technique that will be able classify three types of tea leaves diseases from which two diseases are caused by the pests and one due to pathogens (infectious organisms) and environmental conditions and also show the area damaged by a disease in leaves. Namely Red Rust, Helopeltis and Red spider mite respectively. In this paper we have evaluated two models namely SSD MobileNet V2 and Faster R-CNN ResNet50 V1 for the object detection. The SSD MobileNet V2 gave precision of 0.209 for IOU range of 0.50:0.95 with recall of 0.02 on IOU 0.50:0.95 and final mAP of 20.9%. While Faster R-CNN ResNet50 V1 has precision of 0.252 on IOU range of 0.50:0.95 and recall of 0.044 on IOU of 0.50:0.95 with a mAP of 25%, which is better than SSD. Also used Mask R-CNN for Object Instance Segmentation where we have implemented our custom method to calculate the damaged diseased portion of leaves. Keywords: Tea Leaf Disease, Deep Learning, Red Rust, Helopeltis and Red Spider Mite, SSD MobileNet V2, Faster R-CNN ResNet50 V1 and Mask RCNN.

[181] Kineo: Calibration-Free Metric Motion Capture From Sparse RGB Cameras

Charles Javerliat, Pierre Raimbaud, Guillaume Lavoué

Main category: cs.CV

TL;DR: Kineo is a fully automatic, calibration-free pipeline for markerless motion capture from unsynchronized, uncalibrated consumer RGB cameras that simultaneously calibrates cameras and reconstructs 3D keypoints at metric scale with high accuracy and efficiency.

Details

Motivation: Markerless multiview motion capture is constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches suffer from high computational cost and reduced reconstruction accuracy.

Method: Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras (including distortion coefficients) and reconstruct 3D keypoints and dense scene point maps. It uses confidence-driven spatio-temporal keypoint sampling with graph-based global optimization for robust calibration at fixed computational cost, plus a pairwise reprojection consensus score to quantify 3D reconstruction reliability.

Result: On EgoHumans and Human3.6M datasets, Kineo reduces camera translation error by 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91% compared to prior state-of-the-art calibration-free methods. It processes multi-view sequences faster than their duration in some configurations (e.g., 36min for 1h20min footage).

Conclusion: Kineo provides substantial improvements over prior calibration-free methods in accuracy and efficiency, making markerless motion capture more accessible for non-experts and in-the-wild scenarios. The pipeline is openly released to promote reproducibility and practical adoption.

Abstract: Markerless multiview motion capture is often constrained by the need for precise camera calibration, limiting accessibility for non-experts and in-the-wild captures. Existing calibration-free approaches mitigate this requirement but suffer from high computational cost and reduced reconstruction accuracy. We present Kineo, a fully automatic, calibration-free pipeline for markerless motion capture from videos captured by unsynchronized, uncalibrated, consumer-grade RGB cameras. Kineo leverages 2D keypoints from off-the-shelf detectors to simultaneously calibrate cameras, including Brown-Conrady distortion coefficients, and reconstruct 3D keypoints and dense scene point maps at metric scale. A confidence-driven spatio-temporal keypoint sampling strategy, combined with graph-based global optimization, ensures robust calibration at a fixed computational cost independent of sequence length. We further introduce a pairwise reprojection consensus score to quantify 3D reconstruction reliability for downstream tasks. Evaluations on EgoHumans and Human3.6M demonstrate substantial improvements over prior calibration-free methods. Compared to previous state-of-the-art approaches, Kineo reduces camera translation error by approximately 83-85%, camera angular error by 86-92%, and world mean-per-joint error (W-MPJPE) by 83-91%. Kineo is also efficient in real-world scenarios, processing multi-view sequences faster than their duration in specific configuration (e.g., 36min to process 1h20min of footage). The full pipeline and evaluation code are openly released to promote reproducibility and practical adoption at https://liris-xr.github.io/kineo/.

[182] Decoupled MeanFlow: Turning Flow Models into Flow Maps for Accelerated Sampling

Kyungmin Lee, Sihyun Yu, Jinwoo Shin

Main category: cs.CV

TL;DR: Decoupled MeanFlow converts pretrained flow models into flow map models without architectural changes, enabling high-quality image generation in 1-4 steps with over 100x faster inference.

Details

Motivation: Denoising generative models require many steps due to discretization error, and existing flow map approaches need architectural modifications that limit compatibility with pretrained models.

Method: Conditions final blocks of diffusion transformers on subsequent timesteps to repurpose pretrained flow models as flow maps, combined with enhanced training techniques.

Result: Achieves 1-step FID of 2.16 on ImageNet 256x256 and 2.12 on 512x512, surpassing prior art. With 4 steps, achieves FID of 1.51 and 1.68, nearly matching flow model performance.

Conclusion: Training flow models first then converting them is more efficient than training flow maps from scratch, enabling fast high-quality generation with existing pretrained models.

Abstract: Denoising generative models, such as diffusion and flow-based models, produce high-quality samples but require many denoising steps due to discretization error. Flow maps, which estimate the average velocity between timesteps, mitigate this error and enable faster sampling. However, their training typically demands architectural changes that limit compatibility with pretrained flow models. We introduce Decoupled MeanFlow, a simple decoding strategy that converts flow models into flow map models without architectural modifications. Our method conditions the final blocks of diffusion transformers on the subsequent timestep, allowing pretrained flow models to be directly repurposed as flow maps. Combined with enhanced training techniques, this design enables high-quality generation in as few as 1 to 4 steps. Notably, we find that training flow models and subsequently converting them is more efficient and effective than training flow maps from scratch. On ImageNet 256x256 and 512x512, our models attain 1-step FID of 2.16 and 2.12, respectively, surpassing prior art by a large margin. Furthermore, we achieve FID of 1.51 and 1.68 when increasing the steps to 4, which nearly matches the performance of flow models while delivering over 100x faster inference.

[183] Fast and accurate neural reflectance transformation imaging through knowledge distillation

Tinsae G. Dulecha, Leonardo Righetto, Ruggero Pintus, Enrico Gobbetti, Andrea Giachetti

Main category: cs.CV

TL;DR: Proposes DisK-NeuralRTI, a knowledge distillation approach to reduce computational cost of NeuralRTI while maintaining quality for reflectance transformation imaging.

Details

Motivation: Traditional RTI methods like PTM and HSH have artifacts with complex reflectance, while NeuralRTI provides superior quality but is computationally expensive for interactive relighting on limited hardware.

Method: Uses knowledge distillation to compress NeuralRTI’s large decoder networks into smaller, more efficient networks while preserving quality.

Result: Achieves significant computational cost reduction while maintaining comparable visual quality to the original NeuralRTI approach.

Conclusion: DisK-NeuralRTI enables practical interactive relighting of large RTI images on limited hardware by reducing computational requirements through knowledge distillation.

Abstract: Reflectance Transformation Imaging (RTI) is very popular for its ability to visually analyze surfaces by enhancing surface details through interactive relighting, starting from only a few tens of photographs taken with a fixed camera and variable illumination. Traditional methods like Polynomial Texture Maps (PTM) and Hemispherical Harmonics (HSH) are compact and fast, but struggle to accurately capture complex reflectance fields using few per-pixel coefficients and fixed bases, leading to artifacts, especially in highly reflective or shadowed areas. The NeuralRTI approach, which exploits a neural autoencoder to learn a compact function that better approximates the local reflectance as a function of light directions, has been shown to produce superior quality at comparable storage cost. However, as it performs interactive relighting with custom decoder networks with many parameters, the rendering step is computationally expensive and not feasible at full resolution for large images on limited hardware. Earlier attempts to reduce costs by directly training smaller networks have failed to produce valid results. For this reason, we propose to reduce its computational cost through a novel solution based on Knowledge Distillation (DisK-NeuralRTI). …

[184] Latent Sketchpad: Sketching Visual Thoughts to Elicit Multimodal Reasoning in MLLMs

Huanyu Zhang, Wenshan Wu, Chengzu Li, Ning Shang, Yan Xia, Yangyu Huang, Yifan Zhang, Li Dong, Zhang Zhang, Liang Wang, Tieniu Tan, Furu Wei

Main category: cs.CV

TL;DR: Latent Sketchpad equips MLLMs with an internal visual scratchpad for generative visual thinking, allowing them to interleave textual reasoning with visual latent generation without compromising reasoning performance.

Details

Motivation: MLLMs struggle with visual planning and imagination in complex scenarios, while humans use sketching as visual thinking. The framework aims to extend MLLMs' internal visual representations beyond perceptual understanding to support generative visual thought.

Method: Integrates visual generation into MLLMs’ native autoregressive reasoning process using two components: Context-Aware Vision Head for autoregressive visual representation production, and pretrained Sketch Decoder to render visual latents into interpretable sketch images.

Result: Experiments on MazePlanning dataset show Latent Sketchpad delivers comparable or superior reasoning performance to backbone MLLMs, and generalizes across different frontier MLLMs including Gemma3 and Qwen2.5-VL.

Conclusion: The framework successfully extends MLLMs’ textual reasoning to visual thinking, opening new opportunities for richer human-computer interaction and broader applications.

Abstract: While Multimodal Large Language Models (MLLMs) excel at visual understanding, they often struggle in complex scenarios that require visual planning and imagination. Inspired by how humans use sketching as a form of visual thinking to develop and communicate ideas, we introduce Latent Sketchpad, a framework that equips MLLMs with an internal visual scratchpad. The internal visual representations of MLLMs have traditionally been confined to perceptual understanding. We repurpose them to support generative visual thought without compromising reasoning ability. Building on frontier MLLMs, our approach integrates visual generation directly into their native autoregressive reasoning process. It allows the model to interleave textual reasoning with the generation of visual latents. These latents guide the internal thought process and can be translated into sketch images for interpretability. To realize this, we introduce two components: a Context-Aware Vision Head autoregressively produces visual representations, and a pretrained Sketch Decoder renders these into human-interpretable images. We evaluate the framework on our new dataset MazePlanning. Experiments across various MLLMs show that Latent Sketchpad delivers comparable or even superior reasoning performance to their backbone. It further generalizes across distinct frontier MLLMs, including Gemma3 and Qwen2.5-VL. By extending model’s textual reasoning to visual thinking, our framework opens new opportunities for richer human-computer interaction and broader applications. More details and resources are available on our project page: https://latent-sketchpad.github.io/.

[185] OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, Fei Huang

Main category: cs.CV

TL;DR: OSWorld-MCP is the first comprehensive benchmark for evaluating multimodal agents’ tool invocation, GUI operation, and decision-making abilities in real-world computer environments, addressing the gap in fair assessment of tool usage capabilities.

Details

Motivation: Past evaluations focused mainly on GUI interaction skills while overlooking tool invocation abilities enabled by Model Context Protocol (MCP), creating unfair comparisons between agents with integrated tools and those evaluated only on GUI interaction.

Method: Created a novel automated code-generation pipeline to build tools and combined them with curated existing tools, resulting in 158 high-quality tools across 7 common applications that were manually validated for functionality, applicability, and versatility.

Result: MCP tools generally improved task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), but even the strongest models had relatively low tool invocation rates of only 36.3%.

Conclusion: OSWorld-MCP sets a new standard for evaluating multimodal agents in complex, tool-assisted environments and highlights the importance of assessing tool invocation capabilities, revealing significant room for improvement in current models.

Abstract: With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents’ tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality, practical applicability, and versatility. Extensive evaluations of state-of-the-art multimodal agents on OSWorld-MCP show that MCP tools generally improve task success rates (e.g., from 8.3% to 20.4% for OpenAI o3 at 15 steps, from 40.1% to 43.3% for Claude 4 Sonnet at 50 steps), underscoring the importance of assessing tool invocation capabilities. However, even the strongest models have relatively low tool invocation rates, Only 36.3%, indicating room for improvement and highlighting the benchmark’s challenge. By explicitly measuring MCP tool usage skills, OSWorld-MCP deepens understanding of multimodal agents and sets a new standard for evaluating performance in complex, tool-assisted environments. Our code, environment, and data are publicly available at https://osworld-mcp.github.io.

[186] Physics-Inspired Gaussian Kolmogorov-Arnold Networks for X-ray Scatter Correction in Cone-Beam CT

Xu Jiang, Huiying Pan, Ligen Shi, Jianing Sun, Wenfeng Xu, Xing Zhao

Main category: cs.CV

TL;DR: Deep learning method using Gaussian RBF and KAN networks to correct scatter artifacts in CBCT imaging by modeling rotational symmetry of scatter distribution.

Details

Motivation: CBCT suffers from scatter artifacts during data acquisition, causing CT value bias and reduced tissue contrast that degrades diagnostic accuracy.

Method: Uses Gaussian Radial Basis Functions to model point scatter function and embeds it into Kolmogorov-Arnold Networks layers to learn high-dimensional scatter features with physical prior knowledge.

Result: Effectively corrects scatter artifacts in reconstructed images and outperforms current methods in quantitative metrics, validated through synthetic and real-scan experiments.

Conclusion: The proposed method successfully combines physical scatter characteristics with KAN’s complex function mapping to accurately represent and correct scatter artifacts in CBCT imaging.

Abstract: Cone-beam CT (CBCT) employs a flat-panel detector to achieve three-dimensional imaging with high spatial resolution. However, CBCT is susceptible to scatter during data acquisition, which introduces CT value bias and reduced tissue contrast in the reconstructed images, ultimately degrading diagnostic accuracy. To address this issue, we propose a deep learning-based scatter artifact correction method inspired by physical prior knowledge. Leveraging the fact that the observed point scatter probability density distribution exhibits rotational symmetry in the projection domain. The method uses Gaussian Radial Basis Functions (RBF) to model the point scatter function and embeds it into the Kolmogorov-Arnold Networks (KAN) layer, which provides efficient nonlinear mapping capabilities for learning high-dimensional scatter features. By incorporating the physical characteristics of the scattered photon distribution together with the complex function mapping capacity of KAN, the model improves its ability to accurately represent scatter. The effectiveness of the method is validated through both synthetic and real-scan experiments. Experimental results show that the model can effectively correct the scatter artifacts in the reconstructed images and is superior to the current methods in terms of quantitative metrics.

[187] A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

Xin Zhang, Yuqi Song, Fei Zuo

Main category: cs.CV

TL;DR: A dual-branch CNN for face forgery detection that combines spatial and frequency domain analysis with adaptive feature fusion and a unified loss function, achieving strong performance across multiple forgery types.

Details

Motivation: The rapid advancement of generative AI enables creation of highly realistic forged facial images, posing threats to AI security, digital media integrity, and public trust, creating urgent need for robust detection methods.

Method: Dual-branch CNN with RGB branch for semantic information and frequency branch for high-frequency artifacts, using channel attention for feature fusion and FSC Loss (focal loss, supervised contrastive loss, frequency center margin loss) for training.

Result: Achieves strong performance across all forgery categories on DiFF benchmark (text-to-image, image-to-image, face swap, face edit) and outperforms average human accuracy.

Conclusion: The model demonstrates effectiveness and potential contribution to safeguarding AI ecosystems against visual forgery attacks through complementary spatial-frequency analysis and robust learning strategy.

Abstract: The rapid advancement of generative AI has enabled the creation of highly realistic forged facial images, posing significant threats to AI security, digital media integrity, and public trust. Face forgery techniques, ranging from face swapping and attribute editing to powerful diffusion-based image synthesis, are increasingly being used for malicious purposes such as misinformation, identity fraud, and defamation. This growing challenge underscores the urgent need for robust and generalizable face forgery detection methods as a critical component of AI security infrastructure. In this work, we propose a novel dual-branch convolutional neural network for face forgery detection that leverages complementary cues from both spatial and frequency domains. The RGB branch captures semantic information, while the frequency branch focuses on high-frequency artifacts that are difficult for generative models to suppress. A channel attention module is introduced to adaptively fuse these heterogeneous features, highlighting the most informative channels for forgery discrimination. To guide the network’s learning process, we design a unified loss function, FSC Loss, that combines focal loss, supervised contrastive loss, and a frequency center margin loss to enhance class separability and robustness. We evaluate our model on the DiFF benchmark, which includes forged images generated from four representative methods: text-to-image, image-to-image, face swap, and face edit. Our method achieves strong performance across all categories and outperforms average human accuracy. These results demonstrate the model’s effectiveness and its potential contribution to safeguarding AI ecosystems against visual forgery attacks.

[188] Eye-Tracking, Mouse Tracking, Stimulus Tracking,and Decision-Making Datasets in Digital Pathology

Veronica Thai, Rui Li, Meng Ling, Shuning Jiang, Jeremy Wolfe, Raghu Machiraju, Yan Hu, Zaibo Li, Anil Parwani, Jian Chen

Main category: cs.CV

TL;DR: PathoGaze1.0 is a comprehensive behavioral dataset capturing pathologists’ visual search and decision-making processes during cancer diagnosis from whole-slide images, including eye-tracking, mouse interaction, and diagnostic decision data.

Details

Motivation: Pathologists' diagnostic accuracy averages only 70% and adding a second pathologist doesn't improve consistency. The field lacks behavioral data to explain diagnostic errors and inconsistencies in whole-slide image interpretation.

Method: Collected 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data from 19 pathologists interpreting 397 WSIs using an ecologically valid testbed called PTAH.

Result: Recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events, creating a comprehensive dataset of pathologists’ diagnostic workflow behaviors.

Conclusion: The PathoGaze1.0 dataset provides valuable behavioral data that can help explain diagnostic errors and inconsistencies, and could improve training for both pathologists and AI systems supporting human experts.

Abstract: Interpretation of giga-pixel whole-slide images (WSIs) is an important but difficult task for pathologists. Their diagnostic accuracy is estimated to average around 70%. Adding a second pathologist does not substantially improve decision consistency. The field lacks adequate behavioral data to explain diagnostic errors and inconsistencies. To fill in this gap, we present PathoGaze1.0, a comprehensive behavioral dataset capturing the dynamic visual search and decision-making processes of the full diagnostic workflow during cancer diagnosis. The dataset comprises 18.69 hours of eye-tracking, mouse interaction, stimulus tracking, viewport navigation, and diagnostic decision data (EMSVD) collected from 19 pathologists interpreting 397 WSIs. The data collection process emphasizes ecological validity through an application-grounded testbed, called PTAH. In total, we recorded 171,909 fixations, 263,320 saccades, and 1,867,362 mouse interaction events. In addition, such data could also be used to improve the training of both pathologists and AI systems that might support human experts. All experiments were preregistered at https://osf.io/hj9a7, and the complete dataset along with analysis code is available at https://go.osu.edu/pathogaze.

[189] Group Relative Attention Guidance for Image Editing

Xuanpu Zhang, Xuesong Niu, Ruidong Chen, Dan Song, Jianhao Zeng, Penghui Du, Haoxiang Cao, Kai Wu, An-an Liu

Main category: cs.CV

TL;DR: Proposes Group Relative Attention Guidance (GRAG), a method to control editing intensity in Diffusion-in-Transformer models by reweighting attention delta values, enabling fine-grained editing control without tuning.

Details

Motivation: Existing image editing methods based on Diffusion-in-Transformer models lack effective control over editing degree, limiting customization capabilities.

Method: Analyzes MM-Attention mechanism in DiT models, identifies bias vectors as inherent editing behavior, and proposes GRAG to reweight delta values between tokens and biases to modulate focus on input image vs editing instruction.

Result: GRAG can be integrated with 4 lines of code, consistently enhances editing quality, and achieves smoother and more precise control over editing degree compared to Classifier-Free Guidance.

Conclusion: GRAG provides an effective solution for continuous and fine-grained control over image editing intensity in DiT models without requiring additional tuning.

Abstract: Recently, image editing based on Diffusion-in-Transformer models has undergone rapid development. However, existing editing methods often lack effective control over the degree of editing, limiting their ability to achieve more customized results. To address this limitation, we investigate the MM-Attention mechanism within the DiT model and observe that the Query and Key tokens share a bias vector that is only layer-dependent. We interpret this bias as representing the model’s inherent editing behavior, while the delta between each token and its corresponding bias encodes the content-specific editing signals. Based on this insight, we propose Group Relative Attention Guidance, a simple yet effective method that reweights the delta values of different tokens to modulate the focus of the model on the input image relative to the editing instruction, enabling continuous and fine-grained control over editing intensity without any tuning. Extensive experiments conducted on existing image editing frameworks demonstrate that GRAG can be integrated with as few as four lines of code, consistently enhancing editing quality. Moreover, compared to the commonly used Classifier-Free Guidance, GRAG achieves smoother and more precise control over the degree of editing. Our code will be released at https://github.com/little-misfit/GRAG-Image-Editing.

[190] SAGE: Structure-Aware Generative Video Transitions between Diverse Clips

Mia Kan, Yilin Liu, Niloy Mitra

Main category: cs.CV

TL;DR: SAGE is a zero-shot approach for video transitions that uses structural guidance and generative synthesis to create smooth, semantically consistent transitions between diverse video clips with large temporal gaps.

Details

Motivation: Existing video transition methods struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, creating a need for content-aware and visually coherent transitions.

Method: SAGE combines structural guidance (line maps and motion flow) with generative synthesis in a zero-shot approach, drawing on artistic workflows like aligning silhouettes and interpolating salient features.

Result: Extensive experiments show SAGE outperforms both classical and generative baselines (FILM, TVG, DiffMorpher, VACE, GI) on quantitative metrics and user studies for transitions between diverse clips.

Conclusion: SAGE enables smooth, semantically consistent video transitions without fine-tuning, effectively bridging diverse clips with large temporal gaps or semantic differences.

Abstract: Video transitions aim to synthesize intermediate frames between two clips, but naive approaches such as linear blending introduce artifacts that limit professional use or break temporal coherence. Traditional techniques (cross-fades, morphing, frame interpolation) and recent generative inbetweening methods can produce high-quality plausible intermediates, but they struggle with bridging diverse clips involving large temporal gaps or significant semantic differences, leaving a gap for content-aware and visually coherent transitions. We address this challenge by drawing on artistic workflows, distilling strategies such as aligning silhouettes and interpolating salient features to preserve structure and perceptual continuity. Building on this, we propose SAGE (Structure-Aware Generative vidEo transitions) as a zeroshot approach that combines structural guidance, provided via line maps and motion flow, with generative synthesis, enabling smooth, semantically consistent transitions without fine-tuning. Extensive experiments and comparison with current alternatives, namely [FILM, TVG, DiffMorpher, VACE, GI], demonstrate that SAGE outperforms both classical and generative baselines on quantitative metrics and user studies for producing transitions between diverse clips. Code to be released on acceptance.

[191] MIC-BEV: Multi-Infrastructure Camera Bird’s-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, Jiaqi Ma

Main category: cs.CV

TL;DR: MIC-BEV is a Transformer-based BEV perception framework for infrastructure-based multi-camera 3D object detection that handles variable cameras with heterogeneous parameters and demonstrates strong robustness under sensor degradation.

Details

Motivation: Existing camera-based detection models underperform in infrastructure-based perception scenarios due to multi-view setups, diverse camera configurations, degraded visual inputs, and various road layouts.

Method: Proposes a graph-enhanced fusion module that integrates multi-view image features into BEV space using geometric relationships between cameras and BEV cells alongside latent visual cues. Also introduces M2I synthetic dataset for training.

Result: Achieves state-of-the-art performance in 3D object detection on both M2I and real-world RoScenes datasets, with strong robustness under extreme weather and sensor degradation.

Conclusion: MIC-BEV shows potential for real-world deployment in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy.

Abstract: Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird’s-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

[192] Does Object Binding Naturally Emerge in Large Pretrained Vision Transformers?

Yihao Li, Saeed Salehi, Lyle Ungar, Konrad P. Kording

Main category: cs.CV

TL;DR: Vision Transformers naturally learn object binding capabilities during self-supervised pretraining, enabling them to determine whether image patches belong to the same object, with this ability emerging strongly in models like DINO, MAE, and CLIP but weaker in ImageNet-supervised models.

Details

Motivation: To investigate whether object binding - the ability to bind features into coherent object representations - naturally emerges in pre-trained Vision Transformers, challenging the view that ViTs lack this capability.

Method: Used similarity probes to decode ‘IsSameObject’ property from patch embeddings across ViT layers, analyzed different pretraining objectives (self-supervised vs ImageNet-supervised), and performed ablation studies to test the functional role of object binding.

Result: Self-supervised ViTs achieve over 90% accuracy in detecting whether patches belong to the same object, with this capability emerging reliably in DINO, MAE, and CLIP models but markedly weaker in ImageNet-supervised models. Object binding is encoded in a low-dimensional subspace and actively guides attention.

Conclusion: Object binding emerges naturally in self-supervised ViTs as a functional capability that serves the pretraining objective, challenging the notion that ViTs lack symbolic knowledge of object composition and demonstrating how ‘which parts belong together’ knowledge emerges in connectionist systems.

Abstract: Object binding, the brain’s ability to bind the many features that collectively represent an object into a coherent whole, is central to human cognition. It groups low-level perceptual features into high-level object representations, stores those objects efficiently and compositionally in memory, and supports human reasoning about individual object instances. While prior work often imposes object-centric attention (e.g., Slot Attention) explicitly to probe these benefits, it remains unclear whether this ability naturally emerges in pre-trained Vision Transformers (ViTs). Intuitively, they could: recognizing which patches belong to the same object should be useful for downstream prediction and thus guide attention. Motivated by the quadratic nature of self-attention, we hypothesize that ViTs represent whether two patches belong to the same object, a property we term IsSameObject. We decode IsSameObject from patch embeddings across ViT layers using a similarity probe, which reaches over 90% accuracy. Crucially, this object-binding capability emerges reliably in self-supervised ViTs (DINO, MAE, CLIP), but markedly weaker in ImageNet-supervised models, suggesting that binding is not a trivial architectural artifact, but an ability acquired through specific pretraining objectives. We further discover that IsSameObject is encoded in a low-dimensional subspace on top of object features, and that this signal actively guides attention. Ablating IsSameObject from model activations degrades downstream performance and works against the learning objective, implying that emergent object binding naturally serves the pretraining objective. Our findings challenge the view that ViTs lack object binding and highlight how symbolic knowledge of “which parts belong together” emerges naturally in a connectionist system.

[193] Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance

Yujie Wei, Shiwei Zhang, Hangjie Yuan, Yujin Han, Zhekai Chen, Jiayu Wang, Difan Zou, Xihui Liu, Yingya Zhang, Yu Liu, Hongming Shan

Main category: cs.CV

TL;DR: ProMoE is a novel Mixture-of-Experts framework for Diffusion Transformers that addresses the limitations of applying MoE to vision tasks through a two-step router with explicit routing guidance, enabling better expert specialization for visual tokens.

Details

Motivation: Existing attempts to apply Mixture-of-Experts to Diffusion Transformers have yielded limited gains due to fundamental differences between language and visual tokens. Visual tokens exhibit spatial redundancy and functional heterogeneity, which hinders expert specialization in vision MoE applications.

Method: ProMoE features a two-step router with explicit routing guidance: (1) conditional routing partitions image tokens into conditional and unconditional sets based on functional roles, and (2) prototypical routing refines assignments of conditional tokens using learnable prototypes based on semantic content. It also includes a routing contrastive loss to enhance intra-expert coherence and inter-expert diversity.

Result: Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives, validating the effectiveness of the proposed routing mechanisms.

Conclusion: The explicit semantic guidance enabled by prototypical routing is crucial for vision MoE, and ProMoE provides an effective framework for applying Mixture-of-Experts to Diffusion Transformers in visual tasks.

Abstract: Mixture-of-Experts (MoE) has emerged as a powerful paradigm for scaling model capacity while preserving computational efficiency. Despite its notable success in large language models (LLMs), existing attempts to apply MoE to Diffusion Transformers (DiTs) have yielded limited gains. We attribute this gap to fundamental differences between language and visual tokens. Language tokens are semantically dense with pronounced inter-token variation, while visual tokens exhibit spatial redundancy and functional heterogeneity, hindering expert specialization in vision MoE. To this end, we present ProMoE, an MoE framework featuring a two-step router with explicit routing guidance that promotes expert specialization. Specifically, this guidance encourages the router to partition image tokens into conditional and unconditional sets via conditional routing according to their functional roles, and refine the assignments of conditional image tokens through prototypical routing with learnable prototypes based on semantic content. Moreover, the similarity-based expert allocation in latent space enabled by prototypical routing offers a natural mechanism for incorporating explicit semantic guidance, and we validate that such guidance is crucial for vision MoE. Building on this, we propose a routing contrastive loss that explicitly enhances the prototypical routing process, promoting intra-expert coherence and inter-expert diversity. Extensive experiments on ImageNet benchmark demonstrate that ProMoE surpasses state-of-the-art methods under both Rectified Flow and DDPM training objectives. Code and models will be made publicly available.

[194] Uniform Discrete Diffusion with Metric Path for Video Generation

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Wenxuan Wang, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, Xinlong Wang

Main category: cs.CV

TL;DR: URSA is a discrete generative modeling framework for scalable video generation that bridges the performance gap with continuous approaches through iterative global refinement of spatiotemporal tokens.

Details

Motivation: Discrete approaches for video generation lag behind continuous-space methods due to error accumulation and long-context inconsistency issues.

Method: URSA uses iterative global refinement of discrete spatiotemporal tokens with Linearized Metric Path and Resolution-dependent Timestep Shifting mechanisms, plus asynchronous temporal fine-tuning for unified task handling.

Result: URSA outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods on challenging video and image generation benchmarks.

Conclusion: URSA successfully bridges the gap between discrete and continuous approaches for scalable video generation, enabling high-resolution synthesis and long-duration generation with fewer inference steps.

Abstract: Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-context inconsistency. In this work, we revisit discrete generative modeling and present Uniform discRete diffuSion with metric pAth (URSA), a simple yet powerful framework that bridges the gap with continuous approaches for the scalable video generation. At its core, URSA formulates the video generation task as an iterative global refinement of discrete spatiotemporal tokens. It integrates two key designs: a Linearized Metric Path and a Resolution-dependent Timestep Shifting mechanism. These designs enable URSA to scale efficiently to high-resolution image synthesis and long-duration video generation, while requiring significantly fewer inference steps. Additionally, we introduce an asynchronous temporal fine-tuning strategy that unifies versatile tasks within a single model, including interpolation and image-to-video generation. Extensive experiments on challenging video and image generation benchmarks demonstrate that URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods. Code and models are available at https://github.com/baaivision/URSA

[195] Generative View Stitching

Chonghyuk Song, Michal Stary, Boyuan Chen, George Kopanas, Vincent Sitzmann

Main category: cs.CV

TL;DR: Generative View Stitching (GVS) enables collision-free camera-guided video generation by sampling entire sequences in parallel and using diffusion stitching to ensure scene consistency with predefined camera trajectories.

Details

Motivation: Autoregressive video diffusion models struggle with camera-guided generation because they can't use future conditioning, leading to collisions with generated scenes and subsequent collapse.

Method: GVS extends diffusion stitching from robot planning to video generation, works with any off-the-shelf video model trained with Diffusion Forcing, and uses Omni Guidance for temporal consistency by conditioning on both past and future frames.

Result: GVS achieves stable, collision-free, frame-to-frame consistent video generation that closes loops for various predefined camera paths, including impossible geometries like the Impossible Staircase.

Conclusion: The proposed method successfully addresses the limitations of autoregressive models in camera-guided video generation by enabling parallel sampling and future conditioning through diffusion stitching.

Abstract: Autoregressive video diffusion models are capable of long rollouts that are stable and consistent with history, but they are unable to guide the current generation with conditioning from the future. In camera-guided video generation with a predefined camera trajectory, this limitation leads to collisions with the generated scene, after which autoregression quickly collapses. To address this, we propose Generative View Stitching (GVS), which samples the entire sequence in parallel such that the generated scene is faithful to every part of the predefined camera trajectory. Our main contribution is a sampling algorithm that extends prior work on diffusion stitching for robot planning to video generation. While such stitching methods usually require a specially trained model, GVS is compatible with any off-the-shelf video model trained with Diffusion Forcing, a prevalent sequence diffusion framework that we show already provides the affordances necessary for stitching. We then introduce Omni Guidance, a technique that enhances the temporal consistency in stitching by conditioning on both the past and future, and that enables our proposed loop-closing mechanism for delivering long-range coherence. Overall, GVS achieves camera-guided video generation that is stable, collision-free, frame-to-frame consistent, and closes loops for a variety of predefined camera paths, including Oscar Reutersv"ard’s Impossible Staircase. Results are best viewed as videos at https://andrewsonga.github.io/gvs.

[196] CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Zhao Wang, Aoxue Li, Lingting Zhu, Yong Guo, Qi Dou, Zhenguo Li

Main category: cs.CV

TL;DR: CustomVideo is a framework for multi-subject text-to-video generation that preserves subject identities through attention control and object segmentation.

Details

Motivation: Current text-to-video personalization methods struggle with handling multiple subjects simultaneously, which is a more challenging and practical scenario.

Method: Composes multiple subjects in a single image, implements attention control strategy to disentangle subjects in diffusion model latent space, and uses object segmentation masks for focused attention learning.

Result: Extensive experiments show superiority over state-of-the-art approaches in qualitative, quantitative metrics and user studies.

Conclusion: CustomVideo effectively addresses multi-subject text-to-video customization with identity preservation and provides a comprehensive benchmark dataset.

Abstract: Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

[197] UMCFuse: A Unified Multiple Complex Scenes Infrared and Visible Image Fusion Framework

Xilai Li, Xiaosong Li, Tianshu Tan, Huafeng Li, Tao Ye

Main category: cs.CV

TL;DR: UMCFuse is a unified framework for infrared and visible image fusion in complex scenes, featuring adaptive denoising and multi-directional energy feature fusion to handle interference while preserving details.

Details

Motivation: Little attention has been paid to infrared and visible image fusion in complex scenes, leading to sub-optimal results under interference conditions.

Method: Classifies visible image pixels by light scattering degree to separate details from intensity, uses adaptive denoising for detail layers, and fuses energy features from multiple directions.

Result: Extensive experiments on real and synthetic complex scenes datasets show superiority over recent methods across various adverse conditions and downstream tasks.

Conclusion: UMCFuse effectively balances interference removal and detail preservation, demonstrating strong generalization capacity for infrared and visible image fusion in complex scenes.

Abstract: Infrared and visible image fusion has emerged as a prominent research area in computer vision. However, little attention has been paid to the fusion task in complex scenes, leading to sub-optimal results under interference. To fill this gap, we propose a unified framework for infrared and visible images fusion in complex scenes, termed UMCFuse. Specifically, we classify the pixels of visible images from the degree of scattering of light transmission, allowing us to separate fine details from overall intensity. Maintaining a balance between interference removal and detail preservation is essential for the generalization capacity of the proposed method. Therefore, we propose an adaptive denoising strategy for the fusion of detail layers. Meanwhile, we fuse the energy features from different modalities by analyzing them from multiple directions. Extensive fusion experiments on real and synthetic complex scenes datasets cover adverse weather conditions, noise, blur, overexposure, fire, as well as downstream tasks including semantic segmentation, object detection, salient object detection, and depth estimation, consistently indicate the superiority of the proposed method compared with the recent representative methods. Our code is available at https://github.com/ixilai/UMCFuse.

[198] Is Sora a World Simulator? A Comprehensive Survey on General World Models and Beyond

Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Bohan Li, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, Guan Huang

Main category: cs.CV

TL;DR: A comprehensive survey of recent advancements in general world models, covering video generation, autonomous driving, and autonomous agents, with analysis of challenges and future directions.

Details

Motivation: World models are crucial for achieving Artificial General Intelligence (AGI) and serve as cornerstones for applications ranging from virtual environments to decision-making systems. The emergence of Sora model with remarkable simulation capabilities demonstrates the importance of this field.

Method: The survey conducts comprehensive exploration of latest advancements through analysis of generative methodologies in video generation, autonomous-driving world models, and world models deployed within autonomous agents.

Result: The survey provides systematic analysis of world models’ roles in reshaping transportation, enabling intelligent interactions in dynamic environments, and facilitating synthesis of realistic visual content.

Conclusion: World models represent a crucial pathway toward AGI. The survey serves as foundational reference for research community and identifies challenges, limitations, and potential future directions for continued innovation in this field.

Abstract: General world models represent a crucial pathway toward achieving Artificial General Intelligence (AGI), serving as the cornerstone for various applications ranging from virtual environments to decision-making systems. Recently, the emergence of the Sora model has attained significant attention due to its remarkable simulation capabilities, which exhibits an incipient comprehension of physical laws. In this survey, we embark on a comprehensive exploration of the latest advancements in world models. Our analysis navigates through the forefront of generative methodologies in video generation, where world models stand as pivotal constructs facilitating the synthesis of highly realistic visual content. Additionally, we scrutinize the burgeoning field of autonomous-driving world models, meticulously delineating their indispensable role in reshaping transportation and urban mobility. Furthermore, we delve into the intricacies inherent in world models deployed within autonomous agents, shedding light on their profound significance in enabling intelligent interactions within dynamic environmental contexts. At last, we examine challenges and limitations of world models, and discuss their potential future directions. We hope this survey can serve as a foundational reference for the research community and inspire continued innovation. This survey will be regularly updated at: https://github.com/GigaAI-research/General-World-Models-Survey.

[199] RETTA: Retrieval-Enhanced Test-Time Adaptation for Zero-Shot Video Captioning

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuankai Qi, Amin Beheshti, Quan Z. Sheng, Qingming Huang

Main category: cs.CV

TL;DR: RETTA is a zero-shot video captioning framework that uses frozen pretrained models (XCLIP, CLIP, AnglE, GPT-2) with learnable tokens to bridge video-text understanding without training data, achieving significant improvements over state-of-the-art methods.

Details

Motivation: Zero-shot video captioning remains underexplored compared to fully-supervised methods, and existing approaches struggle to make text generation models sufficiently aware of video content.

Method: Uses four frozen pretrained models: XCLIP for video-text retrieval, CLIP for image-text matching, AnglE for text alignment, and GPT-2 for text generation. Learnable tokens are optimized at test time with crafted loss functions to absorb video information for GPT-2, requiring only 16 iterations without ground truth data.

Result: Achieves absolute 5.1%-32.4% improvements in CIDEr metric on MSR-VTT, MSVD, and VATEX datasets compared to state-of-the-art zero-shot video captioning methods.

Conclusion: RETTA demonstrates that test-time adaptation with learnable tokens can effectively bridge video understanding and text generation in zero-shot settings, outperforming existing methods significantly.

Abstract: Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes advantage of existing pretrained large-scale vision and language models to directly generate captions with test-time adaptation. Specifically, we bridge video and text using four key models: a general video-text retrieval model XCLIP, a general image-text matching model CLIP, a text alignment model AnglE, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium among these four frozen models GPT-2, XCLIP, CLIP, and AnglE. Different from the conventional way that trains these tokens with training data, we propose to learn these tokens with soft targets of the inference data under several carefully crafted loss functions, which enable the tokens to absorb video information catered for GPT-2. This procedure can be efficiently done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show absolute 5.1%-32.4% improvements in terms of the main metric CIDEr compared to several state-of-the-art zero-shot video captioning methods.

[200] Unsupervised Monocular Depth Estimation Based on Hierarchical Feature-Guided Diffusion

Runze Liu, Dongchen Zhu, Guanghui Zhang, Yue Xu, Wenjun Shi, Xiaolin Zhang, Lei Wang, Jiamao Li

Main category: cs.CV

TL;DR: Proposes a diffusion model-based approach for unsupervised monocular depth estimation with enhanced robustness to blurry/noisy images, using hierarchical feature-guided denoising and implicit depth consistency loss.

Details

Motivation: Real-world images are often blurry or noisy due to weather conditions and camera limitations, requiring robust depth estimation models. Generative-based methods show enhanced robustness.

Method: Uses a well-converging diffusion model with hierarchical feature-guided denoising module that leverages image features to guide denoising. Also introduces implicit depth consistency loss to ensure scale consistency in video sequences.

Result: Experiments on KITTI, Make3D, and self-collected SIMIT datasets show the approach stands out among generative-based models and demonstrates remarkable robustness.

Conclusion: The proposed diffusion-based method with feature-guided denoising and implicit depth consistency loss achieves superior performance and robustness in unsupervised monocular depth estimation.

Abstract: Unsupervised monocular depth estimation has received widespread attention because of its capability to train without ground truth. In real-world scenarios, the images may be blurry or noisy due to the influence of weather conditions and inherent limitations of the camera. Therefore, it is particularly important to develop a robust depth estimation model. Benefiting from the training strategies of generative networks, generative-based methods often exhibit enhanced robustness. In light of this, we employ a well-converging diffusion model among generative networks for unsupervised monocular depth estimation. Additionally, we propose a hierarchical feature-guided denoising module. This model significantly enriches the model’s capacity for learning and interpreting depth distribution by fully leveraging image features to guide the denoising process. Furthermore, we explore the implicit depth within reprojection and design an implicit depth consistency loss. This loss function serves to enhance the performance of the model and ensure the scale consistency of depth within a video sequence. We conduct experiments on the KITTI, Make3D, and our self-collected SIMIT datasets. The results indicate that our approach stands out among generative-based models, while also showcasing remarkable robustness.

Zecheng Yin, Chonghao Cheng, and Yao Guo, Zhen Li

Main category: cs.CV

TL;DR: NavVLM is a training-free framework that uses open-source Vision Language Models to enable robots to navigate towards open-set language goals in complex environments without requiring detailed instructions or environmental priors.

Details

Motivation: Existing VLM-based navigation methods often require high computational costs, rely on object-centric approaches, or depend on environmental priors in human instructions. There's a need for more efficient and flexible navigation that can handle abstract language goals.

Method: NavVLM leverages open-source VLMs as its cognitive core to perceive environmental information and provide constant exploration guidance. It operates without training and can navigate with only a neat target rather than detailed instructions with environment prior.

Result: NavVLM achieves state-of-the-art performance in SPL on object-specific tasks in MP3D, HM3D, and Gibson environments. It demonstrates capabilities to navigate towards any open-set languages and has been validated in both simulation and real-world indoor robot experiments.

Conclusion: The framework successfully enables intelligent navigation with abstract language goals in open scenes, showing effectiveness across multiple environments and real-world applications without requiring training or detailed environmental priors.

Abstract: Navigating towards fully open language goals and exploring open scenes in an intelligent way have always raised significant challenges. Recently, Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data. Although many works have focused on leveraging VLMs for navigation in open scenes, they often require high computational cost, rely on object-centric approaches, or depend on environmental priors in detailed human instructions. We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively, even for human-friendly language goal such as abstract places, actions, or specific objects in open scenes. NavVLM leverages the VLM as its cognitive core to perceive environmental information and constantly provides exploration guidance achieving intelligent navigation with only a neat target rather than a detailed instruction with environment prior. We evaluated and validated NavVLM in both simulation and real-world experiments. In simulation, our framework achieves state-of-the-art performance in Success weighted by Path Length (SPL) on object-specifc tasks in richly detailed environments from Matterport 3D (MP3D), Habitat Matterport 3D (HM3D) and Gibson. With navigation episode reported, NavVLM demonstrates the capabilities to navigate towards any open-set languages. In real-world validation, we validated our framework’s effectiveness in real-world robot at indoor scene.

[202] MTFL: Multi-Timescale Feature Learning for Weakly-Supervised Anomaly Detection in Surveillance Videos

Yiling Zhang, Erkut Akdag, Egor Bondarev, Peter H. N. De With

Main category: cs.CV

TL;DR: Proposes Multi-Timescale Feature Learning (MTFL) using Video Swin Transformer with short, medium, and long temporal tubelets for anomaly detection, achieving SotA results on multiple datasets and creating an extended VADD dataset.

Details

Motivation: Anomaly detection requires capturing both fine-grained motion details and contextual events across different time scales for public safety applications.

Method: MTFL method using Video Swin Transformer with multi-scale temporal tubelets (short, medium, long) to extract spatio-temporal features from videos.

Result: Achieved 89.78% AUC on UCF-Crime, 95.32% AUC on ShanghaiTech, and 84.57% AP on XD-Violence datasets, outperforming state-of-the-art methods.

Conclusion: MTFL effectively captures multi-timescale features for anomaly detection and the extended VADD dataset provides broader coverage of realistic anomalies for future development.

Abstract: Detection of anomaly events is relevant for public safety and requires a combination of fine-grained motion information and contextual events at variable time-scales. To this end, we propose a Multi-Timescale Feature Learning (MTFL) method to enhance the representation of anomaly features. Short, medium, and long temporal tubelets are employed to extract spatio-temporal video features using a Video Swin Transformer. Experimental results demonstrate that MTFL outperforms state-of-the-art methods on the UCF-Crime dataset, achieving an anomaly detection performance 89.78% AUC. Moreover, it performs complementary to SotA with 95.32% AUC on the ShanghaiTech and 84.57% AP on the XD-Violence dataset. Furthermore, we generate an extended dataset of the UCF-Crime for development and evaluation on a wider range of anomalies, namely Video Anomaly Detection Dataset (VADD), involving 2,591 videos in 18 classes with extensive coverage of realistic anomalies.

[203] Topology-Preserving Image Segmentation with Spatial-Aware Persistent Feature Matching

Bo Wen, Haochen Zhang, Dirk-Uwe G. Bartsch, William R. Freeman, Truong Q. Nguyen, Cheolhong An

Main category: cs.CV

TL;DR: Proposes a Spatial-Aware Topological Loss Function that improves topological accuracy in tubular structure segmentation by leveraging spatial domain information to assist persistent feature matching.

Details

Motivation: Existing topological segmentation loss functions based on persistent homology suffer from ambiguous matching problems because they only rely on topological space information, ignoring spatial domain information.

Method: Developed a Spatial-Aware Topological Loss Function that leverages information from the original spatial domain of the image to assist in matching persistent features between segmentation and ground truth.

Result: Extensive experiments on various types of tubular structure images show superior performance in improving topological accuracy compared to state-of-the-art methods.

Conclusion: The proposed spatial-aware approach effectively addresses the ambiguous matching problem in topological segmentation and significantly enhances topological correctness in tubular structure segmentation.

Abstract: Topological correctness is critical for segmentation of tubular structures, which pervade in biomedical images. Existing topological segmentation loss functions are primarily based on the persistent homology of the image. They match the persistent features from the segmentation with the persistent features from the ground truth and minimize the difference between them. However, these methods suffer from an ambiguous matching problem since the matching only relies on the information in the topological space. In this work, we propose an effective and efficient Spatial-Aware Topological Loss Function that further leverages the information in the original spatial domain of the image to assist the matching of persistent features. Extensive experiments on images of various types of tubular structures show that the proposed method has superior performance in improving the topological accuracy of the segmentation compared with state-of-the-art methods. Code is available at https://github.com/JRC-VPLab/SATLoss.

[204] Unveiling Concept Attribution in Diffusion Models

Quang H. Nguyen, Hoang Phan, Khoa D. Doan

Main category: cs.CV

TL;DR: CAD framework analyzes diffusion models by attributing concept generation to specific components, revealing both positive and negative contributors, enabling model editing via component ablation.

Details

Motivation: To understand how diffusion model components jointly demonstrate knowledge and address the black-box nature of trained models, going beyond simple knowledge localization.

Method: Component attribution framework that decomposes diffusion models to identify concept-inducing (positive) and concept-suppressing (negative) components through systematic parameter analysis.

Result: Discovered both positive and negative components that contribute to concept generation, enabling development of CAD-Erase and CAD-Amplify editing algorithms for concept removal and amplification.

Conclusion: Provides a complete view of interpreting generative models by revealing the holistic interaction of components, with practical applications for model editing while preserving other knowledge.

Abstract: Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models’ interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?’’}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.

[205] Federated Learning with Partially Labeled Data: A Conditional Distillation Approach

Pochuan Wang, Chen Shen, Masahiro Oda, Chiou-Shann Fuh, Kensaku Mori, Weichung Wang, Holger R. Roth

Main category: cs.CV

TL;DR: ConDistFL is a federated learning framework using conditional distillation to address partial labeling issues in medical image segmentation, improving accuracy while maintaining efficiency and generalizability.

Details

Motivation: To overcome challenges in medical imaging segmentation including data scarcity, privacy constraints, and partial labeling issues in federated learning settings.

Method: Proposes ConDistFL framework incorporating conditional distillation to enable effective learning from partially labeled datasets in decentralized training.

Result: Significantly improves segmentation accuracy across distributed datasets, maintains computational efficiency, and demonstrates strong generalizability in out-of-federation tests including adaptation to unseen contrast phases.

Conclusion: ConDistFL provides an efficient, adaptable solution for collaborative medical image segmentation in privacy-constrained environments.

Abstract: In medical imaging, developing generalized segmentation models that can handle multiple organs and lesions is crucial. However, the scarcity of fully annotated datasets and strict privacy regulations present significant barriers to data sharing. Federated Learning (FL) allows decentralized model training, but existing FL methods often struggle with partial labeling, leading to model divergence and catastrophic forgetting. We propose ConDistFL, a novel FL framework incorporating conditional distillation to address these challenges. ConDistFL enables effective learning from partially labeled datasets, significantly improving segmentation accuracy across distributed and non-uniform datasets. In addition to its superior segmentation performance, ConDistFL maintains computational and communication efficiency, ensuring its scalability for real-world applications. Furthermore, ConDistFL demonstrates remarkable generalizability, significantly outperforming existing FL methods in out-of-federation tests, even adapting to unseen contrast phases (e.g., non-contrast CT images) in our experiments. Extensive evaluations on 3D CT and 2D chest X-ray datasets show that ConDistFL is an efficient, adaptable solution for collaborative medical image segmentation in privacy-constrained settings.

[206] MDP3: A Training-free Approach for List-wise Frame Selection in Video-LLMs

Hui Sun, Shiyin Lu, Huanyu Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Ming Li

Main category: cs.CV

TL;DR: MDP3 is a training-free, model-agnostic frame selection method for Video-LLMs that addresses query relevance, list-wise diversity, and sequentiality using determinantal point processes and Markov decision processes.

Details

Motivation: Video-LLMs face challenges with lengthy visual token sequences due to limited context length and irrelevant frames hindering visual perception. Existing frame selection methods don't capture all three key principles: query relevance, list-wise diversity, and sequentiality.

Method: Proposes MDP3 which uses conditional Gaussian kernel in RKHS to estimate frame similarities, applies DPP for query relevance and diversity, segments video and applies DPP within segments conditioned on preceding selections using MDP for sequentiality.

Result: MDP3 provides (1-1/e)-approximate solution to NP-hard frame selection problem with pseudo-polynomial time complexity. Empirically outperforms existing methods significantly.

Conclusion: MDP3 is an effective, robust frame selection method that can be seamlessly integrated into existing Video-LLMs, addressing key challenges in video understanding.

Abstract: Video large language models (Video-LLMs) have made significant progress in understanding videos. However, processing multiple frames leads to lengthy visual token sequences, presenting challenges such as the limited context length cannot accommodate the entire video, and the inclusion of irrelevant frames hinders visual perception. Hence, effective frame selection is crucial. This paper emphasizes that frame selection should follow three key principles: query relevance, list-wise diversity, and sequentiality. Existing methods, such as uniform frame sampling and query-frame matching, do not capture all of these principles. Thus, we propose Markov decision determinantal point process with dynamic programming (MDP3) for frame selection, a training-free and model-agnostic method that can be seamlessly integrated into existing Video-LLMs. Our method first estimates frame similarities conditioned on the query using a conditional Gaussian kernel within the reproducing kernel Hilbert space~(RKHS). We then apply the determinantal point process~(DPP) to the similarity matrix to capture both query relevance and list-wise diversity. To incorporate sequentiality, we segment the video and apply DPP within each segment, conditioned on the preceding segment selection, modeled as a Markov decision process~(MDP) for allocating selection sizes across segments. Theoretically, MDP3 provides a ((1 - 1/e))-approximate solution to the NP-hard list-wise frame selection problem with pseudo-polynomial time complexity, demonstrating its efficiency. Empirically, MDP3 significantly outperforms existing methods, verifying its effectiveness and robustness.

Yunhang Shen, Chaoyou Fu, Shaoqi Dong, Xiong Wang, Yi-Fan Zhang, Peixian Chen, Mengdan Zhang, Haoyu Cao, Ke Li, Shaohui Lin, Xiawu Zheng, Yan Zhang, Yiyi Zhou, Ran He, Caifeng Shan, Rongrong Ji, Xing Sun

Main category: cs.CV

TL;DR: Long-VITA is a large multi-modal model for long-context visual-language understanding that processes images, videos, and text over 4K frames or 1M tokens, achieving state-of-the-art performance on multi-modal benchmarks using only public datasets.

Details

Motivation: To address the challenge of long-context multi-modal understanding by developing a model that can concurrently process and analyze image, video, and text modalities over extended sequences while maintaining strong performance on short-context tasks.

Method: Proposes a multi-modal training schema starting with large language models, proceeding through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. Implements context-parallelism distributed inference and logits-masked language modeling head for infinite-length inputs.

Result: Achieves state-of-the-art performance on various multi-modal benchmarks using only public datasets (17M samples), with 2x prefill speedup and 4x context length extension in a single node with 8 GPUs.

Conclusion: Long-VITA serves as a competitive baseline and offers valuable insights for advancing long-context multi-modal understanding in the open-source community, being fully open-source and reproducible.

Abstract: We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.

[208] MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Swadhin Das, Raksha Sharma

Main category: cs.CV

TL;DR: Proposes a Multi-stream Encoder-decoder Framework (MsEdF) for remote sensing image captioning that improves performance by optimizing spatial representation and language generation through complementary image encoders and enhanced semantic modeling.

Details

Motivation: Existing single-stream architectures struggle with complex spatial patterns and semantic structures in remote sensing images, limiting their ability to accurately describe scenes with high intraclass similarity or contextual ambiguity.

Method: Uses a multi-stream encoder that fuses information from two complementary image encoders to promote feature diversity, and a decoder with stacked GRU architecture and element-wise aggregation for improved semantic modeling.

Result: Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

Conclusion: The proposed multi-stream framework effectively addresses limitations of single-stream architectures in remote sensing image captioning by enhancing both spatial feature extraction and semantic relationship capture.

Abstract: Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence’s semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

[209] Faces of Fairness: Examining Bias in Facial Expression Recognition Datasets and Models

Mohammad Mehdi Hosseini, Ali Pourramezan Fard, Mohammad H. Mahoor

Main category: cs.CV

TL;DR: This study analyzes bias and fairness in Facial Expression Recognition (FER) systems, examining four datasets and six deep learning models. While GPT-4o-mini and ViT achieve highest accuracy, they also show highest bias levels, highlighting the need for bias mitigation in affective computing.

Details

Motivation: Bias and fairness issues in FER datasets and models remain underexplored, despite their critical importance in building fair AI systems for facial expression recognition tasks.

Method: Analyzed four FER datasets (AffectNet, ExpW, Fer2013, RAF-DB) and evaluated six deep models including three CNN models (MobileNet, ResNet, XceptionNet) and three transformer models (ViT, CLIP, GPT-4o-mini) for bias and fairness.

Result: AffectNet and ExpW show high generalizability despite data imbalances. GPT-4o-mini and ViT achieve highest accuracy but also display highest bias levels among all tested models.

Conclusion: There is an urgent need for developing new methodologies to mitigate bias and ensure fairness in FER datasets and models, particularly in affective computing applications.

Abstract: Building AI systems, including Facial Expression Recognition (FER), involves two critical aspects: data and model design. Both components significantly influence bias and fairness in FER tasks. Issues related to bias and fairness in FER datasets and models remain underexplored. This study investigates bias sources in FER datasets and models. Four common FER datasets–AffectNet, ExpW, Fer2013, and RAF-DB–are analyzed. The findings demonstrate that AffectNet and ExpW exhibit high generalizability despite data imbalances. Additionally, this research evaluates the bias and fairness of six deep models, including three state-of-the-art convolutional neural network (CNN) models: MobileNet, ResNet, XceptionNet, as well as three transformer-based models: ViT, CLIP, and GPT-4o-mini. Experimental results reveal that while GPT-4o-mini and ViT achieve the highest accuracy scores, they also display the highest levels of bias. These findings underscore the urgent need for developing new methodologies to mitigate bias and ensure fairness in datasets and models, particularly in affective computing applications. See our implementation details at https://github.com/MMHosseini/bias_in_FER.

[210] Frequency-Aware Vision Transformers for High-Fidelity Super-Resolution of Earth System Models

Ehsan Zeraatkar, Salah A Faroughi, Jelena Tešić

Main category: cs.CV

TL;DR: Two frequency-aware super-resolution frameworks (ViSIR and ViFOR) are introduced to enhance Earth System Model outputs, addressing spectral bias in traditional methods and achieving state-of-the-art performance on climate data downscaling.

Details

Motivation: Traditional deep super-resolution methods exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details, which limits their effectiveness for climate science applications requiring fine-scale structures.

Method: ViSIR combines Vision Transformers with sinusoidal activations to mitigate spectral bias, while ViFOR integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Both are evaluated on the E3SM-HR Earth system dataset.

Result: Both models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6 dB improvements in PSNR and significantly higher SSIM. Ablation studies highlight benefits of full-field training and frequency hyperparameters.

Conclusion: ViFOR is established as a state-of-the-art, scalable solution for climate data downscaling, with future extensions planned for temporal super-resolution, multimodal variables, automated parameter selection, and physical conservation constraints.

Abstract: Super-resolution (SR) is crucial for enhancing the spatial fidelity of Earth System Model (ESM) outputs, allowing fine-scale structures vital to climate science to be recovered from coarse simulations. However, traditional deep super-resolution methods, including convolutional and transformer-based models, tend to exhibit spectral bias, reconstructing low-frequency content more readily than valuable high-frequency details. In this work, we introduce two frequency-aware frameworks: the Vision Transformer-Tuned Sinusoidal Implicit Representation (ViSIR), combining Vision Transformers and sinusoidal activations to mitigate spectral bias, and the Vision Transformer Fourier Representation Network (ViFOR), which integrates explicit Fourier-based filtering for independent low- and high-frequency learning. Evaluated on the E3SM-HR Earth system dataset across surface temperature, shortwave, and longwave fluxes, these models outperform leading CNN, GAN, and vanilla transformer baselines, with ViFOR demonstrating up to 2.6~dB improvements in PSNR and significantly higher SSIM. Detailed ablation and scaling studies highlight the benefit of full-field training, the impact of frequency hyperparameters, and the potential for generalization. The results establish ViFOR as a state-of-the-art, scalable solution for climate data downscaling. Future extensions will address temporal super-resolution, multimodal climate variables, automated parameter selection, and integration of physical conservation constraints to broaden scientific applicability.

[211] CAUSAL3D: A Comprehensive Benchmark for Causal Learning from Visual Data

Disheng Liu, Yiran Qiao, Wuche Liu, Yiren Lu, Yunlai Zhou, Tuo Liang, Yu Yin, Jing Ma

Main category: cs.CV

TL;DR: Causal3D is a new benchmark that combines structured data and visual representations to evaluate causal reasoning abilities in AI models, featuring 19 diverse 3D-scene datasets with varying complexity.

Details

Motivation: There's a lack of benchmarks for assessing AI models' abilities to infer latent causality from complex visual data, despite causal reasoning being crucial for true intelligence.

Method: Created Causal3D benchmark with 19 3D-scene datasets integrating tables and images, evaluated multiple state-of-the-art methods including causal discovery, causal representation learning, and LLMs/VLMs.

Result: Performance declines significantly as causal structures grow more complex without prior knowledge, revealing challenges even advanced methods face in complex causal scenarios.

Conclusion: Causal3D serves as a vital resource for advancing causal reasoning in computer vision and fostering trustworthy AI in critical domains.

Abstract: True intelligence hinges on the ability to uncover and leverage hidden causal relations. Despite significant progress in AI and computer vision (CV), there remains a lack of benchmarks for assessing models’ abilities to infer latent causality from complex visual data. In this paper, we introduce \textsc{\textbf{Causal3D}}, a novel and comprehensive benchmark that integrates structured data (tables) with corresponding visual representations (images) to evaluate causal reasoning. Designed within a systematic framework, Causal3D comprises 19 3D-scene datasets capturing diverse causal relations, views, and backgrounds, enabling evaluations across scenes of varying complexity. We assess multiple state-of-the-art methods, including classical causal discovery, causal representation learning, and large/vision-language models (LLMs/VLMs). Our experiments show that as causal structures grow more complex without prior knowledge, performance declines significantly, highlighting the challenges even advanced methods face in complex causal scenarios. Causal3D serves as a vital resource for advancing causal reasoning in CV and fostering trustworthy AI in critical domains.

[212] Polygonal network disorder and the turning distance

Alex Dolce, Ryan Lavelle, Bernard Scott, Ashlyn Urbanski, Joseph Klobusicky

Main category: cs.CV

TL;DR: This paper introduces turning disorders for polygonal networks by averaging turning distances between network faces and ordered shapes, derives closed-form expressions for special cases, and shows computational improvements for regular polygons.

Details

Motivation: To extend the turning distance metric from polygons to polygonal planar networks and develop efficient computational methods for measuring network disorder.

Method: Define turning disorders by averaging turning distances between network faces and ordered shapes (regular polygons/circles), derive closed-form expressions for special classes, and apply to various network examples including Archimedean lattices and stochastic processes.

Result: Achieved O((m+n)log(m+n)) time complexity for computing 2-turning distances between regular polygons (improved from O(mnlog(mn))), derived exact expressions for Archimedean lattices, and showed that different turning disorder definitions capture different aspects of network disorder.

Conclusion: Turning disorders provide a flexible framework for quantifying network disorder, with computational efficiency improvements for regular shapes and the ability to capture different disorder notions through choice of ordered shape and weighting schemes.

Abstract: The turning distance is a well-studied metric for measuring the similarity between two polygons. This metric is constructed by taking an $L^p$ distance between step functions which track each shape’s tangent angle of a path tracing its boundary. In this study, we introduce \textit{turning disorders} for polygonal planar networks, defined by averaging turning distances between network faces and “ordered” shapes (regular polygons or circles). We derive closed-form expressions of turning distances for special classes of regular polygons, related to the divisibility of $m$ and $n$, and also between regular polygons and circles. These formulas are used to show that the time for computing the 2-turning distances reduces to $O((m+n) \log(m+n))$ when both shapes are regular polygons, an improvement from $O(mn\log(mn))$ operations needed to compute distances between general polygons of $n$ and $m$ sides. We also apply these formulas to several examples of network microstructure with varying disorder. For Archimedean lattices, a class of regular tilings, we can express turning disorders with exact expressions. We also consider turning disorders applied to two examples of stochastic processes on networks: spring networks evolving under T1 moves and polygonal rupture processes. We find that the two aspects of defining different turning disorders, the choice of ordered shape and whether to apply area-weighting, can capture different notions of network disorder.

[213] DynCIM: Dynamic Curriculum for Imbalanced Multimodal Learning

Chengxuan Qian, Kai Han, Jiaxin Liu, Zhenlong Yuan, Zhengzhong Zhu, Jingchao Wang, Chongwen Lyu, Jun Chen, Zhe Liu

Main category: cs.CV

TL;DR: DynCIM is a dynamic curriculum learning framework that addresses modality and sample imbalances in multimodal learning through sample-level and modality-level curricula, plus adaptive fusion, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Multimodal learning underutilizes collaboration due to disparities in data quality and modality representation capabilities, creating imbalances that hinder effective fusion.

Method: DynCIM uses sample-level curriculum (assessing difficulty via prediction deviation, consistency, stability) and modality-level curriculum (measuring contributions globally and locally), with gating-based dynamic fusion to adaptively adjust modality contributions.

Result: Extensive experiments on six multimodal benchmarking datasets show DynCIM consistently outperforms state-of-the-art methods in both bimodal and trimodal scenarios.

Conclusion: DynCIM effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks.

Abstract: Multimodal learning integrates complementary information from diverse modalities to enhance the decision-making process. However, the potential of multimodal collaboration remains under-exploited due to disparities in data quality and modality representation capabilities. To address this, we introduce DynCIM, a novel dynamic curriculum learning framework designed to quantify the inherent imbalances from both sample and modality perspectives. DynCIM employs a sample-level curriculum to dynamically assess each sample’s difficulty according to prediction deviation, consistency, and stability, while a modality-level curriculum measures modality contributions from global and local. Furthermore, a gating-based dynamic fusion mechanism is introduced to adaptively adjust modality contributions, minimizing redundancy and optimizing fusion effectiveness. Extensive experiments on six multimodal benchmarking datasets, spanning both bimodal and trimodal scenarios, demonstrate that DynCIM consistently outperforms state-of-the-art methods. Our approach effectively mitigates modality and sample imbalances while enhancing adaptability and robustness in multimodal learning tasks. Our code is available at https://github.com/Raymond-Qiancx/DynCIM.

[214] Stealthy Patch-Wise Backdoor Attack in 3D Point Cloud via Curvature Awareness

Yu Feng, Dingxin Zhang, Runkai Zhao, Yong Xia, Heng Huang, Weidong Cai

Main category: cs.CV

TL;DR: SPBA is a patch-wise backdoor attack for 3D point clouds that uses curvature-based imperceptibility scores to inject optimized triggers into less sensitive patches, achieving high stealthiness and attack effectiveness with improved computational efficiency.

Details

Motivation: Existing 3D point cloud backdoor attacks use sample-wise global modifications that have low imperceptibility, and optimization-based approaches are computationally expensive. There's a need for more stealthy and efficient backdoor attacks.

Method: Decompose point clouds into local patches, use curvature-based imperceptibility scores to identify visually less sensitive patches, and optimize a unified patch-wise trigger that perturbs spectral features of selected patches.

Result: SPBA surpasses state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods on ModelNet40 and ShapeNetPart datasets, while maintaining high stealthiness and optimization efficiency.

Conclusion: The proposed SPBA framework demonstrates that patch-wise backdoor attacks with curvature-guided trigger injection can achieve superior performance in terms of stealthiness, attack effectiveness, and computational efficiency compared to existing approaches.

Abstract: Backdoor attacks pose a severe threat to deep neural networks (DNNs) by implanting hidden backdoors that can be activated with predefined triggers to manipulate model behaviors maliciously. Existing 3D point cloud backdoor attacks primarily rely on sample-wise global modifications, which suffer from low imperceptibility. Although optimization can improve stealthiness, optimizing sample-wise triggers significantly increases computational cost. To address these limitations, we propose the Stealthy Patch-Wise Backdoor Attack (SPBA), the first patch-wise backdoor attack framework for 3D point clouds. Specifically, SPBA decomposes point clouds into local patches and employs a curvature-based imperceptibility score to guide trigger injection into visually less sensitive patches. By optimizing a unified patch-wise trigger that perturbs spectral features of selected patches, SPBA significantly enhances optimization efficiency while maintaining high stealthiness. Extensive experiments on ModelNet40 and ShapeNetPart further demonstrate that SPBA surpasses prior state-of-the-art backdoor attacks in both attack effectiveness and resistance to defense methods. The code is available at https://github.com/HazardFY/SPBA.

[215] Superpowering Open-Vocabulary Object Detectors for X-ray Vision

Pablo Garcia-Fernandez, Lorenzo Vaquero, Mingxuan Liu, Feng Xue, Daniel Cores, Nicu Sebe, Manuel Mucientes, Elisa Ricci

Main category: cs.CV

TL;DR: RAXO is a training-free framework that adapts RGB open-vocabulary object detectors for X-ray security screening by creating high-quality X-ray class descriptors through dual-source retrieval and material transfer, achieving significant performance improvements.

Details

Motivation: Open-vocabulary object detection can revolutionize security screening by recognizing any item in X-ray scans, but faces challenges from data scarcity and modality gaps that prevent direct use of RGB-based solutions.

Method: RAXO repurposes off-the-shelf RGB OvOD detectors using a dual-source retrieval strategy that gathers RGB images from the web and enriches them via X-ray material transfer, replacing text-based classification with visual descriptors.

Result: RAXO consistently improves OvOD performance with an average mAP increase of up to 17.0 points over base detectors, and introduces DET-COMPASS benchmark with 300+ object categories for large-scale evaluation.

Conclusion: RAXO provides an effective training-free solution for X-ray open-vocabulary detection, overcoming data scarcity and modality gaps while enabling robust security screening applications.

Abstract: Open-vocabulary object detection (OvOD) is set to revolutionize security screening by enabling systems to recognize any item in X-ray scans. However, developing effective OvOD models for X-ray imaging presents unique challenges due to data scarcity and the modality gap that prevents direct adoption of RGB-based solutions. To overcome these limitations, we propose RAXO, a training-free framework that repurposes off-the-shelf RGB OvOD detectors for robust X-ray detection. RAXO builds high-quality X-ray class descriptors using a dual-source retrieval strategy. It gathers relevant RGB images from the web and enriches them via a novel X-ray material transfer mechanism, eliminating the need for labeled databases. These visual descriptors replace text-based classification in OvOD, leveraging intra-modal feature distances for robust detection. Extensive experiments demonstrate that RAXO consistently improves OvOD performance, providing an average mAP increase of up to 17.0 points over base detectors. To further support research in this emerging field, we also introduce DET-COMPASS, a new benchmark featuring bounding box annotations for over 300 object categories, enabling large-scale evaluation of OvOD in X-ray. Code and dataset available at: https://github.com/PAGF188/RAXO.

[216] LiDAR Remote Sensing Meets Weak Supervision: Concepts, Methods, and Perspectives

Yuan Gao, Shaobo Xia, Pu Wang, Xiaohuan Xi, Sheng Nie, Cheng Wang

Main category: cs.CV

TL;DR: This paper provides a systematic review of Weakly Supervised Learning (WSL) approaches for LiDAR remote sensing, unifying data interpretation and parameter inversion tasks under a common framework to address limitations of costly labeled data requirements.

Details

Motivation: Traditional LiDAR remote sensing relies heavily on costly and labor-intensive labeled data and field measurements, which constrains scalability and spatiotemporal adaptability. WSL offers a unified framework to overcome these limitations.

Method: The paper systematically reviews WSL techniques for LiDAR including incomplete supervision (sparse point labels), inexact supervision (scene-level tags), inaccurate supervision (noisy labels), and cross-domain supervision. It covers methods like pseudo-labeling, consistency regularization, self-training, and label refinement.

Result: The review demonstrates how WSL enables robust learning from limited and weak annotations, addresses LiDAR-specific challenges (irregular geometry, data sparsity, domain heterogeneity), and facilitates joint learning with other remote-sensing data for continuous surface-parameter retrieval.

Conclusion: WSL serves as a bridge between LiDAR and foundation models, enabling leveraging of large-scale multimodal datasets while reducing labeling costs. Future directions include broader WSL-driven advances in generalization, open-world adaptation, and scalable LiDAR remote sensing.

Abstract: Light detection and ranging (LiDAR) remote sensing encompasses two major directions: data interpretation and parameter inversion. However, both directions rely heavily on costly and labor-intensive labeled data and field measurements, which constrains their scalability and spatiotemporal adaptability. Weakly Supervised Learning (WSL) provides a unified framework to address these limitations. This paper departs from the traditional view that treats interpretation and inversion as separate tasks and offers a systematic review of recent advances in LiDAR remote sensing from a unified WSL perspective. We cover typical WSL settings including incomplete supervision(e.g., sparse point labels), inexact supervision (e.g., scene-level tags), inaccurate supervision (e.g., noisy labels), and cross-domain supervision (e.g., domain adaptation/generalization) and corresponding techniques such as pseudo-labeling, consistency regularization, self-training, and label refinement, which collectively enable robust learning from limited and weak annotations.We further analyze LiDAR-specific challenges (e.g., irregular geometry, data sparsity, domain heterogeneity) that require tailored weak supervision, and examine how sparse LiDAR observations can guide joint learning with other remote-sensing data for continuous surface-parameter retrieval. Finally, we highlight future directions where WSL acts as a bridge between LiDAR and foundation models to leverage large-scale multimodal datasets and reduce labeling costs, while also enabling broader WSL-driven advances in generalization, open-world adaptation, and scalable LiDAR remote sensing.

[217] Boosting Omnidirectional Stereo Matching with a Pre-trained Depth Foundation Model

Jannik Endres, Oliver Hahn, Charles Corbière, Simone Schaub-Meyer, Stefan Roth, Alexandre Alahi

Main category: cs.CV

TL;DR: DFI-OmniStereo is a novel omnidirectional stereo matching method that uses pre-trained monocular depth estimation within an iterative optimization framework, achieving state-of-the-art results with 16% reduction in disparity MAE.

Details

Motivation: Omnidirectional depth perception is crucial for mobile robotics, but existing stereo matching approaches have limited accuracy due to scarcity of real-world data across diverse environments and conditions.

Method: Leverages large-scale pre-trained foundation model for relative monocular depth estimation within iterative optimization-based stereo matching architecture, with two-stage training strategy and scale-invariant fine-tuning.

Result: Achieves state-of-the-art results on Helvipad dataset, reducing disparity MAE by approximately 16% compared to previous best omnidirectional stereo method.

Conclusion: The proposed DFI-OmniStereo method effectively improves omnidirectional stereo matching accuracy by integrating pre-trained monocular depth features into an optimization framework.

Abstract: Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360{\deg} field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

[218] FaceCloak: Learning to Protect Face Templates

Sudipta Banerjee, Anubhav Jain, Chinmay Hegde, Nasir Memon

Main category: cs.CV

TL;DR: FaceCloak is a neural network framework that protects face templates by generating renewable binary cloaks to thwart inversion attacks while maintaining biometric utility and unlinkability.

Details

Motivation: To address security and privacy concerns raised by generative models that can reconstruct face images from encoded representations, protecting face templates from inversion attacks.

Method: Generates smart, renewable binary cloaks from single face templates on the fly, proactively thwarting inversion attacks while provably retaining biometric utility and unlinkability.

Result: Outperforms leading baselines in biometric matching and resiliency to reconstruction attacks, with extremely fast inference time (0.28 ms) and lightweight implementation (0.57 MB).

Conclusion: FaceCloak effectively protects face templates against inversion attacks while maintaining practical utility, offering a fast and lightweight solution for face template protection.

Abstract: Generative models can reconstruct face images from encoded representations (templates) bearing remarkable likeness to the original face, raising security and privacy concerns. We present \textsc{FaceCloak}, a neural network framework that protects face templates by generating smart, renewable binary cloaks. Our method proactively thwarts inversion attacks by cloaking face templates with unique disruptors synthesized from a single face template on the fly while provably retaining biometric utility and unlinkability. Our cloaked templates can suppress sensitive attributes while generalizing to novel feature extraction schemes and outperform leading baselines in terms of biometric matching and resiliency to reconstruction attacks. \textsc{FaceCloak}-based matching is extremely fast (inference time =0.28 ms) and light (0.57 MB). We have released our \href{https://github.com/sudban3089/FaceCloak.git}{code} for reproducible research.

[219] DArFace: Deformation Aware Robustness for Low Quality Face Recognition

Sadaf Gulshad, Abdullah Aldahlawi

Main category: cs.CV

TL;DR: DArFace is a deformation-aware robust face recognition framework that addresses performance degradation in low-quality facial images by modeling both global transformations and local elastic deformations through adversarial training and contrastive learning.

Details

Motivation: Facial recognition systems perform poorly on low-quality images (low resolution, motion blur, distortions) due to domain gap from high-quality training data. Existing methods overlook local non-rigid deformations present in real-world scenarios.

Method: Adversarially integrates global transformations (rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Uses contrastive objective to enforce identity consistency across different deformed views without requiring paired high/low-quality samples.

Result: Extensive evaluations on TinyFace, IJB-B, and IJB-C benchmarks show DArFace surpasses state-of-the-art methods with significant gains attributed to local deformation modeling.

Conclusion: DArFace effectively enhances robustness to real-world degradations by incorporating local deformation modeling alongside global transformations, achieving superior performance on low-quality face recognition benchmarks.

Abstract: Facial recognition systems have achieved remarkable success by leveraging deep neural networks, advanced loss functions, and large-scale datasets. However, their performance often deteriorates in real-world scenarios involving low-quality facial images. Such degradations, common in surveillance footage or standoff imaging include low resolution, motion blur, and various distortions, resulting in a substantial domain gap from the high-quality data typically used during training. While existing approaches attempt to address robustness by modifying network architectures or modeling global spatial transformations, they frequently overlook local, non-rigid deformations that are inherently present in real-world settings. In this work, we introduce \textbf{DArFace}, a \textbf{D}eformation-\textbf{A}ware \textbf{r}obust \textbf{Face} recognition framework that enhances robustness to such degradations without requiring paired high- and low-quality training samples. Our method adversarially integrates both global transformations (e.g., rotation, translation) and local elastic deformations during training to simulate realistic low-quality conditions. Moreover, we introduce a contrastive objective to enforce identity consistency across different deformed views. Extensive evaluations on low-quality benchmarks including TinyFace, IJB-B, and IJB-C demonstrate that DArFace surpasses state-of-the-art methods, with significant gains attributed to the inclusion of local deformation modeling.

[220] Video-SafetyBench: A Benchmark for Safety Evaluation of Video LVLMs

Xuannan Liu, Zekun Li, Zheqi He, Peipei Li, Shuhan Xia, Xing Cui, Huaibo Huang, Xi Yang, Ran He

Main category: cs.CV

TL;DR: Video-SafetyBench is the first benchmark for evaluating Large Vision-Language Models’ safety under video-text attacks, revealing significant vulnerabilities to video-induced attacks.

Details

Motivation: Existing multimodal safety evaluations focus on static images, ignoring temporal dynamics of video that may induce distinct safety risks.

Method: Developed a comprehensive benchmark with 2,264 video-text pairs across 48 unsafe categories, using a controllable pipeline that decomposes video semantics into subject images and motion text, and proposed RJScore metric for evaluating uncertain outputs.

Result: Benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks.

Conclusion: Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies for LVLMs.

Abstract: The increasing deployment of Large Vision-Language Models (LVLMs) raises safety concerns under potential malicious inputs. However, existing multimodal safety evaluations primarily focus on model vulnerabilities exposed by static image inputs, ignoring the temporal dynamics of video that may induce distinct safety risks. To bridge this gap, we introduce Video-SafetyBench, the first comprehensive benchmark designed to evaluate the safety of LVLMs under video-text attacks. It comprises 2,264 video-text pairs spanning 48 fine-grained unsafe categories, each pairing a synthesized video with either a harmful query, which contains explicit malice, or a benign query, which appears harmless but triggers harmful behavior when interpreted alongside the video. To generate semantically accurate videos for safety evaluation, we design a controllable pipeline that decomposes video semantics into subject images (what is shown) and motion text (how it moves), which jointly guide the synthesis of query-relevant videos. To effectively evaluate uncertain or borderline harmful outputs, we propose RJScore, a novel LLM-based metric that incorporates the confidence of judge models and human-aligned decision threshold calibration. Extensive experiments show that benign-query video composition achieves average attack success rates of 67.2%, revealing consistent vulnerabilities to video-induced attacks. We believe Video-SafetyBench will catalyze future research into video-based safety evaluation and defense strategies.

[221] Long-RVOS: A Comprehensive Benchmark for Long-term Referring Video Object Segmentation

Tianming Liang, Haichao Jiang, Yuting Yang, Chaolei Tan, Shuai Li, Wei-Shi Zheng, Jian-Fang Hu

Main category: cs.CV

TL;DR: Long-RVOS is a new benchmark for long-term referring video object segmentation with 2,000+ videos averaging 60+ seconds, addressing challenges like occlusion and disappearance-reappearance. The paper also proposes ReferMo, a baseline method using motion information and local-to-global architecture.

Details

Motivation: Existing RVOS datasets focus on short video clips with salient objects, lacking practical scenarios with long videos where objects undergo occlusion, disappearance-reappearance, and shot changes.

Method: Proposed ReferMo method integrates motion information to expand temporal receptive field and uses local-to-global architecture to capture both short-term dynamics and long-term dependencies.

Result: Current state-of-the-art methods struggle significantly with long-video challenges. ReferMo achieves significant improvements over existing methods in long-term scenarios.

Conclusion: Long-RVOS benchmark and ReferMo baseline can drive future RVOS research towards more realistic and long-form videos, addressing current limitations in temporal understanding.

Abstract: Referring video object segmentation (RVOS) aims to identify, track and segment the objects in a video based on language descriptions, which has received great attention in recent years. However, existing datasets remain focus on short video clips within several seconds, with salient objects visible in most frames. To advance the task towards more practical scenarios, we introduce \textbf{Long-RVOS}, a large-scale benchmark for long-term referring video object segmentation. Long-RVOS contains 2,000+ videos of an average duration exceeding 60 seconds, covering a variety of objects that undergo occlusion, disappearance-reappearance and shot changing. The objects are manually annotated with three different types of descriptions to individually evaluate the understanding of static attributes, motion patterns and spatiotemporal relationships. Moreover, unlike previous benchmarks that rely solely on the per-frame spatial evaluation, we introduce two new metrics to assess the temporal and spatiotemporal consistency. We benchmark 6 state-of-the-art methods on Long-RVOS. The results show that current approaches struggle severely with the long-video challenges. To address this, we further propose ReferMo, a promising baseline method that integrates motion information to expand the temporal receptive field, and employs a local-to-global architecture to capture both short-term dynamics and long-term dependencies. Despite simplicity, ReferMo achieves significant improvements over current methods in long-term scenarios. We hope that Long-RVOS and our baseline can drive future RVOS research towards tackling more realistic and long-form videos.

[222] Global urban visual perception varies across demographics and personalities

Matias Quintana, Youlong Gu, Xiucheng Liang, Yujun Hou, Koichi Ito, Yihan Zhu, Mahmoud Abdelrahman, Filip Biljecki

Main category: cs.CV

TL;DR: A large-scale study reveals how demographics and personality traits shape urban street perceptions across different cultures, showing that current machine learning models often misrepresent human preferences by overlooking local context.

Details

Motivation: Current urban planning approaches combine multi-cultural responses without considering demographic differences, potentially amplifying biases and obscuring important variations in how people perceive streetscapes.

Method: Conducted a large-scale urban visual perception survey using street view imagery with 1,000 participants from five countries and 45 nationalities, collecting data on six traditional and four new perception indicators while examining demographic and personality factors.

Result: The SPECS dataset reveals significant demographic- and personality-based differences in street perception. Machine learning models trained on existing datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses.

Conclusion: Urban perception studies must incorporate demographic and personality factors to avoid biased outcomes, as local context significantly influences how people evaluate streetscapes.

Abstract: Understanding people’s preferences is crucial for urban planning, yet current approaches often combine responses from multi-cultural populations, obscuring demographic differences and risking amplifying biases. We conducted a largescale urban visual perception survey of streetscapes worldwide using street view imagery, examining how demographics – including gender, age, income, education, race and ethnicity, and personality traits – shape perceptions among 1,000 participants with balanced demographics from five countries and 45 nationalities. This dataset, Street Perception Evaluation Considering Socioeconomics (SPECS), reveals demographic- and personality-based differences across six traditional indicators – safe, lively, wealthy, beautiful, boring, depressing – and four new ones – live nearby, walk, cycle, green. Location-based sentiments further shape these preferences. Machine learning models trained on existing global datasets tend to overestimate positive indicators and underestimate negative ones compared to human responses, underscoring the need for local context. Our study aspires to rectify the myopic treatment of street perception, which rarely considers demographics or personality traits.

[223] A Generalized Label Shift Perspective for Cross-Domain Gaze Estimation

Hao-Ran Yang, Xiaohui Chen, Chuan-Xian Ren

Main category: cs.CV

TL;DR: This paper introduces a Generalized Label Shift (GLS) perspective to Cross-domain Gaze Estimation (CDGE), proposing a framework with importance reweighting and conditional operator discrepancy estimation to address domain shift problems.

Details

Motivation: Existing CDGE methods that extract domain-invariant features are insufficient according to GLS theory. The paper aims to model cross-domain gaze estimation as a label and conditional shift problem to improve generalization to new target domains.

Method: Proposes a GLS correction framework with importance reweighting strategy based on truncated Gaussian distribution to handle continuity challenges in label shift correction, and derives probability-aware estimation of conditional operator discrepancy for conditional invariant learning.

Result: Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domains and applicability on various models of the proposed method.

Conclusion: The GLS perspective and proposed framework effectively address domain shift in gaze estimation, demonstrating improved generalization performance across different domains and model architectures.

Abstract: Aiming to generalize the well-trained gaze estimation model to new target domains, Cross-domain Gaze Estimation (CDGE) is developed for real-world application scenarios. Existing CDGE methods typically extract the domain-invariant features to mitigate domain shift in feature space, which is proved insufficient by Generalized Label Shift (GLS) theory. In this paper, we introduce a novel GLS perspective to CDGE and modelize the cross-domain problem by label and conditional shift problem. A GLS correction framework is presented and a feasible realization is proposed, in which a importance reweighting strategy based on truncated Gaussian distribution is introduced to overcome the continuity challenges in label shift correction. To embed the reweighted source distribution to conditional invariant learning, we further derive a probability-aware estimation of conditional operator discrepancy. Extensive experiments on standard CDGE tasks with different backbone models validate the superior generalization capability across domain and applicability on various models of proposed method.

[224] VSA: Faster Video Diffusion with Trainable Sparse Attention

Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, Hao Zhang

Main category: cs.CV

TL;DR: VSA is a trainable sparse attention mechanism that replaces full quadratic attention in video diffusion transformers, achieving 2.53× training FLOPS reduction and 6× attention speedup while maintaining model quality.

Details

Motivation: Scaling video diffusion transformers is limited by their quadratic 3D attention complexity, even though most attention mass concentrates on a small subset of positions.

Method: VSA uses a two-stage approach: lightweight coarse stage pools tokens into tiles and identifies critical tokens, then fine stage computes token-level attention only inside those tiles with block computing layout for hardware efficiency.

Result: VSA achieves 85% of FlashAttention3 MFU, cuts training FLOPS by 2.53× with no loss in diffusion performance, and speeds up attention time by 6× (reducing generation time from 31s to 18s).

Conclusion: Trainable sparse attention is a practical alternative to full attention and enables further scaling of video diffusion models.

Abstract: Scaling video diffusion transformers (DiTs) is limited by their quadratic 3D attention, even though most of the attention mass concentrates on a small subset of positions. We turn this observation into VSA, a trainable, hardware-efficient sparse attention that replaces full attention at \emph{both} training and inference. In VSA, a lightweight coarse stage pools tokens into tiles and identifies high-weight \emph{critical tokens}; a fine stage computes token-level attention only inside those tiles subjecting to block computing layout to ensure hard efficiency. This leads to a single differentiable kernel that trains end-to-end, requires no post-hoc profiling, and sustains 85% of FlashAttention3 MFU. We perform a large sweep of ablation studies and scaling-law experiments by pretraining DiTs from 60M to 1.4B parameters. VSA reaches a Pareto point that cuts training FLOPS by 2.53$\times$ with no drop in diffusion loss. Retrofitting the open-source Wan-2.1 model speeds up attention time by 6$\times$ and lowers end-to-end generation time from 31s to 18s with comparable quality. These results establish trainable sparse attention as a practical alternative to full attention and a key enabler for further scaling of video diffusion models. Code will be available at https://github.com/hao-ai-lab/FastVideo.

[225] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Yunlong Tang, Pinxin Liu, Zhangyun Tan, Mingqian Feng, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: MMPerspective is the first benchmark to evaluate multimodal language models’ understanding of perspective geometry through 10 tasks across perception, reasoning, and robustness dimensions.

Details

Motivation: To systematically assess how well multimodal large language models internalize perspective geometry, which is fundamental to human visual perception but unclear in current models.

Method: Created a benchmark with 2,711 real-world and synthetic images and 5,083 question-answer pairs across 10 tasks covering perspective perception, reasoning, and robustness capabilities.

Result: Evaluation of 43 state-of-the-art MLLMs revealed significant limitations - models perform well on surface-level perceptual tasks but struggle with compositional reasoning and spatial consistency under perturbations.

Conclusion: MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems, revealing architecture-scale patterns and robustness bottlenecks.

Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

[226] CPathAgent: An Agent-based Foundation Model for Interpretable High-Resolution Pathology Image Analysis Mimicking Pathologists’ Diagnostic Logic

Yuxuan Sun, Yixuan Si, Chenglu Zhu, Kai Zhang, Zhongyi Shui, Bowen Ding, Tao Lin, Lin Yang

Main category: cs.CV

TL;DR: CPathAgent is an agent-based AI system that mimics pathologists’ diagnostic workflow by autonomously navigating across whole slide images at different magnifications to generate transparent diagnostic summaries, outperforming existing methods across multiple benchmarks.

Details

Motivation: Existing computational pathology models directly output final diagnoses without revealing the reasoning process, unlike pathologists who systematically examine slides at different magnifications. There's a need for models that can emulate the diagnostic approach of pathologists.

Method: Developed a multi-stage training strategy unifying patch-level, region-level, and WSI-level capabilities within a single model. Created PathMMU-HR2 benchmark for large region analysis. The agent autonomously navigates across WSI based on visual features.

Result: CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales (patch, region, and whole slide levels).

Conclusion: The agent-based diagnostic approach represents a promising direction for computational pathology by providing more transparent and interpretable diagnostic summaries that mimic human pathologists’ workflow.

Abstract: Recent advances in computational pathology have led to the emergence of numerous foundation models. These models typically rely on general-purpose encoders with multi-instance learning for whole slide image (WSI) classification or apply multimodal approaches to generate reports directly from images. However, these models cannot emulate the diagnostic approach of pathologists, who systematically examine slides at low magnification to obtain an overview before progressively zooming in on suspicious regions to formulate comprehensive diagnoses. Instead, existing models directly output final diagnoses without revealing the underlying reasoning process. To address this gap, we introduce CPathAgent, an innovative agent-based approach that mimics pathologists’ diagnostic workflow by autonomously navigating across WSI based on observed visual features, thereby generating substantially more transparent and interpretable diagnostic summaries. To achieve this, we develop a multi-stage training strategy that unifies patch-level, region-level, and WSI-level capabilities within a single model, which is essential for replicating how pathologists understand and reason across diverse image scales. Additionally, we construct PathMMU-HR2, the first expert-validated benchmark for large region analysis. This represents a critical intermediate scale between patches and whole slides, reflecting a key clinical reality where pathologists typically examine several key large regions rather than entire slides at once. Extensive experiments demonstrate that CPathAgent consistently outperforms existing approaches across benchmarks at three different image scales, validating the effectiveness of our agent-based diagnostic approach and highlighting a promising direction for computational pathology.

[227] MoPFormer: Motion-Primitive Transformer for Wearable-Sensor Activity Recognition

Hao Zhang, Zhan Zhuang, Xuehao Wang, Xiaodong Yang, Yu Zhang

Main category: cs.CV

TL;DR: MoPFormer is a self-supervised Transformer framework that tokenizes sensor signals into motion primitives to improve interpretability and cross-dataset generalization in Human Activity Recognition.

Details

Motivation: Address limited interpretability in Human Activity Recognition with wearable sensors, which impacts cross-dataset generalization.

Method: Two-stage approach: 1) Partition sensor streams into segments and quantize into motion primitive codewords, 2) Enrich tokenized sequences with context-aware embedding and process with Transformer encoder using masked motion-modeling objective.

Result: Outperforms state-of-the-art methods on six HAR benchmarks and successfully generalizes across multiple datasets. Learned motion primitives enhance both interpretability and cross-dataset performance.

Conclusion: MoPFormer effectively captures fundamental movement patterns that remain consistent across similar activities, improving both performance and interpretability in cross-dataset scenarios.

Abstract: Human Activity Recognition (HAR) with wearable sensors is challenged by limited interpretability, which significantly impacts cross-dataset generalization. To address this challenge, we propose Motion-Primitive Transformer (MoPFormer), a novel self-supervised framework that enhances interpretability by tokenizing inertial measurement unit signals into semantically meaningful motion primitives and leverages a Transformer architecture to learn rich temporal representations. MoPFormer comprises two stages. The first stage is to partition multi-channel sensor streams into short segments and quantize them into discrete ``motion primitive’’ codewords, while the second stage enriches those tokenized sequences through a context-aware embedding module and then processes them with a Transformer encoder. The proposed MoPFormer can be pre-trained using a masked motion-modeling objective that reconstructs missing primitives, enabling it to develop robust representations across diverse sensor configurations. Experiments on six HAR benchmarks demonstrate that MoPFormer not only outperforms state-of-the-art methods but also successfully generalizes across multiple datasets. More importantly, the learned motion primitives significantly enhance both interpretability and cross-dataset performance by capturing fundamental movement patterns that remain consistent across similar activities, regardless of dataset origin.

[228] OmniResponse: Online Multimodal Conversational Response Generation in Dyadic Interactions

Cheng Luo, Jianghui Wang, Bing Li, Siyang Song, Bernard Ghanem

Main category: cs.CV

TL;DR: OMCRG is a novel task for generating synchronized verbal and non-verbal listener feedback in real-time conversations using multimodal inputs. The paper introduces OmniResponse, an MLLM that generates accurate multimodal responses through text as an intermediate modality, and ResponseNet dataset for evaluation.

Details

Motivation: To capture natural dyadic interactions and address challenges in aligning generated audio with listeners' facial responses in online conversational settings.

Method: Proposed OmniResponse, a Multimodal Large Language Model with Chrono-Text Markup for precise text token timing and TempoVoice for synchronized speech output. Uses text as intermediate modality to connect audio and facial responses.

Result: OmniResponse outperforms baseline models on ResponseNet dataset in semantic speech content, audio-visual synchronization, and generation quality.

Conclusion: The OMCRG task and OmniResponse framework successfully enable synchronized multimodal listener response generation, with publicly available dataset, code, and models to advance research in this area.

Abstract: In this paper, we introduce Online Multimodal Conversational Response Generation (OMCRG), a novel task designed to produce synchronized verbal and non-verbal listener feedback online, based on the speaker’s multimodal inputs. OMCRG captures natural dyadic interactions and introduces new challenges in aligning generated audio with listeners’ facial responses. To tackle these challenges, we incorporate text as an intermediate modality to connect audio and facial responses. We propose OmniResponse, a Multimodal Large Language Model (MLLM) that autoregressively generates accurate multimodal listener responses. OmniResponse leverages a pretrained LLM enhanced with two core components: Chrono-Text Markup, which precisely timestamps generated text tokens, and TempoVoice, a controllable online text-to-speech (TTS) module that outputs speech synchronized with facial responses. To advance OMCRG research, we offer ResponseNet, a dataset of 696 detailed dyadic interactions featuring synchronized split-screen videos, multichannel audio, transcripts, and annotated facial behaviors. Comprehensive evaluations on ResponseNet demonstrate that OmniResponse outperforms baseline models in terms of semantic speech content, audio-visual synchronization, and generation quality. Our dataset, code, and models are publicly available.

[229] Geo-Sign: Hyperbolic Contrastive Regularisation for Geometrically Aware Sign Language Translation

Edward Fish, Richard Bowden

Main category: cs.CV

TL;DR: Geo-Sign improves Sign Language Translation by using hyperbolic geometry to enhance skeletal representations, achieving better performance than RGB methods while being more private and computationally efficient.

Details

Motivation: Current SLT research focuses mainly on improving language models, but this work explores enhancing the geometric properties of skeletal representations themselves to better capture hierarchical sign language kinematics.

Method: Projects skeletal features from ST-GCNs into hyperbolic space using Poincaré ball model, with hyperbolic projection layer, weighted Fréchet mean aggregation, and geometric contrastive loss integrated as regularization in translation framework.

Result: Improves state-of-the-art RGB methods while preserving privacy and improving computational efficiency.

Conclusion: Hyperbolic geometry shows strong potential for enhancing skeletal representations in Sign Language Translation.

Abstract: Recent progress in Sign Language Translation (SLT) has focussed primarily on improving the representational capacity of large language models to incorporate Sign Language features. This work explores an alternative direction: enhancing the geometric properties of skeletal representations themselves. We propose Geo-Sign, a method that leverages the properties of hyperbolic geometry to model the hierarchical structure inherent in sign language kinematics. By projecting skeletal features derived from Spatio-Temporal Graph Convolutional Networks (ST-GCNs) into the Poincar'e ball model, we aim to create more discriminative embeddings, particularly for fine-grained motions like finger articulations. We introduce a hyperbolic projection layer, a weighted Fr'echet mean aggregation scheme, and a geometric contrastive loss operating directly in hyperbolic space. These components are integrated into an end-to-end translation framework as a regularisation function, to enhance the representations within the language model. This work demonstrates the potential of hyperbolic geometry to improve skeletal representations for Sign Language Translation, improving on SOTA RGB methods while preserving privacy and improving computational efficiency. Code available here: https://github.com/ed-fish/geo-sign.

[230] From Objects to Anywhere: A Holistic Benchmark for Multi-level Visual Grounding in 3D Scenes

Tianxu Wang, Zhuofan Zhang, Ziyu Zhu, Yue Fan, Jing Xiong, Pengxiang Li, Xiaojian Ma, Qing Li

Main category: cs.CV

TL;DR: Anywhere3D-Bench is a comprehensive 3D visual grounding benchmark with 2,886 expression-bounding box pairs across four levels: human-activity areas, unoccupied space, individual objects, and object parts. Current models struggle most with space-level and part-level tasks.

Details

Motivation: To address the unexplored area of grounding referring expressions beyond objects in 3D scenes, moving beyond traditional object-level grounding to include spatial reasoning and fine-grained object composition.

Method: Created Anywhere3D-Bench benchmark with four grounding levels and evaluated state-of-the-art 3D visual grounding methods, LLMs, and MLLMs on this comprehensive dataset.

Result: Space-level and part-level tasks are most challenging - best models achieve only ~30% accuracy on space-level and ~40% on part-level tasks, significantly lower than area-level and object-level performance.

Conclusion: There’s a critical gap in current models’ capacity for comprehensive 3D spatial reasoning and fine-grained object perception beyond object-level semantics.

Abstract: 3D visual grounding has made notable progress in localizing objects within complex 3D scenes. However, grounding referring expressions beyond objects in 3D scenes remains unexplored. In this paper, we introduce Anywhere3D-Bench, a holistic 3D visual grounding benchmark consisting of 2,886 referring expression-3D bounding box pairs spanning four different grounding levels: human-activity areas, unoccupied space beyond objects, individual objects in the scene, and fine-grained object parts. We assess a range of state-of-the-art 3D visual grounding methods alongside large language models (LLMs) and multimodal LLMs (MLLMs) on Anywhere3D-Bench. Experimental results reveal that space-level and part-level visual grounding pose the greatest challenges: space-level tasks require a more comprehensive spatial reasoning ability, for example, modeling distances and spatial relations within 3D space, while part-level tasks demand fine-grained perception of object composition. Even the best-performing models, Google Gemini-2.5-Pro and OpenAI o3, achieve just around 30% accuracy on space-level tasks and around 40% on part-level tasks, significantly lower than its performance on area-level and object-level tasks. These findings underscore a critical gap in current models’ capacity to understand and reason about 3D scenes beyond object-level semantics.

[231] GS4: Generalizable Sparse Splatting Semantic SLAM

Mingqi Jiang, Chanho Kim, Chen Ziwen, Li Fuxin

Main category: cs.CV

TL;DR: GS4 is a generalizable Gaussian Splatting-based semantic SLAM system that runs 10x faster, uses 10x fewer Gaussians, and achieves SOTA performance in color, depth, semantic mapping, and camera tracking compared to prior methods.

Details

Motivation: Traditional SLAM algorithms produce incomplete, low-resolution maps without tight semantic integration, while recent GS-based SLAM methods require slow per-scene optimization and excessive Gaussians.

Method: GS4 incrementally builds and updates 3D Gaussians using a feed-forward network: Gaussian Prediction Model estimates sparse Gaussian parameters from input frames, Gaussian Refinement Network merges new Gaussians while avoiding redundancy, and optimized GS correction with only 1-5 iterations for drift and floater correction.

Result: Achieves state-of-the-art performance on ScanNet and ScanNet++ benchmarks, with strong generalization shown through zero-shot transfer to NYUv2 and TUM RGB-D datasets. Runs 10x faster and uses 10x fewer Gaussians than prior approaches.

Conclusion: GS4 demonstrates that generalizable Gaussian Splatting can enable efficient, high-performance semantic SLAM with strong generalization capabilities across different datasets.

Abstract: Traditional SLAM algorithms excel at camera tracking, but typically produce incomplete and low-resolution maps that are not tightly integrated with semantics prediction. Recent work integrates Gaussian Splatting (GS) into SLAM to enable dense, photorealistic 3D mapping, yet existing GS-based SLAM methods require per-scene optimization that is slow and consumes an excessive number of Gaussians. We present GS4, the first generalizable GS-based semantic SLAM system. Compared with prior approaches, GS4 runs 10x faster, uses 10x fewer Gaussians, and achieves state-of-the-art performance across color, depth, semantic mapping and camera tracking. From an RGB-D video stream, GS4 incrementally builds and updates a set of 3D Gaussians using a feed-forward network. First, the Gaussian Prediction Model estimates a sparse set of Gaussian parameters from input frame, which integrates both color and semantic prediction with the same backbone. Then, the Gaussian Refinement Network merges new Gaussians with the existing set while avoiding redundancy. Finally, we propose to optimize GS for only 1-5 iterations that corrects drift and floaters when significant pose changes are detected. Experiments on the real-world ScanNet and ScanNet++ benchmarks demonstrate state-of-the-art semantic SLAM performance, with strong generalization capability shown through zero-shot transfer to the NYUv2 and TUM RGB-D datasets.

[232] GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning

Bo Liu, Xiangyu Zhao, Along He, Yidi Chen, Huazhu Fu, Xiao-Ming Wu

Main category: cs.CV

TL;DR: This paper proposes RMCoT, a region-aware multimodal chain-of-thought dataset for medical VQA that provides fine-grained explainability through intermediate reasoning steps grounded in visual regions, along with a verifiable reward mechanism for reinforcement learning.

Details

Motivation: Current medical VQA methods suffer from limited answer reliability and poor interpretability, which impairs clinicians' and patients' ability to understand and trust model outputs.

Method: Proposes RMCoT dataset with intermediate reasoning steps that explicitly ground relevant visual regions, and introduces a verifiable reward mechanism for reinforcement learning to guide post-training.

Result: The method achieves comparable performance using only one-eighth of the training data, demonstrating efficiency and effectiveness.

Conclusion: The proposed approach significantly improves explainability and trustworthiness of medical VQA systems while maintaining high performance with reduced training data requirements.

Abstract: Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images. While recent advances in multi-modal learning have significantly improved performance, current methods still suffer from limited answer reliability and poor interpretability, impairing the ability of clinicians and patients to understand and trust model outputs. To address these limitations, this work first proposes a Region-Aware Multimodal Chain-of-Thought (RMCoT) dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps that explicitly ground relevant visual regions of the medical image, thereby providing fine-grained explainability. Furthermore, we introduce a novel verifiable reward mechanism for reinforcement learning to guide post-training, improving the alignment between the model’s reasoning process and its final answer. Remarkably, our method achieves comparable performance using only one-eighth of the training data, demonstrating the efficiency and effectiveness of the proposal. The dataset is available at https://www.med-vqa.com/GEMeX/.

Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu

Main category: cs.CV

TL;DR: AnyCap Project introduces a plug-and-play framework (ACM) for controllable multimodal captioning, along with a comprehensive dataset (ACD) and evaluation benchmark (AnyCapEval), achieving significant improvements in content and style scores.

Details

Motivation: Existing controllable captioning models lack fine-grained control and reliable evaluation protocols, creating a gap in precise multimodal alignment and instruction following.

Method: AnyCapModel (ACM) is a lightweight plug-and-play framework that enhances existing foundation models for omni-modal captioning without retraining. It reuses original captions while incorporating user instructions and modality features.

Result: ACM significantly improves caption quality across diverse base models, with ACM-8B increasing GPT-4o’s content scores by 45% and style scores by 12%, and achieving substantial gains on benchmarks like MIA-Bench and VidCapBench.

Conclusion: The AnyCap Project provides an integrated solution for controllable captioning through model, dataset, and evaluation components, demonstrating marked improvements in caption quality and controllability.

Abstract: Controllable captioning is essential for precise multimodal alignment and instruction following, yet existing models often lack fine-grained control and reliable evaluation protocols. To address this gap, we present the AnyCap Project, an integrated solution spanning model, dataset, and evaluation. We introduce AnyCapModel (ACM), a lightweight plug-and-play framework that enhances the controllability of existing foundation models for omni-modal captioning without retraining the base model. ACM reuses the original captions from base models while incorporating user instructions and modality features to generate improved captions. To remedy the data scarcity in controllable multimodal captioning, we build AnyCapDataset (ACD), covering three modalities, 28 user-instruction types, and 300,k high-quality data entries. We further propose AnyCapEval, a new benchmark that provides more reliable evaluation metrics for controllable captioning by decoupling content accuracy and stylistic fidelity. ACM markedly improves caption quality across a diverse set of base models on AnyCapEval. Notably, ACM-8B raises GPT-4o's content scores by 45% and style scores by 12%, and it also achieves substantial gains on widely used benchmarks such as MIA-Bench and VidCapBench.

[234] Multispectral State-Space Feature Fusion: Bridging Shared and Cross-Parametric Interactions for Object Detection

Jifeng Shen, Haibo Zhan, Shaohua Dong, Xin Zuo, Wankou Yang, Haibin Ling

Main category: cs.CV

TL;DR: MS2Fusion is a novel multispectral feature fusion framework using state space models that addresses limitations in current object detection by balancing complementary features and shared semantics through dual-path parametric interaction.

Details

Motivation: Current multispectral feature fusion methods excessively prefer local complementary features over cross-modal shared semantics, which hurts generalization, and face bottlenecks in balancing receptive field size with computational complexity.

Method: Proposes MS2Fusion with dual-path parametric interaction: cross-parameter branch for mining complementary information using cross-attention and SSM hidden state decoding, and shared-parameter branch for cross-modal alignment through parameter sharing and joint embedding.

Result: MS2Fusion significantly outperforms state-of-the-art methods on FLIR, M3FD and LLVIP benchmarks, and achieves SOTA results on RGB-T semantic segmentation and RGBT salient object detection without specific design.

Conclusion: MS2Fusion provides an effective and general framework for multispectral feature fusion that balances complementary features and shared semantics, demonstrating superior performance across multiple perception tasks.

Abstract: Modern multispectral feature fusion for object detection faces two critical limitations: (1) Excessive preference for local complementary features over cross-modal shared semantics adversely affects generalization performance; and (2) The trade-off between the receptive field size and computational complexity present critical bottlenecks for scalable feature modeling. Addressing these issues, a novel Multispectral State-Space Feature Fusion framework, dubbed MS2Fusion, is proposed based on the state space model (SSM), achieving efficient and effective fusion through a dual-path parametric interaction mechanism. More specifically, the first cross-parameter interaction branch inherits the advantage of cross-attention in mining complementary information with cross-modal hidden state decoding in SSM. The second shared-parameter branch explores cross-modal alignment with joint embedding to obtain cross-modal similar semantic features and structures through parameter sharing in SSM. Finally, these two paths are jointly optimized with SSM for fusing multispectral features in a unified framework, allowing our MS2Fusion to enjoy both functional complementarity and shared semantic space. In our extensive experiments on mainstream benchmarks including FLIR, M3FD and LLVIP, our MS2Fusion significantly outperforms other state-of-the-art multispectral object detection methods, evidencing its superiority. Moreover, MS2Fusion is general and applicable to other multispectral perception tasks. We show that, even without specific design, MS2Fusion achieves state-of-the-art results on RGB-T semantic segmentation and RGBT salient object detection, showing its generality. The source code will be available at https://github.com/61s61min/MS2Fusion.git.

[235] Towards Real Unsupervised Anomaly Detection Via Confident Meta-Learning

Muhammad Aqeel, Shakiba Sharifi, Marco Cristani, Francesco Setti

Main category: cs.CV

TL;DR: CoMet is a meta-learning approach for deep anomaly detection that enables training on uncurated datasets containing both nominal and anomalous samples, eliminating the need for manual data filtering.

Details

Motivation: Traditional unsupervised anomaly detection assumes all training data are nominal, requiring manual curation that introduces bias and limits adaptability. CoMet addresses this by enabling learning from mixed datasets.

Method: Integrates Soft Confident Learning (assigning lower weights to low-confidence samples) and Meta-Learning (stabilizing training via regularization based on training validation loss covariance) to prevent overfitting and enhance robustness.

Result: Experiments on MVTec-AD, VIADUCT, and KSDD2 datasets show consistent improvements over baselines, insensitivity to training set anomalies, and new state-of-the-art performance across all datasets.

Conclusion: CoMet provides an effective model-agnostic training strategy for anomaly detection that works with uncurated data, eliminating manual filtering requirements while maintaining robust performance.

Abstract: So-called unsupervised anomaly detection is better described as semi-supervised, as it assumes all training data are nominal. This assumption simplifies training but requires manual data curation, introducing bias and limiting adaptability. We propose Confident Meta-learning (CoMet), a novel training strategy that enables deep anomaly detection models to learn from uncurated datasets where nominal and anomalous samples coexist, eliminating the need for explicit filtering. Our approach integrates Soft Confident Learning, which assigns lower weights to low-confidence samples, and Meta-Learning, which stabilizes training by regularizing updates based on training validation loss covariance. This prevents overfitting and enhances robustness to noisy data. CoMet is model-agnostic and can be applied to any anomaly detection method trainable via gradient descent. Experiments on MVTec-AD, VIADUCT, and KSDD2 with two state-of-the-art models demonstrate the effectiveness of our approach, consistently improving over the baseline methods, remaining insensitive to anomalies in the training set, and setting a new state-of-the-art across all datasets. Code is available at https://github.com/aqeeelmirza/CoMet

[236] Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Jinsol Song, Jiamu Wang, Anh Tien Nguyen, Keunho Byeon, Sangjeong Ahn, Sung Hak Lee, Jin Tae Kwak

Main category: cs.CV

TL;DR: Ano-NAViLa is a vision-language model for anomaly detection in pathology images that incorporates both normal and abnormal pathology knowledge, achieving state-of-the-art performance and interpretability.

Details

Motivation: Existing anomaly detection methods designed for industrial settings face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability, especially when disease-related data are limited.

Method: Built on a pre-trained vision-language model with a lightweight trainable MLP, incorporating both normal and abnormal pathology knowledge to enhance accuracy and robustness while providing interpretability through image-text associations.

Result: Ano-NAViLa achieves state-of-the-art performance in anomaly detection and localization on two lymph node datasets from different organs, outperforming competing models.

Conclusion: The proposed Ano-NAViLa model effectively addresses the challenges of anomaly detection in computational pathology by leveraging pathology-specific knowledge and vision-language capabilities.

Abstract: Anomaly detection in computational pathology aims to identify rare and scarce anomalies where disease-related data are often limited or missing. Existing anomaly detection methods, primarily designed for industrial settings, face limitations in pathology due to computational constraints, diverse tissue structures, and lack of interpretability. To address these challenges, we propose Ano-NAViLa, a Normal and Abnormal pathology knowledge-augmented Vision-Language model for Anomaly detection in pathology images. Ano-NAViLa is built on a pre-trained vision-language model with a lightweight trainable MLP. By incorporating both normal and abnormal pathology knowledge, Ano-NAViLa enhances accuracy and robustness to variability in pathology images and provides interpretability through image-text associations. Evaluated on two lymph node datasets from different organs, Ano-NAViLa achieves the state-of-the-art performance in anomaly detection and localization, outperforming competing models.

[237] GRASP: Geospatial pixel Reasoning viA Structured Policy learning

Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, Jonathan Li

Main category: cs.CV

TL;DR: GRASP is a reinforcement learning framework for geospatial pixel reasoning that replaces supervised fine-tuning with policy learning, using bounding boxes and positive points instead of dense masks to reduce annotation costs and improve out-of-domain generalization.

Details

Motivation: Existing approaches suffer from high annotation costs for dense masks and limited generalization in out-of-domain scenarios, motivating a more scalable and robust solution.

Method: Integrates multimodal LLM with pretrained segmentation model in cascaded manner, uses reinforcement learning with BoP-Rewards (bounding boxes and positive points) instead of supervised fine-tuning, and verifies outputs through format and accuracy signals.

Result: Achieves state-of-the-art in-domain performance and up to 54% improvement in out-of-domain scenarios on GRASP-1k benchmark, demonstrating robust generalization.

Conclusion: Reinforcement learning with cost-aware rewards provides a scalable and effective paradigm for geospatial pixel reasoning, reducing annotation burden while improving generalization.

Abstract: Geospatial pixel reasoning aims to generate segmentation masks in remote sensing imagery directly from natural-language instructions. Most existing approaches follow a paradigm that fine-tunes multimodal large language models under supervision with dense pixel-level masks as ground truth. While effective within the training data distribution, this design suffers from two main drawbacks: (1) the high cost of large-scale dense mask annotation, and (2) the limited generalization capability of supervised fine-tuning in out-of-domain scenarios. To address these issues, we propose GRASP, a structured policy-learning framework that integrates a multimodal large language model with a pretrained segmentation model in a cascaded manner. To enhance generalization, we introduce PRIME, a training paradigm that replaces supervised fine-tuning with reinforcement learning to better align reasoning and grounding behaviors with task objectives. To reduce annotation costs, we design BoP-Rewards, which substitutes dense mask labels with bounding box and positive points. It further verifies outputs through two complementary signals: format, which constrains the reasoning and grounding structure to remain syntactically parsable, and accuracy, which evaluates the quality of predicted boxes and points. For evaluation, we train our method and all baselines on EarthReason and GeoPixInstruct, constructing an in-domain benchmark by merging their test sets. We further release GRASP-1k, a fully out-of-domain benchmark with reasoning-intensive queries, reasoning traces, and fine-grained masks. Experimental results demonstrate state-of-the-art (SOTA) in-domain performance and up to 54% improvement in out-of-domain scenarios, confirming that reinforcement learning with cost-aware rewards provides a robust and scalable paradigm for geospatial pixel reasoning. All code and datasets will be released publicly.

[238] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Tian Nian, Liuao Pei, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

Main category: cs.CV

TL;DR: Discrete Diffusion VLA presents a unified transformer policy that models discretized robot actions using discrete diffusion, achieving adaptive decoding order and robust error correction while maintaining compatibility with vision-language models.

Details

Motivation: Current Vision-Language-Action models use fragmented approaches like auto-regressive generation or separate MLP/diffusion heads, which create specialized training requirements and hinder unified, scalable architectures.

Method: Uses discrete diffusion to model discretized action chunks, retaining diffusion’s progressive refinement while being compatible with VLM token interfaces. Features adaptive decoding order and secondary re-masking for error correction.

Result: Achieves 96.3% success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge, outperforming autoregressive, MLP decoder and continuous diffusion baselines.

Conclusion: Discrete-diffusion VLA enables precise action modeling and consistent training, providing foundation for scaling VLA to larger models and datasets while preserving pre-trained vision-language priors.

Abstract: Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions auto-regressively in a fixed left-to-right order or attach separate MLP or diffusion heads outside the backbone, leading to fragmented information pathways and specialized training requirements that hinder a unified, scalable architecture. We present Discrete Diffusion VLA, a unified-transformer policy that models discretized action chunks with discrete diffusion. The design retains diffusion’s progressive refinement paradigm while remaining natively compatible with the discrete token interface of VLMs. Our method achieves an adaptive decoding order that resolves easy action elements before harder ones and uses secondary re-masking to revisit uncertain predictions across refinement rounds, which improves consistency and enables robust error correction. This unified decoder preserves pre-trained vision-language priors, supports parallel decoding, breaks the autoregressive bottleneck, and reduces the number of function evaluations. Discrete Diffusion VLA achieves 96.3% avg. success rates on LIBERO, 71.2% visual matching on SimplerEnv-Fractal and 54.2% overall on SimplerEnv-Bridge, improving over autoregressive, MLP decoder and continuous diffusion baselines. These findings indicate that discrete-diffusion VLA supports precise action modeling and consistent training, laying groundwork for scaling VLA to larger models and datasets. Our project page is https://github.com/Liang-ZX/DiscreteDiffusionVLA

[239] Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning

Haorui Yu, Yang Zhao, Yijia Chu, Qiufeng Yi

Main category: cs.CV

TL;DR: VLMs show cultural incompetence despite appearing competent, using superficial pattern matching rather than genuine cultural understanding. A diagnostic framework reveals systematic biases in fire-themed cultural imagery classification.

Details

Motivation: Vision-Language Models often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding, creating risks of misinterpretation and bias.

Method: Introduced a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis, testing multiple models on Western festivals, non-Western traditions, and emergency scenes.

Result: Models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose risks of symbolic shortcuts.

Conclusion: Cultural evaluation beyond accuracy metrics is needed to ensure interpretable and fair multimodal systems, as current VLMs demonstrate systematic cultural biases and dangerous misclassifications.

Abstract: Vision-Language Models (VLMs) often appear culturally competent but rely on superficial pattern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themed cultural imagery through both classification and explanation analysis. Testing multiple models on Western festivals, non-Western traditions, and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresented cultural events, frequently offering vague labels or dangerously misclassifying emergencies as celebrations. These failures expose the risks of symbolic shortcuts and highlight the need for cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodal systems.

[240] InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention

Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang

Main category: cs.CV

TL;DR: InstanceAssemble is a novel L2I generation method that uses instance-assembling attention for position control with bounding boxes and multimodal content control. It achieves SOTA performance with LoRA adaption and introduces a new benchmark (Denselayout) and evaluation metric (LGS).

Details

Motivation: Current Layout-to-Image methods exhibit suboptimal performance despite leveraging positional conditions and textual descriptions. There's a need for more precise and controllable image synthesis.

Method: Proposes InstanceAssemble architecture with instance-assembling attention for layout conditions, enabling position control via bounding boxes and multimodal content control. Uses light-weighted LoRA modules for flexible adaption to existing DiT-based T2I models.

Result: Achieves state-of-the-art performance under complex layout conditions and exhibits strong compatibility with diverse style LoRA modules. Also introduces Denselayout benchmark (5k images, 90k instances) and Layout Grounding Score metric.

Conclusion: InstanceAssemble effectively addresses limitations of current L2I methods and provides a comprehensive solution with improved performance, new benchmark, and better evaluation metrics.

Abstract: Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu, Yao Lu, Oluwatobi Olabiyi, Yu-Chiang Frank Wang, Rafael Valle, Bryan Catanzaro, Andrew Tao, Song Han, Jan Kautz, Hongxu Yin, Pavlo Molchanov

Main category: cs.CV

TL;DR: OmniVinci is an open-source omni-modal LLM that outperforms Qwen2.5-Omni on multiple benchmarks while using 6x fewer training tokens, demonstrating cross-modal reinforcement in perception and reasoning.

Details

Motivation: To advance machine intelligence by developing multi-modal perception capabilities similar to human sensing, requiring strong alignment between different modalities like vision and audio.

Method: Three key architectural innovations: OmniAlignNet for vision-audio embedding alignment, Temporal Embedding Grouping for relative temporal alignment, and Constrained Rotary Time Embedding for absolute temporal encoding. Plus a data curation pipeline generating 24M single-modal and omni-modal conversations.

Result: Outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using only 0.2T training tokens (6x reduction).

Conclusion: Modalities reinforce each other in perception and reasoning, and the model demonstrates practical advantages in robotics, medical AI, and smart factory applications.

Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni’s 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

[242] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

Main category: cs.CV

TL;DR: CNNs are not inherently texture-biased but rely primarily on local shape features, with feature reliance patterns varying systematically across computer vision, medical imaging, and remote sensing domains.

Details

Motivation: To challenge the established hypothesis that CNNs are inherently texture-biased by addressing limitations in previous cue-conflict experiments and developing a more rigorous framework to quantify feature reliance.

Method: A domain-agnostic framework that systematically suppresses shape, texture, and color cues to quantify feature reliance without forced-choice conflicts, evaluating both humans and neural networks under controlled suppression conditions.

Result: CNNs predominantly rely on local shape features rather than texture, and this reliance can be mitigated through modern training strategies or architectures. Feature reliance patterns differ systematically across domains: computer vision prioritizes shape, medical imaging emphasizes color, and remote sensing relies more on texture.

Conclusion: The texture-bias hypothesis for CNNs is oversimplified; feature reliance varies systematically across domains and can be modified through architectural choices and training strategies, suggesting a more nuanced understanding of feature use in deep learning is needed.

Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

[243] PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer

Zhiwei Yang, Chen Gao, Mike Zheng Shou

Main category: cs.CV

TL;DR: PANDA is a generalist video anomaly detection system using MLLMs that automatically handles any scene and anomaly types without training data or manual intervention through four key capabilities: self-adaptive strategy planning, goal-driven reasoning, tool-augmented self-reflection, and self-improving memory.

Details

Motivation: Current video anomaly detection methods require domain-specific training data and manual adjustments for new scenarios, leading to high labor costs and limited generalization. The goal is to create a system that can handle any scene and anomaly type automatically.

Method: PANDA uses four key capabilities: (1) self-adaptive scene-aware RAG for strategy planning, (2) latent anomaly-guided heuristic prompting for reasoning, (3) progressive reflection with context-aware tools for iterative refinement, and (4) chain-of-memory for leveraging historical experiences.

Result: Extensive experiments show PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training or manual involvement.

Conclusion: PANDA demonstrates generalizable and robust anomaly detection capabilities, validating its effectiveness as a generalist VAD system that operates autonomously across diverse scenarios.

Abstract: Video anomaly detection (VAD) is a critical yet challenging task due to the complex and diverse nature of real-world scenarios. Previous methods typically rely on domain-specific training data and manual adjustments when applying to new scenarios and unseen anomaly types, suffering from high labor costs and limited generalization. Therefore, we aim to achieve generalist VAD, \ie, automatically handle any scene and any anomaly types without training data or human involvement. In this work, we propose PANDA, an agentic AI engineer based on MLLMs. Specifically, we achieve PANDA by comprehensively devising four key capabilities: (1) self-adaptive scene-aware strategy planning, (2) goal-driven heuristic reasoning, (3) tool-augmented self-reflection, and (4) self-improving chain-of-memory. Concretely, we develop a self-adaptive scene-aware RAG mechanism, enabling PANDA to retrieve anomaly-specific knowledge for anomaly detection strategy planning. Next, we introduce a latent anomaly-guided heuristic prompt strategy to enhance reasoning precision. Furthermore, PANDA employs a progressive reflection mechanism alongside a suite of context-aware tools to iteratively refine decision-making in complex scenarios. Finally, a chain-of-memory mechanism enables PANDA to leverage historical experiences for continual performance improvement. Extensive experiments demonstrate that PANDA achieves state-of-the-art performance in multi-scenario, open-set, and complex scenario settings without training and manual involvement, validating its generalizable and robust anomaly detection capability. Code is released at https://github.com/showlab/PANDA.

[244] Look and Tell: A Dataset for Multimodal Grounding Across Egocentric and Exocentric Views

Anna Deichler, Jonas Beskow

Main category: cs.CV

TL;DR: Look and Tell is a multimodal dataset for studying referential communication across different perspectives, collected using smart glasses and stationary cameras in a kitchen scenario.

Details

Motivation: To advance the development of embodied agents that can understand and engage in situated dialogue by studying how different spatial representations affect multimodal grounding.

Method: Used Meta Project Aria smart glasses and stationary cameras to record synchronized gaze, speech, and video from 25 participants instructing partners to identify kitchen ingredients, combined with 3D scene reconstructions.

Result: Created a dataset with 3.67 hours of recordings containing 2,707 richly annotated referential expressions, providing a benchmark for evaluating spatial representation effects.

Conclusion: The dataset serves as a valuable resource for advancing research in multimodal grounding and embodied agent development for situated dialogue.

Abstract: We introduce Look and Tell, a multimodal dataset for studying referential communication across egocentric and exocentric perspectives. Using Meta Project Aria smart glasses and stationary cameras, we recorded synchronized gaze, speech, and video as 25 participants instructed a partner to identify ingredients in a kitchen. Combined with 3D scene reconstructions, this setup provides a benchmark for evaluating how different spatial representations (2D vs. 3D; ego vs. exo) affect multimodal grounding. The dataset contains 3.67 hours of recordings, including 2,707 richly annotated referential expressions, and is designed to advance the development of embodied agents that can understand and engage in situated dialogue.

Jianing Guo, Zhenhong Wu, Chang Tu, Yiyao Ma, Xiangqi Kong, Zhiqian Liu, Jiaming Ji, Shuning Zhang, Yuanpei Chen, Kai Chen, Qi Dou, Yaodong Yang, Xianglong Liu, Huijie Zhao, Weifeng Lv, Simin Li

Main category: cs.CV

TL;DR: The paper proposes RobustVLA, a method to enhance multi-modal robustness in Vision-Language-Action models against perturbations across actions, instructions, environments, and observations, achieving significant performance gains.

Details

Motivation: Existing VLA models focus only on visual perturbations, overlooking broader multi-modal disturbances. The authors identify actions as the most fragile modality and find that current visual-robust VLAs don't generalize to other modalities.

Method: RobustVLA uses offline robust optimization against worst-case action noise for output robustness, and enforces consistent actions across input variations for input robustness. It formulates robustness as a multi-armed bandit problem using upper confidence bound to identify harmful noise.

Result: RobustVLA achieves absolute gains of 12.6% on pi0 backbone and 10.4% on OpenVLA backbone across 17 perturbations, with 50.6x faster inference than existing methods. On real-world FR5 robot with limited demonstrations, it shows 65.6% gain under four modality perturbations.

Conclusion: The proposed RobustVLA framework effectively addresses multi-modal robustness in VLAs, demonstrating superior performance across various perturbations and real-world robotic applications.

Abstract: In Vision-Language-Action (VLA) models, robustness to real-world perturbations is critical for deployment. Existing methods target simple visual disturbances, overlooking the broader multi-modal perturbations that arise in actions, instructions, environments, and observations. Here, we first evaluate the robustness of mainstream VLAs under 17 perturbations across four modalities. We find (1) actions as the most fragile modality, (2) Existing visual-robust VLA do not gain robustness in other modality, and (3) pi0 demonstrates superior robustness with a diffusion-based action head. To build multi-modal robust VLAs, we propose RobustVLA against perturbations in VLA inputs and outputs. For output robustness, we perform offline robust optimization against worst-case action noise that maximizes mismatch in flow matching objective. This can be seen as adversarial training, label smoothing, and outlier penalization. For input robustness, we enforce consistent actions across input variations that preserve task semantics. To account for multiple perturbations, we formulate robustness as a multi-armed bandit problem and apply an upper confidence bound algorithm to automatically identify the most harmful noise. Experiments on LIBERO demonstrate our RobustVLA delivers absolute gains over baselines of 12.6% on the pi0 backbone and 10.4% on the OpenVLA backbone across all 17 perturbations, achieving 50.6x faster inference than existing visual-robust VLAs, and a 10.4% gain under mixed perturbations. Our RobustVLA is particularly effective on real-world FR5 robot with limited demonstrations, showing absolute gains by 65.6% under perturbations of four modalities.

[246] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights

Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan

Main category: cs.CV

TL;DR: Proposes DRBD-Mamba, an efficient 3D brain tumor segmentation model using dual-resolution bi-directional Mamba with space-filling curves and gated fusion, achieving computational efficiency and improved performance across diverse BraTS data partitions.

Details

Motivation: Brain tumor segmentation is challenging due to tumor heterogeneity, and existing Mamba-based models have computational overhead from sequential feature computation across multiple axes. Robustness across diverse BraTS data partitions remains unexplored.

Method: DRBD-Mamba uses space-filling curves for 3D-to-1D feature mapping to preserve spatial locality, reducing multi-axial feature scans. Includes gated fusion module for adaptive context integration and quantization block for robustness. Also proposes five systematic folds on BraTS2023 for rigorous evaluation.

Result: Achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor on 20% test set. On systematic folds, maintains competitive whole tumor accuracy with average gains of 1.16% for tumor core and 1.68% for enhancing tumor over SOTA. Achieves 15x efficiency improvement while maintaining high segmentation accuracy.

Conclusion: DRBD-Mamba provides robust and computationally efficient brain tumor segmentation with improved performance across diverse data conditions, demonstrating both computational advantage and enhanced segmentation accuracy over existing methods.

Abstract: Accurate brain tumor segmentation is significant for clinical diagnosis and treatment but remains challenging due to tumor heterogeneity. Mamba-based State Space Models have demonstrated promising performance. However, despite their computational efficiency over other neural architectures, they incur considerable overhead for this task due to their sequential feature computation across multiple spatial axes. Moreover, their robustness across diverse BraTS data partitions remains largely unexplored, leaving a critical gap in reliable evaluation. To address this, we first propose a dual-resolution bi-directional Mamba (DRBD-Mamba), an efficient 3D segmentation model that captures multi-scale long-range dependencies with minimal computational overhead. We leverage a space-filling curve to preserve spatial locality during 3D-to-1D feature mapping, thereby reducing reliance on computationally expensive multi-axial feature scans. To enrich feature representation, we propose a gated fusion module that adaptively integrates forward and reverse contexts, along with a quantization block that improves robustness. We further propose five systematic folds on BraTS2023 for rigorous evaluation of segmentation techniques under diverse conditions and present analysis of common failure scenarios. On the 20% test set used by recent methods, our model achieves Dice improvements of 0.10% for whole tumor, 1.75% for tumor core, and 0.93% for enhancing tumor. Evaluations on the proposed systematic folds demonstrate that our model maintains competitive whole tumor accuracy while achieving clear average Dice gains of 1.16% for tumor core and 1.68% for enhancing tumor over existing state-of-the-art. Furthermore, our model achieves a 15x efficiency improvement while maintaining high segmentation accuracy, highlighting its robustness and computational advantage over existing methods.

[247] Real-Time Neural Video Compression with Unified Intra and Inter Coding

Hui Xiang, Yifan Bian, Li Li, Jingran Wu, Xianguo Zhang, Dong Liu

Main category: cs.CV

TL;DR: A neural video compression framework that combines intra and inter coding in a unified model, addressing limitations like disocclusion handling and error propagation while maintaining real-time performance.

Details

Motivation: Existing neural video compression schemes have limitations in handling disocclusion, new content, and interframe error propagation. The authors aim to eliminate these issues by borrowing from classic video coding approaches that allow intra coding within inter-coded frames.

Method: Proposed an NVC framework with unified intra and inter coding using a single model trained to perform adaptive intra/inter coding. Also introduced simultaneous two-frame compression to exploit both forward and backward interframe redundancy.

Result: The scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and maintains real-time encoding/decoding performance.

Conclusion: The proposed unified intra/inter coding framework effectively addresses key limitations in neural video compression while achieving superior compression efficiency and stability compared to state-of-the-art methods.

Abstract: Neural video compression (NVC) technologies have advanced rapidly in recent years, yielding state-of-the-art schemes such as DCVC-RT that offer superior compression efficiency to H.266/VVC and real-time encoding/decoding capabilities. Nonetheless, existing NVC schemes have several limitations, including inefficiency in dealing with disocclusion and new content, interframe error propagation and accumulation, among others. To eliminate these limitations, we borrow the idea from classic video coding schemes, which allow intra coding within inter-coded frames. With the intra coding tool enabled, disocclusion and new content are properly handled, and interframe error propagation is naturally intercepted without the need for manual refresh mechanisms. We present an NVC framework with unified intra and inter coding, where every frame is processed by a single model that is trained to perform intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame compression design to exploit interframe redundancy not only forwardly but also backwardly. Experimental results show that our scheme outperforms DCVC-RT by an average of 10.7% BD-rate reduction, delivers more stable bitrate and quality per frame, and retains real-time encoding/decoding performances. Code and models will be released.

[248] UniMedVL: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Lihao Liu, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Zhongying Deng, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Ming Hu, Junjun He

Main category: cs.CV

TL;DR: UniMedVL is the first unified medical multimodal model that simultaneously handles image understanding and generation tasks within a single architecture, addressing the gap between specialized medical AI systems.

Details

Motivation: Existing medical AI systems disrupt unified diagnostic workflows - image understanding models can't generate visual outputs, while generation models can't provide textual explanations, creating gaps in multimodal capabilities.

Method: Proposed a multi-level framework using Observation-Knowledge-Analysis (OKA) paradigm: created UniMed-5M dataset with 5.6M multimodal samples, used Progressive Curriculum Learning for medical knowledge integration, and developed UniMedVL unified model architecture.

Result: UniMedVL achieves superior performance on 5 medical image understanding benchmarks and matches specialized models in generation quality across 8 medical imaging modalities. Bidirectional knowledge sharing enhances visual understanding features.

Conclusion: Integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks, demonstrating the value of unified multimodal architectures.

Abstract: Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at https://github.com/uni-medical/UniMedVL.

[249] DWaste: Greener AI for Waste Sorting using Mobile and Edge Devices

Suman Kunwar

Main category: cs.CV

TL;DR: DWaste is a computer vision platform for real-time waste sorting on smartphones and edge devices, benchmarking various models to find optimal balance between accuracy and efficiency for sustainable waste management.

Details

Motivation: The rise of convenience packaging has generated enormous waste, making efficient waste sorting crucial for sustainable waste management, requiring solutions that work on resource-constrained devices.

Method: Developed DWaste platform and benchmarked image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection models (YOLOv8n, YOLOv11n, including proposed YOLOv8n-CBAM) using annotated recycling dataset, with model quantization for efficiency.

Result: EfficientNetV2S achieved highest accuracy (~96%) but had high latency (~0.22s) and carbon emissions. Lightweight object detection models delivered strong performance (up to 80% mAP) with ultra-fast inference (~0.03s) and small model sizes (<7MB), ideal for real-time use. Quantization reduced model size and VRAM usage by up to 75%.

Conclusion: The work demonstrates successful implementation of “Greener AI” models for real-time, sustainable waste sorting on edge devices, with lightweight object detection models being optimal for low-power, real-time applications.

Abstract: The rise of convenience packaging has led to generation of enormous waste, making efficient waste sorting crucial for sustainable waste management. To address this, we developed DWaste, a computer vision-powered platform designed for real-time waste sorting on resource-constrained smartphones and edge devices, including offline functionality. We benchmarked various image classification models (EfficientNetV2S/M, ResNet50/101, MobileNet) and object detection (YOLOv8n, YOLOv11n) including our purposed YOLOv8n-CBAM model using our annotated dataset designed for recycling. We found a clear trade-off between accuracy and resource consumption: the best classifier, EfficientNetV2S, achieved high accuracy(~ 96%) but suffered from high latency (~ 0.22s) and elevated carbon emissions. In contrast, lightweight object detection models delivered strong performance (up to 80% mAP) with ultra-fast inference (~ 0.03s) and significantly smaller model sizes (< 7MB ), making them ideal for real-time, low-power use. Model quantization further maximized efficiency, substantially reducing model size and VRAM usage by up to 75%. Our work demonstrates the successful implementation of “Greener AI” models to support real-time, sustainable waste sorting on edge devices.

[250] Bridging the gap to real-world language-grounded visual concept learning

Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong

Main category: cs.CV

TL;DR: A framework for adaptive visual concept learning that identifies image-related concept axes and grounds visual concepts in real-world scenes using pretrained vision-language models, enabling superior editing capabilities without predefined axes.

Details

Motivation: Existing approaches to language-grounded visual concept learning are limited to predefined primitive axes (like color and shape) and typically work only on synthetic datasets, failing to capture the rich spectrum of semantic dimensions that human intelligence effortlessly interprets.

Method: Uses a pretrained vision-language model with universal prompting to identify diverse image-related axes without prior knowledge. A universal concept encoder binds visual features to discovered axes without additional parameters per concept. Optimizes a compositional anchoring objective to ensure axes can be independently manipulated.

Result: Demonstrated effectiveness on ImageNet, CelebA-HQ, and AFHQ datasets, showing superior editing capabilities across diverse real-world concepts. Exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods.

Conclusion: The proposed framework enables scalable, adaptive visual concept learning in real-world scenes, overcoming limitations of predefined axes and synthetic datasets while achieving state-of-the-art editing performance and compositional generalization.

Abstract: Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.

[251] GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, Matei Zaharia

Main category: cs.CV

TL;DR: GRAID is a framework that generates high-quality spatial reasoning datasets using 2D bounding boxes from object detectors, avoiding 3D reconstruction errors and generative hallucinations, achieving 91.16% human-validated accuracy.

Details

Motivation: Current VLMs struggle with spatial reasoning, and existing dataset generation methods have limitations: single-image 3D reconstruction introduces cascading errors requiring wide tolerances, while caption-based methods need hyper-detailed annotations and suffer from hallucinations.

Method: GRAID operates exclusively on 2D bounding boxes from standard object detectors to determine qualitative spatial relationships, avoiding both 3D reconstruction errors and generative hallucinations.

Result: Generated over 8.5 million high-quality VQA pairs across BDD100k, NuImages, and Waymo datasets with 91.16% human-validated accuracy (vs 57.6% from recent work). Models trained on GRAID data show generalization: fine-tuned on 6 question types improve on over 10 held-out types with 47.5% and 37.9% accuracy gains.

Conclusion: GRAID produces higher-quality spatial reasoning datasets than existing tools, enabling VLMs to learn spatial reasoning concepts that generalize across question types and improve performance on existing benchmarks.

Abstract: Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning$\unicode{x2014}$a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of higher quality than existing tools that produce similar datasets as validated by human evaluations. We apply our framework to the BDD100k, NuImages, and Waymo datasets, generating over 8.5 million high-quality VQA pairs creating questions spanning spatial relations, counting, ranking, and size comparisons. We evaluate one of the datasets and find it achieves 91.16% human-validated accuracy$\unicode{x2014}$compared to 57.6% on a dataset generated by recent work. Critically, we demonstrate that when trained on GRAID data, models learn spatial reasoning concepts that generalize: models fine-tuned on 6 question types improve on over 10 held-out types, with accuracy gains of 47.5% on BDD and 37.9% on NuImages for Llama 3.2B 11B, and when trained on all questions types, achieve improvements on several existing benchmarks such as BLINK. The GRAID framework, datasets, and additional information can be found $\href{this https URL}{here}$.

[252] LongCat-Video Technical Report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, Tong Zhang

Main category: cs.CV

TL;DR: LongCat-Video is a 13.6B parameter video generation model that excels at efficient long video generation, supporting multiple tasks with unified architecture and achieving strong performance through multi-reward RLHF.

Details

Motivation: To develop efficient long video inference as a key capability toward building world models, addressing the need for high-quality, temporally coherent minute-long video generation.

Method: Built on Diffusion Transformer (DiT) framework with unified architecture for Text-to-Video, Image-to-Video, and Video-Continuation tasks. Uses coarse-to-fine generation strategy, Block Sparse Attention, and multi-reward RLHF training.

Result: Generates 720p, 30fps videos within minutes while maintaining high quality and temporal coherence for long videos. Achieves performance comparable to latest closed-source and leading open-source models.

Conclusion: LongCat-Video represents a significant step toward world models, demonstrating efficient and high-quality long video generation capabilities with publicly available code and model weights to advance the field.

Abstract: Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

[253] VADTree: Explainable Training-Free Video Anomaly Detection via Hierarchical Granularity-Aware Tree

Wenlong Li, Yifei Xu, Yuan Rao, Zhenhua Wang, Shuiguang Deng

Main category: cs.CV

TL;DR: VADTree is a training-free video anomaly detection method that uses a hierarchical granularity-aware tree structure for flexible temporal sampling, leveraging pre-trained GEBD models for event boundary detection and VLMs/LLMs for anomaly reasoning.

Details

Motivation: Supervised VAD methods require large in-domain training data and lack explainability, while existing training-free methods with fixed-length temporal windows struggle to capture anomalies of varying temporal spans.

Method: Proposes VADTree with HGTree structure that decomposes videos into generic event nodes using GEBD, performs hierarchical structuring and redundancy removal, injects multi-dimensional priors into VLMs for node-wise anomaly perception, and uses LLMs for anomaly reasoning with inter-cluster correlation.

Result: Achieves state-of-the-art performance in training-free settings on three challenging datasets while drastically reducing the number of sampled video segments.

Conclusion: VADTree provides an effective training-free solution for video anomaly detection that handles varying temporal spans through flexible hierarchical sampling and leverages pre-trained models for explainable anomaly detection.

Abstract: Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at https://github.com/wenlongli10/VADTree.

[254] IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu

Main category: cs.CV

TL;DR: IGGT is an end-to-end transformer that unifies 3D spatial reconstruction and instance-level semantic understanding through 3D-consistent contrastive learning, enabling coherent 3D scene perception from 2D inputs.

Details

Motivation: Prior approaches treat 3D geometry reconstruction and semantic understanding separately, overlooking their crucial interplay, which limits generalization and downstream task performance. Simple alignment methods restrict perception to aligned models' capacity.

Method: Propose InstanceGrounded Geometry Transformer (IGGT) with 3D-Consistent Contrastive Learning strategy to encode unified representations with geometric structures and instance-grounded clustering from 2D visual inputs. Also create InsScene-15K dataset with comprehensive annotations.

Result: IGGT enables consistent lifting of 2D visual inputs into coherent 3D scenes with explicitly distinct object instances, supporting both spatial reconstruction and instance-level contextual understanding.

Conclusion: The proposed unified approach addresses the limitations of prior methods by jointly modeling geometric and semantic dimensions, facilitating better generalization and performance in downstream 3D understanding tasks.

Abstract: Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model’s capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.

[255] Switchable Token-Specific Codebook Quantization For Face Image Compression

Yongbo Wang, Haonan Wang, Guodong Mu, Ruixin Zhang, Jiaqi Chen, Jingyun Zhang, Jun Wang, Yuan Xie, Zhizhong Zhang, Shouhong Ding

Main category: cs.CV

TL;DR: Proposes Switchable Token-Specific Codebook Quantization for face image compression, using distinct codebook groups for different image categories and independent codebooks per token to improve performance at low bit rates.

Details

Motivation: Global codebook strategies for face images overlook category-specific correlations and semantic token differences, leading to suboptimal performance especially at low bit rates.

Method: Learns distinct codebook groups for different image categories and assigns independent codebooks to each token, recording codebook group membership with minimal bits to enable more codebooks under lower overall bpp.

Result: Achieves 93.51% average accuracy for reconstructed face images at 0.05 bpp on face recognition datasets.

Conclusion: The method enhances expressive capability and reconstruction performance, can be integrated into existing codebook-based approaches, and is particularly effective for face image compression at low bit rates.

Abstract: With the ever-increasing volume of visual data, the efficient and lossless transmission, along with its subsequent interpretation and understanding, has become a critical bottleneck in modern information systems. The emerged codebook-based solution utilize a globally shared codebook to quantize and dequantize each token, controlling the bpp by adjusting the number of tokens or the codebook size. However, for facial images, which are rich in attributes, such global codebook strategies overlook both the category-specific correlations within images and the semantic differences among tokens, resulting in suboptimal performance, especially at low bpp. Motivated by these observations, we propose a Switchable Token-Specific Codebook Quantization for face image compression, which learns distinct codebook groups for different image categories and assigns an independent codebook to each token. By recording the codebook group to which each token belongs with a small number of bits, our method can reduce the loss incurred when decreasing the size of each codebook group. This enables a larger total number of codebooks under a lower overall bpp, thereby enhancing the expressive capability and improving reconstruction performance. Owing to its generalizable design, our method can be integrated into any existing codebook-based representation learning approach and has demonstrated its effectiveness on face recognition datasets, achieving an average accuracy of 93.51% for reconstructed images at 0.05 bpp.

[256] Task-Agnostic Fusion of Time Series and Imagery for Earth Observation

Gianfranco Basile, Johannes Jakubik, Benedikt Blumenstiel, Thomas Brunschwiler, Juan Bernabe Moreno

Main category: cs.CV

TL;DR: A task-agnostic framework for multimodal fusion of time series and images using deterministic/learned quantization and masked correlation learning, achieving superior performance in cross-modal generation and downstream tasks.

Details

Motivation: To enable robust multimodal fusion between time series and single timestamp images for cross-modal generation and improved downstream task performance without task-specific tuning.

Method: Uses deterministic and learned strategies for time series quantization, then employs masked correlation learning to align discrete image and time series tokens in a unified representation space.

Result: Outperforms task-specific fusion by 6% in R² and 2% in RMSE on average, and exceeds baseline methods by 50% in R² and 12% in RMSE. Successfully generates consistent global temperature profiles from satellite imagery.

Conclusion: The task-agnostic pretraining framework provides robust multimodal fusion capabilities, enabling cross-modal generation and superior performance across downstream tasks compared to specialized approaches.

Abstract: We propose a task-agnostic framework for multimodal fusion of time series and single timestamp images, enabling cross-modal generation and robust downstream performance. Our approach explores deterministic and learned strategies for time series quantization and then leverages a masked correlation learning objective, aligning discrete image and time series tokens in a unified representation space. Instantiated in the Earth observation domain, the pretrained model generates consistent global temperature profiles from satellite imagery and is validated through counterfactual experiments. Across downstream tasks, our task-agnostic pretraining outperforms task-specific fusion by 6% in R^2 and 2% in RMSE on average, and exceeds baseline methods by 50% in R$^2$ and 12% in RMSE. Finally, we analyze gradient sensitivity across modalities, providing insights into model robustness. Code, data, and weights will be released under a permissive license.

[257] Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Razaib Tariq, Minji Heo, Simon S. Woo, Shahroz Tariq

Main category: cs.CV

TL;DR: Deepfake detectors suffer significant performance degradation (up to 25.4%) when dealing with Moiré artifacts from smartphone-captured screen videos, and demoiréing methods actually worsen detection accuracy.

Details

Motivation: To address the overlooked problem of Moiré artifacts in real-world deepfake detection scenarios, particularly from smartphone-captured media of digital screens.

Method: Systematic evaluation of 15 SOTA deepfake detectors using a dataset of 12,832 videos (35.64 hours) from multiple sources, plus additional experiments with the DeepMoiréFake (DMF) dataset and synthetic Moiré generation techniques.

Result: Moiré artifacts degrade detector performance by up to 25.4%, synthetic Moiré patterns cause 21.4% accuracy drop, and demoiréing methods unexpectedly reduce accuracy by up to 17.2%.

Conclusion: There is an urgent need for detection models robust to Moiré distortions and other real-world challenges, and the DMF dataset aims to bridge the gap between controlled experiments and practical detection.

Abstract: Deepfake detection remains a pressing challenge, particularly in real-world settings where smartphone-captured media from digital screens often introduces Moir'e artifacts that can distort detection outcomes. This study systematically evaluates state-of-the-art (SOTA) deepfake detectors on Moir'e-affected videos, an issue that has received little attention. We collected a dataset of 12,832 videos, spanning 35.64 hours, from the Celeb-DF, DFD, DFDC, UADFV, and FF++ datasets, capturing footage under diverse real-world conditions, including varying screens, smartphones, lighting setups, and camera angles. To further examine the influence of Moir'e patterns on deepfake detection, we conducted additional experiments using our DeepMoir'eFake, referred to as (DMF) dataset and two synthetic Moir'e generation techniques. Across 15 top-performing detectors, our results show that Moir'e artifacts degrade performance by as much as 25.4%, while synthetically generated Moir'e patterns lead to a 21.4% drop in accuracy. Surprisingly, demoir'eing methods, intended as a mitigation approach, instead worsened the problem, reducing accuracy by up to 17.2%. These findings underscore the urgent need for detection models that can robustly handle Moir'e distortions alongside other realworld challenges, such as compression, sharpening, and blurring. By introducing the DMF dataset, we aim to drive future research toward closing the gap between controlled experiments and practical deepfake detection.

[258] FRBNet: Revisiting Low-Light Vision through Frequency-Domain Radial Basis Network

Fangtong Sun, Congyu Li, Ke Yang, Yuchen Pan, Hanwen Yu, Xichuan Zhang, Yiying Li

Main category: cs.CV

TL;DR: FRBNet is a frequency-domain plug-and-play module that extracts illumination-invariant features for low-light vision tasks by leveraging frequency-domain channel ratios and learnable filters.

Details

Motivation: Existing methods for low-light vision fall short due to incomplete modeling of low-light conditions, and illumination degradation significantly affects downstream tasks like detection and segmentation.

Method: Extends Lambertian model to characterize low-light conditions, analyzes in frequency domain, and proposes FRBNet - a frequency-domain radial basis network that uses frequency-domain channel ratios with learnable filters for illumination-invariant feature enhancement.

Result: Achieves +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation, demonstrating superior performance across various downstream tasks.

Conclusion: FRBNet effectively addresses low-light vision challenges through frequency-domain analysis and can be seamlessly integrated into existing networks as a plug-and-play module.

Abstract: Low-light vision remains a fundamental challenge in computer vision due to severe illumination degradation, which significantly affects the performance of downstream tasks such as detection and segmentation. While recent state-of-the-art methods have improved performance through invariant feature learning modules, they still fall short due to incomplete modeling of low-light conditions. Therefore, we revisit low-light image formation and extend the classical Lambertian model to better characterize low-light conditions. By shifting our analysis to the frequency domain, we theoretically prove that the frequency-domain channel ratio can be leveraged to extract illumination-invariant features via a structured filtering process. We then propose a novel and end-to-end trainable module named \textbf{F}requency-domain \textbf{R}adial \textbf{B}asis \textbf{Net}work (\textbf{FRBNet}), which integrates the frequency-domain channel ratio operation with a learnable frequency domain filter for the overall illumination-invariant feature enhancement. As a plug-and-play module, FRBNet can be integrated into existing networks for low-light downstream tasks without modifying loss functions. Extensive experiments across various downstream tasks demonstrate that FRBNet achieves superior performance, including +2.2 mAP for dark object detection and +2.9 mIoU for nighttime segmentation. Code is available at: https://github.com/Sing-Forevet/FRBNet.

[259] VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

Walid Bousselham, Hilde Kuehne, Cordelia Schmid

Main category: cs.CV

TL;DR: VOLD is a framework that transfers reasoning capabilities from text-only teacher models to vision-language student models using reinforcement learning with on-policy distillation, achieving state-of-the-art performance on reasoning benchmarks.

Details

Motivation: Training vision-language models for complex reasoning is challenging due to scarce high-quality image-text reasoning data, while text-based reasoning resources are abundant but underutilized for VLM reasoning.

Method: Combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, where student reasoning traces are guided by teacher models, plus cold-start alignment via supervised fine-tuning.

Result: Outperforms baseline models significantly and improves state-of-the-art across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista.

Conclusion: Cold-start alignment is essential for effective transfer during online training, and without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance.

Abstract: Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

[260] PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection

Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan

Main category: cs.CV

TL;DR: PRISM-Bench is a benchmark for evaluating multimodal LLMs’ reasoning processes through puzzle-based visual challenges that require identifying errors in chain-of-thought reasoning.

Details

Motivation: Current MLLMs show unreliable reasoning despite progress on vision-language tasks, and existing evaluations only measure final-answer accuracy without assessing reasoning quality.

Method: Uses visual puzzles requiring multi-step symbolic, geometric, and analogical reasoning, with diagnostic tasks where models must identify the first incorrect step in chain-of-thought reasoning containing exactly one error.

Result: Evaluations reveal a gap between fluent generation and faithful reasoning - models producing plausible CoTs often fail to locate simple logical faults.

Conclusion: PRISM-Bench provides a sharper evaluation of multimodal reasoning competence and highlights the need for diagnostic evaluation protocols in developing trustworthy MLLMs.

Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.

cs.AI

[261] Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents

Zihao Wang, Xujing Li, Yining Ye, Junjie Fang, Haoming Wang, Longxiang Liu, Shihao Liang, Junting Lu, Zhiyong Wu, Jiazhan Feng, Wanjun Zhong, Zili Li, Yu Wang, Yu Miao, Bo Zhou, Yuanfan Li, Hao Wang, Zhongkai Zhao, Faming Wu, Zhengxuan Jiang, Weihao Tan, Heyuan Yao, Shi Yan, Xiangyang Li, Yitao Liang, Yujia Qin, Guang Shi

Main category: cs.AI

TL;DR: Game-TARS is a generalist game agent using unified keyboard-mouse action space for cross-domain training, achieving superior performance in Minecraft, web games, and FPS benchmarks compared to state-of-the-art models.

Details

Motivation: To create a generalist agent that can operate across heterogeneous domains (OS, web, simulation games) using human-aligned native inputs rather than API- or GUI-based approaches, enabling large-scale continual pre-training.

Method: Uses unified keyboard-mouse action space, pre-trained on 500B tokens with diverse trajectories and multimodal data. Implements decaying continual loss to reduce causal confusion and Sparse-Thinking strategy for efficient reasoning.

Result: Achieves 2x success rate over previous SOTA in Minecraft, close to human generality in unseen web 3D games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling confirms sustained improvements with cross-game multimodal data.

Conclusion: Simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

Abstract: We present Game-TARS, a generalist game agent trained with a unified, scalable action space anchored to human-aligned native keyboard-mouse inputs. Unlike API- or GUI-based approaches, this paradigm enables large-scale continual pre-training across heterogeneous domains, including OS, web, and simulation games. Game-TARS is pre-trained on over 500B tokens with diverse trajectories and multimodal data. Key techniques include a decaying continual loss to reduce causal confusion and an efficient Sparse-Thinking strategy that balances reasoning depth and inference cost. Experiments show that Game-TARS achieves about 2 times the success rate over the previous sota model on open-world Minecraft tasks, is close to the generality of fresh humans in unseen web 3d games, and outperforms GPT-5, Gemini-2.5-Pro, and Claude-4-Sonnet in FPS benchmarks. Scaling results on training-time and test-time confirm that the unified action space sustains improvements when scaled to cross-game and multimodal data. Our results demonstrate that simple, scalable action representations combined with large-scale pre-training provide a promising path toward generalist agents with broad computer-use abilities.

[262] AI and the Decentering of Disciplinary Creativity

Eamon Duede

Main category: cs.AI

TL;DR: AI’s role in scientific problem-solving can extend but also displace disciplinary creativity, potentially diminishing the value of scientific pursuit.

Details

Motivation: To examine how artificial intelligence impacts disciplinary creativity in scientific fields, distinguishing between creative approaches and products.

Method: Drawing on philosophy of creativity concepts and analyzing two mathematical case studies to demonstrate how AI can extend or displace disciplinary creativity.

Result: While computation can extend disciplinary creativity, certain AI approaches can displace it, potentially altering the value of scientific work.

Conclusion: AI’s role in science needs careful consideration as it can both enhance and potentially diminish disciplinary creativity and the value of scientific pursuit.

Abstract: This paper examines the role of artificial intelligence in scientific problem-solving, with a focus on its implications for disciplinary creativity. Drawing on recent work in the philosophy of creativity, I distinguish between creative approaches and creative products, and introduce the concept of disciplinary creativity -the creative application of discipline-specific expertise to a valued problem within that field. Through two cases in mathematics, I show that while computation can extend disciplinary creativity, certain approaches involving AI can serve to displace it. This displacement has the potential to alter (and, perhaps, diminish) the value of scientific pursuit.

[263] BLM$_1$: A Boundless Large Model for Cross-Space, Cross-Task, and Cross-Embodiment Learning

Wentao Tan, Bowen Wang, Heng Zhi, Chenyu Liu, Zhe Li, Jian Liu, Zengrong Lin, Yukun Dai, Yipeng Chen, Wenjie Yang, Enci Xie, Hao Xue, Baixu Ji, Chen Xu, Zhibin Wang, Tianshi Wang, Lei Zhu, Heng Tao Shen

Main category: cs.AI

TL;DR: BLM$_1$ is a multimodal spatial foundation model that unifies digital and physical spaces, enabling cross-embodiment control and robust reasoning through a two-stage training approach.

Details

Motivation: Current MLLMs, VLAs, and ELLMs have limitations in generalizing across digital-physical spaces, embodiments, and tasks, lacking unified models that operate seamlessly across these domains.

Method: Two-stage training: Stage I injects embodied knowledge into MLLM through curated digital corpora while maintaining language competence; Stage II trains a policy module via intent-bridging interface that extracts high-level semantics from MLLM to guide control without fine-tuning the backbone.

Result: BLM$_1$ outperforms four model families (MLLMs, ELLMs, VLAs, GMLMs) with ~6% gains in digital tasks and ~3% in physical tasks across digital and physical benchmarks.

Conclusion: BLM$_1$ successfully demonstrates unified cross-space, cross-task, and cross-embodiment capabilities, providing a foundation model that bridges the gap between digital and physical embodied intelligence.

Abstract: Multimodal large language models (MLLMs) have advanced vision-language reasoning and are increasingly deployed in embodied agents. However, significant limitations remain: MLLMs generalize poorly across digital-physical spaces and embodiments; vision-language-action models (VLAs) produce low-level actions yet lack robust high-level embodied reasoning; and most embodied large language models (ELLMs) are constrained to digital-space with poor generalization to the physical world. Thus, unified models that operate seamlessly across digital and physical spaces while generalizing across embodiments and tasks remain absent. We introduce the \textbf{Boundless Large Model (BLM$_1$)}, a multimodal spatial foundation model that preserves instruction following and reasoning, incorporates embodied knowledge, and supports robust cross-embodiment control. BLM$_1$ integrates three key capabilities – \textit{cross-space transfer, cross-task learning, and cross-embodiment generalization} – via a two-stage training paradigm. Stage I injects embodied knowledge into the MLLM through curated digital corpora while maintaining language competence. Stage II trains a policy module through an intent-bridging interface that extracts high-level semantics from the MLLM to guide control, without fine-tuning the MLLM backbone. This process is supported by a self-collected cross-embodiment demonstration suite spanning four robot embodiments and six progressively challenging tasks. Evaluations across digital and physical benchmarks show that a single BLM$_1$ instance outperforms four model families – MLLMs, ELLMs, VLAs, and GMLMs – achieving $\sim!\textbf{6%}$ gains in digital tasks and $\sim!\textbf{3%}$ in physical tasks.

[264] Multi-Environment POMDPs: Discrete Model Uncertainty Under Partial Observability

Eline M. Bovy, Caleb Probine, Marnix Suilen, Ufuk Topcu, Nils Jansen

Main category: cs.AI

TL;DR: ME-POMDPs extend POMDPs to handle model uncertainty by considering multiple possible POMDP models. The paper generalizes this to adversarial-belief POMDPs, shows reduction methods, and develops algorithms for computing robust policies.

Details

Motivation: To address situations where multiple domain experts disagree on how to model a problem, requiring robust policies that work well across all possible models.

Method: Generalize ME-POMDPs to adversarial-belief POMDPs, show reduction techniques to simplify model variations, and develop exact and approximate (point-based) algorithms for computing robust policies.

Result: Successfully computed policies for standard POMDP benchmarks extended to multi-environment settings, demonstrating practical applicability of the approach.

Conclusion: The framework provides effective methods for handling model uncertainty in POMDPs through robust policy computation across multiple possible environment models.

Abstract: Multi-environment POMDPs (ME-POMDPs) extend standard POMDPs with discrete model uncertainty. ME-POMDPs represent a finite set of POMDPs that share the same state, action, and observation spaces, but may arbitrarily vary in their transition, observation, and reward models. Such models arise, for instance, when multiple domain experts disagree on how to model a problem. The goal is to find a single policy that is robust against any choice of POMDP within the set, i.e., a policy that maximizes the worst-case reward across all POMDPs. We generalize and expand on existing work in the following way. First, we show that ME-POMDPs can be generalized to POMDPs with sets of initial beliefs, which we call adversarial-belief POMDPs (AB-POMDPs). Second, we show that any arbitrary ME-POMDP can be reduced to a ME-POMDP that only varies in its transition and reward functions or only in its observation and reward functions, while preserving (optimal) policies. We then devise exact and approximate (point-based) algorithms to compute robust policies for AB-POMDPs, and thus ME-POMDPs. We demonstrate that we can compute policies for standard POMDP benchmarks extended to the multi-environment setting.

[265] Test-Time Tuned Language Models Enable End-to-end De Novo Molecular Structure Generation from MS/MS Spectra

Laura Mismetti, Marvin Alberts, Andreas Krause, Mara Graziani

Main category: cs.AI

TL;DR: A novel framework using test-time tuning enhances pre-trained transformer models for de novo molecular structure generation directly from tandem mass spectra and molecular formulae, outperforming state-of-the-art methods by significant margins.

Details

Motivation: Current methods for compound identification in tandem mass spectrometry rely on database matching or multi-step pipelines with intermediate predictions, making it challenging to identify compounds absent from reference databases.

Method: Leverages test-time tuning to enhance a pre-trained transformer model, enabling end-to-end de novo molecular structure generation directly from tandem mass spectra and molecular formulae without manual annotations or intermediate steps.

Result: Surpasses state-of-the-art DiffMS by 100% on NPLIB1 and 20% on MassSpecGym benchmarks. Test-time tuning provides 62% relative performance gain over conventional fine-tuning on MassSpecGym. Generated molecular candidates remain structurally accurate even when deviating from ground truth.

Conclusion: The framework enables reliable de novo molecular identification from mass spectra, providing valuable guidance for human interpretation and addressing the limitation of database-dependent methods.

Abstract: Tandem Mass Spectrometry enables the identification of unknown compounds in crucial fields such as metabolomics, natural product discovery and environmental analysis. However, current methods rely on database matching from previously observed molecules, or on multi-step pipelines that require intermediate fragment or fingerprint prediction. This makes finding the correct molecule highly challenging, particularly for compounds absent from reference databases. We introduce a framework that, by leveraging test-time tuning, enhances the learning of a pre-trained transformer model to address this gap, enabling end-to-end de novo molecular structure generation directly from the tandem mass spectra and molecular formulae, bypassing manual annotations and intermediate steps. We surpass the de-facto state-of-the-art approach DiffMS on two popular benchmarks NPLIB1 and MassSpecGym by 100% and 20%, respectively. Test-time tuning on experimental spectra allows the model to dynamically adapt to novel spectra, and the relative performance gain over conventional fine-tuning is of 62% on MassSpecGym. When predictions deviate from the ground truth, the generated molecular candidates remain structurally accurate, providing valuable guidance for human interpretation and more reliable identification.

[266] Policy Cards: Machine-Readable Runtime Governance for Autonomous AI Agents

Juraj Mavračić

Main category: cs.AI

TL;DR: Policy Cards are a machine-readable standard for expressing operational, regulatory, and ethical constraints for AI agents, enabling runtime enforcement and verifiable compliance.

Details

Motivation: To address the need for practical mechanisms to integrate high-level governance with engineering practice for autonomous AI agents, enabling accountable autonomy at scale.

Method: Create Policy Cards as deployment-layer artifacts that encode allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks like NIST AI RMF, ISO/IEC 42001, and EU AI Act.

Result: Policy Cards can be automatically validated, version-controlled, and linked to runtime enforcement or continuous-audit pipelines, forming a foundation for distributed assurance in multi-agent ecosystems.

Conclusion: Policy Cards provide a practical standard for verifiable compliance of autonomous agents, extending existing transparency artifacts with a normative layer for operational constraints.

Abstract: Policy Cards are introduced as a machine-readable, deployment-layer standard for expressing operational, regulatory, and ethical constraints for AI agents. The Policy Card sits with the agent and enables it to follow required constraints at runtime. It tells the agent what it must and must not do. As such, it becomes an integral part of the deployed agent. Policy Cards extend existing transparency artifacts such as Model, Data, and System Cards by defining a normative layer that encodes allow/deny rules, obligations, evidentiary requirements, and crosswalk mappings to assurance frameworks including NIST AI RMF, ISO/IEC 42001, and the EU AI Act. Each Policy Card can be validated automatically, version-controlled, and linked to runtime enforcement or continuous-audit pipelines. The framework enables verifiable compliance for autonomous agents, forming a foundation for distributed assurance in multi-agent ecosystems. Policy Cards provide a practical mechanism for integrating high-level governance with hands-on engineering practice and enabling accountable autonomy at scale.

[267] Evaluating In Silico Creativity: An Expert Review of AI Chess Compositions

Vivek Veeriah, Federico Barbero, Marcus Chiam, Xidong Feng, Michael Dennis, Ryan Pachauri, Thomas Tumiel, Johan Obando-Ceron, Jiaxin Shi, Shaobo Hou, Satinder Singh, Nenad Tomašev, Tom Zahavy

Main category: cs.AI

TL;DR: An AI system generates creative chess puzzles with aesthetic appeal and novel solutions, evaluated by chess experts.

Details

Motivation: To investigate whether Generative AI can produce creative and novel outputs, specifically in the domain of chess puzzles.

Method: Developed an AI system to generate chess puzzles with aesthetic appeal, novelty, and counter-intuitive solutions. Evaluated by presenting curated puzzles to three world-renowned chess experts.

Result: Three chess experts (International Master Amatzia Avni, Grandmasters Jonathan Levitt and Matthew Sadler) selected their favorite AI-generated puzzles and explained their appeal based on creativity, challenge level, and aesthetic design.

Conclusion: The study demonstrates that Generative AI can produce creative chess puzzles that are appreciated by domain experts for their aesthetic qualities and novel solutions.

Abstract: The rapid advancement of Generative AI has raised significant questions regarding its ability to produce creative and novel outputs. Our recent work investigates this question within the domain of chess puzzles and presents an AI system designed to generate puzzles characterized by aesthetic appeal, novelty, counter-intuitive and unique solutions. We briefly discuss our method below and refer the reader to the technical paper for more details. To assess our system’s creativity, we presented a curated booklet of AI-generated puzzles to three world-renowned experts: International Master for chess compositions Amatzia Avni, Grandmaster Jonathan Levitt, and Grandmaster Matthew Sadler. All three are noted authors on chess aesthetics and the evolving role of computers in the game. They were asked to select their favorites and explain what made them appealing, considering qualities such as their creativity, level of challenge, or aesthetic design.

[268] Why Foundation Models in Pathology Are Failing

Hamid R. Tizhoosh

Main category: cs.AI

TL;DR: Current pathology foundation models show fundamental weaknesses including low accuracy, poor robustness, and safety vulnerabilities due to conceptual mismatches with tissue complexity.

Details

Motivation: To understand why foundation models that revolutionized other domains are failing in computational pathology despite high expectations for cancer diagnosis and prognostication.

Method: Systematic evaluation of pathology foundation models to identify root causes of their shortcomings through conceptual analysis of mismatches between generic AI assumptions and tissue complexity.

Result: Identified seven interrelated causes: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and fundamental design flaws in patch size.

Conclusion: Current pathology foundation models are conceptually misaligned with tissue morphology and require fundamental paradigm rethinking rather than incremental improvements.

Abstract: In non-medical domains, foundation models (FMs) have revolutionized computer vision and language processing through large-scale self-supervised and multimodal learning. Consequently, their rapid adoption in computational pathology was expected to deliver comparable breakthroughs in cancer diagnosis, prognostication, and multimodal retrieval. However, recent systematic evaluations reveal fundamental weaknesses: low diagnostic accuracy, poor robustness, geometric instability, heavy computational demands, and concerning safety vulnerabilities. This short paper examines these shortcomings and argues that they stem from deeper conceptual mismatches between the assumptions underlying generic foundation modeling in mainstream AI and the intrinsic complexity of human tissue. Seven interrelated causes are identified: biological complexity, ineffective self-supervision, overgeneralization, excessive architectural complexity, lack of domain-specific innovation, insufficient data, and a fundamental design flaw related to tissue patch size. These findings suggest that current pathology foundation models remain conceptually misaligned with the nature of tissue morphology and call for a fundamental rethinking of the paradigm itself.

[269] Law in Silico: Simulating Legal Society with LLM-Based Agents

Yiding Wang, Yuxuan Chen, Fanxu Meng, Xifan Chen, Xiaolei Yang, Muhan Zhang

Main category: cs.AI

TL;DR: Law in Silico is an LLM-based agent framework that simulates legal societies with individual decision-making and institutional mechanisms, showing it can reproduce macro-level crime trends and provide insights about legal system effectiveness.

Details

Motivation: Real-world legal experiments are costly or infeasible, so simulating legal societies with AI provides an effective alternative for verifying legal theory and supporting legal administration. LLMs with their world knowledge and role-playing capabilities are well-suited for this task.

Method: Introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement.

Result: Experiments comparing simulated crime rates with real-world data show LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. Micro-level simulations reveal that well-functioning, transparent, and adaptive legal systems better protect vulnerable individuals’ rights.

Conclusion: LLM-based legal society simulation is feasible and valuable, capable of reproducing real-world crime trends while providing insights into legal system effectiveness and protection of vulnerable populations.

Abstract: Since real-world legal experiments are often costly or infeasible, simulating legal societies with Artificial Intelligence (AI) systems provides an effective alternative for verifying and developing legal theory, as well as supporting legal administration. Large Language Models (LLMs), with their world knowledge and role-playing capabilities, are strong candidates to serve as the foundation for legal society simulation. However, the application of LLMs to simulate legal systems remains underexplored. In this work, we introduce Law in Silico, an LLM-based agent framework for simulating legal scenarios with individual decision-making and institutional mechanisms of legislation, adjudication, and enforcement. Our experiments, which compare simulated crime rates with real-world data, demonstrate that LLM-based agents can largely reproduce macro-level crime trends and provide insights that align with real-world observations. At the same time, micro-level simulations reveal that a well-functioning, transparent, and adaptive legal system offers better protection of the rights of vulnerable individuals.

[270] ReCAP: Recursive Context-Aware Reasoning and Planning for Large Language Model Agents

Zhenyu Zhang, Tianyi Chen, Weiran Xu, Alex Pentland, Jiaxin Pei

Main category: cs.AI

TL;DR: ReCAP is a hierarchical framework that improves LLM performance on long-horizon tasks through recursive context-aware reasoning and planning, achieving significant gains in subgoal alignment and success rates.

Details

Motivation: Current sequential prompting methods suffer from context drift and goal information loss, while hierarchical methods weaken cross-level continuity or have high runtime overhead.

Method: ReCAP combines three mechanisms: plan-ahead decomposition, structured re-injection of parent plans, and memory-efficient execution to maintain consistent multi-level context during recursive reasoning.

Result: ReCAP achieved 32% gain on synchronous Robotouille and 29% improvement on asynchronous Robotouille under strict pass@1 protocol, substantially improving subgoal alignment and success rates.

Conclusion: The framework effectively aligns high-level goals with low-level actions, reduces redundant prompting, and preserves coherent context updates across recursion for better long-horizon reasoning.

Abstract: Long-horizon tasks requiring multi-step reasoning and dynamic re-planning remain challenging for large language models (LLMs). Sequential prompting methods are prone to context drift, loss of goal information, and recurrent failure cycles, while hierarchical prompting methods often weaken cross-level continuity or incur substantial runtime overhead. We introduce ReCAP (Recursive Context-Aware Reasoning and Planning), a hierarchical framework with shared context for reasoning and planning in LLMs. ReCAP combines three key mechanisms: (i) plan-ahead decomposition, in which the model generates a full subtask list, executes the first item, and refines the remainder; (ii) structured re-injection of parent plans, maintaining consistent multi-level context during recursive return; and (iii) memory-efficient execution, bounding the active prompt so costs scale linearly with task depth. Together these mechanisms align high-level goals with low-level actions, reduce redundant prompting, and preserve coherent context updates across recursion. Experiments demonstrate that ReCAP substantially improves subgoal alignment and success rates on various long-horizon reasoning benchmarks, achieving a 32% gain on synchronous Robotouille and a 29% improvement on asynchronous Robotouille under the strict pass@1 protocol.

[271] Affordance Representation and Recognition for Autonomous Agents

Habtom Kahsay Gidey, Niklas Huber, Alexander Lenz, Alois Knoll

Main category: cs.AI

TL;DR: A pattern language for world modeling from structured data, featuring DOM Transduction for web page simplification and Hypermedia Affordances Recognition for dynamic service discovery.

Details

Motivation: Software agents need actionable world models from structured data, but face challenges with verbose HTML complexity and static API integrations that prevent adaptation to evolving services.

Method: Two architectural patterns: DOM Transduction Pattern distills verbose DOM into compact task-relevant representations, and Hypermedia Affordances Recognition Pattern enables dynamic discovery and integration of unknown web services at runtime.

Result: Provides a robust framework for agents to efficiently construct and maintain accurate world models from structured data sources.

Conclusion: These patterns enable scalable, adaptive, and interoperable automation across the web and its extended resources by addressing both web page complexity and dynamic service integration challenges.

Abstract: The autonomy of software agents is fundamentally dependent on their ability to construct an actionable internal world model from the structured data that defines their digital environment, such as the Document Object Model (DOM) of web pages and the semantic descriptions of web services. However, constructing this world model from raw structured data presents two critical challenges: the verbosity of raw HTML makes it computationally intractable for direct use by foundation models, while the static nature of hardcoded API integrations prevents agents from adapting to evolving services. This paper introduces a pattern language for world modeling from structured data, presenting two complementary architectural patterns. The DOM Transduction Pattern addresses the challenge of web page complexity by distilling} a verbose, raw DOM into a compact, task-relevant representation or world model optimized for an agent’s reasoning core. Concurrently, the Hypermedia Affordances Recognition Pattern enables the agent to dynamically enrich its world model by parsing standardized semantic descriptions to discover and integrate the capabilities of unknown web services at runtime. Together, these patterns provide a robust framework for engineering agents that can efficiently construct and maintain an accurate world model, enabling scalable, adaptive, and interoperable automation across the web and its extended resources.

[272] Decentralized Multi-Agent Goal Assignment for Path Planning using Large Language Models

Murad Ismayilov, Edwin Meriaux, Shuo Wen, Gregory Dudek

Main category: cs.AI

TL;DR: LLM-based agents achieve near-optimal performance in decentralized multi-agent goal assignment using structured environment information and deterministic conflict resolution.

Details

Motivation: Addressing the challenge of coordinating multiple autonomous agents in shared environments under decentralized conditions without negotiation or iterative coordination.

Method: Agents independently generate ranked goal preferences using structured environment representations (grid visualizations, scenario data), then exchange rankings and use deterministic conflict-resolution rules (e.g., agent index ordering) for assignment.

Result: LLM-based agents with well-designed prompts and quantitative information achieve near-optimal makespans and consistently outperform traditional heuristics like greedy approaches and optimal assignment methods.

Conclusion: Language models show strong potential for decentralized goal assignment in multi-agent path planning, with information structure being a critical factor for performance.

Abstract: Coordinating multiple autonomous agents in shared environments under decentralized conditions is a long-standing challenge in robotics and artificial intelligence. This work addresses the problem of decentralized goal assignment for multi-agent path planning, where agents independently generate ranked preferences over goals based on structured representations of the environment, including grid visualizations and scenario data. After this reasoning phase, agents exchange their goal rankings, and assignments are determined by a fixed, deterministic conflict-resolution rule (e.g., agent index ordering), without negotiation or iterative coordination. We systematically compare greedy heuristics, optimal assignment, and large language model (LLM)-based agents in fully observable grid-world settings. Our results show that LLM-based agents, when provided with well-designed prompts and relevant quantitative information, can achieve near-optimal makespans and consistently outperform traditional heuristics. These findings underscore the potential of language models for decentralized goal assignment in multi-agent path planning and highlight the importance of information structure in such systems.

[273] From Benchmarks to Business Impact: Deploying IBM Generalist Agent in Enterprise Production

Segev Shlomov, Alon Oved, Sami Marreed, Ido Levy, Offer Akrabi, Avi Yaeli, Łukasz Strąk, Elizabeth Koumpan, Yinon Goldshtein, Eilam Shapira, Nir Mashkif, Asaf Adi

Main category: cs.AI

TL;DR: IBM developed CUGA, a generalist agent with hierarchical planner-executor architecture that achieves SOTA performance on benchmarks and shows promise in enterprise settings, particularly for business process outsourcing talent acquisition.

Details

Motivation: To address the gap between academic agent prototypes and production enterprise systems that deliver measurable business value, overcoming challenges of fragmented frameworks, slow development, and lack of standardized evaluation.

Method: CUGA uses a hierarchical planner-executor architecture with strong analytical foundations, evaluated on AppWorld, WebArena, and a new BPO-TA benchmark with 26 tasks across 13 analytics endpoints in business process outsourcing talent acquisition.

Result: CUGA achieved state-of-the-art performance on academic benchmarks and approached specialized agent accuracy in enterprise pilot while showing potential for reducing development time and cost.

Conclusion: Generalist agents like CUGA can operate at enterprise scale, but require addressing requirements for scalability, auditability, safety, and governance to advance from research-grade to enterprise-ready systems.

Abstract: Agents are rapidly advancing in automating digital work, but enterprises face a harder challenge: moving beyond prototypes to deployed systems that deliver measurable business value. This path is complicated by fragmented frameworks, slow development, and the absence of standardized evaluation practices. Generalist agents have emerged as a promising direction, excelling on academic benchmarks and offering flexibility across task types, applications, and modalities. Yet, evidence of their use in production enterprise settings remains limited. This paper reports IBM’s experience developing and piloting the Computer Using Generalist Agent (CUGA), which has been open-sourced for the community (https://github.com/cuga-project/cuga-agent). CUGA adopts a hierarchical planner–executor architecture with strong analytical foundations, achieving state-of-the-art performance on AppWorld and WebArena. Beyond benchmarks, it was evaluated in a pilot within the Business-Process-Outsourcing talent acquisition domain, addressing enterprise requirements for scalability, auditability, safety, and governance. To support assessment, we introduce BPO-TA, a 26-task benchmark spanning 13 analytics endpoints. In preliminary evaluations, CUGA approached the accuracy of specialized agents while indicating potential for reducing development time and cost. Our contribution is twofold: presenting early evidence of generalist agents operating at enterprise scale, and distilling technical and organizational lessons from this initial pilot. We outline requirements and next steps for advancing research-grade architectures like CUGA into robust, enterprise-ready systems.

[274] Generating Creative Chess Puzzles

Xidong Feng, Vivek Veeriah, Marcus Chiam, Michael Dennis, Ryan Pachauri, Thomas Tumiel, Federico Barbero, Johan Obando-Ceron, Jiaxin Shi, Satinder Singh, Shaobo Hou, Nenad Tomašev, Tom Zahavy

Main category: cs.AI

TL;DR: This paper presents an RL framework with novel rewards to generate creative chess puzzles that are unique, counter-intuitive, diverse, and realistic, achieving 10x improvement in counter-intuitive puzzle generation.

Details

Motivation: To address the challenge of generating truly creative, aesthetic, and counter-intuitive outputs in Generative AI, particularly in the domain of chess puzzles where current methods fall short.

Method: Benchmarking Generative AI architectures followed by introducing an RL framework with novel rewards based on chess engine search statistics to enhance puzzle uniqueness, counter-intuitiveness, diversity, and realism.

Result: RL approach increased counter-intuitive puzzle generation from 0.22% (supervised) to 2.5%, surpassing existing dataset rates (2.1%) and best Lichess-trained model (0.4%). Puzzles met novelty/diversity benchmarks, retained aesthetic themes, and were rated by human experts as more creative, enjoyable, and counter-intuitive than composed book puzzles.

Conclusion: The framework successfully generates high-quality creative chess puzzles, producing a curated booklet acknowledged for creativity by three world-renowned experts, demonstrating AI’s potential in creative puzzle composition.

Abstract: While Generative AI rapidly advances in various domains, generating truly creative, aesthetic, and counter-intuitive outputs remains a challenge. This paper presents an approach to tackle these difficulties in the domain of chess puzzles. We start by benchmarking Generative AI architectures, and then introduce an RL framework with novel rewards based on chess engine search statistics to overcome some of those shortcomings. The rewards are designed to enhance a puzzle’s uniqueness, counter-intuitiveness, diversity, and realism. Our RL approach dramatically increases counter-intuitive puzzle generation by 10x, from 0.22% (supervised) to 2.5%, surpassing existing dataset rates (2.1%) and the best Lichess-trained model (0.4%). Our puzzles meet novelty and diversity benchmarks, retain aesthetic themes, and are rated by human experts as more creative, enjoyable, and counter-intuitive than composed book puzzles, even approaching classic compositions. Our final outcome is a curated booklet of these AI-generated puzzles, which is acknowledged for creativity by three world-renowned experts.

[275] Hybrid Modeling, Sim-to-Real Reinforcement Learning, and Large Language Model Driven Control for Digital Twins

Adil Rasheed, Oscar Ravik, Omer San

Main category: cs.AI

TL;DR: Digital twins for dynamical system modeling and control are explored using physics-based, data-driven, and hybrid approaches with traditional and AI controllers on a miniature greenhouse test platform.

Details

Motivation: To investigate the integration of digital twins with various modeling approaches (physics-based, data-driven, hybrid) and control strategies (traditional, AI-driven) for dynamical systems.

Method: Developed and compared four predictive models (Linear, PBM, LSTM, HAM) under interpolation and extrapolation scenarios, and implemented three control strategies (MPC, RL, LLM-based control) on a miniature greenhouse test platform.

Result: HAM provided the most balanced performance across accuracy, generalization, and computational efficiency in modeling, while LSTM achieved high precision at greater resource cost. MPC delivered robust performance, RL showed strong adaptability, and LLM-based controllers offered flexible human-AI interaction.

Conclusion: Hybrid Analysis and Modeling (HAM) offers the best balance for digital twin modeling, while different control strategies (MPC, RL, LLM) provide complementary strengths in precision, adaptability, and human-AI interaction capabilities.

Abstract: This work investigates the use of digital twins for dynamical system modeling and control, integrating physics-based, data-driven, and hybrid approaches with both traditional and AI-driven controllers. Using a miniature greenhouse as a test platform, four predictive models Linear, Physics-Based Modeling (PBM), Long Short Term Memory (LSTM), and Hybrid Analysis and Modeling (HAM) are developed and compared under interpolation and extrapolation scenarios. Three control strategies Model Predictive Control (MPC), Reinforcement Learning (RL), and Large Language Model (LLM) based control are also implemented to assess trade-offs in precision, adaptability, and implementation effort. Results show that in modeling HAM provides the most balanced performance across accuracy, generalization, and computational efficiency, while LSTM achieves high precision at greater resource cost. Among controllers, MPC delivers robust and predictable performance, RL demonstrates strong adaptability, and LLM-based controllers offer flexible human-AI interaction when coupled with predictive tools.

[276] Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

Shrestha Datta, Shahriar Kabir Nahin, Anshuman Chhabra, Prasant Mohapatra

Main category: cs.AI

TL;DR: This survey analyzes security risks specific to agentic AI systems powered by LLMs, covering threats, benchmarks, evaluations, and defense strategies from technical and governance perspectives.

Details

Motivation: Agentic AI systems with planning, tool use, memory, and autonomy create new and amplified security risks distinct from traditional AI safety and software security, requiring specialized analysis.

Method: The paper presents a taxonomy of agentic AI threats, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies through a comprehensive survey approach.

Result: The survey synthesizes current research on agentic AI security, identifying specific threats and proposing evaluation frameworks and defense mechanisms for secure-by-design systems.

Conclusion: The paper highlights open challenges in agentic AI security and aims to support the development of secure-by-design agent systems through comprehensive threat analysis and defense strategies.

Abstract: Agentic AI systems powered by large language models (LLMs) and endowed with planning, tool use, memory, and autonomy, are emerging as powerful, flexible platforms for automation. Their ability to autonomously execute tasks across web, software, and physical environments creates new and amplified security risks, distinct from both traditional AI safety and conventional software security. This survey outlines a taxonomy of threats specific to agentic AI, reviews recent benchmarks and evaluation methodologies, and discusses defense strategies from both technical and governance perspectives. We synthesize current research and highlight open challenges, aiming to support the development of secure-by-design agent systems.

[277] Latent Chain-of-Thought for Visual Reasoning

Guohao Sun, Hang Hua, Jian Wang, Jiebo Luo, Sohail Dianat, Majid Rabbani, Raghuveer Rao, Zhiqiang Tao

Main category: cs.AI

TL;DR: Proposes a variational inference-based training algorithm for LVLMs that treats reasoning as posterior inference, using diversity-seeking RL and Bayesian scaling to improve CoT reasoning across benchmarks.

Details

Motivation: Existing training methods (SFT, PPO, GRPO) for chain-of-thought reasoning in LVLMs don't generalize well to unseen tasks and rely on biased reward models, limiting interpretability and reliability.

Method: Reformulates reasoning as posterior inference using amortized variational inference, implements diversity-seeking RL with sparse token-level rewards, and uses Bayesian inference-scaling with marginal likelihood instead of Best-of-N/Beam Search.

Result: Empirically improves state-of-the-art LVLMs on seven reasoning benchmarks, enhancing effectiveness, generalization, and interpretability.

Conclusion: The proposed variational inference framework successfully addresses limitations of existing training methods by enabling diverse, high-quality reasoning chains without reward hacking, while being computationally efficient.

Abstract: Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

[278] Decentralized Causal Discovery using Judo Calculus

Sridhar Mahadevan

Main category: cs.AI

TL;DR: Judo calculus is an intuitionistic decentralized framework for causal discovery that formalizes context dependence using sheaf theory and j-stable causal inference.

Details

Motivation: Causal effects in real-world applications depend on contextual regimes (age, country, dose, etc.), requiring a framework that handles this local truth rather than global causal claims.

Method: Uses judo calculus with j-do-calculus in a topos of sheaves, combining Lawvere-Tierney modal operator j for regime selection with standard causal discovery methods (score-based, constraint-based, gradient-based).

Result: Experimental results show computational efficiency gains from decentralized sheaf-theoretic approach and improved performance over classical causal discovery methods across synthetic and real-world datasets.

Conclusion: Judo calculus provides a formal framework for context-dependent causal discovery that is both computationally efficient and more accurate than traditional methods.

Abstract: We describe a theory and implementation of an intuitionistic decentralized framework for causal discovery using judo calculus, which is formally defined as j-stable causal inference using j-do-calculus in a topos of sheaves. In real-world applications – from biology to medicine and social science – causal effects depend on regime (age, country, dose, genotype, or lab protocol). Our proposed judo calculus formalizes this context dependence formally as local truth: a causal claim is proven true on a cover of regimes, not everywhere at once. The Lawvere-Tierney modal operator j chooses which regimes are relevant; j-stability means the claim holds constructively and consistently across that family. We describe an algorithmic and implementation framework for judo calculus, combining it with standard score-based, constraint-based, and gradient-based causal discovery methods. We describe experimental results on a range of domains, from synthetic to real-world datasets from biology and economics. Our experimental results show the computational efficiency gained by the decentralized nature of sheaf-theoretic causal discovery, as well as improved performance over classical causal discovery methods.

[279] The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity

Aymane El Gadarri, Ali Aouad, Vivek F. Farias

Main category: cs.AI

TL;DR: The paper proposes a ‘sign estimator’ method that replaces cross-entropy loss with binary classification loss in LLM alignment, providing consistent ordinal alignment and reducing preference distortion compared to standard RLHF.

Details

Motivation: Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences and yield inconsistent estimates of population-average utility when fitting probabilistic models to pairwise comparison data.

Method: The sign estimator replaces cross-entropy with binary classification loss in the aggregation step, providing a simple, provably consistent, and efficient estimator for LLM alignment.

Result: In simulations using digital twins, the sign estimator reduced preference distortion by cutting angular estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF.

Conclusion: The sign estimator achieves consistent ordinal alignment under mild assumptions, provides polynomial finite-sample error bounds, and compares favorably to panel data heuristics while maintaining implementation simplicity of existing LLM alignment pipelines.

Abstract: Traditional LLM alignment methods are vulnerable to heterogeneity in human preferences. Fitting a na"ive probabilistic model to pairwise comparison data (say over prompt-completion pairs) yields an inconsistent estimate of the population-average utility -a canonical measure of social welfare. We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator by replacing cross-entropy with binary classification loss in the aggregation step. This simple modification recovers consistent ordinal alignment under mild assumptions and achieves the first polynomial finite-sample error bounds in this setting. In realistic simulations of LLM alignment using digital twins, the sign estimator substantially reduces preference distortion over a panel of simulated personas, cutting (angular) estimation error by nearly 35% and decreasing disagreement with true population preferences from 12% to 8% compared to standard RLHF. Our method also compares favorably to panel data heuristics that explicitly model user heterogeneity and require tracking individual-level preference data-all while maintaining the implementation simplicity of existing LLM alignment pipelines.

Shangde Gao, Zelin Xu, Zhe Jiang

Main category: cs.AI

TL;DR: This paper proposes a conditioned deep learning model that incorporates individuals’ social infrastructure resilience (SIR) and spatial context to predict post-disruption movement pattern shifts using sparse individual-level data.

Details

Motivation: Predicting individual movement pattern shifts after disruptive events is challenging due to: lack of measures for heterogeneous social infrastructure resilience (SIR), insufficient capture of complex interactions between movement patterns and spatial contexts, and spatial sparsity of individual-level movement data that doesn't suit traditional prediction methods.

Method: The study develops a conditioned deep learning model that incorporates individuals’ SIR to capture complex relationships between movement patterns and local spatial context using large-scale, sparse individual-level data.

Result: Experiments show that incorporating SIR and spatial context enhances the model’s ability to predict post-event individual movement patterns. The model can capture divergent shifts in movement patterns among individuals with similar pre-event patterns but different SIR levels.

Conclusion: The conditioned deep learning approach successfully addresses the challenges of predicting individual movement shifts after disruptive events by integrating social infrastructure resilience and spatial context into the modeling framework.

Abstract: Shifts in individual movement patterns following disruptive events can reveal changing demands for community resources. However, predicting such shifts before disruptive events remains challenging for several reasons. First, measures are lacking for individuals’ heterogeneous social infrastructure resilience (SIR), which directly influences their movement patterns, and commonly used features are often limited or unavailable at scale, e.g., sociodemographic characteristics. Second, the complex interactions between individual movement patterns and spatial contexts have not been sufficiently captured. Third, individual-level movement may be spatially sparse and not well-suited to traditional decision-making methods for movement predictions. This study incorporates individuals’ SIR into a conditioned deep learning model to capture the complex relationships between individual movement patterns and local spatial context using large-scale, sparse individual-level data. Our experiments demonstrate that incorporating individuals’ SIR and spatial context can enhance the model’s ability to predict post-event individual movement patterns. The conditioned model can capture the divergent shifts in movement patterns among individuals who exhibit similar pre-event patterns but differ in SIR.

[281] Discovering Heuristics with Large Language Models (LLMs) for Mixed-Integer Programs: Single-Machine Scheduling

İbrahim Oğuz Çetinkaya, İ. Esra Büyüktahtakın, Parshin Shojaee, Chandan K. Reddy

Main category: cs.AI

TL;DR: LLM-discovered heuristics EDDC and MDDC outperform traditional methods for single-machine total tardiness scheduling, showing human-LLM collaboration can create scalable solutions for NP-hard problems.

Details

Motivation: To leverage LLMs for discovering novel heuristics in combinatorial optimization, specifically for the NP-hard single-machine total tardiness problem where exact methods become intractable for large instances.

Method: Developed two LLM-discovered heuristics (EDDC and MDDC) inspired by EDD and MDD rules, benchmarked using mixed-integer programming formulation with optimality gaps and solution time metrics across various job sizes (20-500 jobs).

Result: EDDC improved upon classic EDD rule and other algorithms up to 500 jobs. MDDC consistently outperformed traditional heuristics and remained competitive with exact approaches, especially on larger instances.

Conclusion: Human-LLM collaboration can produce scalable, high-performing heuristics for NP-hard constrained combinatorial optimization problems even with limited resources when properly configured.

Abstract: Our study contributes to the scheduling and combinatorial optimization literature with new heuristics discovered by leveraging the power of Large Language Models (LLMs). We focus on the single-machine total tardiness (SMTT) problem, which aims to minimize total tardiness by sequencing n jobs on a single processor without preemption, given processing times and due dates. We develop and benchmark two novel LLM-discovered heuristics, the EDD Challenger (EDDC) and MDD Challenger (MDDC), inspired by the well-known Earliest Due Date (EDD) and Modified Due Date (MDD) rules. In contrast to prior studies that employed simpler rule-based heuristics, we evaluate our LLM-discovered algorithms using rigorous criteria, including optimality gaps and solution time derived from a mixed-integer programming (MIP) formulation of SMTT. We compare their performance against state-of-the-art heuristics and exact methods across various job sizes (20, 100, 200, and 500 jobs). For instances with more than 100 jobs, exact methods such as MIP and dynamic programming become computationally intractable. Up to 500 jobs, EDDC improves upon the classic EDD rule and another widely used algorithm in the literature. MDDC consistently outperforms traditional heuristics and remains competitive with exact approaches, particularly on larger and more complex instances. This study shows that human-LLM collaboration can produce scalable, high-performing heuristics for NP-hard constrained combinatorial optimization, even under limited resources when effectively configured.

[282] OneCast: Structured Decomposition and Modular Generation for Cross-Domain Time Series Forecasting

Tingyue Pan, Mingyue Cheng, Shilong Zhang, Zhiding Liu, Xiaoyu Tao, Yucong Luo, Jintao Zhang, Qi Liu

Main category: cs.AI

TL;DR: OneCast is a cross-domain time series forecasting framework that decomposes series into seasonal and trend components, modeling them separately with specialized modules for better generalization across domains.

Details

Motivation: Existing methods struggle with domain-specific trend shifts and inconsistent periodic patterns when forecasting across heterogeneous time series data, due to treating temporal series as undifferentiated sequences without explicit structural decomposition.

Method: Proposes OneCast framework that: 1) Decomposes time series into seasonal and trend components; 2) Models seasonal patterns via lightweight projection with interpretable basis functions; 3) Encodes trend into discrete tokens using semantic-aware tokenizer and infers through masked discrete diffusion mechanism; 4) Combines outputs from both branches for final forecast.

Result: Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines in cross-domain time series forecasting.

Conclusion: Explicitly decoupling time series into structural components (seasonal and trend) with tailored modeling approaches enables more effective generalization across heterogeneous domains compared to treating time series as undifferentiated sequences.

Abstract: Cross-domain time series forecasting is a valuable task in various web applications. Despite its rapid advancement, achieving effective generalization across heterogeneous time series data remains a significant challenge. Existing methods have made progress by extending single-domain models, yet often fall short when facing domain-specific trend shifts and inconsistent periodic patterns. We argue that a key limitation lies in treating temporal series as undifferentiated sequence, without explicitly decoupling their inherent structural components. To address this, we propose OneCast, a structured and modular forecasting framework that decomposes time series into seasonal and trend components, each modeled through tailored generative pathways. Specifically, the seasonal component is captured by a lightweight projection module that reconstructs periodic patterns via interpretable basis functions. In parallel, the trend component is encoded into discrete tokens at segment level via a semantic-aware tokenizer, and subsequently inferred through a masked discrete diffusion mechanism. The outputs from both branches are combined to produce a final forecast that captures seasonal patterns while tracking domain-specific trends. Extensive experiments across eight domains demonstrate that OneCast mostly outperforms state-of-the-art baselines.

[283] LLMLogAnalyzer: A Clustering-Based Log Analysis Chatbot using Large Language Models

Peng Cai, Reza Ryan, Nickson M. Karie

Main category: cs.AI

TL;DR: LLMLogAnalyzer is a clustering-based log analysis chatbot that combines LLMs and ML algorithms to simplify cybersecurity log analysis, overcoming LLM limitations and achieving significant performance improvements over existing solutions.

Details

Motivation: System log analysis is crucial for cybersecurity but remains challenging due to high costs, lack of expertise, and time constraints. Current LLM-based solutions have limitations in context window constraints and poor structured text handling.

Method: Uses a modular architecture with router, log recognizer, log parser, and search tools. Combines LLMs with ML algorithms for clustering-based log analysis to overcome LLM limitations and improve structured text handling.

Result: Achieved 39% to 68% performance improvements over state-of-the-art LLM chatbots (ChatGPT, ChatPDF, NotebookLM) across different tasks. Showed strong robustness with 93% reduction in interquartile range using ROUGE-1 scores, indicating lower result variability.

Conclusion: LLMLogAnalyzer effectively enhances LLM capabilities for structured text analysis, improving accuracy and robustness in log analysis tasks, making it valuable for both cybersecurity experts and non-technical users.

Abstract: System logs are a cornerstone of cybersecurity, supporting proactive breach prevention and post-incident investigations. However, analyzing vast amounts of diverse log data remains significantly challenging, as high costs, lack of in-house expertise, and time constraints make even basic analysis difficult for many organizations. This study introduces LLMLogAnalyzer, a clustering-based log analysis chatbot that leverages Large Language Models (LLMs) and Machine Learning (ML) algorithms to simplify and streamline log analysis processes. This innovative approach addresses key LLM limitations, including context window constraints and poor structured text handling capabilities, enabling more effective summarization, pattern extraction, and anomaly detection tasks. LLMLogAnalyzer is evaluated across four distinct domain logs and various tasks. Results demonstrate significant performance improvements over state-of-the-art LLM-based chatbots, including ChatGPT, ChatPDF, and NotebookLM, with consistent gains ranging from 39% to 68% across different tasks. The system also exhibits strong robustness, achieving a 93% reduction in interquartile range (IQR) when using ROUGE-1 scores, indicating significantly lower result variability. The framework’s effectiveness stems from its modular architecture comprising a router, log recognizer, log parser, and search tools. This design enhances LLM capabilities for structured text analysis while improving accuracy and robustness, making it a valuable resource for both cybersecurity experts and non-technical users.

[284] Modeling Electric Vehicle Car-Following Behavior: Classical vs Machine Learning Approach

Md. Shihab Uddin, Md Nazmus Shakib, Rahul Bhadani

Main category: cs.AI

TL;DR: This study compares classical physics-based car following models with a machine learning approach (Random Forest) for electric vehicle behavior, finding that the Random Forest model significantly outperforms all classical models in predicting acceleration across different gap scenarios.

Details

Motivation: The increasing adoption of electric vehicles requires better understanding of their driving behavior to enhance traffic safety and develop smart driving systems, especially in mixed autonomy traffic environments.

Method: Used real-world EV following data to compare classical models (IDM, OVM, OVRV, simplified CACC) calibrated via RMSE minimization against a Random Forest Regressor that predicts acceleration using spacing, speed, and gap type as inputs.

Result: Random Forest achieved superior accuracy with RMSEs of 0.0046 (medium gap), 0.0016 (long gap), and 0.0025 (extra long gap), while the best classical model (CACC) had RMSE of 2.67 for long gaps.

Conclusion: Machine learning models like Random Forest provide valuable tools for simulating EV behavior and analyzing mixed autonomy traffic dynamics, outperforming traditional physics-based approaches across all scenarios.

Abstract: The increasing adoption of electric vehicles (EVs) necessitates an understanding of their driving behavior to enhance traffic safety and develop smart driving systems. This study compares classical and machine learning models for EV car following behavior. Classical models include the Intelligent Driver Model (IDM), Optimum Velocity Model (OVM), Optimal Velocity Relative Velocity (OVRV), and a simplified CACC model, while the machine learning approach employs a Random Forest Regressor. Using a real world dataset of an EV following an internal combustion engine (ICE) vehicle under varied driving conditions, we calibrated classical model parameters by minimizing the RMSE between predictions and real data. The Random Forest model predicts acceleration using spacing, speed, and gap type as inputs. Results demonstrate the Random Forest’s superior accuracy, achieving RMSEs of 0.0046 (medium gap), 0.0016 (long gap), and 0.0025 (extra long gap). Among physics based models, CACC performed best, with an RMSE of 2.67 for long gaps. These findings highlight the machine learning model’s performance across all scenarios. Such models are valuable for simulating EV behavior and analyzing mixed autonomy traffic dynamics in EV integrated environments.

[285] HistoLens: An Interactive XAI Toolkit for Verifying and Mitigating Flaws in Vision-Language Models for Histopathology

Sandeep Vissapragada, Vikrant Sahu, Gagan Raj Gupta, Vandita Singh

Main category: cs.AI

TL;DR: HistoLens is a transparent AI system for pathology that allows doctors to ask questions in plain English and get clear explanations with visual proofs, keeping the pathologist in charge while enhancing diagnostic confidence.

Details

Motivation: To build trust in AI for medical diagnosis by creating a transparent system that explains its reasoning like a human colleague, rather than being a black box.

Method: Created HistoLens that translates natural language questions into AI queries, provides structured reports with visual heatmaps showing exact cells used for analysis, and filters out background noise to focus on relevant tissue.

Result: A workflow where pathologists remain the experts in charge while using AI to verify insights and make faster, more confident diagnoses.

Conclusion: Transparent AI systems like HistoLens can build doctor trust by providing explainable reasoning and visual proofs, enabling collaborative diagnosis while maintaining human expert oversight.

Abstract: For doctors to truly trust artificial intelligence, it can’t be a black box. They need to understand its reasoning, almost as if they were consulting a colleague. We created HistoLens1 to be that transparent, collaborative partner. It allows a pathologist to simply ask a question in plain English about a tissue slide–just as they would ask a trainee. Our system intelligently translates this question into a precise query for its AI engine, which then provides a clear, structured report. But it doesn’t stop there. If a doctor ever asks, “Why?”, HistoLens can instantly provide a ‘visual proof’ for any finding–a heatmap that points to the exact cells and regions the AI used for its analysis. We’ve also ensured the AI focuses only on the patient’s tissue, just like a trained pathologist would, by teaching it to ignore distracting background noise. The result is a workflow where the pathologist remains the expert in charge, using a trustworthy AI assistant to verify their insights and make faster, more confident diagnoses.

[286] From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems

Yu Luo, Jiamin Jiang, Jingfei Feng, Lei Tao, Qingliang Zhang, Xidao Wen, Yongqian Sun, Shenglin Zhang, Jielong Huang, Nan Qi, Dan Pei

Main category: cs.AI

TL;DR: OpsAgent is a lightweight, multi-agent system for automated incident management in cloud systems that converts heterogeneous observability data into structured text and uses transparent multi-agent collaboration for diagnostics, with self-evolution capabilities.

Details

Motivation: Manual incident management is labor-intensive and error-prone with massive observability data, while existing automated approaches struggle with generalization, interpretability, and high deployment costs.

Method: Uses training-free data processor to convert observability data into structured textual descriptions, multi-agent collaboration framework for transparent diagnostics, and dual self-evolution mechanism combining internal model updates with external experience accumulation.

Result: Achieves state-of-the-art performance on OPENRCA benchmark, demonstrating generalizability, interpretability, cost-efficiency, and self-evolution capabilities.

Conclusion: OpsAgent is a practically deployable and sustainable solution for long-term operation in real-world cloud systems, addressing key limitations of existing automated incident management approaches.

Abstract: Incident management (IM) is central to the reliability of large-scale cloud systems. Yet manual IM, where on-call engineers examine metrics, logs, and traces is labor-intensive and error-prone in the face of massive and heterogeneous observability data. Existing automated IM approaches often struggle to generalize across systems, provide limited interpretability, and incur high deployment costs, which hinders adoption in practice. In this paper, we present OpsAgent, a lightweight, self-evolving multi-agent system for IM that employs a training-free data processor to convert heterogeneous observability data into structured textual descriptions, along with a multi-agent collaboration framework that makes diagnostic inference transparent and auditable. To support continual capability growth, OpsAgent also introduces a dual self-evolution mechanism that integrates internal model updates with external experience accumulation, thereby closing the deployment loop. Comprehensive experiments on the OPENRCA benchmark demonstrate state-of-the-art performance and show that OpsAgent is generalizable, interpretable, cost-efficient, and self-evolving, making it a practically deployable and sustainable solution for long-term operation in real-world cloud systems.

[287] BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data

Bingsen Qiu, Zijian Liu, Xiao Liu, Haoshen Yang, Zeren Gao, Bingjie Wang, Feier Zhang, Yixuan Qin, Chunyan Li

Main category: cs.AI

TL;DR: Automated framework for generating high-difficulty multi-hop QA datasets from semi-structured knowledge, using NLI-based relation typing, reverse question construction, and quality evaluation pipeline.

Details

Motivation: Current multi-hop QA datasets are scarce, mostly designed for evaluation rather than training, and manual curation is costly and doesn't scale, creating a data bottleneck for training retrieval-and-reasoning agents.

Method: Three-step approach: (1) grow diverse evidence clusters via NLI-based relation typing and diversity-aware expansion, (2) apply reverse question construction to create oblique cues, (3) enforce quality with multi-model consensus filtering and structured constraint decomposition.

Result: Scalable process that produces complex, retrieval-resistant yet verifiable questions suitable for both SFT/RL training and challenging evaluation, reducing human effort while maintaining difficulty.

Conclusion: The framework addresses the critical data bottleneck in multi-hop QA by automating the generation of training-ready datasets that preserve the difficulty profile of strong benchmarks.

Abstract: Building training-ready multi-hop question answering (QA) datasets that truly stress a model’s retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems – requiring the integration of ambiguous, indirect, and cross-domain cues – these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions – where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence – incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present an automated framework for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.

[288] UniPlanner: A Unified Motion Planning Framework for Autonomous Vehicle Decision-Making Systems via Multi-Dataset Integration

Xin Yang, Yuhang Zhang, Wei Li, Xin Lin, Wenbin Zou, Chen Xu

Main category: cs.AI

TL;DR: UniPlanner is a multi-dataset motion planning framework for autonomous vehicles that achieves unified cross-dataset learning through three innovations: HFTDN for trajectory aggregation, GFTM for robust correlation learning, and S2D for adaptive training/inference.

Details

Motivation: Existing deep learning planning methods are limited to single-dataset training, restricting their robustness. The authors discovered consistent trajectory distributions and history-future correlations across different datasets, enabling multi-dataset integration.

Method: Three synergistic components: 1) HFTDN aggregates trajectory pairs across datasets using similarity retrieval; 2) GFTM learns robust correlations with gradient-free design to prevent shortcut learning; 3) S2D uses adaptive dropout for robust training and full prior utilization during inference.

Result: The framework enables unified cross-dataset learning, making planning knowledge safely transferable while preventing shortcut learning. It generates cross-dataset planning guidance and universal planning priors.

Conclusion: UniPlanner represents the first planning framework designed for multi-dataset integration, addressing limitations of single-dataset approaches and enhancing robustness in autonomous vehicle motion planning.

Abstract: Motion planning is a critical component of autonomous vehicle decision-making systems, directly determining trajectory safety and driving efficiency. While deep learning approaches have advanced planning capabilities, existing methods remain confined to single-dataset training, limiting their robustness in planning. Through systematic analysis, we discover that vehicular trajectory distributions and history-future correlations demonstrate remarkable consistency across different datasets. Based on these findings, we propose UniPlanner, the first planning framework designed for multi-dataset integration in autonomous vehicle decision-making. UniPlanner achieves unified cross-dataset learning through three synergistic innovations. First, the History-Future Trajectory Dictionary Network (HFTDN) aggregates history-future trajectory pairs from multiple datasets, using historical trajectory similarity to retrieve relevant futures and generate cross-dataset planning guidance. Second, the Gradient-Free Trajectory Mapper (GFTM) learns robust history-future correlations from multiple datasets, transforming historical trajectories into universal planning priors. Its gradient-free design ensures the introduction of valuable priors while preventing shortcut learning, making the planning knowledge safely transferable. Third, the Sparse-to-Dense (S2D) paradigm implements adaptive dropout to selectively suppress planning priors during training for robust learning, while enabling full prior utilization during inference to maximize planning performance.

[289] MGA: Memory-Driven GUI Agent for Observation-Centric Interaction

Weihua Cheng, Ersheng Ni, Wenlong Wang, Yifei Sun, Junming Liu, Wangyu Shen, Yirong Chen, Botian Shi, Ding Wang

Main category: cs.AI

TL;DR: MGA is a Memory-Driven GUI Agent that reframes GUI interaction as ‘observe first, then decide’ to address error propagation and local exploration bias in existing GUI agents.

Details

Motivation: Existing GUI agents suffer from dependence on historical trajectories (amplifying error propagation) and local exploration bias ('decision-first, observation-later' mechanisms that overlook critical interface cues).

Method: MGA models each step as an independent, context-rich environment state using a triad: current screenshot, task-agnostic spatial information, and dynamically updated structured memory. It follows ‘observe first, then decide’ principle.

Result: Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer show MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines.

Conclusion: MGA’s memory-driven approach with independent step modeling and ‘observe first’ principle effectively addresses key limitations of existing GUI agents, demonstrating superior performance across multiple benchmarks and real applications.

Abstract: The rapid progress of Large Language Models (LLMs) and their multimodal extensions (MLLMs) has enabled agentic systems capable of perceiving and acting across diverse environments. A challenging yet impactful frontier is the development of GUI agents, which must navigate complex desktop and web interfaces while maintaining robustness and generalization. Existing paradigms typically model tasks as long-chain executions, concatenating historical trajectories into the context. While approaches such as Mirage and GTA1 refine planning or introduce multi-branch action selection, they remain constrained by two persistent issues: Dependence on historical trajectories, which amplifies error propagation. And Local exploration bias, where “decision-first, observation-later” mechanisms overlook critical interface cues. We introduce the Memory-Driven GUI Agent (MGA), which reframes GUI interaction around the principle of observe first, then decide. MGA models each step as an independent, context-rich environment state represented by a triad: current screenshot, task-agnostic spatial information, and a dynamically updated structured memory. Experiments on OSworld benchmarks, real desktop applications (Chrome, VSCode, VLC), and cross-task transfer demonstrate that MGA achieves substantial gains in robustness, generalization, and efficiency compared to state-of-the-art baselines. The code is publicly available at: {https://anonymous.4open.science/r/MGA-3571}.

[290] MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, Siheng Chen

Main category: cs.AI

TL;DR: MCP-Flow is an automated pipeline that discovers 1166 MCP servers and 11536 tools, generates 68733 instruction-function pairs and 6439 trajectories for training LLMs to better utilize external tools.

Details

Motivation: LLMs struggle to effectively use the expanding Model Contextual Protocol (MCP) ecosystem due to limited server coverage, manual curation requirements, and lack of training support, hindering real-world deployment.

Method: Automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training using collected data from 1166 servers and 11536 tools.

Result: Generated 68733 high-quality instruction-function call pairs and 6439 trajectories, significantly exceeding prior work in scale and diversity. Experiments show superior MCP tool selection, function-call generation, and enhanced agentic task performance.

Conclusion: MCP-Flow provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments and is publicly available as open source.

Abstract: Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow’s effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments. MCP-Flow is publicly available at \href{https://github.com/wwh0411/MCP-Flow}{https://github.com/wwh0411/MCP-Flow}.

[291] Investigating Intra-Abstraction Policies For Non-exact Abstraction Algorithms

Robin Schmöcker, Alexander Dockhorn, Bodo Rosenhahn

Main category: cs.AI

TL;DR: This paper addresses the tie-breaking problem in MCTS when using abstractions, where multiple actions from the same parent share the same abstract node and thus have identical UCB values. The authors propose and evaluate alternative intra-abstraction policies that outperform the default random tie-breaking used in state-of-the-art methods like pruned OGA.

Details

Motivation: MCTS suffers from sample inefficiency, which can be improved using abstractions to share information among nodes. However, current abstraction methods like pruned OGA don't properly handle cases where multiple actions from the same parent belong to the same abstract node, leading to identical UCB values that require tie-breaking. The default random tie-breaking may be suboptimal.

Method: The authors propose and empirically evaluate several alternative intra-abstraction policies to replace the random tie-breaking rule used when multiple actions from the same parent share the same abstract node and have identical UCB values in MCTS.

Result: Several of the proposed intra-abstraction policies outperform the random policy across a majority of environments and parameter settings tested in the empirical evaluation.

Conclusion: Alternative intra-abstraction policies can significantly improve performance over the default random tie-breaking used in current abstraction algorithms for MCTS, addressing an important limitation in how abstractions handle cases with multiple actions sharing the same abstract node.

Abstract: One weakness of Monte Carlo Tree Search (MCTS) is its sample efficiency which can be addressed by building and using state and/or action abstractions in parallel to the tree search such that information can be shared among nodes of the same layer. The primary usage of abstractions for MCTS is to enhance the Upper Confidence Bound (UCB) value during the tree policy by aggregating visits and returns of an abstract node. However, this direct usage of abstractions does not take the case into account where multiple actions with the same parent might be in the same abstract node, as these would then all have the same UCB value, thus requiring a tiebreak rule. In state-of-the-art abstraction algorithms such as pruned On the Go Abstractions (pruned OGA), this case has not been noticed, and a random tiebreak rule was implicitly chosen. In this paper, we propose and empirically evaluate several alternative intra-abstraction policies, several of which outperform the random policy across a majority of environments and parameter settings.

[292] Verifying Large Language Models’ Reasoning Paths via Correlation Matrix Rank

Jiayu Liu, Wei Dai, Zhenya Huang, Ning Miao, Enhong Chen

Main category: cs.AI

TL;DR: The paper proposes Self-Indicator, a method that uses the correlation matrix rank between input problems and output reasoning paths as an internal indicator of LLM reasoning correctness, achieving significant performance improvements with minimal computational overhead.

Details

Motivation: Existing methods for checking LLM outputs rely heavily on external resources like trained verifiers or elaborate prompts, leading to high computational costs and domain specificity. The authors investigate whether LLMs' internal behaviors can indicate reasoning credibility.

Method: The method calculates the correlation matrix rank between input problems and output reasoning paths as a correctness indicator. This requires only the LLM itself without external models or complex prompts. A plug-and-play Self-Indicator method reweights candidate reasoning paths based on this indicator.

Result: Self-Indicator achieves over 75% accuracy in distinguishing correct from incorrect reasoning paths and improves accuracies on three reasoning benchmarks by more than 8%. It outperforms other voting and verification methods with minimal computational overhead.

Conclusion: The internal behaviors of LLMs, specifically the correlation matrix rank between inputs and reasoning paths, serve as a robust indicator of reasoning correctness. This enables effective, efficient verification without external resources.

Abstract: Despite the strong reasoning ability of large language models~(LLMs), they are prone to errors and hallucinations. As a result, how to check their outputs effectively and efficiently has become a critical problem in their applications. Existing checking methods heavily rely on external resources, such as trained verifiers (e.g., process/outcome reward models) or elaborate prompts, which lead to high computational overhead and are only applicable to specific domains. In this paper, we investigate whether the internal behaviors of LLMs have already implied the credibility of their reasoning paths. Specifically, we find that the rank of the correlation matrix between the input problem and the output reasoning path is a robust indicator of reasoning correctness. Different from other correctness indicators for LLMs, the calculation of the correlation matrix only relies on the LLM itself, which avoids the hassle of training a separate model or designing complicated prompts. Based on it, we design a simple, plug-and-play Self-Indicator method to reweight candidate reasoning paths, which achieves significant performance improvements than other voting and verification methods with very few computational overhead. Our experiments across multiple LLMs of varying scales and model families have further shown the effectiveness of Self-Indicator. It achieves over 75% accuracy in distinguishing correct reasoning paths from incorrect ones, and, in turn, improves the accuracies on three reasoning benchmarks by more than 8%.

[293] Retrieval and Argumentation Enhanced Multi-Agent LLMs for Judgmental Forecasting

Deniz Gorur, Antoni Rago, Francesca Toni

Main category: cs.AI

TL;DR: A multi-agent framework for claim verification using LLMs, where different agents generate evidence for/against claims as QBAFs, improving forecasting accuracy through evidence combination.

Details

Motivation: To improve judgmental forecasting by treating it as claim verification and leveraging multiple agents with different approaches to gather and combine evidence.

Method: Multi-agent framework with three types of LLM-powered agents: ArgLLM (existing QBAF approach), RbAM (relation-based argument mining), and RAG-ArgLLM (retrieval-augmented generation). Agents generate quantitative bipolar argumentation frameworks.

Result: Experiments on judgmental forecasting datasets show combining evidence from multiple agents improves accuracy, especially with three agents, while providing explainable verification.

Conclusion: Multi-agent frameworks with diverse evidence-gathering approaches can enhance forecasting accuracy and provide transparent claim verification through combined evidence.

Abstract: Judgmental forecasting is the task of making predictions about future events based on human judgment. This task can be seen as a form of claim verification, where the claim corresponds to a future event and the task is to assess the plausibility of that event. In this paper, we propose a novel multi-agent framework for claim verification, whereby different agents may disagree on claim veracity and bring specific evidence for and against the claims, represented as quantitative bipolar argumentation frameworks (QBAFs). We then instantiate the framework for supporting claim verification, with a variety of agents realised with Large Language Models (LLMs): (1) ArgLLM agents, an existing approach for claim verification that generates and evaluates QBAFs; (2) RbAM agents, whereby LLM-empowered Relation-based Argument Mining (RbAM) from external sources is used to generate QBAFs; (3) RAG-ArgLLM agents, extending ArgLLM agents with a form of Retrieval-Augmented Generation (RAG) of arguments from external sources. Finally, we conduct experiments with two standard judgmental forecasting datasets, with instances of our framework with two or three agents, empowered by six different base LLMs. We observe that combining evidence from agents can improve forecasting accuracy, especially in the case of three agents, while providing an explainable combination of evidence for claim verification.

[294] Generative Large Language Models (gLLMs) in Content Analysis: A Practical Guide for Communication Research

Daria Kravets-Meinke, Hannah Schmid-Petri, Sonja Niemann, Ute Schmid

Main category: cs.AI

TL;DR: Generative LLMs like ChatGPT are revolutionizing communication research content analysis by outperforming human coders in speed, cost, and accuracy, but require addressing 7 key methodological challenges for proper integration.

Details

Motivation: To make gLLM-based content analysis more accessible to communication researchers while ensuring adherence to disciplinary quality standards of validity, reliability, reproducibility, and research ethics.

Method: Synthesizes emerging research on gLLM-assisted quantitative content analysis and proposes a comprehensive best-practice guide addressing 7 critical challenges: codebook development, prompt engineering, model selection, parameter tuning, iterative refinement, validation of reliability, and performance enhancement.

Result: The paper provides a framework for integrating gLLMs into communication research methodology, highlighting their advantages over human coders while addressing quality assurance challenges.

Conclusion: gLLMs represent a paradigm shift in automated content analysis for communication research, but require systematic methodological approaches to overcome implementation challenges and maintain research quality standards.

Abstract: Generative Large Language Models (gLLMs), such as ChatGPT, are increasingly being used in communication research for content analysis. Studies show that gLLMs can outperform both crowd workers and trained coders, such as research assistants, on various coding tasks relevant to communication science, often at a fraction of the time and cost. Additionally, gLLMs can decode implicit meanings and contextual information, be instructed using natural language, deployed with only basic programming skills, and require little to no annotated data beyond a validation dataset - constituting a paradigm shift in automated content analysis. Despite their potential, the integration of gLLMs into the methodological toolkit of communication research remains underdeveloped. In gLLM-assisted quantitative content analysis, researchers must address at least seven critical challenges that impact result quality: (1) codebook development, (2) prompt engineering, (3) model selection, (4) parameter tuning, (5) iterative refinement, (6) validation of the model’s reliability, and optionally, (7) performance enhancement. This paper synthesizes emerging research on gLLM-assisted quantitative content analysis and proposes a comprehensive best-practice guide to navigate these challenges. Our goal is to make gLLM-based content analysis more accessible to a broader range of communication researchers and ensure adherence to established disciplinary quality standards of validity, reliability, reproducibility, and research ethics.

[295] VDSAgents: A PCS-Guided Multi-Agent System for Veridical Data Science Automation

Yunxuan Jiang, Silan Hu, Xiaoning Wang, Yuanyuan Zhang, Xiangyu Chang

Main category: cs.AI

TL;DR: VDSAgents is a multi-agent system that embeds Predictability-Computability-Stability (PCS) principles into LLM-driven data science workflows, outperforming existing end-to-end systems like AutoKaggle and DataInterpreter.

Details

Motivation: Current LLM-driven data science systems rely solely on LLM reasoning without scientific guidance, limiting trustworthiness and robustness with real-world noisy datasets.

Method: Multi-agent system implementing modular workflow (data cleaning, feature engineering, modeling, evaluation) guided by PCS principles, incorporating perturbation analysis, unit testing, and model validation.

Result: Consistently outperformed AutoKaggle and DataInterpreter across nine diverse datasets using DeepSeek-V3 and GPT-4o backends.

Conclusion: Embedding PCS principles into LLM-driven data science automation is feasible and improves system performance over purely LLM-based approaches.

Abstract: Large language models (LLMs) become increasingly integrated into data science workflows for automated system design. However, these LLM-driven data science systems rely solely on the internal reasoning of LLMs, lacking guidance from scientific and theoretical principles. This limits their trustworthiness and robustness, especially when dealing with noisy and complex real-world datasets. This paper provides VDSAgents, a multi-agent system grounded in the Predictability-Computability-Stability (PCS) principles proposed in the Veridical Data Science (VDS) framework. Guided by PCS principles, the system implements a modular workflow for data cleaning, feature engineering, modeling, and evaluation. Each phase is handled by an elegant agent, incorporating perturbation analysis, unit testing, and model validation to ensure both functionality and scientific auditability. We evaluate VDSAgents on nine datasets with diverse characteristics, comparing it with state-of-the-art end-to-end data science systems, such as AutoKaggle and DataInterpreter, using DeepSeek-V3 and GPT-4o as backends. VDSAgents consistently outperforms the results of AutoKaggle and DataInterpreter, which validates the feasibility of embedding PCS principles into LLM-driven data science automation.

[296] A Unified Geometric Space Bridging AI Models and the Human Brain

Silin Chen, Yuzhong Chen, Zifan Wang, Junhao Wang, Zifeng Jia, Keith M Kendrick, Tuo Zhang, Lin Zhao, Dezhong Yao, Tianming Liu, Xi Jiang

Main category: cs.AI

TL;DR: The paper introduces Brain-like Space, a unified geometric framework that maps AI models’ intrinsic spatial attention organization to human brain networks, enabling comparison across different modalities and revealing a continuous arc-shaped geometry of brain-likeness.

Details

Motivation: To understand whether artificial neural networks organize information like the human brain, and to create a common ground for comparing AI models across different modalities (vision, language, multimodal) beyond specific inputs and tasks.

Method: Developed Brain-like Space concept that maps AI models’ intrinsic spatial attention topological organization onto canonical human functional brain networks. Analyzed 151 Transformer-based models including large vision models, large language models, and large multimodal models.

Result: Uncovered a continuous arc-shaped geometry reflecting gradual increase in brain-likeness. Found that brain-likeness is shaped by pretraining paradigm emphasizing global semantic abstraction and positional encoding facilitating deep cross-modal fusion. Brain-likeness and downstream task performance are not identical.

Conclusion: Brain-like Space provides the first unified framework for situating, quantifying, and comparing intelligence across domains, revealing deep organizational principles that bridge machines and the brain.

Abstract: For decades, neuroscientists and computer scientists have pursued a shared ambition: to understand intelligence and build it. Modern artificial neural networks now rival humans in language, perception, and reasoning, yet it is still largely unknown whether these artificial systems organize information as the brain does. Existing brain-AI alignment studies have shown the striking correspondence between the two systems, but such comparisons remain bound to specific inputs and tasks, offering no common ground for comparing how AI models with different kinds of modalities-vision, language, or multimodal-are intrinsically organized. Here we introduce a groundbreaking concept of Brain-like Space: a unified geometric space in which every AI model can be precisely situated and compared by mapping its intrinsic spatial attention topological organization onto canonical human functional brain networks, regardless of input modality, task, or sensory domain. Our extensive analysis of 151 Transformer-based models spanning state-of-the-art large vision models, large language models, and large multimodal models uncovers a continuous arc-shaped geometry within this space, reflecting a gradual increase of brain-likeness; different models exhibit distinct distribution patterns within this geometry associated with different degrees of brain-likeness, shaped not merely by their modality but by whether the pretraining paradigm emphasizes global semantic abstraction and whether the positional encoding scheme facilitates deep fusion across different modalities. Moreover, the degree of brain-likeness for a model and its downstream task performance are not “identical twins”. The Brain-like Space provides the first unified framework for situating, quantifying, and comparing intelligence across domains, revealing the deep organizational principles that bridge machines and the brain.

[297] An N-of-1 Artificial Intelligence Ecosystem for Precision Medicine

Pedram Fard, Alaleh Azhir, Neguine Rezaii, Jiazi Tian, Hossein Estiri

Main category: cs.AI

TL;DR: The paper proposes a multi-agent ecosystem for N-of-1 decision support in medical AI, shifting from monolithic models to orchestrated intelligence that provides personalized care with transparency and equity.

Details

Motivation: Current AI systems in medicine serve the average patient well but fail at the margins - patients with rare variants, multimorbidity, or underrepresented demographics, creating equity and trust issues.

Method: A multi-agent ecosystem where agents clustered by organ systems, patient populations, and analytic modalities use shared models and evidence synthesis tools, with results converging in a coordination layer that weighs reliability, uncertainty, and data density.

Result: The system generates decision-support packets with risk estimates bounded by confidence ranges, outlier flags, and linked evidence, shifting validation from population averages to individual reliability metrics.

Conclusion: This approach aligns medical AI with the first principle of medicine by providing transparent, equitable, and individually-centered care through orchestrated intelligence rather than monolithic models.

Abstract: Artificial intelligence in medicine is built to serve the average patient. By minimizing error across large datasets, most systems deliver strong aggregate accuracy yet falter at the margins: patients with rare variants, multimorbidity, or underrepresented demographics. This average patient fallacy erodes both equity and trust. We propose a different design: a multi-agent ecosystem for N-of-1 decision support. In this environment, agents clustered by organ systems, patient populations, and analytic modalities draw on a shared library of models and evidence synthesis tools. Their results converge in a coordination layer that weighs reliability, uncertainty, and data density before presenting the clinician with a decision-support packet: risk estimates bounded by confidence ranges, outlier flags, and linked evidence. Validation shifts from population averages to individual reliability, measured by error in low-density regions, calibration in the small, and risk–coverage trade-offs. Anticipated challenges include computational demands, automation bias, and regulatory fit, addressed through caching strategies, consensus checks, and adaptive trial frameworks. By moving from monolithic models to orchestrated intelligence, this approach seeks to align medical AI with the first principle of medicine: care that is transparent, equitable, and centered on the individual.

[298] Improving LLM Reasoning via Dependency-Aware Query Decomposition and Logic-Parallel Content Expansion

Xianjun Gao, Jianchun Liu, Hongli Xu, Liusheng Huang

Main category: cs.AI

TL;DR: Orion is an efficient LLM reasoning framework that uses dependency-aware query decomposition and parallel content expansion to achieve both high reasoning quality and low latency for web applications.

Details

Motivation: Current LLM reasoning creates bottlenecks for web services due to inefficient sequential generation and rigid reasoning strategies, failing to meet both efficiency and quality requirements simultaneously.

Method: Two-phase approach: (1) key point generation with retrieval-augmented few-shot prompting, and (2) parallel content expansion using dependency graphs for logical consistency, plus pipeline scheduling for cross-query parallelism.

Result: Achieves up to 4.33x higher token generation speed, 3.42x lower answer latency, and 18.75% improvement in reasoning quality compared to baselines.

Conclusion: Orion successfully addresses the dual requirements of web services by enabling efficient, high-quality LLM reasoning through dependency-aware parallel processing.

Abstract: The integration of Large Language Models (LLMs) into real-time Web applications, such as AI-powered search and conversational agents, presents a fundamental Web infrastructure challenge: reconciling the demand for high-quality, complex reasoning with the stringent low-latency and high-throughput requirements of interactive services. Current LLM reasoning, hindered by computationally inefficient sequential generation and rigid reasoning strategies, creates a critical bottleneck for the Web services. Existing approaches typically optimize the LLM reasoning for either efficiency or quality but struggle to achieve both, and thus fail to meet the dual requirements of modern Web platforms. To overcome these limitations, we propose Orion, a novel and efficient reasoning framework that enables dependency-aware query decomposition and logic-parallel content expansion. Concretely, Orion decomposes a single query reasoning process into two synergistic phases: (1) \textit{key point generation}, which distills logically structured key points through retrieval-augmented few-shot prompting, and (2) \textit{content parallel expansion}, which concurrently elaborates on these points based on a dependency graph to ensure logical consistency. Furthermore, Orion introduces a pipeline scheduling mechanism that exploits the complementary computational characteristics of the two phases (generation imposes pressure on GPU computing and expansion stresses on GPU memory) across multiple queries, enabling cross-query parallelism and dramatically improving reasoning performance (\ie, efficiency and quality). Experiments on diverse benchmarks show that Orion not only delivers up to 4.33x higher token generation speed and 3.42x lower answer latency over the baselines but also improves reasoning quality by up to 18.75% through explicitly modeling inter-point dependencies.

[299] APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training

Jiarui Qin, Yunjia Xi, Junjie Huang, Renting Rui, Di Yin, Weiwen Liu, Yong Yu, Weinan Zhang, Xing Sun

Main category: cs.AI

TL;DR: APTBench is a new benchmark framework that converts real-world agent tasks into multiple-choice/text completion questions to evaluate agentic capabilities during LLM pre-training, addressing the gap between current pre-training benchmarks and agent evaluation needs.

Details

Motivation: Current pre-training benchmarks focus on isolated static skills and fail to assess agentic capabilities, while agent benchmarks require post-trained models. There's a need for benchmarks that can evaluate agentic potential during pre-training to guide model development more effectively.

Method: APTBench converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities like planning and action, covering key scenarios in software engineering and deep research.

Result: APTBench provides a more predictive signal of a model’s downstream agent performance compared to general-purpose benchmarks, while being significantly more lightweight and cost-effective than full-scale end-to-end agent evaluations after post-training.

Conclusion: APTBench fills a critical gap in evaluating agentic capabilities during LLM pre-training, enabling more effective guidance of model development and better alignment with real-world autonomous task execution requirements.

Abstract: With the rapid development of LLM-based agents, there is a growing trend to incorporate agent-specific data into the pre-training stage of LLMs, aiming to better align LLMs with real-world autonomous task execution. However, current pre-training benchmarks primarily focus on isolated and static skills, e.g., common knowledge or mathematical/code reasoning, and fail to reflect model’s agentic capabilities. On the other hand, agent benchmarks are typically designed for post-trained models, requiring multi-turn task execution abilities that base models struggle to support. Thus, there is a compelling need for a benchmark that can evaluate agentic potentials during pre-training and guide the model training more effectively. To address this gap, we propose APTBench, a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions tailored for base models. It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research. Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model’s downstream performance as an agent, while remaining significantly more lightweight and cost-effective than full-scale, end-to-end agent evaluations after post-training.

[300] OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, Lingpeng Kong

Main category: cs.AI

TL;DR: MobileRisk-Live introduces a dynamic sandbox environment and safety benchmark for mobile agent safety research, while OS-Sentinel proposes a hybrid framework combining formal verification and VLM-based contextual judgment to detect safety risks in mobile agents.

Details

Motivation: Vision-Language Model (VLM) powered agents show promise for digital automation but pose significant safety risks including system compromise and privacy leakage. Current methods struggle to detect these risks across the vast operational space of mobile environments.

Method: Proposed OS-Sentinel framework combines: 1) Formal Verifier for explicit system-level violation detection, and 2) VLM-based Contextual Judge for assessing contextual risks and agent actions. Built on MobileRisk-Live benchmark with realistic trajectories and fine-grained annotations.

Result: OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics, demonstrating superior safety detection capabilities.

Conclusion: The framework provides critical insights for developing safer and more reliable autonomous mobile agents, establishing a foundation for mobile agent safety research.

Abstract: Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms. While these agents hold great promise for advancing digital automation, their potential for unsafe operations, such as system compromise and privacy leakage, is raising significant concerns. Detecting these safety concerns across the vast and complex operational space of mobile environments presents a formidable challenge that remains critically underexplored. To establish a foundation for mobile agent safety research, we introduce MobileRisk-Live, a dynamic sandbox environment accompanied by a safety detection benchmark comprising realistic trajectories with fine-grained annotations. Built upon this, we propose OS-Sentinel, a novel hybrid safety detection framework that synergistically combines a Formal Verifier for detecting explicit system-level violations with a VLM-based Contextual Judge for assessing contextual risks and agent actions. Experiments show that OS-Sentinel achieves 10%-30% improvements over existing approaches across multiple metrics. Further analysis provides critical insights that foster the development of safer and more reliable autonomous mobile agents.

[301] Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

Main category: cs.AI

TL;DR: This study compares logical and abstract reasoning abilities of multiple LLMs against human performance using custom-designed reasoning questions, revealing significant gaps in LLMs’ deduction capabilities.

Details

Motivation: To evaluate whether LLMs truly understand information, perform inferences, and draw logical conclusions beyond just linguistic task performance, advancing AI reasoning capabilities.

Method: Compare reasoning skills of GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabi’a using eight custom-designed reasoning questions, benchmarking results against human performance on the same tasks.

Result: Revealed significant differences between LLM performance and human performance, indicating areas where LLMs struggle with deduction despite their linguistic capabilities.

Conclusion: LLMs show limitations in logical and abstract reasoning compared to humans, highlighting the need for continued development in true reasoning capabilities beyond surface-level linguistic performance.

Abstract: Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabi'a - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.

[302] Adaptive Surrogate Gradients for Sequential Reinforcement Learning in Spiking Neural Networks

Korneel Van den Berghe, Stein Stroobants, Vijay Janapa Reddi, G. C. H. E. de Croon

Main category: cs.AI

TL;DR: This paper addresses challenges in training Spiking Neural Networks (SNNs) for robotics control by analyzing surrogate gradient slopes and proposing a novel training approach with privileged guiding policy, achieving significant performance improvements in real-world drone control tasks.

Details

Motivation: Neuromorphic computing offers energy efficiency for robotics, but SNNs face challenges: non-differentiable spiking neurons requiring surrogate gradients with unclear optimization properties, and stateful dynamics requiring sequence training hindered by limited sequence lengths in reinforcement learning.

Method: Systematically analyzed surrogate gradient slope settings and their effects on gradient magnitude and alignment. Proposed a novel training approach using privileged guiding policy to bootstrap learning while maintaining online environment interactions with spiking policy. Combined with adaptive slope scheduling.

Result: Shallower slopes or scheduled slopes led to 2.1x improvement in both training and final deployed performance in RL settings. Achieved average return of 400 points in real-world drone position control, substantially outperforming prior techniques like Behavioral Cloning and TD3BC (which achieved at most -200 points).

Conclusion: This work advances theoretical understanding of surrogate gradient learning in SNNs and provides practical training methodologies for neuromorphic controllers, demonstrating significant improvements in real-world robotic systems.

Abstract: Neuromorphic computing systems are set to revolutionize energy-constrained robotics by achieving orders-of-magnitude efficiency gains, while enabling native temporal processing. Spiking Neural Networks (SNNs) represent a promising algorithmic approach for these systems, yet their application to complex control tasks faces two critical challenges: (1) the non-differentiable nature of spiking neurons necessitates surrogate gradients with unclear optimization properties, and (2) the stateful dynamics of SNNs require training on sequences, which in reinforcement learning (RL) is hindered by limited sequence lengths during early training, preventing the network from bridging its warm-up period. We address these challenges by systematically analyzing surrogate gradient slope settings, showing that shallower slopes increase gradient magnitude in deeper layers but reduce alignment with true gradients. In supervised learning, we find no clear preference for fixed or scheduled slopes. The effect is much more pronounced in RL settings, where shallower slopes or scheduled slopes lead to a 2.1x improvement in both training and final deployed performance. Next, we propose a novel training approach that leverages a privileged guiding policy to bootstrap the learning process, while still exploiting online environment interactions with the spiking policy. Combining our method with an adaptive slope schedule for a real-world drone position control task, we achieve an average return of 400 points, substantially outperforming prior techniques, including Behavioral Cloning and TD3BC, which achieve at most –200 points under the same conditions. This work advances both the theoretical understanding of surrogate gradient learning in SNNs and practical training methodologies for neuromorphic controllers demonstrated in real-world robotic systems.

[303] From Cross-Task Examples to In-Task Prompts: A Graph-Based Pseudo-Labeling Framework for In-context Learning

Zihan Chen, Song Wang, Xingbo Fu, Chengshuai Shi, Zhenyu Lei, Cong Shen, Jundong Li

Main category: cs.AI

TL;DR: A two-stage pipeline reduces LLM labeling costs for ICL by using cross-task examples for pseudo-labeling and graph-based propagation to create demonstrations.

Details

Motivation: Collecting high-quality examples for new or challenging tasks is costly and labor-intensive, especially for in-context learning.

Method: Two-stage pipeline: 1) Use cross-task examples to prompt LLM for pseudo-labeling small target set, 2) Graph-based label propagation to spread labels without additional LLM queries.

Result: Achieves strong performance across five tasks while significantly lowering labeling costs compared to traditional approaches.

Conclusion: Combines cross-task supervision flexibility with LLM-free propagation scalability for cost-efficient ICL demonstration construction.

Abstract: The capability of in-context learning (ICL) enables large language models (LLMs) to perform novel tasks without parameter updates by conditioning on a few input-output examples. However, collecting high-quality examples for new or challenging tasks can be costly and labor-intensive. In this work, we propose a cost-efficient two-stage pipeline that reduces reliance on LLMs for data labeling. Our approach first leverages readily available cross-task examples to prompt an LLM and pseudo-label a small set of target task instances. We then introduce a graph-based label propagation method that spreads label information to the remaining target examples without additional LLM queries. The resulting fully pseudo-labeled dataset is used to construct in-task demonstrations for ICL. This pipeline combines the flexibility of cross-task supervision with the scalability of LLM-free propagation. Experiments across five tasks demonstrate that our method achieves strong performance while lowering labeling costs.

[304] Generative AI for Healthcare: Fundamentals, Challenges, and Perspectives

Gang Chen, Changshuo Liu, Gene Anne Ooi, Marcus Tan, Zhongle Xie, Jianwei Yin, James Wei Luen Yip, Wenqiao Zhang, Jiaqi Zhu, Beng Chin Ooi

Main category: cs.AI

TL;DR: This paper proposes a data-centric paradigm for deploying Generative AI in healthcare, positioning the medical data ecosystem as the foundational substrate to support GenAI systems through effective data processing and knowledge retrieval.

Details

Motivation: GenAI promises transformative opportunities in healthcare but requires understanding of what can be achieved. Current deployments need better integration with healthcare tasks and data systems.

Method: Reposition the data life cycle by making medical data ecosystem the foundational substrate. Implement semantic vector search, contextual querying, and effective data processing pipelines to support GenAI operations for both upstream model training and downstream clinical applications.

Result: The proposed ecosystem enables deployment of GenAI for high-quality healthcare delivery by supplying foundation models with high-quality multimodal data and serving as a knowledge retrieval backend via agentic layer.

Conclusion: A data-centric paradigm with medical data ecosystem as foundation enables sustainable integration, representation, and retrieval of diverse medical data, supporting effective GenAI deployment in healthcare.

Abstract: Generative Artificial Intelligence (GenAI) is taking the world by storm. It promises transformative opportunities for advancing and disrupting existing practices, including healthcare. From large language models (LLMs) for clinical note synthesis and conversational assistance to multimodal systems that integrate medical imaging, electronic health records, and genomic data for decision support, GenAI is transforming the practice of medicine and the delivery of healthcare, such as diagnosis and personalized treatments, with great potential in reducing the cognitive burden on clinicians, thereby improving overall healthcare delivery. However, GenAI deployment in healthcare requires an in-depth understanding of healthcare tasks and what can and cannot be achieved. In this paper, we propose a data-centric paradigm in the design and deployment of GenAI systems for healthcare. Specifically, we reposition the data life cycle by making the medical data ecosystem as the foundational substrate for generative healthcare systems. This ecosystem is designed to sustainably support the integration, representation, and retrieval of diverse medical data and knowledge. With effective and efficient data processing pipelines, such as semantic vector search and contextual querying, it enables GenAI-powered operations for upstream model components and downstream clinical applications. Ultimately, it not only supplies foundation models with high-quality, multimodal data for large-scale pretraining and domain-specific fine-tuning, but also serves as a knowledge retrieval backend to support task-specific inference via the agentic layer. The ecosystem enables the deployment of GenAI for high-quality and effective healthcare delivery.

[305] FunReason-MT Technical Report: Overcoming the Complexity Barrier in Multi-Turn Function Calling

Zengzhuang Xu, Bingguang Hao, Zechuan Wang, Yuntao Wen, Maolin Wang, Yang Liu, Long Chen, Dong Wang, Yicheng Chen, Cunyin Peng, Chenyi Zhuang, Jinjie Gu, Leilei Gan, Xiangyu Zhao, Shi Gu

Main category: cs.AI

TL;DR: FunReason-MT is a novel data synthesis framework that generates high-quality multi-turn function calling training data for LLMs, addressing challenges in targeted model training, tool architecture isolation, and multi-turn logical dependencies.

Details

Motivation: Existing data synthesis methods are insufficient for generating high-quality multi-turn function calling data in real-world environments, creating a bottleneck for developing advanced AI systems that need to interface with external tools.

Method: The framework employs three key techniques: Environment-API Graph Interactions for trajectory collection, Advanced Tool-Query Synthesis for query construction, and Guided Iterative Chain for sophisticated Chain-of-Thought generation.

Result: A 4B model trained on FunReason-MT generated data achieves state-of-the-art performance on Berkeley Function-Calling Leaderboard (BFCLv3), outperforming most closed-source models of comparable size, with further improvements confirmed on BFCLv4.

Conclusion: FunReason-MT provides a reliable and robust source for agentic learning, effectively addressing the structural deficiencies in multi-turn function calling data synthesis.

Abstract: Function calling (FC) empowers large language models (LLMs) and autonomous agents to interface with external tools, a critical capability for solving complex, real-world problems. As this ability becomes increasingly central to advanced AI systems, the need for high-quality, multi-turn training data to develop and refine it cannot be overstated. Existing data synthesis methods, such as random environment sampling or multi-agent role-playing, are not powerful enough to generate high-quality data in real-world environments. Practical challenges come in three folds: targeted model training, isolation of tool architecture, and multi-turn logical dependency. To address these structural deficiencies, we present FunReason-MT, a novel data synthesis framework for real-world multi-turn tool use. FunReason-MT resolves the complexity barrier in multi-turn FC data by employing 1) Environment-API Graph Interactions to gather varied high-quality trajectories, 2) Advanced Tool-Query Synthesis to simplify hard query construction, and 3) Guided Iterative Chain for sophisticated CoT generation. Evaluations on Berkeley Function-Calling Leaderboard (BFCLv3) demonstrate the power of our framework: a 4B model built upon FunReason-MT generated data achieves state-of-the-art performance among comparable-sized models, outperforming most close-source models. Further performance improvements on BFCLv4 confirm that FunReason-MT provides a reliable and robust source for agentic learning.

[306] Advancing site-specific disease and pest management in precision agriculture: From reasoning-driven foundation models to adaptive, feedback-based learning

Nitin Rai, Daeun, Choi, Nathan S. Boyd, Arnold W. Schumann

Main category: cs.AI

TL;DR: Foundation models (FMs) are transforming site-specific disease management in crops through vision-language integration, enabling symptom interpretation, reasoning, and interactive QA, with VLMs showing 5-10x more growth than LLMs.

Details

Motivation: To advance precision agriculture by leveraging foundation models for real-time computer vision and multi-modal data processing in crop disease management, moving beyond traditional neural networks.

Method: Reviewed ~40 articles on FM applications, focusing on LLMs and VLMs, and analyzed their integration with adaptive learning, reinforcement learning, and digital twin frameworks for targeted spraying.

Result: FMs are rapidly gaining traction (2023-24 surge), VLMs outpace LLMs significantly, RL/AL remain nascent for smart spraying, digital twins enable virtual simulation, and human-robot collaboration is limited.

Conclusion: Multi-modal FMs with real-time feedback will drive next-generation SSDM, though addressing the sim-to-real gap and enhancing human-in-the-loop approaches remain critical challenges.

Abstract: Site-specific disease management (SSDM) in crops has advanced rapidly through machine and deep learning (ML and DL) for real-time computer vision. Research evolved from handcrafted feature extraction to large-scale automated feature learning. With foundation models (FMs), crop disease datasets are now processed in fundamentally new ways. Unlike traditional neural networks, FMs integrate visual and textual data, interpret symptoms in text, reason about symptom-management relationships, and support interactive QA for growers and educators. Adaptive and imitation learning in robotics further enables field-based disease management. This review screened approx. 40 articles on FM applications for SSDM, focusing on large-language models (LLMs) and vision-language models (VLMs), and discussing their role in adaptive learning (AL), reinforcement learning (RL), and digital twin frameworks for targeted spraying. Key findings: (a) FMs are gaining traction with surging literature in 2023-24; (b) VLMs outpace LLMs, with a 5-10x increase in publications; (c) RL and AL are still nascent for smart spraying; (d) digital twins with RL can simulate targeted spraying virtually; (e) addressing the sim-to-real gap is critical for real-world deployment; (f) human-robot collaboration remains limited, especially in human-in-the-loop approaches where robots detect early symptoms and humans validate uncertain cases; (g) multi-modal FMs with real-time feedback will drive next-gen SSDM. For updates, resources, and contributions, visit, https://github.com/nitin-dominic/AgriPathogenDatabase, to submit papers, code, or datasets.

[307] OrchDAG: Complex Tool Orchestration in Multi-Turn Interactions with Plan DAGs

Yifu Lu, Shengjie Liu, Li Dong

Main category: cs.AI

TL;DR: OrchDAG is a synthetic data generation pipeline that models multi-turn tool execution as DAGs with controllable complexity, used to benchmark models and enhance RLVR training with graph-based rewards.

Details

Motivation: Most existing work on agentic tool use overlooks the complexity of multi-turn tool interactions, creating a need for better benchmarks and training methods.

Method: Introduces OrchDAG pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity, and proposes a graph-based reward for RLVR training.

Result: The dataset presents a challenging but solvable benchmark, and the graph-based reward is effective when combined with GRPO-style algorithms.

Conclusion: Leveraging topological structure and data complexity is important for improving multi-turn tool use capabilities in agents.

Abstract: Agentic tool use has gained traction with the rise of agentic tool calling, yet most existing work overlooks the complexity of multi-turn tool interactions. We introduce OrchDAG, a synthetic data generation pipeline that models tool execution as directed acyclic graphs (DAGs) with controllable complexity. Using this dataset, we benchmark model performance and propose a graph-based reward to enhance RLVR training. Experiments show that the dataset presents a challenging but solvable benchmark, and the proposed reward is effective when combined with GRPO-style algorithms, highlighting the importance of leveraging topological structure and data complexity in multi-turn tool use.

[308] Bridging Tool Dependencies and Domain Knowledge: A Graph-Based Framework for In-Context Planning

Shengjie Liu, Li Dong, Zhenyu Zhang

Main category: cs.AI

TL;DR: A framework that combines tool knowledge graphs with domain knowledge graphs to improve exemplar artifact generation by modeling tool dependencies and procedural knowledge.

Details

Motivation: To enhance exemplar artifact generation by uncovering and exploiting dependencies among tools and documents, addressing the need for better tool-augmented reasoning and planning.

Method: Constructs tool knowledge graph from tool schemas using DeepResearch-inspired analysis, creates complementary knowledge graph from documents/SOPs, fuses both graphs, and uses deep-sparse integration to align structural tool dependencies with procedural knowledge for plan generation.

Result: Experiments show the unified framework effectively models tool interactions and improves plan generation.

Conclusion: Linking tool graphs with domain knowledge graphs provides significant benefits for tool-augmented reasoning and planning.

Abstract: We present a framework for uncovering and exploiting dependencies among tools and documents to enhance exemplar artifact generation. Our method begins by constructing a tool knowledge graph from tool schemas,including descriptions, arguments, and output payloads, using a DeepResearch-inspired analysis. In parallel, we derive a complementary knowledge graph from internal documents and SOPs, which is then fused with the tool graph. To generate exemplar plans, we adopt a deep-sparse integration strategy that aligns structural tool dependencies with procedural knowledge. Experiments demonstrate that this unified framework effectively models tool interactions and improves plan generation, underscoring the benefits of linking tool graphs with domain knowledge graphs for tool-augmented reasoning and planning.

[309] Mining Large Independent Sets on Massive Graphs

Yu Zhang, Witold Pedrycz, Chanjuan Liu, Enqiang Zhu

Main category: cs.AI

TL;DR: ARCIS is an efficient algorithm for finding large independent sets in massive graphs using adaptive restarts and consensus-guided vertex fixing to improve search efficiency and avoid stagnation.

Details

Motivation: Existing heuristics for Maximum Independent Set problem stagnate with fixed search schedules and underuse past solution information, leading to wasted effort in low-quality search regions.

Method: ARCIS combines adaptive restart policy that refreshes exploration when progress slows, and Consensus-Guided Vertex Fixing that restricts search to non-consensus regions by fixing vertices consistently observed within rounds, with reversible fixing that allows unfixing vertices that lose support.

Result: Experiments on 222 graphs from four benchmark suites show ARCIS attains best or tied-best solution quality in most instances while delivering competitive runtime and low variability.

Conclusion: ARCIS is a practical and robust method for large-scale graph mining, with ablation studies confirming the impact of each component.

Abstract: The Maximum Independent Set problem is fundamental for extracting conflict-free structure from large graphs, with applications in scheduling, recommendation, and network analysis. However, existing heuristics can stagnate when search schedules are fixed and information from past solutions is underused, leading to wasted effort in low-quality regions of the search space. We present ARCIS, an efficient algorithm for mining large independent sets on massive graphs. ARCIS couples two main components. The first is an adaptive restart policy that refreshes exploration when progress slows. The second is Consensus-Guided Vertex Fixing, which restricts the search to the non-consensus region of the graph by fixing vertices consistently observed within a round. The consensus is maintained as a running intersection within each round, and because it is recomputed at every restart, the fixing is reversible. Vertices that later lose support are automatically unfixed and their neighborhoods re-enter the working graph, which corrects occasional mistakes while preserving progress. Experiments on 222 graphs from four benchmark suites show that ARCIS attains the best or tied-best solution quality in most instances while delivering competitive runtime and low variability. Ablation studies isolate the impact of each component, indicating that ARCIS is a practical and robust method for large-scale graph mining.

[310] A Survey on Large Language Model-Based Game Agents

Sihao Hu, Tiansheng Huang, Gaowen Liu, Ramana Rao Kompella, Fatih Ilhan, Selim Furkan Tekin, Yichang Xu, Zachary Yahn, Ling Liu

Main category: cs.AI

TL;DR: This survey provides a comprehensive review of LLM-based game agents (LLMGAs) through a unified reference architecture, covering single-agent components (memory, reasoning, perception-action) and multi-agent coordination, with a taxonomy linking game genres to agent requirements.

Details

Motivation: Game environments offer rich, controllable settings that simulate real-world complexity, making them valuable testbeds for exploring Artificial General Intelligence capabilities. The emergence of LLMs provides new opportunities to endow game agents with generalizable reasoning, memory, and adaptability.

Method: The survey uses a unified reference architecture to synthesize existing studies. At single-agent level: memory, reasoning, and perception-action interfaces. At multi-agent level: communication protocols and organizational models. A challenge-centered taxonomy links six major game genres to their dominant agent requirements.

Result: The survey offers an up-to-date review of LLM-based game agents, providing a structured framework for understanding how language enables agents to perceive, think, and act in complex game environments, with applications ranging from low-latency control to open-ended goal formation.

Conclusion: LLM-based game agents represent a promising direction for advancing Artificial General Intelligence by leveraging game environments as rich testbeds, with the unified architecture providing a foundation for future research and development in this emerging field.

Abstract: Game environments provide rich, controllable settings that stimulate many aspects of real-world complexity. As such, game agents offer a valuable testbed for exploring capabilities relevant to Artificial General Intelligence. Recently, the emergence of Large Language Models (LLMs) provides new opportunities to endow these agents with generalizable reasoning, memory, and adaptability in complex game environments. This survey offers an up-to-date review of LLM-based game agents (LLMGAs) through a unified reference architecture. At the single-agent level, we synthesize existing studies around three core components: memory, reasoning, and perception-action interfaces, which jointly characterize how language enables agents to perceive, think, and act. At the multi-agent level, we outline how communication protocols and organizational models support coordination, role differentiation, and large-scale social behaviors. To contextualize these designs, we introduce a challenge-centered taxonomy linking six major game genres to their dominant agent requirements, from low-latency control in action games to open-ended goal formation in sandbox worlds. A curated list of related papers is available at https://github.com/git-disl/awesome-LLM-game-agent-papers

[311] 3D-Prover: Diversity Driven Theorem Proving With Determinantal Point Processes

Sean Lamont, Christian Walder, Amir Dezfouli, Paul Montague, Michael Norrish

Main category: cs.AI

TL;DR: 3D-Prover is a method that uses synthetic data and Determinantal Point Processes to prune the search space in automated theorem proving by selecting semantically diverse and high-quality tactics, improving proof rates and efficiency.

Details

Motivation: The intractable search space in automated formal reasoning grows exponentially with proof depth, and many candidate proof tactics are semantically similar or cause execution errors, wasting computational resources.

Method: Generate semantically aware tactic representations capturing effects on proving environment, success likelihood, and execution time. Use Determinantal Point Processes to filter tactics for semantic diversity and quality.

Result: When augmenting popular open-source proving LLMs on miniF2F and LeanDojo benchmarks, 3D-Prover increases overall proof rate and significantly improves tactic success rate, execution time, and diversity.

Conclusion: 3D-Prover provides an effective general approach for pruning search spaces in automated theorem proving that can augment any underlying tactic generator, leading to more efficient and successful proof attempts.

Abstract: A key challenge in automated formal reasoning is the intractable search space, which grows exponentially with the depth of the proof. This branching is caused by the large number of candidate proof tactics which can be applied to a given goal. Nonetheless, many of these tactics are semantically similar or lead to an execution error, wasting valuable resources in both cases. We address the problem of effectively pruning this search, using only synthetic data generated from previous proof attempts. We first demonstrate that it is possible to generate semantically aware tactic representations which capture the effect on the proving environment, likelihood of success, and execution time. We then propose a novel filtering mechanism which leverages these representations to select semantically diverse and high quality tactics, using Determinantal Point Processes. Our approach, 3D- Prover, is designed to be general, and to augment any underlying tactic generator. We demonstrate the effectiveness of 3D-Prover on the miniF2F and LeanDojo benchmarks by augmenting popular open source proving LLMs. We show that our approach leads to an increase in the overall proof rate, as well as a significant improvement in the tactic success rate, execution time and diversity. We make our code available at https://github.com/sean-lamont/3D-Prover.

[312] TableTime: Reformulating Time Series Classification as Training-Free Table Understanding with Large Language Models

Jiahao Wang, Mingyue Cheng, Qingyang Mao, Yitong Zhou, Daoyu Wang, Qi Liu, Feiyang Xu, Xin Li

Main category: cs.AI

TL;DR: TableTime reformulates multivariate time series classification as a table understanding task, converting time series to tabular format and using LLMs for zero-shot classification through text representation and enhanced reasoning.

Details

Motivation: Existing LLM-based methods for MTSC have three bottlenecks: difficulty encoding temporal/channel information losslessly, challenging alignment with LLM semantic space, and requiring task-specific retraining which is expensive.

Method: Convert multivariate time series to tabular form to minimize information loss, represent tabular data in text format for natural LLM alignment, and use a reasoning framework with contextual text, neighborhood assistance, multi-path inference and problem decomposition.

Result: Extensive experiments on 10 UEA archive datasets verify TableTime’s superior performance in zero-shot multivariate time series classification.

Conclusion: TableTime effectively bridges the gaps in existing LLM-based MTSC methods by reformulating the problem as table understanding, enabling lossless information encoding and zero-shot classification without retraining.

Abstract: Large language models (LLMs) have demonstrated their effectiveness in multivariate time series classification (MTSC). Effective adaptation of LLMs for MTSC necessitates informative data representations. Existing LLM-based methods directly encode embeddings for time series within the latent space of LLMs from scratch to align with semantic space of LLMs. Despite their effectiveness, we reveal that these methods conceal three inherent bottlenecks: (1) they struggle to encode temporal and channel-specific information in a lossless manner, both of which are critical components of multivariate time series; (2) it is much difficult to align the learned representation space with the semantic space of the LLMs; (3) they require task-specific retraining, which is both computationally expensive and labor-intensive. To bridge these gaps, we propose TableTime, which reformulates MTSC as a table understanding task. Specifically, TableTime introduces the following strategies: (1) convert multivariate time series into a tabular form, thus minimizing information loss to the greatest extent; (2) represent tabular time series in text format to achieve natural alignment with the semantic space of LLMs; (3) design a reasoning framework that integrates contextual text information, neighborhood assistance, multi-path inference and problem decomposition to enhance the reasoning ability of LLMs and realize zero-shot classification. Extensive experiments performed on 10 publicly representative datasets from UEA archive verify the superiorities of the TableTime.

[313] Multimodal Dreaming: A Global Workspace Approach to World Model-Based Reinforcement Learning

Léopold Maytié, Roland Bertin Johannet, Rufin VanRullen

Main category: cs.AI

TL;DR: GW-Dreamer combines Global Workspace theory with world models in RL, enabling more efficient training with fewer environment steps and emergent robustness to missing observation modalities.

Details

Motivation: Humans use rich internal models for reasoning and adaptation, while typical RL world models operate directly on environment variables which can be slow. High-level latent dimensions could improve efficiency.

Method: Combines Global Workspace (GW) theory with world models in RL, performing the dreaming process (mental simulation) inside the GW latent space rather than directly on environment variables.

Result: GW-Dreamer trains with fewer environment steps than PPO and Dreamer baselines, and shows emergent robustness to missing observation modalities (images or simulation attributes) that baseline models lack.

Conclusion: The combination of GW with World Models has great potential for improving decision-making in RL agents, offering more efficient training and robustness properties.

Abstract: Humans leverage rich internal models of the world to reason about the future, imagine counterfactuals, and adapt flexibly to new situations. In Reinforcement Learning (RL), world models aim to capture how the environment evolves in response to the agent’s actions, facilitating planning and generalization. However, typical world models directly operate on the environment variables (e.g. pixels, physical attributes), which can make their training slow and cumbersome; instead, it may be advantageous to rely on high-level latent dimensions that capture relevant multimodal variables. Global Workspace (GW) Theory offers a cognitive framework for multimodal integration and information broadcasting in the brain, and recent studies have begun to introduce efficient deep learning implementations of GW. Here, we evaluate the capabilities of an RL system combining GW with a world model. We compare our GW-Dreamer with various versions of the standard PPO and the original Dreamer algorithms. We show that performing the dreaming process (i.e., mental simulation) inside the GW latent space allows for training with fewer environment steps. As an additional emergent property, the resulting model (but not its comparison baselines) displays strong robustness to the absence of one of its observation modalities (images or simulation attributes). We conclude that the combination of GW with World Models holds great potential for improving decision-making in RL agents.

[314] Partner Modelling Emerges in Recurrent Agents (But Only When It Matters)

Ruaridh Mon-Williams, Max Taylor-Davies, Elizabeth Mieczkowski, Natalia Velez, Neil R. Bramley, Yanwei Wang, Thomas L. Griffiths, Christopher G. Lucas

Main category: cs.AI

TL;DR: Model-free RNN agents develop structured internal representations of partners’ abilities through cooperative interaction in Overcooked-AI, enabling rapid adaptation to novel collaborators without explicit architectural mechanisms.

Details

Motivation: To understand whether flexible collaboration requires dedicated mechanisms for modeling others or can emerge spontaneously from cooperative interaction pressures.

Method: Train simple model-free RNN agents to collaborate with diverse partners in Overcooked-AI environment, analyze internal hidden states using probing techniques and behavioral analysis.

Result: Agents develop structured internal representations of partners’ task abilities, enabling rapid adaptation and generalization to novel collaborators, particularly when agents can influence partner behavior through task allocation.

Conclusion: Partner modeling can arise spontaneously in model-free agents under environmental conditions that impose appropriate social pressure, without requiring additional architectural features or inductive biases.

Abstract: Humans are remarkably adept at collaboration, able to infer the strengths and weaknesses of new partners in order to work successfully towards shared goals. To build AI systems with this capability, we must first understand its building blocks: does such flexibility require explicit, dedicated mechanisms for modelling others – or can it emerge spontaneously from the pressures of open-ended cooperative interaction? To investigate this question, we train simple model-free RNN agents to collaborate with a population of diverse partners. Using the `Overcooked-AI’ environment, we collect data from thousands of collaborative teams, and analyse agents’ internal hidden states. Despite a lack of additional architectural features, inductive biases, or auxiliary objectives, the agents nevertheless develop structured internal representations of their partners’ task abilities, enabling rapid adaptation and generalisation to novel collaborators. We investigated these internal models through probing techniques, and large-scale behavioural analysis. Notably, we find that structured partner modelling emerges when agents can influence partner behaviour by controlling task allocation. Our results show that partner modelling can arise spontaneously in model-free agents – but only under environmental conditions that impose the right kind of social pressure.

[315] VIRAL: Vision-grounded Integration for Reward design And Learning

Valentin Cuzin-Rambaud, Emilien Komlenovic, Alexandre Faure, Bruno Yun

Main category: cs.AI

TL;DR: VIRAL is a pipeline that uses multi-modal LLMs to autonomously generate and refine reward functions for reinforcement learning, improving alignment with user intent and accelerating behavior learning.

Details

Motivation: Address the critical challenge of human-machine alignment in AI, particularly the risks of poorly designed reward functions in reinforcement learning, where LLMs have shown potential to outperform humans in reward generation.

Method: VIRAL pipeline uses multi-modal LLMs to create and iteratively improve reward functions based on environment and goal prompts/images, incorporating human feedback or video LLM-generated descriptions of agent policies.

Result: Evaluation in five Gymnasium environments showed VIRAL accelerates learning of new behaviors while ensuring improved alignment with user intent.

Conclusion: VIRAL successfully demonstrates that LLM-based reward generation and refinement can enhance reinforcement learning performance and better align with human objectives.

Abstract: The alignment between humans and machines is a critical challenge in artificial intelligence today. Reinforcement learning, which aims to maximize a reward function, is particularly vulnerable to the risks associated with poorly designed reward functions. Recent advancements has shown that Large Language Models (LLMs) for reward generation can outperform human performance in this context. We introduce VIRAL, a pipeline for generating and refining reward functions through the use of multi-modal LLMs. VIRAL autonomously creates and interactively improves reward functions based on a given environment and a goal prompt or annotated image. The refinement process can incorporate human feedback or be guided by a description generated by a video LLM, which explains the agent’s policy in video form. We evaluated VIRAL in five Gymnasium environments, demonstrating that it accelerates the learning of new behaviors while ensuring improved alignment with user intent. The source-code and demo video are available at: https://github.com/VIRAL-UCBL1/VIRAL and https://youtu.be/Hqo82CxVT38.

[316] The Confidence Paradox: Can LLM Know When It’s Wrong

Sahil Tripathi, Md Tabrez Nafis, Imran Hussain, Jiechao Gao

Main category: cs.AI

TL;DR: HonestVQA is a model-agnostic framework that improves ethical alignment in DocVQA by reducing overconfidence and misaligned responses through weighted loss and contrastive learning.

Details

Motivation: Existing DocVQA models like LayoutLMv3, UDOP, and DONUT focus on accuracy but produce overconfident or ethically misaligned responses, especially under uncertainty, lacking ethical calibration.

Method: Proposed HonestVQA framework uses weighted loss and contrastive learning to align model confidence with correctness in a self-supervised, model-agnostic approach.

Result: HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. Achieves 78.9% accuracy and 76.1% F1-score with good cross-domain generalization.

Conclusion: HonestVQA effectively addresses ethical alignment in DocVQA models by improving both accuracy and confidence calibration through novel metrics (H-Score and ECI) and training techniques.

Abstract: Document Visual Question Answering (DocVQA) models often produce overconfident or ethically misaligned responses, especially under uncertainty. Existing models like LayoutLMv3, UDOP, and DONUT focus on accuracy but lack ethical calibration. We propose HonestVQA, a model-agnostic, self-supervised framework that aligns model confidence with correctness using weighted loss and contrastive learning. We introduce two new metrics Honesty Score (H-Score) and Ethical Confidence Index (ECI)-to evaluate ethical alignment. HonestVQA improves accuracy and F1 by up to 4.3% across SpDocVQA, InfographicsVQA, and SROIE datasets, while reducing overconfidence. It also generalizes well across domains, achieving 78.9% accuracy and 76.1% F1-score.

[317] Memory Mosaics at scale

Jianyu Zhang, Léon Bottou

Main category: cs.AI

TL;DR: Memory Mosaics v2 scaled to 10B parameters and trained on 1 trillion tokens match transformers on training knowledge learning and significantly outperform them on new knowledge storage and in-context learning tasks, even when compared to transformers trained on 8x more data.

Details

Motivation: To verify if the favorable compositional and in-context learning capabilities of Memory Mosaics remain when scaled to large language model sizes (LLaMA-8B scale) and trained on real-world datasets.

Method: Scaling Memory Mosaics to 10B parameters, training on 1 trillion tokens, and introducing architectural modifications (Memory Mosaics v2), then evaluating across three dimensions: training-knowledge storage, new-knowledge storage, and in-context learning.

Result: Memory Mosaics v2 match transformers on training knowledge learning and significantly outperform transformers on new knowledge storage and in-context learning tasks. A Memory Mosaics v2 trained on 1 trillion tokens performs better than a transformer trained on 8 trillion tokens.

Conclusion: Memory Mosaics maintain their favorable properties when scaled to large language model sizes and trained on real-world datasets, demonstrating superior performance on new tasks at inference time compared to transformers, with improvements not easily replicated by simply increasing transformer training data.

Abstract: Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications (“Memory Mosaics v2”), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.

[318] A Neuroscience-Inspired Dual-Process Model of Compositional Generalization

Alex Noviello, Claas Beger, Jacob Groner, Kevin Ellis, Weinan Sun

Main category: cs.AI

TL;DR: Mirage is a neuro-inspired dual-process model combining a fast Transformer (System 1) with a rule-based Schema Engine (System 2) to achieve systematic compositional generalization, achieving >99% accuracy on SCAN benchmark.

Details

Motivation: Deep learning models struggle with systematic compositional generalization, which is a hallmark of human cognition. The paper aims to address this limitation by drawing inspiration from brain architecture.

Method: Proposes Mirage - a dual-process model with System 1 (meta-trained Transformer) for fast intuitive processing and System 2 (Schema Engine) for deliberate rule-based processing, trained on random grammars with single-step decomposition.

Result: Achieves >99% accuracy on all splits of the SCAN benchmark in a task-agnostic setting. Ablations confirm systematic behavior emerges from architectural interplay between the two systems.

Conclusion: Provides a concrete computational model showing how compositional reasoning can arise from modular cognitive architecture, combining iterative neural updates with interpretable schema modules.

Abstract: Deep learning models struggle with systematic compositional generalization, a hallmark of human cognition. We propose \textsc{Mirage}, a neuro-inspired dual-process model that offers a processing account for this ability. It combines a fast, intuitive System~1'' (a meta-trained Transformer) with a deliberate, rule-based System~2’’ (a Schema Engine), mirroring the brain’s neocortical and hippocampal–prefrontal circuits. Trained to perform general, single-step decomposition on a stream of random grammars, Mirage achieves $>$99% accuracy on all splits of the SCAN benchmark in a task-agnostic setting. Ablations confirm that the model’s systematic behavior emerges from the architectural interplay of its two systems, particularly its use of explicit, prioritized schemas and iterative refinement. In line with recent progress on recursive/recurrent Transformer approaches, Mirage preserves an iterative neural update while externalizing declarative control into an interpretable schema module. Our work provides a concrete computational model for interpreting how compositional reasoning can arise from a modular cognitive architecture.

[319] Freeze and Conquer: Reusable Ansatz for Solving the Traveling Salesman Problem

Fabrizio Fagiolo, Nicolò Vescera

Main category: cs.AI

TL;DR: A variational quantum algorithm for TSP using compact permutation encoding and optimize-freeze-reuse strategy that reduces qubit requirements and eliminates costly structural research in testing.

Details

Motivation: To develop a quantum algorithm for TSP that is immediately implementable on NISQ hardware by reducing qubit requirements and eliminating expensive structural optimization during testing.

Method: Combines compact permutation encoding with optimize-freeze-reuse strategy: circuit topology is optimized on training instances using Simulated Annealing, then frozen and reused on new instances with only rapid parameter re-optimization.

Result: Achieved 100% optimal trip sampling for 4 cities, 90% for 5 cities, 80% for 6 cities, but dropped to ~20% for 7 cities, showing scalability limitations for larger problems.

Conclusion: The method shows robust generalization for moderate problem sizes and dramatically reduces time-to-solution without degrading quality, though scalability limitations emerge beyond 6 cities.

Abstract: In this paper we present a variational algorithm for the Traveling Salesman Problem (TSP) that combines (i) a compact encoding of permutations, which reduces the qubit requirement too, (ii) an optimize-freeze-reuse strategy: where the circuit topology (Ansatz'') is first optimized on a training instance by Simulated Annealing (SA), then frozen’’ and re-used on novel instances, limited to a rapid re-optimization of only the circuit parameters. This pipeline eliminates costly structural research in testing, making the procedure immediately implementable on NISQ hardware. On a set of $40$ randomly generated symmetric instances that span $4 - 7$ cities, the resulting Ansatz achieves an average optimal trip sampling probability of $100%$ for 4 city cases, $90%$ for 5 city cases and $80%$ for 6 city cases. With 7 cities the success rate drops markedly to an average of $\sim 20%$, revealing the onset of scalability limitations of the proposed method. The results show robust generalization ability for moderate problem sizes and indicate how freezing the Ansatz can dramatically reduce time-to-solution without degrading solution quality. The paper also discusses scalability limitations, the impact of ``warm-start’’ initialization of parameters, and prospects for extension to more complex problems, such as Vehicle Routing and Job-Shop Scheduling.

[320] Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou

Main category: cs.AI

TL;DR: The paper proposes ILA, a data selection framework that optimizes for semantic coverage and information depth to create compact training subsets that achieve faster performance gains than existing methods.

Details

Motivation: Scaling supervised fine-tuning data doesn't guarantee proportional performance gains, highlighting the need to understand what makes training samples effective.

Method: Proposes Information Landscape Approximation (ILA) - a model-agnostic data selection framework that jointly optimizes for semantic coverage (breadth of task domains) and information depth (richness of individual examples).

Result: Models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, showing accelerated scaling.

Conclusion: Simple proxies for semantic coverage and information depth explain most validation loss variance, and ILA effectively constructs compact subsets that approximate the informational value of large datasets.

Abstract: Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

[321] Is It Certainly a Deepfake? Reliability Analysis in Detection & Generation Ecosystem

Neslihan Kose, Anthony Rhodes, Umur Aybars Ciftci, Ilke Demir

Main category: cs.AI

TL;DR: This paper presents the first comprehensive uncertainty analysis of deepfake detectors, showing that uncertainty patterns can be leveraged for deepfake source detection and provide insights for reliable detection systems.

Details

Motivation: As generative models create more synthetic content causing online mistrust, deepfake detectors are needed but their misuse (claiming fake as real or vice versa) fuels misinformation. Uncertainty analysis is crucial for trustworthy detection.

Method: Leverages Bayesian Neural Networks and Monte Carlo dropout to quantify aleatoric and epistemic uncertainties across diverse detector architectures. Evaluates uncertainty on two datasets with nine generators, using four blind and two biological detectors.

Result: Uncertainty manifold holds consistent information for deepfake source detection. Uncertainty maps localize prediction confidence at pixel level, revealing patterns correlated with generator-specific artifacts. The approach shows generalization capability, model calibration, and robustness against adversarial attacks.

Conclusion: Uncertainty quantification is established as a fundamental requirement for trustworthy synthetic media detection, providing critical insights for deploying reliable deepfake detection systems.

Abstract: As generative models are advancing in quality and quantity for creating synthetic content, deepfakes begin to cause online mistrust. Deepfake detectors are proposed to counter this effect, however, misuse of detectors claiming fake content as real or vice versa further fuels this misinformation problem. We present the first comprehensive uncertainty analysis of deepfake detectors, systematically investigating how generative artifacts influence prediction confidence. As reflected in detectors’ responses, deepfake generators also contribute to this uncertainty as their generative residues vary, so we cross the uncertainty analysis of deepfake detectors and generators. Based on our observations, the uncertainty manifold holds enough consistent information to leverage uncertainty for deepfake source detection. Our approach leverages Bayesian Neural Networks and Monte Carlo dropout to quantify both aleatoric and epistemic uncertainties across diverse detector architectures. We evaluate uncertainty on two datasets with nine generators, with four blind and two biological detectors, compare different uncertainty methods, explore region- and pixel-based uncertainty, and conduct ablation studies. We conduct and analyze binary real/fake, multi-class real/fake, source detection, and leave-one-out experiments between the generator/detector combinations to share their generalization capability, model calibration, uncertainty, and robustness against adversarial attacks. We further introduce uncertainty maps that localize prediction confidence at the pixel level, revealing distinct patterns correlated with generator-specific artifacts. Our analysis provides critical insights for deploying reliable deepfake detection systems and establishes uncertainty quantification as a fundamental requirement for trustworthy synthetic media detection.

[322] MathBode: Understanding LLM Reasoning with Dynamical Systems

Charles L. Wang

Main category: cs.AI

TL;DR: MathBode is a dynamic diagnostic tool that analyzes mathematical reasoning in LLMs using frequency-domain analysis of parametric problems, revealing systematic behaviors like low-pass filtering and phase lag that standard accuracy metrics miss.

Details

Motivation: Standard one-shot accuracy metrics fail to capture the dynamic reasoning capabilities and systematic behaviors of LLMs in mathematical problem-solving. There's a need for more interpretable diagnostics that can reveal how models track parametric changes over time.

Method: Treats parametric problems as systems, driving a single parameter sinusoidally and fitting first-harmonic responses of model outputs and exact solutions. This yields frequency-resolved metrics (gain and phase) that form Bode-style fingerprints across five mathematical problem families.

Result: The diagnostic reveals systematic low-pass behavior and growing phase lag across models, with results separating frontier from mid-tier models on dynamics. A symbolic baseline shows ideal performance (G≈1, φ≈0) for calibration.

Conclusion: MathBode provides a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency, offering interpretable insights into LLM mathematical reasoning capabilities.

Abstract: This paper presents MathBode, a dynamic diagnostic for mathematical reasoning in large language models (LLMs). Instead of one-shot accuracy, MathBode treats each parametric problem as a system: we drive a single parameter sinusoidally and fit first-harmonic responses of model outputs and exact solutions. This yields interpretable, frequency-resolved metrics – gain (amplitude tracking) and phase (lag) – that form Bode-style fingerprints. Across five closed-form families (linear solve, ratio/saturation, compound interest, 2x2 linear systems, similar triangles), the diagnostic surfaces systematic low-pass behavior and growing phase lag that accuracy alone obscures. We compare several models against a symbolic baseline that calibrates the instrument ($G \approx 1$, $\phi \approx 0$). Results separate frontier from mid-tier models on dynamics, providing a compact, reproducible protocol that complements standard benchmarks with actionable measurements of reasoning fidelity and consistency. We open-source the dataset and code to enable further research and adoption.

Emma Rose Madden

Main category: cs.AI

TL;DR: LLMs should be used as pattern matchers for quasi-predictive interpolation under explicit scope conditions, not as substitutes for probabilistic inference in social sciences.

Details

Motivation: To address concerns about interpreting LLM outputs in social science applications and provide practical guidance for their appropriate use.

Method: Proposes using LLMs as high-capacity pattern matchers with explicit scope conditions, and introduces practical guardrails including independent draws, preregistered human baselines, reliability-aware validation, and subgroup calibration.

Result: A pragmatic reframing of LLM usage in social sciences that enables useful prototyping and forecasting while avoiding category errors.

Conclusion: LLMs can be valuable tools in social science research when used appropriately as pattern matchers with proper guardrails, rather than as direct substitutes for human inference.

Abstract: Large Language Models (LLMs) are being increasingly used as synthetic agents in social science, in applications ranging from augmenting survey responses to powering multi-agent simulations. This paper outlines cautions that should be taken when interpreting LLM outputs and proposes a pragmatic reframing for the social sciences in which LLMs are used as high-capacity pattern matchers for quasi-predictive interpolation under explicit scope conditions and not as substitutes for probabilistic inference. Practical guardrails such as independent draws, preregistered human baselines, reliability-aware validation, and subgroup calibration, are introduced so that researchers may engage in useful prototyping and forecasting while avoiding category errors.

[324] Co-TAP: Three-Layer Agent Interaction Protocol Technical Report

Shunyu An, Miao Wang, Yongchao Li, Dong Wan, Lina Wang, Ling Qin, Liqin Gao, Congyao Fan, Zhiyong Mao, Jiange Pu, Wenji Xia, Dong Zhao, Zhaohui Hao, Rui Hu, Ji Lu, Guiyue Zhou, Baoyu Tang, Yanqin Gao, Yongsheng Du, Daigang Xu, Lingjun Huang, Baoli Wang, Xiwen Zhang, Luyao Wang, Shilong Liu

Main category: cs.AI

TL;DR: Co-TAP is a three-layer agent interaction protocol addressing interoperability, interaction/collaboration, and knowledge sharing in multi-agent systems through HAI, UAP, and MEK protocols.

Details

Motivation: To address challenges in multi-agent systems across three core dimensions: Interoperability, Interaction and Collaboration, and Knowledge Sharing.

Method: Three-layer protocol design: HAI for human-agent interaction standardization, UAP for unified service discovery and protocol conversion, and MEK for standardized memory-extraction-knowledge cognitive chain.

Result: A comprehensive protocol framework enabling real-time performance, seamless interconnection, and collective intelligence capabilities in multi-agent systems.

Conclusion: Co-TAP provides solid engineering foundation and theoretical guidance for building next-generation efficient, scalable, and intelligent multi-agent applications.

Abstract: This paper proposes Co-TAP (T: Triple, A: Agent, P: Protocol), a three-layer agent interaction protocol designed to address the challenges faced by multi-agent systems across the three core dimensions of Interoperability, Interaction and Collaboration, and Knowledge Sharing. We have designed and proposed a layered solution composed of three core protocols: the Human-Agent Interaction Protocol (HAI), the Unified Agent Protocol (UAP), and the Memory-Extraction-Knowledge Protocol (MEK). HAI focuses on the interaction layer, standardizing the flow of information between users, interfaces, and agents by defining a standardized, event-driven communication paradigm. This ensures the real-time performance, reliability, and synergy of interactions. As the core of the infrastructure layer, UAP is designed to break down communication barriers among heterogeneous agents through unified service discovery and protocol conversion mechanisms, thereby enabling seamless interconnection and interoperability of the underlying network. MEK, in turn, operates at the cognitive layer. By establishing a standardized ‘‘Memory (M) - Extraction (E) - Knowledge (K)’’ cognitive chain, it empowers agents with the ability to learn from individual experiences and form shareable knowledge, thereby laying the foundation for the realization of true collective intelligence. We believe this protocol framework will provide a solid engineering foundation and theoretical guidance for building the next generation of efficient, scalable, and intelligent multi-agent applications.

[325] A Comprehensive Survey on Reinforcement Learning-based Agentic Search: Foundations, Roles, Optimizations, Evaluations, and Applications

Minhua Lin, Zongyu Wu, Zhichao Xu, Hui Liu, Xianfeng Tang, Qi He, Charu Aggarwal, Hui Liu, Xiang Zhang, Suhang Wang

Main category: cs.AI

TL;DR: This survey provides the first comprehensive overview of RL-based agentic search, organizing the field along three dimensions: functional roles of RL, optimization strategies, and scope of optimization.

Details

Motivation: LLMs have limitations including static knowledge, factual hallucinations, and inability to retrieve real-time information. Traditional RAG pipelines are single-turn and heuristic, lacking adaptive control over retrieval and reasoning.

Method: The survey organizes RL-based agentic search along three dimensions: what RL is for (functional roles), how RL is used (optimization strategies), and where RL is applied (scope of optimization). It summarizes methods, evaluation protocols, and applications.

Result: The survey provides a comprehensive framework for understanding RL-based agentic search systems and their potential to overcome limitations of traditional approaches.

Conclusion: RL offers a powerful mechanism for adaptive and self-improving search behavior. The survey aims to inspire future research on integrating RL and agentic search to build reliable and scalable systems.

Abstract: The advent of large language models (LLMs) has transformed information access and reasoning through open-ended natural language interaction. However, LLMs remain limited by static knowledge, factual hallucinations, and the inability to retrieve real-time or domain-specific information. Retrieval-Augmented Generation (RAG) mitigates these issues by grounding model outputs in external evidence, but traditional RAG pipelines are often single turn and heuristic, lacking adaptive control over retrieval and reasoning. Recent advances in agentic search address these limitations by enabling LLMs to plan, retrieve, and reflect through multi-step interaction with search environments. Within this paradigm, reinforcement learning (RL) offers a powerful mechanism for adaptive and self-improving search behavior. This survey provides the first comprehensive overview of \emph{RL-based agentic search}, organizing the emerging field along three complementary dimensions: (i) What RL is for (functional roles), (ii) How RL is used (optimization strategies), and (iii) Where RL is applied (scope of optimization). We summarize representative methods, evaluation protocols, and applications, and discuss open challenges and future directions toward building reliable and scalable RL driven agentic search systems. We hope this survey will inspire future research on the integration of RL and agentic search. Our repository is available at https://github.com/ventr1c/Awesome-RL-based-Agentic-Search-Papers.

[326] PanicToCalm: A Proactive Counseling Agent for Panic Attacks

Jihyun Lee, Yejin Min, San Kim, Yejin Jeon, SungJun Yang, Hyounghun Kim, Gary Geunbae Lee

Main category: cs.AI

TL;DR: PACE dataset for panic attacks using first-person narratives and PFA principles, with PACER model providing empathetic counseling that outperforms baselines in panic scenarios.

Details

Motivation: Address scarcity of suitable datasets for training panic attack intervention models due to ethical and logistical issues.

Method: Introduce PACE dataset from first-person narratives structured around Psychological First Aid, train PACER model using supervised learning and simulated preference alignment.

Result: PACER outperforms strong baselines in counselor-side metrics and client affect improvement, consistently preferred over general, CBT-based, and GPT-4-powered models.

Conclusion: PACER demonstrates practical value for panic attack intervention through multi-dimensional evaluation framework PanicEval.

Abstract: Panic attacks are acute episodes of fear and distress, in which timely, appropriate intervention can significantly help individuals regain stability. However, suitable datasets for training such models remain scarce due to ethical and logistical issues. To address this, we introduce PACE, which is a dataset that includes high-distress episodes constructed from first-person narratives, and structured around the principles of Psychological First Aid (PFA). Using this data, we train PACER, a counseling model designed to provide both empathetic and directive support, which is optimized through supervised learning and simulated preference alignment. To assess its effectiveness, we propose PanicEval, a multi-dimensional framework covering general counseling quality and crisis-specific strategies. Experimental results show that PACER outperforms strong baselines in both counselor-side metrics and client affect improvement. Human evaluations further confirm its practical value, with PACER consistently preferred over general, CBT-based, and GPT-4-powered models in panic scenarios (Code is available at https://github.com/JihyunLee1/PanicToCalm ).

[327] Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles

Siddharth Mehrotra, Jin Huang, Xuelong Fu, Roel Dobbe, Clara I. Sánchez, Maarten de Rijke

Main category: cs.AI

TL;DR: Scoping review of AIES and FAccT conferences reveals current trustworthy AI research is overly techno-centric, neglecting sociotechnical dimensions and creating critical gaps in understanding AI trustworthiness in real-world contexts.

Details

Motivation: Current trustworthy AI research primarily focuses on technical attributes like reliability and fairness while overlooking sociotechnical dimensions critical for understanding AI trustworthiness in real-world applications.

Method: Conducted scoping review of AIES and FAccT conference proceedings, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains.

Result: Significant progress in defining technical attributes but critical gaps exist - research emphasizes technical precision over social/ethical considerations, sociotechnical nature remains unexplored, and trustworthiness emerges as contested concept shaped by power dynamics.

Conclusion: Interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential; actionable measures proposed for holistic frameworks addressing AI-society interplay for responsible technological development.

Abstract: Background: Trustworthy AI serves as a foundational pillar for two major AI ethics conferences: AIES and FAccT. However, current research often adopts techno-centric approaches, focusing primarily on technical attributes such as reliability, robustness, and fairness, while overlooking the sociotechnical dimensions critical to understanding AI trustworthiness in real-world contexts. Objectives: This scoping review aims to examine how the AIES and FAccT communities conceptualize, measure, and validate AI trustworthiness, identifying major gaps and opportunities for advancing a holistic understanding of trustworthy AI systems. Methods: We conduct a scoping review of AIES and FAccT conference proceedings to date, systematically analyzing how trustworthiness is defined, operationalized, and applied across different research domains. Our analysis focuses on conceptualization approaches, measurement methods, verification and validation techniques, application areas, and underlying values. Results: While significant progress has been made in defining technical attributes such as transparency, accountability, and robustness, our findings reveal critical gaps. Current research often predominantly emphasizes technical precision at the expense of social and ethical considerations. The sociotechnical nature of AI systems remains less explored and trustworthiness emerges as a contested concept shaped by those with the power to define it. Conclusions: An interdisciplinary approach combining technical rigor with social, cultural, and institutional considerations is essential for advancing trustworthy AI. We propose actionable measures for the AI ethics community to adopt holistic frameworks that genuinely address the complex interplay between AI systems and society, ultimately promoting responsible technological development that benefits all stakeholders.

[328] Huxley-Gödel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

Wenyi Wang, Piotr Piękos, Li Nanbo, Firas Laakom, Yimeng Chen, Mateusz Ostaszewski, Mingchen Zhuge, Jürgen Schmidhuber

Main category: cs.AI

TL;DR: The paper introduces the Huxley-Gödel Machine (HGM), a self-improving coding agent that uses a new metric called CMP to guide self-modifications, overcoming the mismatch between benchmark performance and actual self-improvement potential.

Details

Motivation: Current self-improving coding agents assume higher benchmark performance leads to better self-modifications, but there's a mismatch between performance and actual self-improvement potential (metaproductivity).

Method: Proposed CMP metric that aggregates descendant performances to measure self-improvement potential. HGM estimates CMP and uses it to guide tree search through self-modifications, simulating Gödel Machine behavior.

Result: HGM outperforms prior methods on SWE-bench Verified and Polyglot with less wall-clock time. Achieves human-level performance on SWE-bench Lite, matching best human-engineered coding agents.

Conclusion: HGM demonstrates effective self-improvement through CMP-guided search, showing strong transfer across datasets and LLMs while achieving human-level coding performance.

Abstract: Recent studies operationalize self-improvement through coding agents that edit their own codebases. They grow a tree of self-modifications through expansion strategies that favor higher software engineering benchmark performance, assuming that this implies more promising subsequent self-modifications. However, we identify a mismatch between the agent’s self-improvement potential (metaproductivity) and its coding benchmark performance, namely the Metaproductivity-Performance Mismatch. Inspired by Huxley’s concept of clade, we propose a metric ($\mathrm{CMP}$) that aggregates the benchmark performances of the descendants of an agent as an indicator of its potential for self-improvement. We show that, in our self-improving coding agent development setting, access to the true $\mathrm{CMP}$ is sufficient to simulate how the G"odel Machine would behave under certain assumptions. We introduce the Huxley-G"odel Machine (HGM), which, by estimating $\mathrm{CMP}$ and using it as guidance, searches the tree of self-modifications. On SWE-bench Verified and Polyglot, HGM outperforms prior self-improving coding agent development methods while using less wall-clock time. Last but not least, HGM demonstrates strong transfer to other coding datasets and large language models. The agent optimized by HGM on SWE-bench Verified with GPT-5-mini and evaluated on SWE-bench Lite with GPT-5 achieves human-level performance, matching the best officially checked results of human-engineered coding agents. Our code is available at https://github.com/metauto-ai/HGM.

[329] From Prompt Optimization to Multi-Dimensional Credibility Evaluation: Enhancing Trustworthiness of Chinese LLM-Generated Liver MRI Reports

Qiuli Wang, Jie Chen, Yongxu Liu, Xingpeng Zhang, Xiaoming Li, Wei Chen

Main category: cs.AI

TL;DR: This study introduces a Multi-Dimensional Credibility Assessment (MDCA) framework to enhance trustworthiness of LLM-generated liver MRI reports and provides guidance on institution-specific prompt optimization.

Details

Motivation: LLMs show promise in generating diagnostic conclusions from imaging findings, but systematic guidance on prompt optimization across clinical contexts is lacking, and there's no standardized framework for assessing trustworthiness of LLM-generated radiology reports.

Method: Proposed a Multi-Dimensional Credibility Assessment (MDCA) framework and applied it to evaluate several advanced LLMs (Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, ByteDance-Seed-OSS-36B-Instruct) using the SiliconFlow platform.

Result: The study compares performance of multiple advanced LLMs in generating liver MRI reports, though specific quantitative results are not provided in the abstract.

Conclusion: The MDCA framework provides a standardized approach to assess trustworthiness of LLM-generated radiology reports and offers guidance for institution-specific prompt optimization to improve reliability.

Abstract: Large language models (LLMs) have demonstrated promising performance in generating diagnostic conclusions from imaging findings, thereby supporting radiology reporting, trainee education, and quality control. However, systematic guidance on how to optimize prompt design across different clinical contexts remains underexplored. Moreover, a comprehensive and standardized framework for assessing the trustworthiness of LLM-generated radiology reports is yet to be established. This study aims to enhance the trustworthiness of LLM-generated liver MRI reports by introducing a Multi-Dimensional Credibility Assessment (MDCA) framework and providing guidance on institution-specific prompt optimization. The proposed framework is applied to evaluate and compare the performance of several advanced LLMs, including Kimi-K2-Instruct-0905, Qwen3-235B-A22B-Instruct-2507, DeepSeek-V3, and ByteDance-Seed-OSS-36B-Instruct, using the SiliconFlow platform.

[330] Human-Like Goalkeeping in a Realistic Football Simulation: a Sample-Efficient Reinforcement Learning Approach

Alessandro Sestini, Joakim Bergdahl, Jean-Philippe Barrette-LaPierre, Florian Fuchs, Brady Chen, Michael Jones, Linus Gisslén

Main category: cs.AI

TL;DR: A sample-efficient Deep Reinforcement Learning method for training human-like game AI agents, tested in EA SPORTS FC 25 where it outperformed built-in AI by 10% in ball saving rate and trained 50% faster than standard DRL methods.

Details

Motivation: DRL has rarely been used in game industry for authentic AI behaviors due to impractical large models. Game studios need human-like agents with limited resources.

Method: Sample-efficient DRL method that improves efficiency by leveraging pre-collected data and increasing network plasticity, tailored for industrial game development settings.

Result: Goalkeeper agent in EA SPORTS FC 25 outperformed built-in AI by 10% in ball saving rate, trained 50% faster than standard DRL, and created more human-like gameplay according to experts.

Conclusion: The method successfully addresses industry needs and is intended to replace hand-crafted AI in future game iterations, demonstrating practical impact for game development.

Abstract: While several high profile video games have served as testbeds for Deep Reinforcement Learning (DRL), this technique has rarely been employed by the game industry for crafting authentic AI behaviors. Previous research focuses on training super-human agents with large models, which is impractical for game studios with limited resources aiming for human-like agents. This paper proposes a sample-efficient DRL method tailored for training and fine-tuning agents in industrial settings such as the video game industry. Our method improves sample efficiency of value-based DRL by leveraging pre-collected data and increasing network plasticity. We evaluate our method training a goalkeeper agent in EA SPORTS FC 25, one of the best-selling football simulations today. Our agent outperforms the game’s built-in AI by 10% in ball saving rate. Ablation studies show that our method trains agents 50% faster compared to standard DRL methods. Finally, qualitative evaluation from domain experts indicates that our approach creates more human-like gameplay compared to hand-crafted agents. As a testimony of the impact of the approach, the method is intended to replace the hand-crafted counterpart in next iterations of the series.

[331] ReCode: Unify Plan and Action for Universal Granularity Control

Zhaoyang Yu, Jiayi Zhang, Huixue Su, Yufan Zhao, Yifan Wu, Mingyi Deng, Jinyu Xiang, Yizhang Lin, Lingxiao Tang, Yingchao Li, Yuyu Luo, Bang Liu, Chenglin Wu

Main category: cs.AI

TL;DR: ReCode introduces a recursive code generation paradigm that unifies planning and action by treating high-level plans as abstract functions that are recursively decomposed into primitive actions, enabling dynamic granularity control and generating rich training data.

Details

Motivation: Current LLM-based agents lack the ability to operate fluidly across decision granularities due to rigid separation between planning and action, which limits adaptability and generalization.

Method: ReCode treats high-level plans as abstract placeholder functions and recursively decomposes them into finer-grained sub-functions until reaching primitive actions, using a unified code representation.

Result: Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training.

Conclusion: Unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control in AI agents.

Abstract: Real-world tasks require decisions at varying granularities, and humans excel at this by leveraging a unified cognitive representation where planning is fundamentally understood as a high-level form of action. However, current Large Language Model (LLM)-based agents lack this crucial capability to operate fluidly across decision granularities. This limitation stems from existing paradigms that enforce a rigid separation between high-level planning and low-level action, which impairs dynamic adaptability and limits generalization. We propose ReCode (Recursive Code Generation), a novel paradigm that addresses this limitation by unifying planning and action within a single code representation. In this representation, ReCode treats high-level plans as abstract placeholder functions, which the agent then recursively decomposes into finer-grained sub-functions until reaching primitive actions. This recursive approach dissolves the rigid boundary between plan and action, enabling the agent to dynamically control its decision granularity. Furthermore, the recursive structure inherently generates rich, multi-granularity training data, enabling models to learn hierarchical decision-making processes. Extensive experiments show ReCode significantly surpasses advanced baselines in inference performance and demonstrates exceptional data efficiency in training, validating our core insight that unifying planning and action through recursive code generation is a powerful and effective approach to achieving universal granularity control. The code is available at https://github.com/FoundationAgents/ReCode.

[332] Multi-Agent Evolve: LLM Self-Improve through Co-evolution

Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhang, Mostofa Patwary, Jiaxuan You

Main category: cs.AI

TL;DR: MAE is a multi-agent self-evolution framework that enables LLMs to improve reasoning capabilities through three interacting agents (Proposer, Solver, Judge) using reinforcement learning, achieving 4.54% average improvement on benchmarks with minimal human supervision.

Details

Motivation: Current RL methods for LLMs rely heavily on human-curated datasets and verifiable rewards, limiting scalability. Self-Play RL methods require grounded environments, making generalization to diverse domains challenging.

Method: Proposes Multi-Agent Evolve (MAE) with three agents instantiated from a single LLM: Proposer generates questions, Solver provides solutions, and Judge evaluates both. Uses reinforcement learning to optimize agent behaviors through co-evolution.

Result: Experiments on Qwen2.5-3B-Instruct show 4.54% average improvement across multiple benchmarks in mathematics, reasoning, and general knowledge Q&A.

Conclusion: MAE provides a scalable, data-efficient method for enhancing LLM reasoning abilities with minimal human supervision, demonstrating effectiveness across diverse domains.

Abstract: Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs). However, the success of RL for LLMs heavily relies on human-curated datasets and verifiable rewards, which limit their scalability and generality. Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data. However, their methods primarily depend on a grounded environment for feedback (e.g., a Python interpreter or a game engine); extending them to general domains remains challenging. To address these challenges, we propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A. The core design of MAE is based on a triplet of interacting agents (Proposer, Solver, Judge) that are instantiated from a single LLM, and applies reinforcement learning to optimize their behaviors. The Proposer generates questions, the Solver attempts solutions, and the Judge evaluates both while co-evolving. Experiments on Qwen2.5-3B-Instruct demonstrate that MAE achieves an average improvement of 4.54% on multiple benchmarks. These results highlight MAE as a scalable, data-efficient method for enhancing the general reasoning abilities of LLMs with minimal reliance on human-curated supervision.

cs.SD

[333] Optimized Loudspeaker Panning for Adaptive Sound-Field Correction and Non-stationary Listening Areas

Yuancheng Luo

Main category: cs.SD

TL;DR: Bayesian loudspeaker normalization and content panning optimization methods for sound-field correction in non-standard multichannel audio layouts.

Details

Motivation: Practical loudspeaker layouts often deviate from standardized configurations, causing sound-field errors that degrade audio quality in timbre, imaging, and clarity.

Method: Uses Bayesian loudspeaker normalization with conjugate prior distributions to estimate layouts for non-stationary listening locations, and frequency-domain panning optimization with spatial, electrical, and acoustic constraints.

Result: Methods create virtual loudspeakers in standardized layouts for accurate multichannel reproduction without requiring acoustic measurements.

Conclusion: The approach enables robust sound-field correction in practical applications with varying loudspeaker layouts and listening conditions.

Abstract: Surround sound systems commonly distribute loudspeakers along standardized layouts for multichannel audio reproduction. However in less controlled environments, practical layouts vary in loudspeaker quantity, placement, and listening locations / areas. Deviations from standard layouts introduce sound-field errors that degrade acoustic timbre, imaging, and clarity of audio content reproduction. This work introduces both Bayesian loudspeaker normalization and content panning optimization methods for sound-field correction. Conjugate prior distributions over loudspeaker-listener directions update estimated layouts for non-stationary listening locations; digital filters adapt loudspeaker acoustic responses to a common reference target at the estimated listening area without acoustic measurements. Frequency-domain panning coefficients are then optimized via sensitivity / efficiency objectives subject to spatial, electrical, and acoustic domain constraints; normalized and panned loudspeakers form virtual loudspeakers in standardized layouts for accurate multichannel reproduction. Experiments investigate robustness of Bayesian adaptation, and panning optimizations in practical applications.

[334] Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation

Kang Zhang, Trung X. Pham, Suyeon Lee, Axi Niu, Arda Senocak, Joon Son Chung

Main category: cs.SD

TL;DR: MGAudio is a flow-based framework for video-to-audio generation that uses model-guided dual-role alignment, achieving state-of-the-art performance on VGGSound and UnAV-100 benchmarks.

Details

Motivation: To overcome limitations of prior classifier-based or classifier-free guidance approaches by enabling the generative model to guide itself through a dedicated training objective for video-conditioned audio generation.

Method: Integrates three components: (1) scalable flow-based Transformer model, (2) dual-role alignment mechanism where audio-visual encoder serves as both conditioning module and feature aligner, (3) model-guided objective for cross-modal coherence and audio realism.

Result: Achieves state-of-the-art performance on VGGSound with FAD reduced to 0.40, substantially surpassing classifier-free guidance baselines, and outperforms existing methods across FD, IS, and alignment metrics. Also generalizes well to UnAV-100 benchmark.

Conclusion: Model-guided dual-role alignment is a powerful and scalable paradigm for conditional video-to-audio generation.

Abstract: We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio

[335] emg2speech: synthesizing speech from electromyography using self-supervised speech models

Harshavardhana T. Gowda, Lee M. Miller

Main category: cs.SD

TL;DR: A neuromuscular speech interface that translates EMG signals from facial muscles directly into audio using self-supervised speech representations, enabling end-to-end EMG-to-speech generation.

Details

Motivation: To develop a direct interface that converts muscle activity during speech articulation into synthesized speech, leveraging the relationship between self-supervised speech features and EMG signals.

Method: Linear mapping of self-supervised speech features to EMG power, with gesture-specific clustering in feature space, followed by direct mapping of EMG signals to speech feature space for synthesis.

Result: Strong linear relationship between SS features and EMG power (r=0.85), structured clusters for different articulatory gestures, and successful end-to-end EMG-to-speech generation without explicit articulatory models.

Conclusion: Self-supervised speech models implicitly encode articulatory mechanisms, enabling direct EMG-to-speech conversion through linear mappings and feature space relationships.

Abstract: We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.

[336] HergNet: a Fast Neural Surrogate Model for Sound Field Predictions via Superposition of Plane Waves

Matteo Calafà, Yuanxin Xia, Cheol-Ho Jeong

Main category: cs.SD

TL;DR: A neural network architecture that automatically satisfies the Helmholtz equation for efficient sound field prediction in 2D and 3D.

Details

Motivation: To develop a method that can efficiently predict sound fields while ensuring physical validity through automatic satisfaction of the Helmholtz equation.

Method: Novel neural network architecture designed to automatically satisfy the Helmholtz equation, enabling learning of boundary-value problems in wave phenomena.

Result: Numerical experiments show potential to outperform state-of-the-art methods in room acoustics simulation, especially at mid to high frequencies.

Conclusion: The proposed neural network provides physically valid and efficient prediction of sound fields, with promising performance in room acoustics applications.

Abstract: We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies.

[337] TsetlinKWS: A 65nm 16.58uW, 0.63mm2 State-Driven Convolutional Tsetlin Machine-Based Accelerator For Keyword Spotting

Baizhou Lin, Yuetong Fang, Renjing Xu, Rishad Shafik, Jagmohan Chauhan

Main category: cs.SD

TL;DR: TsetlinKWS is an algorithm-hardware co-design framework that enables Convolutional Tsetlin Machines to achieve competitive accuracy on keyword spotting while achieving 10× reduction in operations and high energy efficiency.

Details

Motivation: The Tsetlin Machine offers low-power and interpretable inference but has limited performance on speech tasks. This work aims to make CTM competitive for keyword spotting applications.

Method: Proposes MFSC-SF feature extraction with spectral convolution, OG-BCSR algorithm for 9.84× model compression, and a state-driven hardware architecture exploiting data reuse and sparsity.

Result: Achieves 87.35% accuracy on 12-keyword spotting, 9.84× model size reduction, 16.58 μW power consumption at 0.7V, 0.63 mm² core area, and 10× reduction in operations compared to state-of-the-art.

Conclusion: TsetlinKWS positions CTM as a highly-efficient candidate for ultra-low-power speech applications with competitive accuracy and significant efficiency improvements.

Abstract: The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) on the 12-keyword spotting task. Firstly, we introduce a novel Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme together with spectral convolution, enabling the CTM to reach its first-ever competitive accuracy of 87.35% on the 12-keyword spotting task. Secondly, we develop an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) algorithm that achieves a remarkable 9.84$\times$ reduction in model size, significantly improving the storage efficiency on CTMs. Finally, we propose a state-driven architecture tailored for the CTM, which simultaneously exploits data reuse and sparsity to achieve high energy efficiency. The full system is evaluated in 65 nm process technology, consuming 16.58 $\mu$W at 0.7 V with a compact 0.63 mm$^2$ core area. TsetlinKWS requires only 907k logic operations per inference, representing a 10$\times$ reduction compared to the state-of-the-art KWS accelerators, positioning the CTM as a highly-efficient candidate for ultra-low-power speech applications.

[338] Sound Source Localization for Spatial Mapping of Surgical Actions in Dynamic Scenes

Jonas Hein, Lazaros Vlachopoulos, Maurits Geert Laurent Olthof, Bastian Sigrist, Philipp Fürnstahl, Matthias Seibold

Main category: cs.SD

TL;DR: A novel framework that integrates 3D acoustic information with visual data to create 4D audio-visual representations of surgical scenes, enabling spatial sound localization and multimodal understanding of surgical environments.

Details

Motivation: Current surgical scene understanding approaches rely mainly on visual data or end-to-end learning, which limits fine-grained contextual modeling. The goal is to enhance surgical scene representations by integrating 3D acoustic information for temporally and spatially aware multimodal understanding.

Method: Projects acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. Uses a transformer-based acoustic event detection module to identify tool-tissue interactions and spatially localize them in the audio-visual scene representation. Evaluated in realistic operating room setup during simulated surgical procedures.

Result: Successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing comprehensive dynamic representation of surgical activity.

Conclusion: Introduces the first approach for spatial sound localization in dynamic surgical scenes, marking significant advancement toward multimodal surgical representations. The framework enables richer contextual understanding and provides foundation for future intelligent surgical systems.

Abstract: Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.

[339] Bayesian Speech synthesizers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiangli, Wen Wu, Chao Zhang

Main category: cs.SD

TL;DR: BELLE is a novel continuous-valued autoregressive TTS framework that predicts mel-spectrograms directly from text, using Bayesian evidential learning to model uncertainty and trained on diverse synthetic speech data.

Details

Motivation: To address limitations of codec-based TTS (pretraining challenges and quantization errors) by developing continuous-valued generative models that can better handle diverse speech patterns and provide reliable sampling strategies.

Method: Proposes BELLE framework that treats mel-spectrogram frames as Gaussian distributions from learned hyper distributions, uses Bayesian evidential learning to distill diverse speech samples synthesized from multiple pre-trained TTS models with the same text-audio prompts.

Result: BELLE achieves highly competitive performance compared to current best open-source TTS models, despite being trained on synthetic data and using only about one-tenth of their training data.

Conclusion: Continuous-valued autoregressive TTS with Bayesian evidential learning and synthetic data distillation is a promising approach that can match state-of-the-art performance while being more data-efficient.

Abstract: Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.

[340] Online neural fusion of distortionless differential beamformers for robust speech enhancement

Yuanhang Qian, Kunlong Zhao, Jilu Jin, Xueqin Luo, Gongping Huang, Jingdong Chen, Jacob Benesty

Main category: cs.SD

TL;DR: Proposes a neural fusion framework that combines multiple fixed beamformers using neural network-estimated weights, overcoming limitations of adaptive convex combination in non-stationary acoustic environments.

Details

Motivation: Fixed beamforming provides stable performance but lacks adaptability to varying acoustic conditions. Adaptive convex combination methods fail in highly non-stationary scenarios like rapidly moving interference due to unreliable tracking of rapid changes.

Method: Frame-online neural fusion framework that estimates combination weights through a neural network to linearly combine outputs of multiple distortionless differential beamformers.

Result: The proposed method adapts more effectively to dynamic acoustic environments compared to conventional ACC, achieving stronger interference suppression while maintaining the distortionless constraint.

Conclusion: Neural network-based weight estimation enables better adaptation to non-stationary acoustic conditions than traditional adaptive methods, providing improved interference suppression for beamforming applications.

Abstract: Fixed beamforming is widely used in practice since it does not depend on the estimation of noise statistics and provides relatively stable performance. However, a single beamformer cannot adapt to varying acoustic conditions, which limits its interference suppression capability. To address this, adaptive convex combination (ACC) algorithms have been introduced, where the outputs of multiple fixed beamformers are linearly combined to improve robustness. Nevertheless, ACC often fails in highly non-stationary scenarios, such as rapidly moving interference, since its adaptive updates cannot reliably track rapid changes. To overcome this limitation, we propose a frame-online neural fusion framework for multiple distortionless differential beamformers, which estimates the combination weights through a neural network. Compared with conventional ACC, the proposed method adapts more effectively to dynamic acoustic environments, achieving stronger interference suppression while maintaining the distortionless constraint.

[341] Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

Rinku Sebastian, Simon O’Keefe, Martin Trefzer

Main category: cs.SD

TL;DR: Proposes Time domain Mel frequency Wavelet Coefficient (TMFWC) to combine MFCC and wavelet transform advantages while reducing computational complexity, achieving improved efficiency with reservoir computing.

Details

Motivation: MFCC lacks time-frequency information while wavelet transform has poor frequency resolution in low frequencies and doesn't align well with human auditory perception. Need to combine both advantages efficiently.

Method: Extracts Mel scale features in time domain by combining wavelet transform concepts, avoiding time-frequency conversion and reducing wavelet extraction complexity.

Result: Significantly improved efficiency of audio signal processing when combined with reservoir computing methodology.

Conclusion: TMFWC successfully integrates MFCC and wavelet transform benefits while reducing computational burden, making it an effective feature extraction method for speech processing.

Abstract: Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications, as the filtering in this feature is similar to the filtering taking place in the human ear. But the main drawback of this feature is that it provides only the frequency information of the signal but does not provide the information about at what time which frequency is present. The wavelet transform, with its flexible time-frequency window, provides time and frequency information of the signal and is an appropriate tool for the analysis of non-stationary signals like speech. On the other hand, because of its uniform frequency scaling, a typical wavelet transform may be less effective in analysing speech signals, have poorer frequency resolution in low frequencies, and be less in line with human auditory perception. Hence, it is necessary to develop a feature that incorporates the merits of both MFCC and wavelet transform. A great deal of studies are trying to combine both these features. The present Wavelet Transform based Mel-scaled feature extraction methods require more computation when a wavelet transform is applied on top of Mel-scale filtering, since it adds extra processing steps. Here we are proposing a method to extract Mel scale features in time domain combining the concept of wavelet transform, thus reducing the computational burden of time-frequency conversion and the complexity of wavelet extraction. Combining our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.

[342] STAR-Bench: Probing Deep Spatio-Temporal Reasoning as Audio 4D Intelligence

Zihan Liu, Zhikang Niu, Qiuyang Xiao, Zhisheng Zheng, Ruoqi Yuan, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Jianze Liang, Xie Chen, Leilei Sun, Dahua Lin, Jiaqi Wang

Main category: cs.SD

TL;DR: STAR-Bench is a new benchmark that measures audio 4D intelligence - reasoning over sound dynamics in time and 3D space, addressing limitations in existing audio benchmarks that mainly test semantics recoverable from text.

Details

Motivation: Existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. The authors aim to formalize and measure audio 4D intelligence.

Method: STAR-Bench combines Foundational Acoustic Perception (six attributes under absolute/relative regimes) with Holistic Spatio-Temporal Reasoning (segment reordering, spatial tasks including localization, multi-source relations, and dynamic trajectories). Data curation uses procedurally synthesized audio for foundational tasks and a four-stage human-annotated process for holistic data.

Result: Evaluation of 19 models shows substantial gaps compared to humans, with caption-only answering causing large drops (-31.5% temporal, -35.2% spatial). Closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning.

Conclusion: STAR-Bench provides critical insights and a clear path forward for developing models with more robust understanding of the physical world through audio 4D intelligence.

Abstract: Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5% temporal, -35.2% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.

[343] BNMusic: Blending Environmental Noises into Personalized Music

Chi Zuo, Martin B. Møller, Pablo Martínez-Nuevo, Huayang Huang, Yu Wu, Ye Zhu

Main category: cs.SD

TL;DR: BNMusic is a novel framework that generates personalized music from text prompts to blend with environmental noise, reducing noise perception through rhythmically aligned and adaptively amplified music generation.

Details

Motivation: Traditional acoustic masking requires excessive volume to cover environmental noise due to misalignment issues. The paper aims to reduce noise noticeability by blending it with personalized music generated from user text prompts.

Method: A two-stage framework: 1) synthesizes complete music in mel-spectrogram representation that captures the musical essence of noise, 2) adaptively amplifies generated music to reduce noise perception while preserving audio quality.

Result: Experiments on MusicBench, EPIC-SOUNDS, and ESC-50 datasets demonstrate effective blending of environmental noise with rhythmically aligned and enjoyable music segments, minimizing noise noticeability.

Conclusion: BNMusic framework successfully improves acoustic experiences by generating personalized music that blends with environmental noise, offering an alternative to traditional acoustic masking methods.

Abstract: While being disturbed by environmental noises, the acoustic masking technique is a conventional way to reduce the annoyance in audio engineering that seeks to cover up the noises with other dominant yet less intrusive sounds. However, misalignment between the dominant sound and the noise-such as mismatched downbeats-often requires an excessive volume increase to achieve effective masking. Motivated by recent advances in cross-modal generation, in this work, we introduce an alternative method to acoustic masking, aiming to reduce the noticeability of environmental noises by blending them into personalized music generated based on user-provided text prompts. Following the paradigm of music generation using mel-spectrogram representations, we propose a Blending Noises into Personalized Music (BNMusic) framework with two key stages. The first stage synthesizes a complete piece of music in a mel-spectrogram representation that encapsulates the musical essence of the noise. In the second stage, we adaptively amplify the generated music segment to further reduce noise perception and enhance the blending effectiveness, while preserving auditory quality. Our experiments with comprehensive evaluations on MusicBench, EPIC-SOUNDS, and ESC-50 demonstrate the effectiveness of our framework, highlighting the ability to blend environmental noise with rhythmically aligned, adaptively amplified, and enjoyable music segments, minimizing the noticeability of the noise, thereby improving overall acoustic experiences. Project page: https://d-fas.github.io/BNMusic_page/.

[344] Latent Multi-view Learning for Robust Environmental Sound Representations

Sivan Ding, Julia Wilkins, Magdalena Fuentes, Juan Pablo Bello

Main category: cs.SD

TL;DR: A multi-view learning framework that combines contrastive and generative SSL methods to capture sound source and device information from environmental audio data.

Details

Motivation: To explore how contrastive and generative self-supervised learning approaches can complement each other in a unified framework for environmental sound representation learning.

Method: Encodes compressed audio latents into view-specific and view-common subspaces using two self-supervised objectives: contrastive learning for targeted information flow and reconstruction for overall information preservation.

Result: Demonstrated improved performance on urban sound sensor network dataset for sound source and sensor classification compared to traditional SSL techniques.

Conclusion: The framework successfully integrates contrastive and generative SSL methods and shows potential for disentangling environmental sound attributes in structured latent spaces.

Abstract: Self-supervised learning (SSL) approaches, such as contrastive and generative methods, have advanced environmental sound representation learning using unlabeled data. However, how these approaches can complement each other within a unified framework remains relatively underexplored. In this work, we propose a multi-view learning framework that integrates contrastive principles into a generative pipeline to capture sound source and device information. Our method encodes compressed audio latents into view-specific and view-common subspaces, guided by two self-supervised objectives: contrastive learning for targeted information flow between subspaces, and reconstruction for overall information preservation. We evaluate our method on an urban sound sensor network dataset for sound source and sensor classification, demonstrating improved downstream performance over traditional SSL techniques. Additionally, we investigate the model’s potential to disentangle environmental sound attributes within the structured latent space under varied training configurations.

[345] Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation

Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang

Main category: cs.SD

TL;DR: The paper identifies and addresses Insertion Hallucination in Video-to-Audio generation, where models generate sounds without visual sources, and proposes a training-free method to reduce this issue by over 50%.

Details

Motivation: Existing evaluation metrics for Video-to-Audio generation overlook a critical failure mode where models generate acoustic events (speech/music) without corresponding visual sources, driven by dataset biases like off-screen sounds.

Method: Proposes Posterior Feature Correction (PFC), a training-free inference-time method that uses a two-pass process: initial audio generation to detect hallucinated segments, then regeneration after masking corresponding video features at those timestamps.

Result: State-of-the-art models suffer from severe Insertion Hallucination. PFC reduces both prevalence (IH@vid) and duration (IH@dur) of hallucinations by over 50% on average, without degrading conventional audio quality and synchronization metrics.

Conclusion: This work formally defines, systematically measures, and effectively mitigates Insertion Hallucination, paving the way for more reliable and faithful Video-to-Audio models.

Abstract: Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.

[346] Low-Resource Audio Codec (LRAC): 2025 Challenge Description

Kamil Wojcicki, Yusuf Ziya Isik, Laura Lechler, Mansur Yesilbursa, Ivana Balić, Wolfgang Mack, Rafał Łaganowski, Guoqing Zhang, Yossi Adi, Minje Kim, Shinji Watanabe

Main category: cs.SD

TL;DR: The 2025 Low-Resource Audio Codec Challenge aims to develop neural and hybrid codecs for resource-constrained applications, addressing limitations of current neural audio codecs in low-resource operation and robustness to acoustic distortions.

Details

Motivation: Current neural audio codecs have superior speech quality at ultralow bitrates but face practical adoption challenges due to low-resource operation constraints and lack of robustness to acoustic distortions like background noise and reverberation.

Method: Introduces a challenge framework with standardized training dataset, two baseline systems, and comprehensive evaluation framework to develop neural and hybrid codecs for edge deployment scenarios with stringent compute constraints.

Result: The challenge is expected to yield valuable insights applicable to both codec design and related downstream audio tasks, though specific results are not yet available as this is a proposed challenge.

Conclusion: The 2025 Low-Resource Audio Codec Challenge will catalyze progress in developing practical neural codecs that can operate under compute constraints while maintaining low latency and bitrate, with robustness to acoustic degradations.

Abstract: While recent neural audio codecs deliver superior speech quality at ultralow bitrates over traditional methods, their practical adoption is hindered by obstacles related to low-resource operation and robustness to acoustic distortions. Edge deployment scenarios demand codecs that operate under stringent compute constraints while maintaining low latency and bitrate. The presence of background noise and reverberation further necessitates designs that are resilient to such degradations. The performance of neural codecs under these constraints and their integration with speech enhancement remain largely unaddressed. To catalyze progress in this area, we introduce the 2025 Low-Resource Audio Codec Challenge, which targets the development of neural and hybrid codecs for resource-constrained applications. Participants are supported with a standardized training dataset, two baseline systems, and a comprehensive evaluation framework. The challenge is expected to yield valuable insights applicable to both codec design and related downstream audio tasks.

cs.LG

[347] An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis

Phuong Q. Dao, Mark Roantree, Vuong M. Ngo

Main category: cs.LG

TL;DR: Proposes BERT-ViT-EF and DTCN models for multimodal sentiment analysis using early fusion of text and image transformers, achieving state-of-the-art results on MVSA-Single and TumEmo datasets.

Details

Motivation: To improve multimodal sentiment analysis by enabling deeper cross-modal interactions and more effective joint representation learning than unimodal approaches.

Method: BERT-ViT-EF combines BERT for text and ViT for images via early fusion. DTCN extends this with additional transformer layer for text refinement and contrastive learning for modality alignment.

Result: DTCN achieves 78.4% accuracy and 78.3% F1-score on TumEmo, and 76.6% accuracy and 75.9% F1-score on MVSA-Single, demonstrating superior performance.

Conclusion: Early fusion and deeper contextual modeling in transformer-based architectures significantly enhance multimodal sentiment analysis performance.

Abstract: Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model’s capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature learning. Empirical results on two widely used MSA benchmarks MVSA-Single and TumEmo demonstrate the effectiveness of our approach. DTCN achieves best accuracy (78.4%) and F1-score (78.3%) on TumEmo, and delivers competitive performance on MVSA-Single, with 76.6% accuracy and 75.9% F1-score. These improvements highlight the benefits of early fusion and deeper contextual modeling in Transformer-based multimodal sentiment analysis.

[348] Speeding Up MACE: Low-Precision Tricks for Equivarient Force Fields

Alexandre Benoit

Main category: cs.LG

TL;DR: This paper investigates computational optimizations for MACE force fields, showing that cuEquivariance backend and mixed-precision inference (BF16/FP16 linear layers with FP32 accumulation) can achieve 3-4x speedups while maintaining physical accuracy in molecular dynamics simulations.

Details

Motivation: To reduce the high computational cost of machine-learning force fields like MACE while preserving physical fidelity, by systematically evaluating reduced-precision arithmetic and GPU-optimized kernels.

Method: Profiled MACE end-to-end and per block, compared e3nn and cuEquivariance backends, assessed FP64/FP32/BF16/FP16 precision settings with FP32 accumulation for inference, NVT/NPT water simulations, and training runs under reproducible timing conditions.

Result: cuEquivariance reduced inference latency by ~3x. BF16/FP16 linear layers within FP32 models provided ~4x additional speedups. Energies and thermodynamic observables remained within run-to-run variability, but half-precision training degraded force RMSE.

Conclusion: Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. Recommended policy: cuEquivariance with FP32 default, enable BF16/FP16 for linear layers with FP32 accumulation for maximum throughput, while keeping training in FP32.

Abstract: Machine-learning force fields can deliver accurate molecular dynamics (MD) at high computational cost. For SO(3)-equivariant models such as MACE, there is little systematic evidence on whether reduced-precision arithmetic and GPU-optimized kernels can cut this cost without harming physical fidelity. This thesis aims to make MACE cheaper and faster while preserving accuracy by identifying computational bottlenecks and evaluating low-precision execution policies. We profile MACE end-to-end and per block, compare the e3nn and NVIDIA cuEquivariance backends, and assess FP64/FP32/BF16/FP16 settings (with FP32 accumulation) for inference, short NVT and long NPT water simulations, and toy training runs under reproducible, steady-state timing. cuEquivariance reduces inference latency by about $3\times$. Casting only linear layers to BF16/FP16 within an FP32 model yields roughly 4x additional speedups, while energies and thermodynamic observables in NVT/NPT MD remain within run-to-run variability. Half-precision weights during training degrade force RMSE. Mixing e3nn and cuEq modules without explicit adapters causes representation mismatches. Fused equivariant kernels and mixed-precision inference can substantially accelerate state-of-the-art force fields with negligible impact on downstream MD. A practical policy is to use cuEquivariance with FP32 by default and enable BF16/FP16 for linear layers (keeping FP32 accumulations) for maximum throughput, while training remains in FP32. Further gains are expected on Ampere/Hopper GPUs (TF32/BF16) and from kernel-level FP16/BF16 paths and pipeline fusion.

[349] Adversarially-Aware Architecture Design for Robust Medical AI Systems

Alyssa Gerhart, Balaji Iyangar

Main category: cs.LG

TL;DR: Adversarial attacks threaten healthcare AI by causing dangerous misclassifications, especially impacting underserved populations. The study demonstrates these vulnerabilities on dermatological data and shows partial defense success through adversarial training and distillation.

Details

Motivation: Adversarial attacks pose severe risks to healthcare AI systems, potentially causing treatment delays and misdiagnoses that threaten patient safety, particularly in vulnerable populations.

Method: Empirical experimentation on dermatological dataset using threat modeling, experimental benchmarking, and model evaluation to test adversarial attacks and defenses like adversarial training and distillation.

Result: Adversarial methods significantly reduce classification accuracy, while defenses reduce attack success rates but must be balanced against model performance on clean data.

Conclusion: Integrated technical, ethical, and policy-based approaches are needed to build more resilient and equitable AI in healthcare.

Abstract: Adversarial attacks pose a severe risk to AI systems used in healthcare, capable of misleading models into dangerous misclassifications that can delay treatments or cause misdiagnoses. These attacks, often imperceptible to human perception, threaten patient safety, particularly in underserved populations. Our study explores these vulnerabilities through empirical experimentation on a dermatological dataset, where adversarial methods significantly reduce classification accuracy. Through detailed threat modeling, experimental benchmarking, and model evaluation, we demonstrate both the severity of the threat and the partial success of defenses like adversarial training and distillation. Our results show that while defenses reduce attack success rates, they must be balanced against model performance on clean data. We conclude with a call for integrated technical, ethical, and policy-based approaches to build more resilient, equitable AI in healthcare.

[350] DiNo and RanBu: Lightweight Predictions from Shallow Random Forests

Tiago Mendonça dos Santos, Rafael Izbicki, Luís Gustavo Esteves

Main category: cs.LG

TL;DR: DiNo and RanBu are shallow-forest methods that convert depth-limited trees into efficient predictors using distance-based approaches, achieving comparable accuracy to full-depth random forests with up to 95% reduction in training and inference time.

Details

Motivation: Random Forest ensembles have high inference latency and memory demands due to reliance on hundreds of deep trees, limiting deployment in latency-sensitive or resource-constrained environments.

Method: DiNo measures cophenetic distances via most recent common ancestor of observation pairs, while RanBu applies kernel smoothing to Breiman’s classical proximity measure. Both operate after forest training without growing additional trees.

Result: RanBu matches or exceeds full-depth random forest accuracy (especially in high-noise settings) with up to 95% time reduction. DiNo achieves best bias-variance trade-off in low-noise regimes at modest computational cost. Both extend to quantile regression.

Conclusion: The methods provide efficient alternatives to deep random forests with substantial speed gains while maintaining accuracy, implemented as an open-source R/C++ package for structured tabular data.

Abstract: Random Forest ensembles are a strong baseline for tabular prediction tasks, but their reliance on hundreds of deep trees often results in high inference latency and memory demands, limiting deployment in latency-sensitive or resource-constrained environments. We introduce DiNo (Distance with Nodes) and RanBu (Random Bushes), two shallow-forest methods that convert a small set of depth-limited trees into efficient, distance-weighted predictors. DiNo measures cophenetic distances via the most recent common ancestor of observation pairs, while RanBu applies kernel smoothing to Breiman’s classical proximity measure. Both approaches operate entirely after forest training: no additional trees are grown, and tuning of the single bandwidth parameter $h$ requires only lightweight matrix-vector operations. Across three synthetic benchmarks and 25 public datasets, RanBu matches or exceeds the accuracy of full-depth random forests-particularly in high-noise settings-while reducing training plus inference time by up to 95%. DiNo achieves the best bias-variance trade-off in low-noise regimes at a modest computational cost. Both methods extend directly to quantile regression, maintaining accuracy with substantial speed gains. The implementation is available as an open-source R/C++ package at https://github.com/tiagomendonca/dirf. We focus on structured tabular random samples (i.i.d.), leaving extensions to other modalities for future work.

[351] Noise is All You Need: Solving Linear Inverse Problems by Noise Combination Sampling with Diffusion Models

Xun Su, Hiroyuki Kasai

Main category: cs.LG

TL;DR: Proposes Noise Combination Sampling to optimally integrate observation information into diffusion models for inverse problems, avoiding the trade-off between excessive and insufficient constraint integration.

Details

Motivation: Addresses the dilemma in zero-shot inverse problem solving where too much observation information disrupts generation while too little fails to enforce constraints.

Method: Synthesizes optimal noise vector from noise subspace to approximate measurement score, replacing standard DDPM noise term, enabling natural embedding of conditional information without hyperparameter tuning.

Result: Achieves superior performance with negligible computational overhead, especially when generation steps are small, improving robustness and stability across various inverse problems.

Conclusion: Noise Combination Sampling provides an effective solution for integrating conditional information in diffusion models for inverse problems, enhancing performance without computational burden.

Abstract: Pretrained diffusion models have demonstrated strong capabilities in zero-shot inverse problem solving by incorporating observation information into the generation process of the diffusion models. However, this presents an inherent dilemma: excessive integration can disrupt the generative process, while insufficient integration fails to emphasize the constraints imposed by the inverse problem. To address this, we propose \emph{Noise Combination Sampling}, a novel method that synthesizes an optimal noise vector from a noise subspace to approximate the measurement score, replacing the noise term in the standard Denoising Diffusion Probabilistic Models process. This enables conditional information to be naturally embedded into the generation process without reliance on step-wise hyperparameter tuning. Our method can be applied to a wide range of inverse problem solvers, including image compression, and, particularly when the number of generation steps $T$ is small, achieves superior performance with negligible computational overhead, significantly improving robustness and stability.

Shuang Geng, Wenli Zhang, Jiaheng Xie, Rui Wang, Sudha Ram

Main category: cs.LG

TL;DR: A Closed-Loop LLM-Knowledge Graph framework that integrates depression detection with knowledge expansion through iterative learning cycles, enhancing both predictive accuracy and medical understanding from social media data.

Details

Motivation: To address the limitation of prior studies that use medical knowledge for prediction but fail to expand such knowledge through the predictive process, creating a mutually reinforcing system.

Method: Developed a closed-loop framework with two phases: 1) Knowledge-aware depression detection using LLM for joint depression detection and entity extraction, with knowledge graph representation and weighting; 2) Knowledge refinement and expansion incorporating new entities, relationships, and types into the knowledge graph under expert supervision.

Result: The framework enhanced both predictive accuracy and medical understanding from large-scale UGC. Expert evaluations confirmed discovery of clinically meaningful symptoms, comorbidities, and social triggers complementary to existing literature.

Conclusion: Successfully conceptualized and operationalized prediction-through-learning and learning-through-prediction as mutually reinforcing processes, demonstrating co-evolution of computational models and domain knowledge for adaptive, data-driven knowledge systems.

Abstract: Social media user-generated content (UGC) provides real-time, self-reported indicators of mental health conditions such as depression, offering a valuable source for predictive analytics. While prior studies integrate medical knowledge to improve prediction accuracy, they overlook the opportunity to simultaneously expand such knowledge through predictive processes. We develop a Closed-Loop Large Language Model (LLM)-Knowledge Graph framework that integrates prediction and knowledge expansion in an iterative learning cycle. In the knowledge-aware depression detection phase, the LLM jointly performs depression detection and entity extraction, while the knowledge graph represents and weights these entities to refine prediction performance. In the knowledge refinement and expansion phase, new entities, relationships, and entity types extracted by the LLM are incorporated into the knowledge graph under expert supervision, enabling continual knowledge evolution. Using large-scale UGC, the framework enhances both predictive accuracy and medical understanding. Expert evaluations confirmed the discovery of clinically meaningful symptoms, comorbidities, and social triggers complementary to existing literature. We conceptualize and operationalize prediction-through-learning and learning-through-prediction as mutually reinforcing processes, advancing both methodological and theoretical understanding in predictive analytics. The framework demonstrates the co-evolution of computational models and domain knowledge, offering a foundation for adaptive, data-driven knowledge systems applicable to other dynamic risk monitoring contexts.

[353] Chain of Execution Supervision Promotes General Reasoning in Large Language Models

Nuo Chen, Zehua Li, Keqin Bao, Junyang Lin, Dayiheng Liu

Main category: cs.LG

TL;DR: TracePile is a large-scale corpus that transforms code execution into explicit chain-of-thought rationales (Chain of Execution) to improve reasoning in LLMs, showing consistent performance gains across multiple benchmarks.

Details

Motivation: Current LLMs struggle with implicit reasoning in code due to syntactic noise. Code's logical structure offers rich reasoning paradigms but needs explicit representation for effective training.

Method: Created TracePile corpus with 2.6M samples converting code execution to step-by-step Chain of Execution rationales. Used three training approaches: continue-pretraining, instruction tuning, and two-stage finetuning across four base models.

Result: Consistent improvements across 20 benchmarks in math, code, logic, and algorithms. LLaMA3.1-8B improved by 7.1% on average across nine math datasets, with gains on LiveCodeBench, CRUX, and MMLU.

Conclusion: Explicit representation of code reasoning through Chain of Execution significantly enhances LLM reasoning capabilities across diverse domains.

Abstract: Building robust and general reasoning ability is a central goal in the development of large language models (LLMs). Recent efforts increasingly turn to code as a rich training source, given its inherent logical structure and diverse reasoning paradigms such as divide-and-conquer, topological ordering, and enumeration. However, reasoning in code is often expressed implicitly and entangled with syntactic or implementation noise, making direct training on raw code suboptimal.To address this, we introduce TracePile, a large-scale corpus of 2.6 million samples that transforms code execution into explicit, step-by-step chain-of-thought-style rationales, which we call Chain of Execution (CoE). The corpus spans domains including mathematics, classical algorithms and algorithmic competition, and is enriched with variable-tracing questions and code rewritings to enhance logical granularity and code diversity. We evaluate TracePile using three training setups: continue-pretraining, instruction tuning after pretraining, and two-stage finetuning. Experiments across four base models (LLaMA 3, LLaMA 3.1, Qwen-2.5, and Qwen-2.5 Coder) and 20 benchmarks covering math, code, logic, and algorithms demonstrate consistent improvements. Notably, TracePile boosts LLaMA3.1-8B by 7.1% on average across nine math datasets and delivers clear gains on LiveCodeBench, CRUX, and MMLU under two-stage fine-tuning.

[354] NUM2EVENT: Interpretable Event Reasoning from Numerical time-series

Ninghui Feng, Yiyan Qi

Main category: cs.LG

TL;DR: The paper introduces number-to-event reasoning, a task to infer structured events from numerical time-series data, addressing limitations in LLMs’ numerical reasoning capabilities.

Details

Motivation: LLMs have strong multimodal reasoning but limited understanding of numerical time-series signals. Existing approaches focus on forecasting without explaining the latent events driving numerical changes or the reasoning process.

Method: Proposes a reasoning-aware framework with: agent-guided event extractor (AGE), marked multivariate Hawkes-based synthetic generator (EveDTS), and two-stage fine-tuning pipeline combining time-series encoder with structured decoder.

Result: Experiments on multi-domain datasets show the method substantially outperforms strong LLM baselines in event-level precision and recall.

Conclusion: The work provides a new direction for bridging quantitative reasoning and semantic understanding, enabling LLMs to explain and predict events directly from numerical dynamics.

Abstract: Large language models (LLMs) have recently demonstrated impressive multimodal reasoning capabilities, yet their understanding of purely numerical time-series signals remains limited. Existing approaches mainly focus on forecasting or trend description, without uncovering the latent events that drive numerical changes or explaining the reasoning process behind them. In this work, we introduce the task of number-to-event reasoning and decoding, which aims to infer interpretable structured events from numerical inputs, even when current text is unavailable. To address the data scarcity and semantic alignment challenges, we propose a reasoning-aware framework that integrates an agent-guided event extractor (AGE), a marked multivariate Hawkes-based synthetic generator (EveDTS), and a two-stage fine-tuning pipeline combining a time-series encoder with a structured decoder. Our model explicitly reasons over numerical changes, generates intermediate explanations, and outputs structured event hypotheses. Experiments on multi-domain datasets show that our method substantially outperforms strong LLM baselines in event-level precision and recall. These results suggest a new direction for bridging quantitative reasoning and semantic understanding, enabling LLMs to explain and predict events directly from numerical dynamics.

[355] Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Yuxuan Tang, Yifan Feng

Main category: cs.LG

TL;DR: RCPO is a unified framework that bridges preference optimization with ranked choice modeling via maximum likelihood estimation, supporting both utility-based and rank-based choice models and outperforming existing pairwise methods.

Details

Motivation: Current LLM alignment relies on pairwise preference optimization, which overlooks richer forms of human feedback like multiwise comparisons and top-k rankings.

Method: Proposed Ranked Choice Preference Optimization (RCPO) framework using maximum likelihood estimation, instantiated with Multinomial Logit and Mallows-RMJ choice models.

Result: Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show RCPO consistently outperforms competitive baselines.

Conclusion: RCPO demonstrates that directly leveraging ranked preference data with appropriate choice models yields more effective alignment and provides a versatile foundation for incorporating choice modeling into LLM training.

Abstract: Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiwise comparisons and top-$k$ rankings. We propose Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. The framework is flexible, supporting both utility-based and rank-based choice models. It subsumes several existing pairwise methods (e.g., DPO, SimPO), while providing principled training objectives for richer feedback formats. We instantiate this framework with two representative ranked choice models (Multinomial Logit and Mallows-RMJ). Empirical studies on Llama-3-8B-Instruct and Gemma-2-9B-it across AlpacaEval 2 and Arena-Hard benchmarks show that RCPO consistently outperforms competitive baselines. RCPO shows how directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers a versatile and extensible foundation for incorporating (ranked) choice modeling into LLM training.

[356] Local Performance vs. Out-of-Distribution Generalization: An Empirical Analysis of Personalized Federated Learning in Heterogeneous Data Environments

Mortesa Hussaini, Jan Theiß, Anthony Stein

Main category: cs.LG

TL;DR: This paper addresses client drift in Federated Learning with heterogeneous data by proposing FLIU, a modified FedAvg with adaptive personalization, and evaluates both local performance and generalization across various data distributions.

Details

Motivation: Local models in Federated Learning converge to local optima during training, causing client drift when aggregated. Personalized FL focuses only on local performance but neglects generalization to out-of-distribution samples, which is crucial for robustness.

Method: Proposes FLIU (Federated Learning with Individualized Updates), extending FedAvg with an individualization step using adaptive personalization factor. Evaluates different stages within communication rounds and tests on MNIST and CIFAR-10 under IID, pathological non-IID, and novel Dirichlet distribution scenarios.

Result: Empirical evaluation shows FLIU’s performance across various distributional conditions, including challenging Dirichlet distributions designed to stress algorithms on complex data heterogeneity.

Conclusion: The study provides comprehensive evaluation of FL approaches considering both local performance and generalization, demonstrating FLIU’s effectiveness in handling heterogeneous data environments while maintaining generalization capabilities.

Abstract: In the context of Federated Learning with heterogeneous data environments, local models tend to converge to their own local model optima during local training steps, deviating from the overall data distributions. Aggregation of these local updates, e.g., with FedAvg, often does not align with the global model optimum (client drift), resulting in an update that is suboptimal for most clients. Personalized Federated Learning approaches address this challenge by exclusively focusing on the average local performances of clients’ models on their own data distribution. Generalization to out-of-distribution samples, which is a substantial benefit of FedAvg and represents a significant component of robustness, appears to be inadequately incorporated into the assessment and evaluation processes. This study involves a thorough evaluation of Federated Learning approaches, encompassing both their local performance and their generalization capabilities. Therefore, we examine different stages within a single communication round to enable a more nuanced understanding of the considered metrics. Furthermore, we propose and incorporate a modified approach of FedAvg, designated as Federated Learning with Individualized Updates (FLIU), extending the algorithm by a straightforward individualization step with an adaptive personalization factor. We evaluate and compare the approaches empirically using MNIST and CIFAR-10 under various distributional conditions, including benchmark IID and pathological non-IID, as well as additional novel test environments with Dirichlet distribution specifically developed to stress the algorithms on complex data heterogeneity.

[357] LLMComp: A Language Modeling Paradigm for Error-Bounded Scientific Data Compression

Guozhong Li, Muhannad Alhumaidi, Spiros Skiadopoulos, Panos Kalnis

Main category: cs.LG

TL;DR: LLMCOMP uses decoder-only large language models for lossy compression of scientific spatiotemporal data, achieving up to 30% higher compression ratios than state-of-the-art methods under strict error bounds.

Details

Motivation: The rapid growth of high-resolution scientific simulations and observation systems generates massive spatiotemporal datasets, requiring efficient, error-bounded compression. LLMs have shown strong capabilities in modeling complex sequential data.

Method: Quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, applies coverage-guided sampling for training efficiency, and trains an autoregressive transformer with spatial-temporal embeddings. During compression, performs top-k prediction storing rank indices and fallback corrections.

Result: Experiments on multiple reanalysis datasets show LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds.

Conclusion: LLMs have significant potential as general-purpose compressors for high-fidelity scientific data.

Abstract: The rapid growth of high-resolution scientific simulations and observation systems is generating massive spatiotemporal datasets, making efficient, error-bounded compression increasingly important. Meanwhile, decoder-only large language models (LLMs) have demonstrated remarkable capabilities in modeling complex sequential data. In this paper, we propose LLMCOMP, a novel lossy compression paradigm that leverages decoder-only large LLMs to model scientific data. LLMCOMP first quantizes 3D fields into discrete tokens, arranges them via Z-order curves to preserve locality, and applies coverage-guided sampling to enhance training efficiency. An autoregressive transformer is then trained with spatial-temporal embeddings to model token transitions. During compression, the model performs top-k prediction, storing only rank indices and fallback corrections to ensure strict error bounds. Experiments on multiple reanalysis datasets show that LLMCOMP consistently outperforms state-of-the-art compressors, achieving up to 30% higher compression ratios under strict error bounds. These results highlight the potential of LLMs as general-purpose compressors for high-fidelity scientific data.

[358] Monotone and Separable Set Functions: Characterizations and Neural Models

Soutrik Sarangi, Yonatan Sverdlov, Nadav Dym, Abir De

Main category: cs.LG

TL;DR: The paper introduces Monotone and Separating (MAS) set functions that preserve set containment relationships through vector embeddings, establishes theoretical bounds, and demonstrates practical benefits for set containment tasks.

Details

Motivation: Applications for set containment problems require functions that preserve the natural partial order of sets, where set inclusion corresponds to vector ordering.

Method: Design set-to-vector functions that satisfy MAS property, establish theoretical bounds on vector dimensions, propose weakly MAS model for infinite ground sets, and construct universal monotone models.

Result: MAS functions exist with bounded dimensions for finite sets but not for infinite sets; weakly MAS model provides stable alternative; experiments show improved performance on set containment tasks compared to standard models.

Conclusion: MAS functions enable effective set containment modeling, with theoretical guarantees and practical advantages, though infinite sets require relaxed weakly MAS approach.

Abstract: Motivated by applications for set containment problems, we consider the following fundamental problem: can we design set-to-vector functions so that the natural partial order on sets is preserved, namely $S\subseteq T \text{ if and only if } F(S)\leq F(T) $. We call functions satisfying this property Monotone and Separating (MAS) set functions. % We establish lower and upper bounds for the vector dimension necessary to obtain MAS functions, as a function of the cardinality of the multisets and the underlying ground set. In the important case of an infinite ground set, we show that MAS functions do not exist, but provide a model called our which provably enjoys a relaxed MAS property we name “weakly MAS” and is stable in the sense of Holder continuity. We also show that MAS functions can be used to construct universal models that are monotone by construction and can approximate all monotone set functions. Experimentally, we consider a variety of set containment tasks. The experiments show the benefit of using our our model, in comparison with standard set models which do not incorporate set containment as an inductive bias. Our code is available in https://github.com/yonatansverdlov/Monotone-Embedding.

[359] Help the machine to help you: an evaluation in the wild of egocentric data cleaning via skeptical learning

Andrea Bontempelli, Matteo Busso, Leonardo Javier Malcotti, Fausto Giunchiglia

Main category: cs.LG

TL;DR: This paper evaluates Skeptical Learning (SKEL) in real-world conditions with actual users who can refine input labels, showing reduced annotation effort and improved data quality.

Details

Motivation: Digital personal assistants require high-quality annotations but user annotations often contain errors and noise. Previous SKEL research lacked end-user confirmation, which is crucial for accurate context evaluation.

Method: Conducted a 4-week study with university students using the iLog mobile application, where users could refine input labels based on their current perspectives and needs.

Result: Results show challenges in balancing user effort and data quality, but SKEL demonstrated reduced annotation effort and improved quality of collected data.

Conclusion: SKEL shows potential benefits for real-world applications by reducing annotation burden while maintaining data quality, though finding the right balance between user effort and data quality remains challenging.

Abstract: Any digital personal assistant, whether used to support task performance, answer questions, or manage work and daily life, including fitness schedules, requires high-quality annotations to function properly. However, user annotations, whether actively produced or inferred from context (e.g., data from smartphone sensors), are often subject to errors and noise. Previous research on Skeptical Learning (SKEL) addressed the issue of noisy labels by comparing offline active annotations with passive data, allowing for an evaluation of annotation accuracy. However, this evaluation did not include confirmation from end-users, the best judges of their own context. In this study, we evaluate SKEL’s performance in real-world conditions with actual users who can refine the input labels based on their current perspectives and needs. The study involves university students using the iLog mobile application on their devices over a period of four weeks. The results highlight the challenges of finding the right balance between user effort and data quality, as well as the potential benefits of using SKEL, which include reduced annotation effort and improved quality of collected data.

[360] Flight Delay Prediction via Cross-Modality Adaptation of Large Language Models and Aircraft Trajectory Representation

Thaweerath Phisannupawong, Joshua Julian Damanik, Han-Lim Choi

Main category: cs.LG

TL;DR: A lightweight LLM-based multimodal approach for flight delay prediction that integrates trajectory data with textual aeronautical information, achieving sub-minute prediction error.

Details

Motivation: Flight delays highlight inefficiencies in air traffic management and impact network performance, requiring better prediction methods from air traffic controllers' perspective.

Method: Adapts trajectory data into language modality and integrates with textual aeronautical information (flight info, weather reports, aerodrome notices) using multimodal LLM framework.

Result: Consistently achieves sub-minute prediction error by effectively leveraging contextual delay information and supports real-time updates.

Conclusion: Linguistic understanding combined with cross-modality adaptation of trajectory information enhances delay prediction with practical scalability for real-world operations.

Abstract: Flight delay prediction has become a key focus in air traffic management, as delays highlight inefficiencies that impact overall network performance. This paper presents a lightweight large language model-based multimodal flight delay prediction, formulated from the perspective of air traffic controllers monitoring aircraft delay after entering the terminal area. The approach integrates trajectory representations with textual aeronautical information, including flight information, weather reports, and aerodrome notices, by adapting trajectory data into the language modality to capture airspace conditions. Experimental results show that the model consistently achieves sub-minute prediction error by effectively leveraging contextual information related to the sources of delay. The framework demonstrates that linguistic understanding, when combined with cross-modality adaptation of trajectory information, enhances delay prediction. Moreover, the approach shows practicality and scalability for real-world operations, supporting real-time updates that refine predictions upon receiving new operational information.

[361] Combining Textual and Structural Information for Premise Selection in Lean

Job Petrovčič, David Eliecer Narvaez Denis, Ljupčo Todorovski

Main category: cs.LG

TL;DR: A graph-augmented approach combining text embeddings with graph neural networks outperforms language-based methods for premise selection in theorem proving by over 25%.

Details

Motivation: Existing language-based methods treat premises in isolation, ignoring the web of dependencies that connects them, which is a key bottleneck for scaling theorem proving.

Method: Combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing state-premise and premise-premise relations.

Result: Outperforms the ReProver language-based baseline by over 25% across standard retrieval metrics on the LeanDojo Benchmark.

Conclusion: Relational information is powerful for more effective premise selection in theorem proving.

Abstract: Premise selection is a key bottleneck for scaling theorem proving in large formal libraries. Yet existing language-based methods often treat premises in isolation, ignoring the web of dependencies that connects them. We present a graph-augmented approach that combines dense text embeddings of Lean formalizations with graph neural networks over a heterogeneous dependency graph capturing both state–premise and premise–premise relations. On the LeanDojo Benchmark, our method outperforms the ReProver language-based baseline by over 25% across standard retrieval metrics. These results demonstrate the power of relational information for more effective premise selection.

[362] Integrating Genomics into Multimodal EHR Foundation Models

Jonathan Amar, Edward Liu, Alessandra Breschi, Liangliang Zhang, Pouya Kheradpour, Sylvia Li, Lisa Soleymani Lehmann, Alessandro Giulianelli, Matt Edwards, Yugang Jia, David Nola, Raghav Mani, Pankaj Vats, Jesse Tetreault, T. J. Chen, Cory Y. McLean

Main category: cs.LG

TL;DR: This paper presents an EHR foundation model that integrates Polygenic Risk Scores (PRS) with traditional EHR data, using All of Us program data to create holistic health profiles and improve disease prediction capabilities.

Details

Motivation: To move beyond traditional EHR-only approaches and build more comprehensive health profiles by integrating genetic predisposition data (PRS) with clinical data for better disease prediction and personalized healthcare.

Method: Develops a multimodal framework using All of Us Research Program data, extending generative AI advancements to EHR foundation models to learn complex relationships between clinical data and genetic predispositions.

Result: The model demonstrates strong predictive value for disease onset, particularly Type 2 Diabetes, and reveals the interplay between PRS and EHR data. It also shows versatility through transfer learning for custom classification tasks.

Conclusion: This integrated approach enables new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, advancing personalized and equitable healthcare evidence generation.

Abstract: This paper introduces an innovative Electronic Health Record (EHR) foundation model that integrates Polygenic Risk Scores (PRS) as a foundational data modality, moving beyond traditional EHR-only approaches to build more holistic health profiles. Leveraging the extensive and diverse data from the All of Us (AoU) Research Program, this multimodal framework aims to learn complex relationships between clinical data and genetic predispositions. The methodology extends advancements in generative AI to the EHR foundation model space, enhancing predictive capabilities and interpretability. Evaluation on AoU data demonstrates the model’s predictive value for the onset of various conditions, particularly Type 2 Diabetes (T2D), and illustrates the interplay between PRS and EHR data. The work also explores transfer learning for custom classification tasks, showcasing the architecture’s versatility and efficiency. This approach is pivotal for unlocking new insights into disease prediction, proactive health management, risk stratification, and personalized treatment strategies, laying the groundwork for more personalized, equitable, and actionable real-world evidence generation in healthcare.

[363] Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

Zihao Jing, Yan Sun, Yan Yi Li, Sugitha Janarthanan, Alana Deng, Pingzhao Hu

Main category: cs.LG

TL;DR: MuMo is a structured multimodal fusion framework that addresses 3D conformer unreliability and modality collapse in molecular representation through structured fusion and progressive injection mechanisms.

Details

Motivation: Multimodal molecular models suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization capabilities.

Method: Uses Structured Fusion Pipeline (SFP) to combine 2D topology and 3D geometry into a unified structural prior, and Progressive Injection (PI) mechanism to asymmetrically integrate this prior while preserving modality-specific modeling.

Result: Achieves 2.7% average improvement over best-performing baselines across 29 benchmark tasks, ranking first on 22 tasks including 27% improvement on LD50 task.

Conclusion: MuMo demonstrates robustness to 3D conformer noise and effective multimodal fusion in molecular representation, validated by superior performance across multiple benchmarks.

Abstract: Multimodal molecular models often suffer from 3D conformer unreliability and modality collapse, limiting their robustness and generalization. We propose MuMo, a structured multimodal fusion framework that addresses these challenges in molecular representation through two key strategies. To reduce the instability of conformer-dependent fusion, we design a Structured Fusion Pipeline (SFP) that combines 2D topology and 3D geometry into a unified and stable structural prior. To mitigate modality collapse caused by naive fusion, we introduce a Progressive Injection (PI) mechanism that asymmetrically integrates this prior into the sequence stream, preserving modality-specific modeling while enabling cross-modal enrichment. Built on a state space backbone, MuMo supports long-range dependency modeling and robust information propagation. Across 29 benchmark tasks from Therapeutics Data Commons (TDC) and MoleculeNet, MuMo achieves an average improvement of 2.7% over the best-performing baseline on each task, ranking first on 22 of them, including a 27% improvement on the LD50 task. These results validate its robustness to 3D conformer noise and the effectiveness of multimodal fusion in molecular representation. The code is available at: github.com/selmiss/MuMo.

[364] Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

Aaron Wang, Zihan Zhao, Subash Katel, Vivekanand Gyanchand Sahu, Elham E Khoda, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte

Main category: cs.LG

TL;DR: SAL-T is a physics-inspired linear transformer that reduces computational complexity while maintaining performance in high-energy particle collision analysis, achieving comparable results to full-attention transformers with lower latency and resource usage.

Details

Motivation: Transformers are effective for particle collision analysis but have quadratic complexity that creates deployment challenges in high-throughput environments like CERN LHC, requiring substantial resources and increasing inference latency.

Method: SAL-T enhances the linformer architecture with spatially aware partitioning of particles based on kinematic features, computing attention between physically significant regions, and uses convolutional layers informed by jet physics to capture local correlations.

Result: SAL-T outperforms standard linformer in jet classification tasks and achieves comparable results to full-attention transformers while using considerably fewer resources with lower inference latency. This trend is confirmed on ModelNet10 point cloud classification dataset.

Conclusion: SAL-T provides an efficient transformer architecture that maintains performance while addressing computational challenges in high-data-throughput particle physics applications, offering a practical solution for deployment in resource-constrained environments.

Abstract: Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

[365] Efficient Low Rank Attention for Long-Context Inference in Large Language Models

Tenghui Li, Guoxu Zhou, Xuyang Zhao, Yuning Qiu, Qibin Zhao

Main category: cs.LG

TL;DR: LRQK is a two-stage framework that decomposes query and key matrices into low-rank factors to reduce KV cache memory usage while maintaining exact attention outputs through a mixed GPU-CPU cache system.

Details

Motivation: As input text length grows, KV cache in LLMs imposes prohibitive GPU memory costs, limiting long-context inference on resource-constrained devices. Existing approaches like KV quantization and pruning reduce memory but suffer from precision loss or suboptimal KV pair retention.

Method: Two-stage framework: 1) Jointly decomposes full-precision query and key matrices into compact rank-r factors during prefill stage, 2) Uses low-dimensional projections to compute proxy attention scores in O(lr) time at each decode step, with top-k token selection and mixed GPU-CPU cache using hit-and-miss mechanism.

Result: Extensive experiments on RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B show LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal accuracy loss.

Conclusion: LRQK effectively reduces GPU memory costs for long-context inference while preserving exact attention outputs, making it suitable for resource-constrained devices.

Abstract: As the length of input text grows, the key-value (KV) cache in LLMs imposes prohibitive GPU memory costs and limits long-context inference on resource constrained devices. Existing approaches, such as KV quantization and pruning, reduce memory usage but suffer from numerical precision loss or suboptimal retention of key-value pairs. We introduce Low Rank Query and Key attention (LRQK), a two-stage framework that jointly decomposes the full-precision query and key matrices into compact rank-(r) factors during the prefill stage, and then uses these low-dimensional projections to compute proxy attention scores in (\mathcal{O}(lr)) time at each decode step. By selecting only the top-(k) tokens and a small fixed set of recent tokens, LRQK employs a mixed GPU-CPU cache with a hit-and-miss mechanism that transfers only missing full-precision KV pairs, thereby preserving exact attention outputs while reducing CPU-GPU data movement. Extensive experiments on the RULER and LongBench benchmarks with LLaMA-3-8B and Qwen2.5-7B demonstrate that LRQK matches or surpasses leading sparse-attention methods in long context settings, while delivering significant memory savings with minimal loss in accuracy. Our code is available at https://github.com/tenghuilee/LRQK.

[366] Beyond Hidden-Layer Manipulation: Semantically-Aware Logit Interventions for Debiasing LLMs

Wei Xia

Main category: cs.LG

TL;DR: Static and Dynamic are zero-shot logits-layer debiasing methods that reduce bias by up to 70% with minimal fluency loss, outperforming hidden-layer approaches.

Details

Motivation: To develop effective debiasing methods for aligned LLMs that can reduce bias while maintaining fluency.

Method: Two zero-shot logits-layer debiasing methods: Static and Dynamic, using semantic-aware logits intervention.

Result: Dynamic method reduces bias by up to 70% with minimal fluency loss, and logits intervention outperforms hidden-layer approaches.

Conclusion: Semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.

Abstract: We proposed Static and Dynamic – two zero-shot logits-layer debiasing methods. Dynamic reduces bias by up to 70% with minimal fluency loss. Logits intervention outperforms hidden-layer approaches. We show semantic-aware logits intervention is stable and effective for debiasing aligned LLMs.

[367] The Structural Scalpel: Automated Contiguous Layer Pruning for Large Language Models

Yao Lu, Yuqi Li, Wenbin Xie, Shanqing Yu, Qi Xuan, Zhaowei Zhu, Shiping Wen

Main category: cs.LG

TL;DR: CLP is a continuous layer pruning framework that uses differentiable concave gates and cutoff endpoint tuning to prune LLMs while maintaining performance, outperforming existing methods by significant margins.

Details

Motivation: LLMs face deployment challenges on edge devices due to large size and computational costs. Existing layer pruning methods ignore layer dependencies, disrupting information flow and degrading performance.

Method: Proposes CLP with two innovations: differentiable concave gate algorithm for automatic segment identification via gradient optimization, and cutoff endpoint tuning strategy that fine-tunes layers adjacent to pruned segments.

Result: CLP achieves 95.34% performance retention on LLaMA3-70B at 20% pruning rate, outperforming baselines by 4.29%-30.52%. Can be combined with quantization for further compression with minimal performance loss.

Conclusion: CLP effectively addresses layer pruning challenges by considering layer dependencies and provides superior performance retention compared to existing methods across various model architectures and sizes.

Abstract: Although large language models (LLMs) have achieved revolutionary breakthroughs in many fields, their large model size and high computational cost pose significant challenges for practical deployment on resource-constrained edge devices. To this end, layer pruning has been proposed to reduce the computational overhead by directly removing redundant layers. However, existing layer pruning methods typically rely on hand-crafted metrics to evaluate and remove individual layers, while ignoring the dependencies between layers. This can disrupt the model’s information flow and severely degrade performance. To address these issues, we propose CLP, a novel continuous layer pruning framework that introduces two key innovations: a differentiable concave gate algorithm that automatically identifies the best continuous layer segments for pruning via gradient-based optimization; and a cutoff endpoint tuning strategy that effectively restores model performance by fine-tuning only the layers adjacent to the pruned segments. Extensive experiments across multiple model architectures (including LLaMA2, LLaMA3 and Qwen) and sizes (from $7$B to $70$B parameters) show that CLP significantly outperforms existing state-of-the-art baselines. For example, at a pruning rate of $20%$, CLP achieves an average performance retention of $95.34%$ on LLaMA3-70B, outperforming baselines by $4.29%$-$30.52%$. Furthermore, CLP can be seamlessly combined with quantization to further compress the model with only a slight performance loss.

[368] Error Adjustment Based on Spatiotemporal Correlation Fusion for Traffic Forecasting

Fuqiang Liu, Weiping Ding, Luis Miranda-Moreno, Lijun Sun

Main category: cs.LG

TL;DR: SAEA is a framework that adjusts spatiotemporally autocorrelated prediction errors in traffic forecasting by modeling errors as a VAR process, incorporating spatial structure through regularization, and dynamically refining predictions at test time.

Details

Motivation: Current DNN-based traffic forecasting models assume uncorrelated errors, but traffic data exhibits spatiotemporal autocorrelation that limits model performance. This gap is overlooked by existing studies.

Method: Models prediction errors as spatiotemporal VAR process, captures error correlations via coefficient matrix with structural sparse regularization for road network alignment, and implements test-time error adjustment for dynamic prediction refinement.

Result: The method enhances performance across different traffic datasets and various forecasting models, showing improvements in almost all cases.

Conclusion: SAEA effectively addresses spatiotemporal error autocorrelation in traffic forecasting, providing a general framework that improves model performance by systematically adjusting autocorrelated prediction errors.

Abstract: Deep neural networks (DNNs) play a significant role in an increasing body of research on traffic forecasting due to their effectively capturing spatiotemporal patterns embedded in traffic data. A general assumption of training the said forecasting models via mean squared error estimation is that the errors across time steps and spatial positions are uncorrelated. However, this assumption does not really hold because of the autocorrelation caused by both the temporality and spatiality of traffic data. This gap limits the performance of DNN-based forecasting models and is overlooked by current studies. To fill up this gap, this paper proposes Spatiotemporally Autocorrelated Error Adjustment (SAEA), a novel and general framework designed to systematically adjust autocorrelated prediction errors in traffic forecasting. Unlike existing approaches that assume prediction errors follow a random Gaussian noise distribution, SAEA models these errors as a spatiotemporal vector autoregressive (VAR) process to capture their intrinsic dependencies. First, it explicitly captures both spatial and temporal error correlations by a coefficient matrix, which is then embedded into a newly formulated cost function. Second, a structurally sparse regularization is introduced to incorporate prior spatial information, ensuring that the learned coefficient matrix aligns with the inherent road network structure. Finally, an inference process with test-time error adjustment is designed to dynamically refine predictions, mitigating the impact of autocorrelated errors in real-time forecasting. The effectiveness of the proposed approach is verified on different traffic datasets. Results across a wide range of traffic forecasting models show that our method enhances performance in almost all cases.

[369] A machine learning framework integrating seed traits and plasma parameters for predicting germination uplift in crops

Saklain Niam, Tashfiqur Rahman, Md. Amjad Patwary, Mukarram Hossain

Main category: cs.LG

TL;DR: First machine learning framework to predict cold plasma germination uplift using Extra Trees model, achieving R²=0.925 with feature reduction. Identified hormetic response patterns and species-specific prediction accuracy.

Details

Motivation: Cold plasma is an eco-friendly seed germination enhancement method, but outcomes are difficult to predict due to complex seed-plasma-environment interactions. Need for reliable prediction framework.

Method: Used machine learning models (GB, XGB, ET, and hybrids) with dielectric barrier discharge plasma data on soybean, barley, sunflower, radish, and tomato. Applied feature reduction and embedded framework in MLflow.

Result: Extra Trees model performed best with R²=0.919 (improved to 0.925 after feature reduction). Identified hormetic response: negligible effects <7kV/<200s, maximum germination at 7-15kV for 200-500s, reduced beyond 20kV. Radish and soybean showed highest prediction consistency.

Conclusion: Successfully developed ML framework for predicting cold plasma germination effects, providing decision-support tool for precision agriculture optimization.

Abstract: Cold plasma (CP) is an eco-friendly method to enhance seed germination, yet outcomes remain difficult to predict due to complex seed–plasma–environment interactions. This study introduces the first machine learning framework to forecast germination uplift in soybean, barley, sunflower, radish, and tomato under dielectric barrier discharge (DBD) plasma. Among the models tested (GB, XGB, ET, and hybrids), Extra Trees (ET) performed best (R\textsuperscript{2} = 0.919; RMSE = 3.21; MAE = 2.62), improving to R\textsuperscript{2} = 0.925 after feature reduction. Engineering analysis revealed a hormetic response: negligible effects at $<$7 kV or $<$200 s, maximum germination at 7–15 kV for 200–500 s, and reduced germination beyond 20 kV or prolonged exposures. Discharge power was also a dominant factor, with germination rate maximizing at $\geq$100 W with low exposure time. Species and cultivar-level predictions showed radish (MAE = 1.46) and soybean (MAE = 2.05) were modeled with high consistency, while sunflower remained slightly higher variable (MAE = 3.80). Among cultivars, Williams (MAE = 1.23) and Sari (1.33) were well predicted, while Arian (2.86) and Ny'{\i}rs'{e}gi fekete (3.74) were comparatively poorly captured. This framework was also embedded into MLflow, providing a decision-support tool for optimizing CP seed germination in precision agriculture.

[370] Aligning Diffusion Language Models via Unpaired Preference Optimization

Vaibhav Jindal, Hejian Sang, Chun-Mao Lai, Yanning Chen, Zhipeng Wang

Main category: cs.LG

TL;DR: ELBO-KTO combines ELBO surrogate for diffusion log-likelihoods with unpaired preference optimization (KTO) to align diffusion language models, achieving strong performance without costly pairwise preference data.

Details

Motivation: Aligning diffusion language models to human preferences is challenging due to intractable sequence log-likelihoods and costly pairwise preference data collection.

Method: Uses ELBO surrogate for diffusion log-likelihoods combined with Kahneman Tversky Optimization (KTO), a prospect-theoretic unpaired preference objective, with variance-reduction techniques for stable training.

Result: Achieves 65.9% and 62.3% adjusted win rates on kto-mix-14k and UltraFeedback-Binary datasets, performing on par or better than base model across GSM8K, MMLU, and other reasoning/knowledge benchmarks.

Conclusion: Establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

Abstract: Diffusion language models (dLLMs) are an emerging alternative to autoregressive (AR) generators, but aligning them to human preferences is challenging because sequence log-likelihoods are intractable and pairwise preference data are costly to collect. We introduce ELBO-KTO, which combines an ELBO surrogate for diffusion log-likelihoods with a prospect-theoretic, unpaired preference objective (Kahneman Tversky Optimization, KTO). We analyze the bias and variance induced by the ELBO substitution and employ variance-reduction practices that stabilize gradients during training. Applied to LLaDA-8B-Instruct, ELBO-KTO yields \textbf{65.9%} and \textbf{62.3%} adjusted win rates on kto-mix-14k and UltraFeedback-Binary, respectively, versus the base model under an automatic LLM judge. Across downstream tasks, including GSM8K, MMLU, and additional reasoning/knowledge benchmarks, ELBO-KTO trained on UltraFeedback-Binary performs on par with or better than the base model under identical decoding. This establishes unpaired preference optimization as a viable alternative to pairwise alignment in diffusion LLMs.

[371] Quantum Machine Learning for Image Classification: A Hybrid Model of Residual Network with Quantum Support Vector Machine

Md. Farhan Shahriyar, Gazi Tanbhir, Abdullah Md Raihan Chy

Main category: cs.LG

TL;DR: Hybrid quantum-classical approach combining ResNet-50 for feature extraction and Quantum SVM for classification achieves 99.23% accuracy in potato disease detection, outperforming classical models.

Details

Motivation: Classical ML and deep learning struggle with high-dimensional complex datasets, requiring quantum computing to improve classification efficiency in image classification tasks.

Method: ResNet-50 extracts features from potato disease RGB images, PCA reduces dimensionality, then QSVM with quantum feature maps (ZZ, Z, Pauli-X) transforms classical data to quantum states for classification.

Result: Z-feature map-based QSVM achieved 99.23% accuracy, outperforming classical SVM and Random Forest models in potato disease classification.

Conclusion: Integration of quantum computing with classical deep learning provides advantages for image classification and offers a promising solution for disease detection through hybrid quantum-classical modeling.

Abstract: Recently, there has been growing attention on combining quantum machine learning (QML) with classical deep learning approaches, as computational techniques are key to improving the performance of image classification tasks. This study presents a hybrid approach that uses ResNet-50 (Residual Network) for feature extraction and Quantum Support Vector Machines (QSVM) for classification in the context of potato disease detection. Classical machine learning as well as deep learning models often struggle with high-dimensional and complex datasets, necessitating advanced techniques like quantum computing to improve classification efficiency. In our research, we use ResNet-50 to extract deep feature representations from RGB images of potato diseases. These features are then subjected to dimensionality reduction using Principal Component Analysis (PCA). The resulting features are processed through QSVM models which apply various quantum feature maps such as ZZ, Z, and Pauli-X to transform classical data into quantum states. To assess the model performance, we compared it with classical machine learning algorithms such as Support Vector Machine (SVM) and Random Forest (RF) using five-fold stratified cross-validation for comprehensive evaluation. The experimental results demonstrate that the Z-feature map-based QSVM outperforms classical models, achieving an accuracy of 99.23 percent, surpassing both SVM and RF models. This research highlights the advantages of integrating quantum computing into image classification and provides a potential disease detection solution through hybrid quantum-classical modeling.

[372] Quanvolutional Neural Networks for Pneumonia Detection: An Efficient Quantum-Assisted Feature Extraction Paradigm

Gazi Tanbhir, Md. Farhan Shahriyar, Abdullah Md Raihan Chy

Main category: cs.LG

TL;DR: A hybrid quantum-classical model using Quanvolutional Neural Networks (QNNs) achieves 83.33% validation accuracy for pneumonia detection, outperforming classical CNNs (73.33%) on the PneumoniaMNIST dataset.

Details

Motivation: To overcome limitations of CNNs in pneumonia detection, including high computational costs, limited feature representation, and poor generalization from small datasets.

Method: A hybrid quantum-classical model with a quanvolutional layer using parameterized quantum circuits (PQC) to process 2x2 image patches, employing Y-gates for encoding and entangling layers for non-classical feature extraction, followed by classical neural network classification.

Result: QNN achieved 83.33% validation accuracy vs 73.33% for classical CNN, showing enhanced convergence and sample efficiency.

Conclusion: QNNs offer a computationally efficient alternative for medical image analysis, particularly with limited labeled data, and provide foundation for integrating quantum computing into medical diagnostic systems.

Abstract: Pneumonia poses a significant global health challenge, demanding accurate and timely diagnosis. While deep learning, particularly Convolutional Neural Networks (CNNs), has shown promise in medical image analysis for pneumonia detection, CNNs often suffer from high computational costs, limitations in feature representation, and challenges in generalizing from smaller datasets. To address these limitations, we explore the application of Quanvolutional Neural Networks (QNNs), leveraging quantum computing for enhanced feature extraction. This paper introduces a novel hybrid quantum-classical model for pneumonia detection using the PneumoniaMNIST dataset. Our approach utilizes a quanvolutional layer with a parameterized quantum circuit (PQC) to process 2x2 image patches, employing rotational Y-gates for data encoding and entangling layers to generate non-classical feature representations. These quantum-extracted features are then fed into a classical neural network for classification. Experimental results demonstrate that the proposed QNN achieves a higher validation accuracy of 83.33 percent compared to a comparable classical CNN which achieves 73.33 percent. This enhanced convergence and sample efficiency highlight the potential of QNNs for medical image analysis, particularly in scenarios with limited labeled data. This research lays the foundation for integrating quantum computing into deep-learning-driven medical diagnostic systems, offering a computationally efficient alternative to traditional approaches.

[373] AI-Driven Carbon Monitoring: Transformer-Based Reconstruction of Atmospheric CO2 in Canadian Poultry Regions

Padmanabhan Jagannathan Prajesh, Kaliaperumal Ragunath, Miriam Gordon, Bruce Rathgeber, Suresh Neethirajan

Main category: cs.LG

TL;DR: ST-ViWT framework uses wavelet-enhanced vision transformers to reconstruct continuous XCO2 fields from sparse OCO-2 satellite data, achieving high accuracy (R2=0.984) and enabling carbon accounting for agricultural emissions.

Details

Motivation: Accurate mapping of column-averaged CO2 (XCO2) over agricultural landscapes is essential for guiding emission mitigation strategies, particularly in poultry-intensive regions where facility density may correlate with CO2 emissions.

Method: Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) framework that fuses wavelet time-frequency representations with transformer attention over meteorology, vegetation indices, topography, and land cover data from OCO-2 satellite observations.

Result: Achieved R2 = 0.984 and RMSE = 0.468 ppm on 2024 OCO-2 data; 92.3% of gap-filled predictions within ±1 ppm. Independent TCCON validation showed robust generalization (bias = -0.14 ppm; r = 0.928). Spatial analysis revealed moderate positive association between poultry facility density and XCO2 (r = 0.43).

Conclusion: ST-ViWT enables seamless 0.25 degree CO2 surfaces with explicit uncertainties, supporting integration of satellite constraints with national inventories and precision livestock platforms for scalable, transparent carbon accounting and policy-relevant mitigation assessment.

Abstract: Accurate mapping of column-averaged CO2 (XCO2) over agricultural landscapes is essential for guiding emission mitigation strategies. We present a Spatiotemporal Vision Transformer with Wavelets (ST-ViWT) framework that reconstructs continuous, uncertainty-quantified XCO2 fields from OCO-2 across southern Canada, emphasizing poultry-intensive regions. The model fuses wavelet time-frequency representations with transformer attention over meteorology, vegetation indices, topography, and land cover. On 2024 OCO-2 data, ST-ViWT attains R2 = 0.984 and RMSE = 0.468 ppm; 92.3 percent of gap-filled predictions lie within +/-1 ppm. Independent validation with TCCON shows robust generalization (bias = -0.14 ppm; r = 0.928), including faithful reproduction of the late-summer drawdown. Spatial analysis across 14 poultry regions reveals a moderate positive association between facility density and XCO2 (r = 0.43); high-density areas exhibit larger seasonal amplitudes (9.57 ppm) and enhanced summer variability. Compared with conventional interpolation and standard machine-learning baselines, ST-ViWT yields seamless 0.25 degree CO2 surfaces with explicit uncertainties, enabling year-round coverage despite sparse observations. The approach supports integration of satellite constraints with national inventories and precision livestock platforms to benchmark emissions, refine region-specific factors, and verify interventions. Importantly, transformer-based Earth observation enables scalable, transparent, spatially explicit carbon accounting, hotspot prioritization, and policy-relevant mitigation assessment.

[374] Transformers from Compressed Representations

Juan C. Leon Alcazar, Mattia Soldan, Mohammad Saatialsoruji, Alejandro Pardo, Hani Itani, Juan Camilo Perez, Bernard Ghanem

Main category: cs.LG

TL;DR: TEMPEST is a method that uses compressed file byte-streams for representation learning, enabling transformers to learn semantic representations directly from compressed data without raw byte processing or full decoding.

Details

Motivation: Compressed file formats are efficient for storage and transmission but their potential for representation learning remains largely unexplored.

Method: Exploits inherent byte-stream structure of compressed files to design tokenization and encoding strategy, allowing standard transformers to learn directly from compressed data streams.

Result: Achieves competitive accuracy with state-of-the-art while substantially reducing token count for semantic classification, lowering computational complexity and memory usage across diverse datasets and modalities.

Conclusion: TEMPEST demonstrates that compressed representations can be effectively leveraged for efficient representation learning, bypassing the need for raw byte processing or full media decoding.

Abstract: Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a method that exploits the inherent byte-stream structure of compressed files to design an effective tokenization and encoding strategy. By leveraging this compact encoding, a standard transformer can directly learn semantic representations from compressed data streams, bypassing the need for raw byte-level processing or full media decoding. Our proposal substantially reduces the number of tokens required for semantic classification, thereby lowering both computational complexity and memory usage. Through extensive experiments across diverse datasets, coding schemes, and modalities, we show that TEMPEST achieves accuracy competitive wit the state-of-the-art while delivering efficiency gains in memory and compute.

[375] Optimize Any Topology: A Foundation Model for Shape- and Resolution-Free Structural Topology Optimization

Amin Heyrani Nobari, Lyle Regenwetter, Cyril Picard, Ligong Han, Faez Ahmed

Main category: cs.LG

TL;DR: OAT is a foundation-model framework that directly predicts minimum-compliance layouts for arbitrary aspect ratios, resolutions, volume fractions, loads, and fixtures using a resolution- and shape-agnostic autoencoder with implicit neural-field decoder and conditional latent-diffusion model.

Details

Motivation: Existing deep-learning methods for structural topology optimization are limited to fixed square grids, few hand-coded boundary conditions, and post-hoc optimization, preventing general deployment.

Method: Combines resolution- and shape-agnostic autoencoder with implicit neural-field decoder and conditional latent-diffusion model trained on OpenTO corpus of 2.2 million optimized structures covering 2 million unique boundary-condition configurations.

Result: Lowers mean compliance up to 90% relative to best prior models, delivers sub-1 second inference on single GPU across resolutions from 64x64 to 256x256 and aspect ratios as high as 10:1.

Conclusion: OAT establishes as a general, fast, and resolution-free framework for physics-aware topology optimization and provides large-scale dataset to spur further research in generative modeling for inverse design.

Abstract: Structural topology optimization (TO) is central to engineering design but remains computationally intensive due to complex physics and hard constraints. Existing deep-learning methods are limited to fixed square grids, a few hand-coded boundary conditions, and post-hoc optimization, preventing general deployment. We introduce Optimize Any Topology (OAT), a foundation-model framework that directly predicts minimum-compliance layouts for arbitrary aspect ratios, resolutions, volume fractions, loads, and fixtures. OAT combines a resolution- and shape-agnostic autoencoder with an implicit neural-field decoder and a conditional latent-diffusion model trained on OpenTO, a new corpus of 2.2 million optimized structures covering 2 million unique boundary-condition configurations. On four public benchmarks and two challenging unseen tests, OAT lowers mean compliance up to 90% relative to the best prior models and delivers sub-1 second inference on a single GPU across resolutions from 64 x 64 to 256 x 256 and aspect ratios as high as 10:1. These results establish OAT as a general, fast, and resolution-free framework for physics-aware topology optimization and provide a large-scale dataset to spur further research in generative modeling for inverse design. Code & data can be found at https://github.com/ahnobari/OptimizeAnyTopology.

[376] Traffic flow forecasting, STL decomposition, Hybrid model, LSTM, ARIMA, XGBoost, Intelligent transportation systems

Fujiang Yuan, Yangrui Fan, Xiaohuan Bing, Zhen Tian, Chunhong Yuan, Yankang Li

Main category: cs.LG

TL;DR: A hybrid traffic flow forecasting framework using STL decomposition with LSTM, ARIMA, and XGBoost models that outperforms individual models.

Details

Motivation: Single model approaches fail to capture complex, nonlinear, and multi-scale temporal patterns in traffic flow data, requiring a more sophisticated approach.

Method: STL decomposition separates time series into trend, seasonal, and residual components, then LSTM models trends, ARIMA captures seasonality, and XGBoost predicts residuals, with multiplicative integration of predictions.

Result: The hybrid model significantly outperformed standalone LSTM, ARIMA, and XGBoost models across MAE, RMSE, and R-squared metrics using 998 NYC traffic flow records.

Conclusion: The decomposition strategy effectively isolates temporal characteristics, allowing model specialization that improves prediction accuracy, interpretability, and robustness.

Abstract: Accurate traffic flow forecasting is essential for intelligent transportation systems and urban traffic management. However, single model approaches often fail to capture the complex, nonlinear, and multi scale temporal patterns in traffic flow data. This study proposes a decomposition driven hybrid framework that integrates Seasonal Trend decomposition using Loess (STL) with three complementary predictive models. STL first decomposes the original time series into trend, seasonal, and residual components. Then, a Long Short Term Memory (LSTM) network models long term trends, an Autoregressive Integrated Moving Average (ARIMA) model captures seasonal periodicity, and an Extreme Gradient Boosting (XGBoost) algorithm predicts nonlinear residual fluctuations. The final forecast is obtained through multiplicative integration of the sub model predictions. Using 998 traffic flow records from a New York City intersection between November and December 2015, results show that the LSTM ARIMA XGBoost hybrid model significantly outperforms standalone models including LSTM, ARIMA, and XGBoost across MAE, RMSE, and R squared metrics. The decomposition strategy effectively isolates temporal characteristics, allowing each model to specialize, thereby improving prediction accuracy, interpretability, and robustness.

[377] Sparsity and Superposition in Mixture of Experts

Marmik Chaudhari, Jeremi Nuer, Rome Thorstenson

Main category: cs.LG

TL;DR: MoE models differ from dense networks in how they use superposition, with network sparsity (active experts ratio) being key to understanding their behavior rather than feature sparsity/importance.

Details

Motivation: To understand the mechanistic differences between Mixture of Experts (MoE) models and dense networks, particularly how superposition works differently in MoEs.

Method: Developed new metrics for measuring superposition across experts, analyzed how network sparsity affects feature representation, and proposed new definition of expert specialization based on monosemantic features.

Result: Found that greater network sparsity leads to greater monosemanticity, and experts naturally organize around coherent feature combinations with proper initialization.

Conclusion: Network sparsity in MoEs may enable more interpretable models without performance loss, challenging the assumption that interpretability and capability are fundamentally opposed.

Abstract: Mixture of Experts (MoE) models have become central to scaling large language models, yet their mechanistic differences from dense networks remain poorly understood. Previous work has explored how dense models use \textit{superposition} to represent more features than dimensions, and how superposition is a function of feature sparsity and feature importance. MoE models cannot be explained mechanistically through the same lens. We find that neither feature sparsity nor feature importance cause discontinuous phase changes, and that network sparsity (the ratio of active to total experts) better characterizes MoEs. We develop new metrics for measuring superposition across experts. Our findings demonstrate that models with greater network sparsity exhibit greater \emph{monosemanticity}. We propose a new definition of expert specialization based on monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations when initialized appropriately. These results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the common assumption that interpretability and capability are fundamentally at odds.

[378] DBLoss: Decomposition-based Loss Function for Time Series Forecasting

Xiangfei Qiu, Xingjian Wu, Hanyin Cheng, Xvyuan Liu, Chenjuan Guo, Jilin Hu, Bin Yang

Main category: cs.LG

TL;DR: Proposes DBLoss, a decomposition-based loss function that separates time series into seasonal and trend components using exponential moving averages, then calculates weighted losses for each component to improve forecasting accuracy.

Details

Motivation: Existing MSE loss functions often fail to capture seasonality and trend patterns effectively in time series forecasting, even when decomposition modules are used in model architectures.

Method: Uses exponential moving averages to decompose time series into seasonal and trend components within the forecasting horizon, calculates separate losses for each component, and weights them appropriately.

Result: Extensive experiments show DBLoss significantly improves performance of state-of-the-art forecasting models across diverse real-world datasets.

Conclusion: DBLoss provides an effective general loss function that can be combined with any deep learning forecasting model and offers a new perspective for time series loss function design.

Abstract: Time series forecasting holds significant value in various domains such as economics, traffic, energy, and AIOps, as accurate predictions facilitate informed decision-making. However, the existing Mean Squared Error (MSE) loss function sometimes fails to accurately capture the seasonality or trend within the forecasting horizon, even when decomposition modules are used in the forward propagation to model the trend and seasonality separately. To address these challenges, we propose a simple yet effective Decomposition-Based Loss function called DBLoss. This method uses exponential moving averages to decompose the time series into seasonal and trend components within the forecasting horizon, and then calculates the loss for each of these components separately, followed by weighting them. As a general loss function, DBLoss can be combined with any deep learning forecasting model. Extensive experiments demonstrate that DBLoss significantly improves the performance of state-of-the-art models across diverse real-world datasets and provides a new perspective on the design of time series loss functions.

[379] Informed Initialization for Bayesian Optimization and Active Learning

Carl Hvarfner, David Eriksson, Eytan Bakshy, Max Balandat

Main category: cs.LG

TL;DR: HIPE is a novel acquisition strategy for Bayesian Optimization that balances predictive uncertainty reduction with hyperparameter learning during initialization, outperforming standard space-filling designs in few-shot settings.

Details

Motivation: Standard (quasi-)random initialization designs for Bayesian Optimization neglect that space-filling may not reduce predictive uncertainty effectively and can conflict with efficient hyperparameter learning, which is crucial for good surrogate model quality in few-shot settings.

Method: Proposed Hyperparameter-Informed Predictive Exploration (HIPE), an information-theoretic acquisition strategy that balances predictive uncertainty reduction with hyperparameter learning. Derived a closed-form expression for HIPE in the Gaussian Process setting.

Result: HIPE outperforms standard initialization strategies in predictive accuracy, hyperparameter identification, and subsequent optimization performance, especially in large-batch, few-shot settings relevant to real-world Bayesian Optimization applications.

Conclusion: HIPE provides an effective initialization strategy that addresses limitations of traditional space-filling designs by explicitly considering both predictive uncertainty reduction and hyperparameter learning needs.

Abstract: Bayesian Optimization is a widely used method for optimizing expensive black-box functions, relying on probabilistic surrogate models such as Gaussian Processes. The quality of the surrogate model is crucial for good optimization performance, especially in the few-shot setting where only a small number of batches of points can be evaluated. In this setting, the initialization plays a critical role in shaping the surrogate’s predictive quality and guiding subsequent optimization. Despite this, practitioners typically rely on (quasi-)random designs to cover the input space. However, such approaches neglect two key factors: (a) space-filling designs may not be desirable to reduce predictive uncertainty, and (b) efficient hyperparameter learning during initialization is essential for high-quality prediction, which may conflict with space-filling designs. To address these limitations, we propose Hyperparameter-Informed Predictive Exploration (HIPE), a novel acquisition strategy that balances predictive uncertainty reduction with hyperparameter learning using information-theoretic principles. We derive a closed-form expression for HIPE in the Gaussian Process setting and demonstrate its effectiveness through extensive experiments in active learning and few-shot BO. Our results show that HIPE outperforms standard initialization strategies in terms of predictive accuracy, hyperparameter identification, and subsequent optimization performance, particularly in large-batch, few-shot settings relevant to many real-world Bayesian Optimization applications.

[380] Beyond Prompt Engineering: Neuro-Symbolic-Causal Architecture for Robust Multi-Objective AI Agents

Gokturk Aytug Akarlar

Main category: cs.LG

TL;DR: Chimera is a neuro-symbolic-causal architecture that combines LLM strategist, symbolic constraint engine, and causal inference to create robust autonomous agents, outperforming LLM-only and constrained-LLM approaches in e-commerce simulations.

Details

Motivation: LLM agents exhibit catastrophic brittleness in high-stakes domains, with identical capabilities producing wildly different outcomes based solely on prompt framing, creating deployment risks.

Method: Integrates three components: LLM strategist for decision-making, formally verified symbolic constraint engine for safety, and causal inference module for counterfactual reasoning. Tested in 52-week e-commerce simulations with price elasticity, trust dynamics, and seasonal demand.

Result: LLM-only agents failed catastrophically (total loss of $99K) or destroyed brand trust (-48.6%). Constrained-LLM prevented disasters but achieved only 43-87% of Chimera’s profit. Chimera delivered highest returns ($1.52M-$1.96M, up to +$2.2M) while improving brand trust (+1.8% to +20.86%) with zero constraint violations.

Conclusion: Architectural design, not prompt engineering, determines reliability of autonomous agents in production environments. Neuro-symbolic-causal integration provides prompt-agnostic robustness.

Abstract: Large language models show promise as autonomous decision-making agents, yet their deployment in high-stakes domains remains fraught with risk. Without architectural safeguards, LLM agents exhibit catastrophic brittleness: identical capabilities produce wildly different outcomes depending solely on prompt framing. We present Chimera, a neuro-symbolic-causal architecture that integrates three complementary components - an LLM strategist, a formally verified symbolic constraint engine, and a causal inference module for counterfactual reasoning. We benchmark Chimera against baseline architectures (LLM-only, LLM with symbolic constraints) across 52-week simulations in a realistic e-commerce environment featuring price elasticity, trust dynamics, and seasonal demand. Under organizational biases toward either volume or margin optimization, LLM-only agents fail catastrophically (total loss of $99K in volume scenarios) or destroy brand trust (-48.6% in margin scenarios). Adding symbolic constraints prevents disasters but achieves only 43-87% of Chimera’s profit. Chimera consistently delivers the highest returns ($1.52M and $1.96M respectively, some cases +$2.2M) while improving brand trust (+1.8% and +10.8%, some cases +20.86%), demonstrating prompt-agnostic robustness. Our TLA+ formal verification proves zero constraint violations across all scenarios. These results establish that architectural design not prompt engineering determines the reliability of autonomous agents in production environments. We provide open-source implementations and interactive demonstrations for reproducibility.

[381] Parallel BiLSTM-Transformer networks for forecasting chaotic dynamics

Junwen Ma, Mingyu Ge, Yisen Wang, Yong Zhang, Weicheng Fu

Main category: cs.LG

TL;DR: A hybrid Transformer-BiLSTM model for chaotic time series prediction that captures both long-range dependencies and local temporal features through parallel branches with feature fusion.

Details

Motivation: Conventional approaches fail to simultaneously capture local features and global dependencies in chaotic systems, which exhibit extreme sensitivity to initial conditions and complex dynamics.

Method: Dual-branch architecture with Transformer for long-range dependencies and BiLSTM for local temporal features, fused in a feature-fusion layer. Evaluated on Lorenz system for autonomous evolution prediction and unmeasured variable inference.

Result: The hybrid framework consistently outperforms single-branch architectures across both prediction tasks, demonstrating superior accuracy and robustness.

Conclusion: The proposed parallel predictive framework effectively addresses the challenges of chaotic system prediction by combining complementary representations from Transformer and BiLSTM networks.

Abstract: The nonlinear nature of chaotic systems results in extreme sensitivity to initial conditions and highly intricate dynamical behaviors, posing fundamental challenges for accurately predicting their evolution. To overcome the limitation that conventional approaches fail to capture both local features and global dependencies in chaotic time series simultaneously, this study proposes a parallel predictive framework integrating Transformer and Bidirectional Long Short-Term Memory (BiLSTM) networks. The hybrid model employs a dual-branch architecture, where the Transformer branch mainly captures long-range dependencies while the BiLSTM branch focuses on extracting local temporal features. The complementary representations from the two branches are fused in a dedicated feature-fusion layer to enhance predictive accuracy. As illustrating examples, the model’s performance is systematically evaluated on two representative tasks in the Lorenz system. The first is autonomous evolution prediction, in which the model recursively extrapolates system trajectories from the time-delay embeddings of the state vector to evaluate long-term tracking accuracy and stability. The second is inference of unmeasured variable, where the model reconstructs the unobserved states from the time-delay embeddings of partial observations to assess its state-completion capability. The results consistently indicate that the proposed hybrid framework outperforms both single-branch architectures across tasks, demonstrating its robustness and effectiveness in chaotic system prediction.

[382] On the Societal Impact of Machine Learning

Joachim Baumann

Main category: cs.LG

TL;DR: This PhD thesis focuses on measuring fairness in machine learning systems, analyzing bias dynamics, and developing interventions to reduce algorithmic discrimination while maintaining utility.

Details

Motivation: Machine learning systems increasingly influence important decisions but are often developed without fairness considerations, risking discriminatory effects on society.

Method: The thesis enables appropriate fairness measurement, systematic decomposition of ML systems to anticipate bias dynamics, and effective interventions to reduce discrimination.

Result: The work provides a foundation for ensuring ML’s societal impact aligns with social values, addressing fairness in consequential decision-making systems.

Conclusion: The thesis discusses ongoing challenges and future research directions as ML systems become more integrated into society, particularly with generative AI.

Abstract: This PhD thesis investigates the societal impact of machine learning (ML). ML increasingly informs consequential decisions and recommendations, significantly affecting many aspects of our lives. As these data-driven systems are often developed without explicit fairness considerations, they carry the risk of discriminatory effects. The contributions in this thesis enable more appropriate measurement of fairness in ML systems, systematic decomposition of ML systems to anticipate bias dynamics, and effective interventions that reduce algorithmic discrimination while maintaining system utility. I conclude by discussing ongoing challenges and future research directions as ML systems, including generative artificial intelligence, become increasingly integrated into society. This work offers a foundation for ensuring that ML’s societal impact aligns with broader social values.

[383] MUStReason: A Benchmark for Diagnosing Pragmatic Reasoning in Video-LMs for Multimodal Sarcasm Detection

Anisha Saha, Varsha Suresh, Timothy Hospedales, Vera Demberg

Main category: cs.LG

TL;DR: MUStReason is a diagnostic benchmark for evaluating VideoLMs’ sarcasm detection capabilities, featuring modality-specific cue annotations and reasoning steps. The paper also introduces PragCoT, a framework that helps VideoLMs focus on implied intentions rather than literal meaning.

Details

Motivation: Current multimodal models struggle with complex tasks like sarcasm detection, which requires identifying relevant cues across modalities and pragmatically reasoning over them to infer speaker's intention. VideoLMs have limitations in this area.

Method: Introduces MUStReason benchmark with annotations of modality-specific cues and reasoning steps. Proposes PragCoT framework that steers VideoLMs to focus on implied intentions over literal meaning. Disentangles the problem into perception and reasoning components.

Result: The benchmark enables quantitative and qualitative evaluation of generated reasoning in VideoLMs for sarcasm classification performance.

Conclusion: MUStReason provides a diagnostic tool to explore VideoLMs’ limitations in sarcasm detection, while PragCoT offers a framework to improve their ability to focus on implied intentions essential for detecting sarcasm.

Abstract: Sarcasm is a specific type of irony which involves discerning what is said from what is meant. Detecting sarcasm depends not only on the literal content of an utterance but also on non-verbal cues such as speaker’s tonality, facial expressions and conversational context. However, current multimodal models struggle with complex tasks like sarcasm detection, which require identifying relevant cues across modalities and pragmatically reasoning over them to infer the speaker’s intention. To explore these limitations in VideoLMs, we introduce MUStReason, a diagnostic benchmark enriched with annotations of modality-specific relevant cues and underlying reasoning steps to identify sarcastic intent. In addition to benchmarking sarcasm classification performance in VideoLMs, using MUStReason we quantitatively and qualitatively evaluate the generated reasoning by disentangling the problem into perception and reasoning, we propose PragCoT, a framework that steers VideoLMs to focus on implied intentions over literal meaning, a property core to detecting sarcasm.

[384] Debiasing Reward Models by Representation Learning with Guarantees

Ignavier Ng, Patrick Blöbaum, Siddharth Bhandari, Kun Zhang, Shiva Kasiviswanathan

Main category: cs.LG

TL;DR: Proposes a principled framework to mitigate spurious correlations in reward models while preserving intended human preferences, using variational inference to identify non-spurious latent variables.

Details

Motivation: Current alignment techniques like RLHF often exploit spurious correlations (response length, discrimination, sycophancy, conceptual bias) in reward models, which is a growing problem that needs addressing.

Method: Formulates data-generating process with spurious and non-spurious latent variables, shows non-spurious variables can be theoretically identified, and uses variational inference to recover these variables for training reward models.

Result: Experiments on synthetic and real-world datasets show the method effectively mitigates spurious correlation issues and produces more robust reward models.

Conclusion: The proposed framework successfully addresses spurious bias problems in reward models while maintaining the underlying factors that reflect true human preferences.

Abstract: Recent alignment techniques, such as reinforcement learning from human feedback, have been widely adopted to align large language models with human preferences by learning and leveraging reward models. In practice, these models often exploit spurious correlations, involving, e.g., response length, discrimination, sycophancy, and conceptual bias, which is a problem that has received increasing attention. In this work, we propose a principled framework that mitigates these biases in reward models while preserving the underlying factors that reflect intended preferences. We first provide a formulation of the data-generating process, assuming that the observed data (e.g., text) is generated from both spurious and non-spurious latent variables. We show that, interestingly, these non-spurious latent variables can be theoretically identified from data, regardless of whether a surrogate for the spurious latent variables is available. This further inspires a practical method that uses variational inference to recover these variables and leverages them to train reward models. Experiments on synthetic and real-world datasets demonstrate that our method effectively mitigates spurious correlation issues and yields more robust reward models.

[385] Explaining Robustness to Catastrophic Forgetting Through Incremental Concept Formation

Nicki Barari, Edward Kim, Christopher MacLellan

Main category: cs.LG

TL;DR: The paper examines three hypotheses about why Cobweb/4V resists catastrophic forgetting in continual learning: adaptive structural reorganization, sparse selective updates, and information-theoretic learning advantages over gradient backpropagation.

Details

Motivation: To understand the factors contributing to catastrophic forgetting robustness in Cobweb/4V and identify mechanisms for stable continual learning systems.

Method: Compared Cobweb/4V with neural baselines including CobwebNN on MNIST, Fashion-MNIST, MedMNIST, and CIFAR-10 datasets to test three hypotheses about learning stability.

Result: Experiments confirmed that adaptive restructuring enhances plasticity, sparse updates reduce interference, and information-theoretic learning preserves prior knowledge without revisiting past data.

Conclusion: Concept-based information-theoretic approaches with adaptive restructuring and sparse updates provide effective mechanisms for mitigating catastrophic forgetting in continual learning.

Abstract: Catastrophic forgetting remains a central challenge in continual learning, where models are required to integrate new knowledge over time without losing what they have previously learned. In prior work, we introduced Cobweb/4V, a hierarchical concept formation model that exhibited robustness to catastrophic forgetting in visual domains. Motivated by this robustness, we examine three hypotheses regarding the factors that contribute to such stability: (1) adaptive structural reorganization enhances knowledge retention, (2) sparse and selective updates reduce interference, and (3) information-theoretic learning based on sufficiency statistics provides advantages over gradient-based backpropagation. To test these hypotheses, we compare Cobweb/4V with neural baselines, including CobwebNN, a neural implementation of the Cobweb framework introduced in this work. Experiments on datasets of varying complexity (MNIST, Fashion-MNIST, MedMNIST, and CIFAR-10) show that adaptive restructuring enhances learning plasticity, sparse updates help mitigate interference, and the information-theoretic learning process preserves prior knowledge without revisiting past data. Together, these findings provide insight into mechanisms that can mitigate catastrophic forgetting and highlight the potential of concept-based, information-theoretic approaches for building stable and adaptive continual learning systems.

[386] Relaxed Sequence Sampling for Diverse Protein Design

Joohwan Ko, Aristofanis Rontogiannis, Yih-En Andrew Ban, Axel Elaldi, Nicholas Franklin

Main category: cs.LG

TL;DR: RSS is a new MCMC framework for protein design that combines structural and evolutionary information, outperforming existing methods in designability and diversity.

Details

Motivation: Existing protein design methods like RSO rely on single-path gradient descent and ignore sequence-space constraints, limiting diversity and designability.

Method: RSS uses Markov chain Monte Carlo in continuous logit space, combining gradient-guided exploration with protein language model-informed jumps. It couples AlphaFold2 structural objectives with ESM2 sequence priors.

Result: In protein binder design, RSS produces 5x more designable structures and 2-3x greater structural diversity than RSO baselines at equal computational cost.

Conclusion: RSS provides a principled approach for efficiently exploring the protein design landscape by balancing accuracy and biological plausibility.

Abstract: Protein design using structure prediction models such as AlphaFold2 has shown remarkable success, but existing approaches like relaxed sequence optimization (RSO) rely on single-path gradient descent and ignore sequence-space constraints, limiting diversity and designability. We introduce Relaxed Sequence Sampling (RSS), a Markov chain Monte Carlo (MCMC) framework that integrates structural and evolutionary information for protein design. RSS operates in continuous logit space, combining gradient-guided exploration with protein language model-informed jumps. Its energy function couples AlphaFold2-derived structural objectives with ESM2-derived sequence priors, balancing accuracy and biological plausibility. In an in silico protein binder design task, RSS produces 5$\times$ more designable structures and 2-3$\times$ greater structural diversity than RSO baselines, at equal computational cost. These results highlight RSS as a principled approach for efficiently exploring the protein design landscape.

[387] Revealing the Potential of Learnable Perturbation Ensemble Forecast Model for Tropical Cyclone Prediction

Jun Liu, Tao Zhou, Jiarui Li, Xiaohui Zhong, Peng Zhang, Jie Feng, Lei Chen, Hao Li

Main category: cs.LG

TL;DR: FuXi-ENS, an AI-based ensemble forecasting system with learnable perturbations, outperforms ECMWF-ENS in tropical cyclone track forecasting and captures large-scale circulation better, though it underestimates intensity.

Details

Motivation: Traditional ensemble forecasting systems for tropical cyclones face high computational costs and limited capability to represent atmospheric nonlinearity, motivating the development of AI-based approaches.

Method: FuXi-ENS introduces a learnable perturbation scheme for ensemble generation and is systematically compared with ECMWF-ENS using all 90 global tropical cyclones in 2018, examining physical variables, track/intensity forecasts, and dynamical/thermodynamical fields.

Result: FuXi-ENS shows advantages in predicting TC-related physical variables, achieves more accurate track forecasts with reduced ensemble spread, but underestimates intensity. It better captures large-scale circulation and has moisture turbulent energy more concentrated around TC warm core.

Conclusion: Learnable perturbations can improve TC forecasting skill and provide valuable insights for advancing AI-based ensemble prediction of extreme weather events with significant societal impacts.

Abstract: Tropical cyclones (TCs) are highly destructive and inherently uncertain weather systems. Ensemble forecasting helps quantify these uncertainties, yet traditional systems are constrained by high computational costs and limited capability to fully represent atmospheric nonlinearity. FuXi-ENS introduces a learnable perturbation scheme for ensemble generation, representing a novel AI-based forecasting paradigm. Here, we systematically compare FuXi-ENS with ECMWF-ENS using all 90 global TCs in 2018, examining their performance in TC-related physical variables, track and intensity forecasts, and the associated dynamical and thermodynamical fields. FuXi-ENS demonstrates clear advantages in predicting TC-related physical variables, and achieves more accurate track forecasts with reduced ensemble spread, though it still underestimates intensity relative to observations. Further dynamical and thermodynamical analyses reveal that FuXi-ENS better captures large-scale circulation, with moisture turbulent energy more tightly concentrated around the TC warm core, whereas ECMWF-ENS exhibits a more dispersed distribution. These findings highlight the potential of learnable perturbations to improve TC forecasting skill and provide valuable insights for advancing AI-based ensemble prediction of extreme weather events that have significant societal impacts.

[388] Learning Interpretable Features in Audio Latent Spaces via Sparse Autoencoders

Nathan Paek, Yongyi Zang, Qihui Yang, Randal Leistikow

Main category: cs.LG

TL;DR: A framework for interpreting audio generative models by mapping latent representations to human-interpretable acoustic concepts using sparse autoencoders and linear mappings to acoustic properties.

Details

Motivation: Sparse autoencoders work well for language models but face challenges with audio due to compression obscuring semantic meaning and limited automatic feature characterization in audio generation.

Method: Train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, timbre) to enable controllable manipulation and analysis.

Result: Validated on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyzed DiffRhythm model to show how pitch, timbre, and loudness evolve during generation.

Conclusion: The framework enables interpretable analysis of audio generation and can be extended to visual latent space generation models.

Abstract: While sparse autoencoders (SAEs) successfully extract interpretable features from language models, applying them to audio generation faces unique challenges: audio’s dense nature requires compression that obscures semantic meaning, and automatic feature characterization remains limited. We propose a framework for interpreting audio generative models by mapping their latent representations to human-interpretable acoustic concepts. We train SAEs on audio autoencoder latents, then learn linear mappings from SAE features to discretized acoustic properties (pitch, amplitude, and timbre). This enables both controllable manipulation and analysis of the AI music generation process, revealing how acoustic properties emerge during synthesis. We validate our approach on continuous (DiffRhythm-VAE) and discrete (EnCodec, WavTokenizer) audio latent spaces, and analyze DiffRhythm, a state-of-the-art text-to-music model, to demonstrate how pitch, timbre, and loudness evolve throughout generation. While our work is only done on audio modality, our framework can be extended to interpretable analysis of visual latent space generation models.

[389] How do simple rotations affect the implicit bias of Adam?

Adela DePavia, Vasileios Charisopoulos, Rebecca Willett

Main category: cs.LG

TL;DR: Adam’s richness bias for learning nonlinear decision boundaries is sensitive to data rotations, which can reverse its advantage over gradient descent. A reparameterization method using orthogonal transformations can restore Adam’s bias towards rich boundaries.

Details

Motivation: To understand why adaptive gradient methods like Adam sometimes generalize worse than gradient descent, particularly examining how their coordinate-wise preconditioning makes them sensitive to data rotations.

Method: Analyze Adam’s sensitivity to orthogonal transformations of feature space and test a reparameterization method that applies orthogonal transformations to make optimization rotation-equivariant.

Result: Small rotations of data distribution can make Adam forfeit its richness bias and converge to worse linear boundaries than gradient descent. The reparameterization method successfully restores Adam’s bias towards rich decision boundaries.

Conclusion: Adam’s coordinate-wise preconditioning creates rotation sensitivity that can hurt generalization, but this can be mitigated through appropriate reparameterization to achieve rotation-equivariant optimization.

Abstract: Adaptive gradient methods such as Adam and Adagrad are widely used in machine learning, yet their effect on the generalization of learned models – relative to methods like gradient descent – remains poorly understood. Prior work on binary classification suggests that Adam exhibits a ``richness bias,’’ which can help it learn nonlinear decision boundaries closer to the Bayes-optimal decision boundary relative to gradient descent. However, the coordinate-wise preconditioning scheme employed by Adam renders the overall method sensitive to orthogonal transformations of feature space. We show that this sensitivity can manifest as a reversal of Adam’s competitive advantage: even small rotations of the underlying data distribution can make Adam forfeit its richness bias and converge to a linear decision boundary that is farther from the Bayes-optimal decision boundary than the one learned by gradient descent. To alleviate this issue, we show that a recently proposed reparameterization method – which applies an orthogonal transformation to the optimization objective – endows any first-order method with equivariance to data rotations, and we empirically demonstrate its ability to restore Adam’s bias towards rich decision boundaries.

[390] A Physics-informed Multi-resolution Neural Operator

Sumanta Roy, Bahador Bahmani, Ioannis G. Kevrekidis, Michael D. Shields

Main category: cs.LG

TL;DR: A physics-informed operator learning method that extends RINO to work without training data by projecting inputs to latent space and using MLP with PDE enforcement via finite difference solver.

Details

Motivation: Operator learning frameworks require large amounts of high-fidelity training data which can be challenging to obtain, especially with unevenly discretized data across different resolutions.

Method: Extends RINO framework to data-free setup by projecting inputs to latent embedding space using pre-trained basis functions, then using MLP with latent codes and spatiotemporal coordinates to produce solutions, with PDEs enforced via finite difference solver.

Result: Method validated on numerical examples with multi-resolution data, handling inputs sampled at varying resolutions including both coarse and fine discretizations.

Conclusion: Proposed approach successfully addresses challenges of data scarcity and uneven discretization in operator learning for PDE applications.

Abstract: The predictive accuracy of operator learning frameworks depends on the quality and quantity of available training data (input-output function pairs), often requiring substantial amounts of high-fidelity data, which can be challenging to obtain in some real-world engineering applications. These datasets may be unevenly discretized from one realization to another, with the grid resolution varying across samples. In this study, we introduce a physics-informed operator learning approach by extending the Resolution Independent Neural Operator (RINO) framework to a fully data-free setup, addressing both challenges simultaneously. Here, the arbitrarily (but sufficiently finely) discretized input functions are projected onto a latent embedding space (i.e., a vector space of finite dimensions), using pre-trained basis functions. The operator associated with the underlying partial differential equations (PDEs) is then approximated by a simple multi-layer perceptron (MLP), which takes as input a latent code along with spatiotemporal coordinates to produce the solution in the physical space. The PDEs are enforced via a finite difference solver in the physical space. The validation and performance of the proposed method are benchmarked on several numerical examples with multi-resolution data, where input functions are sampled at varying resolutions, including both coarse and fine discretizations.

[391] Combining SHAP and Causal Analysis for Interpretable Fault Detection in Industrial Processes

Pedro Cortes dos Santos, Matheus Becali Rocha, Renato A Krohling

Main category: cs.LG

TL;DR: This paper develops an innovative fault detection framework using SHAP and causal analysis to improve both accuracy and interpretability in complex industrial processes.

Details

Motivation: Industrial processes generate complex data that challenge fault detection systems, often yielding opaque or underwhelming results despite advanced machine learning techniques.

Method: Uses SHAP to identify critical process features, then applies causal analysis through Directed Acyclic Graphs to uncover fault propagation mechanisms.

Result: The causal structures align with SHAP findings, highlighting key process elements like cooling and separation systems as pivotal to fault development, enhancing detection accuracy and providing actionable insights.

Conclusion: This dual approach bridges predictive power with causal understanding, offering a robust tool for monitoring complex manufacturing environments and enabling smarter, more interpretable fault detection.

Abstract: Industrial processes generate complex data that challenge fault detection systems, often yielding opaque or underwhelming results despite advanced machine learning techniques. This study tackles such difficulties using the Tennessee Eastman Process, a well-established benchmark known for its intricate dynamics, to develop an innovative fault detection framework. Initial attempts with standard models revealed limitations in both performance and interpretability, prompting a shift toward a more tractable approach. By employing SHAP (SHapley Additive exPlanations), we transform the problem into a more manageable and transparent form, pinpointing the most critical process features driving fault predictions. This reduction in complexity unlocks the ability to apply causal analysis through Directed Acyclic Graphs, generated by multiple algorithms, to uncover the underlying mechanisms of fault propagation. The resulting causal structures align strikingly with SHAP findings, consistently highlighting key process elements-like cooling and separation systems-as pivotal to fault development. Together, these methods not only enhance detection accuracy but also provide operators with clear, actionable insights into fault origins, a synergy that, to our knowledge, has not been previously explored in this context. This dual approach bridges predictive power with causal understanding, offering a robust tool for monitoring complex manufacturing environments and paving the way for smarter, more interpretable fault detection in industrial systems.

[392] ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

Yilang Zhang, Xiaodong Yang, Yiwei Cai, Georgios B. Giannakis

Main category: cs.LG

TL;DR: The paper proposes a method that accumulates high-rank weight updates from consecutive low-rank increments to overcome limitations of standard LoRA, achieving better performance and faster convergence while maintaining efficiency.

Details

Motivation: Large language models face computational bottlenecks during fine-tuning. While LoRA reduces costs by using low-rank updates, this restriction can hinder effectiveness and slow convergence.

Method: Progressively accumulates high-rank weight updates from consecutive low-rank increments by identifying optimal low-rank matrices that minimize loss and approximate full fine-tuning. Uses optimal scaling of columns from original low-rank matrices for efficient optimization without restarting.

Result: Extensive tests with LLMs up to 12B parameters show consistent performance gains and fast convergence compared to state-of-the-art LoRA variants across natural language understanding, commonsense reasoning, and mathematical problem solving tasks.

Conclusion: The proposed method effectively addresses LoRA’s limitations by accumulating high-rank updates from low-rank increments, achieving better performance and faster convergence while maintaining computational efficiency.

Abstract: As large language models (LLMs) continue to scale in size, the computational overhead has become a major bottleneck for task-specific fine-tuning. While low-rank adaptation (LoRA) effectively curtails this cost by confining the weight updates to a low-dimensional subspace, such a restriction can hinder effectiveness and slow convergence. This contribution deals with these limitations by accumulating progressively a high-rank weight update from consecutive low-rank increments. Specifically, the per update optimal low-rank matrix is identified to minimize the loss function and closely approximate full fine-tuning. To endow efficient and seamless optimization without restarting, this optimal choice is formed by appropriately scaling the columns of the original low-rank matrix. Rigorous performance guarantees reveal that the optimal scaling can be found analytically. Extensive numerical tests with popular LLMs scaling up to 12 billion parameters demonstrate a consistent performance gain and fast convergence relative to state-of-the-art LoRA variants on diverse tasks including natural language understanding, commonsense reasoning, and mathematical problem solving.

[393] A PDE-Informed Latent Diffusion Model for 2-m Temperature Downscaling

Paul Rosu, Muchang Bahng, Erick Jiang, Rico Zhu, Vahid Tarokh

Main category: cs.LG

TL;DR: A physics-conditioned latent diffusion model for dynamical downscaling of atmospheric data, specifically reconstructing high-resolution 2-m temperature fields, with PDE loss integration for physical consistency.

Details

Motivation: To enhance the physical plausibility of generated atmospheric temperature fields by integrating physical constraints into diffusion models, addressing limitations of purely data-driven approaches.

Method: Builds on existing diffusion architecture with residual formulation against reference UNet, integrates PDE loss term computed in full resolution space using finite-difference approximation of advection-diffusion balance.

Result: Conventional diffusion training already yields low PDE residuals, and fine-tuning with additional PDE loss further regularizes the model and enhances physical plausibility of generated fields.

Conclusion: Physics-conditioned latent diffusion models with integrated PDE constraints can effectively generate physically consistent high-resolution atmospheric temperature fields, with codebase available for future development.

Abstract: This work presents a physics-conditioned latent diffusion model tailored for dynamical downscaling of atmospheric data, with a focus on reconstructing high-resolution 2-m temperature fields. Building upon a pre-existing diffusion architecture and employing a residual formulation against a reference UNet, we integrate a partial differential equation (PDE) loss term into the model’s training objective. The PDE loss is computed in the full resolution (pixel) space by decoding the latent representation and is designed to enforce physical consistency through a finite-difference approximation of an effective advection-diffusion balance. Empirical observations indicate that conventional diffusion training already yields low PDE residuals, and we investigate how fine-tuning with this additional loss further regularizes the model and enhances the physical plausibility of the generated fields. The entirety of our codebase is available on Github, for future reference and development.

[394] GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA

Zhichao Wang

Main category: cs.LG

TL;DR: GIFT is a novel RL framework that minimizes discrepancy between implicit and explicit reward models for LLM alignment, combining ideas from GRPO, DPO, and UNA with joint normalization to create a stable MSE loss formulation.

Details

Motivation: To address limitations of existing RL methods like PPO and GRPO that directly maximize cumulative rewards, and offline methods like DPO/UNA that lose exploration capability, by creating a more stable and efficient alignment approach.

Method: Combines online multi-response generation (GRPO), implicit reward formulation (DPO), and implicit-explicit reward alignment (UNA) with joint normalization of rewards to transform the problem into a convex MSE loss between normalized reward functions.

Result: Achieves superior reasoning and alignment performance on mathematical benchmarks with faster convergence, better generalization, reduced training overfitting, and computational efficiency compared to GRPO.

Conclusion: GIFT provides a stable, convex, and analytically differentiable framework for LLM alignment that retains on-policy exploration while requiring fewer hyperparameters and achieving better performance than existing methods.

Abstract: I propose \textbf{G}roup-relative \textbf{I}mplicit \textbf{F}ine \textbf{T}uning (GIFT), a novel reinforcement learning framework for aligning LLMs. Instead of directly maximizing cumulative rewards like PPO or GRPO, GIFT minimizes the discrepancy between implicit and explicit reward models. It combines three key ideas: (1) the online multi-response generation and normalization of GRPO, (2) the implicit reward formulation of DPO, and (3) the implicit-explicit reward alignment principle of UNA. By jointly normalizing the implicit and explicit rewards, GIFT eliminates an otherwise intractable term that prevents effective use of implicit rewards. This normalization transforms the complex reward maximization objective into a simple mean squared error (MSE) loss between the normalized reward functions, converting a non-convex optimization problem into a convex, stable, and analytically differentiable formulation. Unlike offline methods such as DPO and UNA, GIFT remains on-policy and thus retains exploration capability. Compared to GRPO, it requires fewer hyperparameters, converges faster, and generalizes better with significantly reduced training overfitting. Empirically, GIFT achieves superior reasoning and alignment performance on mathematical benchmarks while remaining computationally efficient.

[395] Preference Learning with Response Time: Robust Losses and Guarantees

Ayush Sawarni, Sahasrajit Sarmasarkar, Vasilis Syrgkanis

Main category: cs.LG

TL;DR: This paper introduces methods to incorporate response time data into human preference learning, showing significant improvements in sample efficiency and convergence rates compared to traditional binary preference approaches.

Details

Motivation: Current human preference learning frameworks only use binary preference data, ignoring valuable temporal information from user decision-making that could improve reward model elicitation.

Method: Proposed novel methodologies using the Evidence Accumulation Drift Diffusion (EZ) model to incorporate response time information alongside binary choices, with Neyman-orthogonal loss functions for reward model learning.

Result: The response time-augmented approach reduces error rates from exponential to polynomial scaling with reward magnitude, achieving oracle convergence rates and significant sample efficiency improvements. Extensive experiments validate these findings in image preference learning.

Conclusion: Incorporating response time data into preference learning frameworks provides substantial benefits over binary-only approaches, with improved theoretical guarantees and practical performance across various reward function spaces.

Abstract: This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.

[396] Artificial Intelligence Based Predictive Maintenance for Electric Buses

Ayse Irmak Ercevik, Ahmet Murat Ozbayoglu

Main category: cs.LG

TL;DR: A graph-based feature selection method combined with AI techniques effectively predicts vehicle alarms in electric buses using CAN Bus data, enabling proactive maintenance.

Details

Motivation: Electric buses present PdM challenges due to complex electric systems, and traditional scheduled maintenance fails to detect anomalies in multi-dimensional CAN Bus data.

Method: Developed hybrid graph-based feature selection combining statistical filtering with community detection algorithms, then applied optimized ML models (SVM, Random Forest, XGBoost) with data balancing techniques and LIME for interpretability.

Result: The system successfully predicts vehicle alarms, enhances feature interpretability, and supports proactive maintenance strategies.

Conclusion: The developed approach effectively addresses PdM challenges in electric buses and aligns with Industry 4.0 principles for smart maintenance.

Abstract: Predictive maintenance (PdM) is crucial for optimizing efficiency and minimizing downtime of electric buses. While these vehicles provide environmental benefits, they pose challenges for PdM due to complex electric transmission and battery systems. Traditional maintenance, often based on scheduled inspections, struggles to capture anomalies in multi-dimensional real-time CAN Bus data. This study employs a graph-based feature selection method to analyze relationships among CAN Bus parameters of electric buses and investigates the prediction performance of targeted alarms using artificial intelligence techniques. The raw data collected over two years underwent extensive preprocessing to ensure data quality and consistency. A hybrid graph-based feature selection tool was developed by combining statistical filtering (Pearson correlation, Cramer’s V, ANOVA F-test) with optimization-based community detection algorithms (InfoMap, Leiden, Louvain, Fast Greedy). Machine learning models, including SVM, Random Forest, and XGBoost, were optimized through grid and random search with data balancing via SMOTEEN and binary search-based down-sampling. Model interpretability was achieved using LIME to identify the features influencing predictions. The results demonstrate that the developed system effectively predicts vehicle alarms, enhances feature interpretability, and supports proactive maintenance strategies aligned with Industry 4.0 principles.

[397] RS-ORT: A Reduced-Space Branch-and-Bound Algorithm for Optimal Regression Trees

Cristobal Heredia, Pedro Chumpitaz-Flores, Kaixun Hua

Main category: cs.LG

TL;DR: RS-ORT is a specialized branch-and-bound algorithm for training optimal regression trees that handles continuous features efficiently without binarization, achieving guaranteed optimality on million-size datasets.

Details

Motivation: Existing MIP approaches for regression trees are limited to binary features or become intractable with continuous data, while naive binarization sacrifices optimality and creates unnecessarily deep trees.

Method: Two-stage optimization with branch-and-bound that branches only on tree-structural variables, using bound tightening techniques (closed-form leaf prediction, empirical threshold discretization, depth-1 subtree parsing) and decomposable bounding strategies with parallel execution.

Result: Superior training and testing performance on regression benchmarks, handling datasets up to 2,000,000 samples with continuous features, obtaining guaranteed optimal performance with simpler tree structures and better generalization in four hours.

Conclusion: RS-ORT effectively addresses computational intractability in optimal regression tree training for large-scale continuous data, delivering guaranteed optimal solutions with improved efficiency and generalization.

Abstract: Mixed-integer programming (MIP) has emerged as a powerful framework for learning optimal decision trees. Yet, existing MIP approaches for regression tasks are either limited to purely binary features or become computationally intractable when continuous, large-scale data are involved. Naively binarizing continuous features sacrifices global optimality and often yields needlessly deep trees. We recast the optimal regression-tree training as a two-stage optimization problem and propose Reduced-Space Optimal Regression Trees (RS-ORT) - a specialized branch-and-bound (BB) algorithm that branches exclusively on tree-structural variables. This design guarantees the algorithm’s convergence and its independence from the number of training samples. Leveraging the model’s structure, we introduce several bound tightening techniques - closed-form leaf prediction, empirical threshold discretization, and exact depth-1 subtree parsing - that combine with decomposable upper and lower bounding strategies to accelerate the training. The BB node-wise decomposition enables trivial parallel execution, further alleviating the computational intractability even for million-size datasets. Based on the empirical studies on several regression benchmarks containing both binary and continuous features, RS-ORT also delivers superior training and testing performance than state-of-the-art methods. Notably, on datasets with up to 2,000,000 samples with continuous features, RS-ORT can obtain guaranteed training performance with a simpler tree structure and a better generalization ability in four hours.

[398] Group Interventions on Deep Networks for Causal Discovery in Subsystems

Wasim Ahmad, Maha Shadaydeh, Joachim Denzler

Main category: cs.LG

TL;DR: gCDMI is a novel multi-group causal discovery method that uses group-level interventions on deep neural networks and model invariance testing to identify causal relationships among variable groups in nonlinear multivariate time series.

Details

Motivation: Most existing causal discovery methods focus only on pairwise cause-effect relationships, overlooking interactions among groups of variables and their collective causal influence in complex systems.

Method: Three-step approach: 1) Use deep learning to model structural relationships among time series groups, 2) Apply group-wise interventions to trained model, 3) Conduct model invariance testing to determine causal links among variable groups.

Result: Superior performance on simulated datasets in identifying group-level causal relationships compared to existing methods. Validated on real-world datasets including brain networks and climate ecosystems.

Conclusion: Group-level interventions applied to deep learning models combined with invariance testing can effectively reveal complex causal structures, providing valuable insights for neuroscience and climate science.

Abstract: Causal discovery uncovers complex relationships between variables, enhancing predictions, decision-making, and insights into real-world systems, especially in nonlinear multivariate time series. However, most existing methods primarily focus on pairwise cause-effect relationships, overlooking interactions among groups of variables, i.e., subsystems and their collective causal influence. In this study, we introduce gCDMI, a novel multi-group causal discovery method that leverages group-level interventions on trained deep neural networks and employs model invariance testing to infer causal relationships. Our approach involves three key steps. First, we use deep learning to jointly model the structural relationships among groups of all time series. Second, we apply group-wise interventions to the trained model. Finally, we conduct model invariance testing to determine the presence of causal links among variable groups. We evaluate our method on simulated datasets, demonstrating its superior performance in identifying group-level causal relationships compared to existing methods. Additionally, we validate our approach on real-world datasets, including brain networks and climate ecosystems. Our results highlight that applying group-level interventions to deep learning models, combined with invariance testing, can effectively reveal complex causal structures, offering valuable insights for domains such as neuroscience and climate science.

[399] Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

Marko Karbevski, Antonij Mijoski

Main category: cs.LG

TL;DR: The paper proves that Query weights in attention mechanisms are redundant, reducing parameters by over 8% while maintaining comparable performance.

Details

Motivation: To investigate whether the Query-Key-Value weight triplet in attention mechanisms can be reduced to improve parameter efficiency in large language models.

Method: Theoretical analysis under simplifying assumptions followed by empirical validation on GPT-3 small architectures with full complexity (layer normalization, skip connections, weight decay) trained from scratch.

Result: The reduced model without Query weights achieves comparable validation loss to standard baselines while reducing non-embedding/lm-head parameters by over 8%.

Conclusion: Query weights are redundant in attention mechanisms, motivating further investigation of this redundancy at larger scales.

Abstract: The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.

[400] Geometry-Inspired Unified Framework for Discounted and Average Reward MDPs

Arsenii Mustafin, Xinyi Sheng, Dominik Baumann

Main category: cs.LG

TL;DR: Extends geometric interpretation of MDPs from discounted-reward to average-reward case, unifying both approaches and proving geometric convergence of Value Iteration for ergodic optimal policies.

Details

Motivation: To unify the theoretical analysis of MDPs by extending geometric interpretations from discounted-reward to average-reward cases, which are typically analyzed separately.

Method: Extends a recently introduced geometric interpretation of MDPs for the discounted-reward case to the average-reward case, creating a unified framework.

Result: Successfully extends a major result: under a unique and ergodic optimal policy, the Value Iteration algorithm achieves geometric convergence rate in the average-reward case.

Conclusion: The geometric interpretation successfully unifies analysis of both discounted-reward and average-reward MDPs, enabling extension of convergence results across both cases.

Abstract: The theoretical analysis of Markov Decision Processes (MDPs) is commonly split into two cases - the average-reward case and the discounted-reward case - which, while sharing similarities, are typically analyzed separately. In this work, we extend a recently introduced geometric interpretation of MDPs for the discounted-reward case to the average-reward case, thereby unifying both. This allows us to extend a major result known for the discounted-reward case to the average-reward case: under a unique and ergodic optimal policy, the Value Iteration algorithm achieves a geometric convergence rate.

[401] Improving the Straight-Through Estimator with Zeroth-Order Information

Ningfeng Yang, Tor M. Aamodt

Main category: cs.LG

TL;DR: FOGZO combines first-order STE gradients with zeroth-order methods to improve quantized neural network training, achieving better accuracy with reduced computation compared to pure ZO methods.

Details

Motivation: Training neural networks with quantized parameters is challenging - STE provides biased gradients while ZO methods are unbiased but computationally expensive. There's a need to balance gradient quality and computational efficiency.

Method: First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that leverages STE’s high-quality biased gradients while reducing bias through ZO corrections, achieving better tradeoff between quality and training time.

Result: 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% for ResNet 18/50, 1-22 perplexity improvement for LLaMA models up to 0.3B parameters. 796× computation reduction vs n-SPSA for 2-layer MLP on MNIST.

Conclusion: FOGZO effectively reduces STE bias while maintaining computational efficiency, making it a practical solution for quantization-aware training across various model architectures.

Abstract: We study the problem of training neural networks with quantized parameters. Learning low-precision quantized parameters by enabling computation of gradients via the Straight-Through Estimator (STE) can be challenging. While the STE enables back-propagation, which is a first-order method, recent works have explored the use of zeroth-order (ZO) gradient descent for fine-tuning. We note that the STE provides high-quality biased gradients, and ZO gradients are unbiased but can be expensive. We thus propose First-Order-Guided Zeroth-Order Gradient Descent (FOGZO) that reduces STE bias while reducing computations relative to ZO methods. Empirically, we show FOGZO improves the tradeoff between quality and training time in Quantization-Aware Pre-Training. Specifically, versus STE at the same number of iterations, we show a 1-8% accuracy improvement for DeiT Tiny/Small, 1-2% accuracy improvement on ResNet 18/50, and 1-22 perplexity point improvement for LLaMA models with up to 0.3 billion parameters. For the same loss, FOGZO yields a 796$\times$ reduction in computation versus n-SPSA for a 2-layer MLP on MNIST. Code is available at https://github.com/1733116199/fogzo.

[402] Differential Privacy: Gradient Leakage Attacks in Federated Learning Environments

Miguel Fernandez-de-Retana, Unai Zulaika, Rubén Sánchez-Corcuera, Aitor Almeida

Main category: cs.LG

TL;DR: DP-SGD effectively defends against gradient leakage attacks in federated learning but reduces model utility, while PDP-SGD maintains performance but fails as a practical defense.

Details

Motivation: Federated Learning is vulnerable to Gradient Leakage Attacks that can reveal private data from shared model updates, requiring effective defenses.

Method: Evaluated DP-SGD and PDP-SGD mechanisms on computer vision models with varying privacy levels, analyzing private data reconstruction quality in simulated FL environment.

Result: DP-SGD significantly mitigates gradient leakage attacks with moderate utility trade-off, while PDP-SGD maintains strong classification performance but is ineffective against reconstruction attacks.

Conclusion: Empirical evaluation of privacy mechanisms is crucial beyond theoretical guarantees, especially in distributed learning where information leakage poses critical threats.

Abstract: Federated Learning (FL) allows for the training of Machine Learning models in a collaborative manner without the need to share sensitive data. However, it remains vulnerable to Gradient Leakage Attacks (GLAs), which can reveal private information from the shared model updates. In this work, we investigate the effectiveness of Differential Privacy (DP) mechanisms - specifically, DP-SGD and a variant based on explicit regularization (PDP-SGD) - as defenses against GLAs. To this end, we evaluate the performance of several computer vision models trained under varying privacy levels on a simple classification task, and then analyze the quality of private data reconstructions obtained from the intercepted gradients in a simulated FL environment. Our results demonstrate that DP-SGD significantly mitigates the risk of gradient leakage attacks, albeit with a moderate trade-off in model utility. In contrast, PDP-SGD maintains strong classification performance but proves ineffective as a practical defense against reconstruction attacks. These findings highlight the importance of empirically evaluating privacy mechanisms beyond their theoretical guarantees, particularly in distributed learning scenarios where information leakage may represent an unassumable critical threat to data security and privacy.

[403] A data free neural operator enabling fast inference of 2D and 3D Navier Stokes equations

Junho Choi, Teng-Yuan Chang, Namjung Kim, Youngjoon Hong

Main category: cs.LG

TL;DR: A data-free neural operator for Navier-Stokes equations that eliminates need for training data, enables real-time ensemble forecasting, and works for 3D flows where previous methods failed.

Details

Motivation: Ensemble simulations of high-dimensional flow models are computationally expensive for real-time applications, and existing neural operators require costly data and struggle with 3D flows.

Method: Physics-grounded architecture that takes initial/boundary conditions and forcing functions as input, eliminating need for paired solution data through data-free operator network design.

Result: Method surpasses prior neural operators in accuracy across 2D benchmarks and 3D test cases, achieves greater efficiency than conventional solvers for ensembles, and successfully solves 3D Navier-Stokes equations.

Conclusion: This approach establishes a practical pathway toward data-free, high-fidelity PDE surrogates by combining numerically grounded architecture with machine learning scalability for scientific simulation and prediction.

Abstract: Ensemble simulations of high-dimensional flow models (e.g., Navier Stokes type PDEs) are computationally prohibitive for real time applications. Neural operators enable fast inference but are limited by costly data requirements and poor generalization to 3D flows. We present a data-free operator network for the Navier Stokes equations that eliminates the need for paired solution data and enables robust, real time inference for large ensemble forecasting. The physics-grounded architecture takes initial and boundary conditions as well as forcing functions, yielding solutions robust to high variability and perturbations. Across 2D benchmarks and 3D test cases, the method surpasses prior neural operators in accuracy and, for ensembles, achieves greater efficiency than conventional numerical solvers. Notably, it delivers accurate solutions of the three dimensional Navier Stokes equations, a regime not previously demonstrated for data free neural operators. By uniting a numerically grounded architecture with the scalability of machine learning, this approach establishes a practical pathway toward data free, high fidelity PDE surrogates for end to end scientific simulation and prediction.

[404] Modeling Biological Multifunctionality with Echo State Networks

Anastasia-Maria Leventi-Peetz, Jörg-Volker Peetz, Kai Weber, Nikolaos Zacharis

Main category: cs.LG

TL;DR: A 3D reaction-diffusion model was developed to simulate biological electrophysiological processes, and an Echo State Network successfully reproduced the system’s dynamics from the generated data.

Details

Motivation: To capture spatiotemporal behavior of biological systems, particularly electrophysiological processes, using computational models.

Method: Developed a 3D multicomponent reaction-diffusion model with excitable-system dynamics, solved numerically to generate time-series data, then trained an Echo State Network on this data.

Result: The Echo State Network successfully reproduced the system’s dynamic behavior, demonstrating effective simulation of biological dynamics.

Conclusion: Simulating biological dynamics using data-driven, multifunctional ESN models is both feasible and effective.

Abstract: In this work, a three-dimensional multicomponent reaction-diffusion model has been developed, combining excitable-system dynamics with diffusion processes and sharing conceptual features with the FitzHugh-Nagumo model. Designed to capture the spatiotemporal behavior of biological systems, particularly electrophysiological processes, the model was solved numerically to generate time-series data. These data were subsequently used to train and evaluate an Echo State Network (ESN), which successfully reproduced the system’s dynamic behavior. The results demonstrate that simulating biological dynamics using data-driven, multifunctional ESN models is both feasible and effective.

[405] ChessQA: Evaluating Large Language Models for Chess Understanding

Qianfeng Wen, Zhenwei Tang, Ashton Anderson

Main category: cs.LG

TL;DR: ChessQA is a comprehensive benchmark that evaluates LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, Semantic) to provide a more complete assessment beyond simple move quality evaluations.

Details

Motivation: Existing evaluations of LLM chess ability are ad hoc and narrow, making it difficult to accurately measure chess understanding and how it varies with scale, training methods, or architecture choices.

Method: Developed ChessQA benchmark with five task categories that correspond to ascending abstractions in chess knowledge, from basic rules to high-level concepts. The benchmark is dynamic with evolving prompts, answer keys, and construction scripts.

Result: Evaluation of contemporary LLMs revealed persistent weaknesses across all five categories, with detailed results and error analyses provided by category.

Conclusion: ChessQA provides a comprehensive, controlled setting for diagnosing and comparing LLM chess understanding, and will be released with code, datasets, and a public leaderboard to support further research.

Abstract: Chess provides an ideal testbed for evaluating the reasoning, modeling, and abstraction capabilities of large language models (LLMs), as it has well-defined structure and objective ground truth while admitting a wide spectrum of skill levels. However, existing evaluations of LLM ability in chess are ad hoc and narrow in scope, making it difficult to accurately measure LLM chess understanding and how it varies with scale, post-training methodologies, or architecture choices. We present ChessQA, a comprehensive benchmark that assesses LLM chess understanding across five task categories (Structural, Motifs, Short Tactics, Position Judgment, and Semantic), which approximately correspond to the ascending abstractions that players master as they accumulate chess knowledge, from understanding basic rules and learning tactical motifs to correctly calculating tactics, evaluating positions, and semantically describing high-level concepts. In this way, ChessQA captures a more comprehensive picture of chess ability and understanding, going significantly beyond the simple move quality evaluations done previously, and offers a controlled, consistent setting for diagnosis and comparison. Furthermore, ChessQA is inherently dynamic, with prompts, answer keys, and construction scripts that can evolve as models improve. Evaluating a range of contemporary LLMs, we find persistent weaknesses across all five categories and provide results and error analyses by category. We will release the code, periodically refreshed datasets, and a public leaderboard to support further research.

[406] A Pragmatic Way to Measure Chain-of-Thought Monitorability

Scott Emmons, Roland S. Zimmermann, David K. Elson, Rohin Shah

Main category: cs.LG

TL;DR: Proposes metrics to measure Chain-of-Thought monitorability through legibility and coverage, implemented via an LLM-powered autorater prompt.

Details

Motivation: To preserve AI safety through Chain-of-Thought monitoring by preventing loss of monitorability due to shifts in training practices or model architecture.

Method: Developed metrics for legibility (human-followable reasoning) and coverage (complete reasoning for human reproduction of output), implemented with an autorater prompt that enables LLMs to compute these metrics on existing CoTs.

Result: Frontier models exhibit high monitorability on challenging benchmarks, and the autorater successfully detects synthetic CoT degradations.

Conclusion: The proposed metrics and autorater prompt provide a practical tool for developers to track monitorability impacts of design decisions, complementing adversarial testing for AI safety.

Abstract: While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.

[407] An efficient probabilistic hardware architecture for diffusion-like models

Andraž Jelinčič, Owen Lockwood, Akhil Garlapati, Guillaume Verdon, Trevor McCourt

Main category: cs.LG

TL;DR: Proposes an all-transistor probabilistic computer that implements denoising models at hardware level, achieving 10,000x energy efficiency over GPUs.

Details

Motivation: Existing stochastic computers rely on limited modeling techniques and exotic, unscalable hardware, failing to gain traction despite promising efficiency gains.

Method: Developed an all-transistor probabilistic computer architecture that implements powerful denoising models directly at the hardware level.

Result: System-level analysis shows devices based on this architecture could achieve performance parity with GPUs on image benchmarks using approximately 10,000 times less energy.

Conclusion: The proposed all-transistor probabilistic computer addresses previous limitations and offers dramatic energy efficiency improvements for probabilistic AI applications.

Abstract: The proliferation of probabilistic AI has promoted proposals for specialized stochastic computers. Despite promising efficiency gains, these proposals have failed to gain traction because they rely on fundamentally limited modeling techniques and exotic, unscalable hardware. In this work, we address these shortcomings by proposing an all-transistor probabilistic computer that implements powerful denoising models at the hardware level. A system-level analysis indicates that devices based on our architecture could achieve performance parity with GPUs on a simple image benchmark using approximately 10,000 times less energy.

[408] Diffusion Adaptive Text Embedding for Text-to-Image Diffusion Models

Byeonghu Na, Minsang Park, Gyuwon Sim, Donghyeok Shin, HeeSun Bae, Mina Kang, Se Jung Kwon, Wanmo Kang, Il-Chul Moon

Main category: cs.LG

TL;DR: DATE dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data to improve text-image alignment without requiring additional training.

Details

Motivation: Fixed text embeddings across all diffusion timesteps limit adaptability to the generative process and reduce text-image alignment.

Method: Formulate an optimization problem and derive an update rule that refines text embeddings at each sampling step based on intermediate perturbed data.

Result: DATE maintains generative capability while providing superior text-image alignment over fixed embeddings across multi-concept generation and text-guided image editing tasks.

Conclusion: DATE enables dynamic adaptation of text conditions throughout diffusion sampling without additional training, improving text-image alignment.

Abstract: Text-to-image diffusion models rely on text embeddings from a pre-trained text encoder, but these embeddings remain fixed across all diffusion timesteps, limiting their adaptability to the generative process. We propose Diffusion Adaptive Text Embedding (DATE), which dynamically updates text embeddings at each diffusion timestep based on intermediate perturbed data. We formulate an optimization problem and derive an update rule that refines the text embeddings at each sampling step to improve alignment and preference between the mean predicted image and the text. This allows DATE to dynamically adapts the text conditions to the reverse-diffused images throughout diffusion sampling without requiring additional model training. Through theoretical analysis and empirical results, we show that DATE maintains the generative capability of the model while providing superior text-image alignment over fixed text embeddings across various tasks, including multi-concept generation and text-guided image editing. Our code is available at https://github.com/aailab-kaist/DATE.

[409] Synergistic Neural Forecasting of Air Pollution with Stochastic Sampling

Yohan Abeysinghe, Muhammad Akhtar Munir, Sanoojan Baliah, Ron Sarafian, Fahad Shahbaz Khan, Yinon Rudich, Salman Khan

Main category: cs.LG

TL;DR: SynCast is a neural forecasting model that improves PM concentration predictions, especially for extreme pollution events, using transformer architecture and diffusion-based refinement.

Details

Motivation: Air pollution is a major global health risk, and existing models often underestimate hazardous pollution spikes from wildfires, urban haze, and dust storms, making accurate forecasting essential for timely public health interventions.

Method: Built on a regionally adapted transformer backbone with diffusion-based stochastic refinement module, integrating meteorological and air composition data from ERA5 and CAMS datasets, using domain-aware objectives and extreme value theory.

Result: Substantial gains in forecasting fidelity across PM1, PM2.5, and PM10 variables, especially under extreme conditions, without compromising global accuracy.

Conclusion: SynCast provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.

Abstract: Air pollution remains a leading global health and environmental risk, particularly in regions vulnerable to episodic air pollution spikes due to wildfires, urban haze and dust storms. Accurate forecasting of particulate matter (PM) concentrations is essential to enable timely public health warnings and interventions, yet existing models often underestimate rare but hazardous pollution events. Here, we present SynCast, a high-resolution neural forecasting model that integrates meteorological and air composition data to improve predictions of both average and extreme pollution levels. Built on a regionally adapted transformer backbone and enhanced with a diffusion-based stochastic refinement module, SynCast captures the nonlinear dynamics driving PM spikes more accurately than existing approaches. Leveraging on harmonized ERA5 and CAMS datasets, our model shows substantial gains in forecasting fidelity across multiple PM variables (PM$1$, PM${2.5}$, PM$_{10}$), especially under extreme conditions. We demonstrate that conventional loss functions underrepresent distributional tails (rare pollution events) and show that SynCast, guided by domain-aware objectives and extreme value theory, significantly enhances performance in highly impacted regions without compromising global accuracy. This approach provides a scalable foundation for next-generation air quality early warning systems and supports climate-health risk mitigation in vulnerable regions.

[410] HyperGraphX: Graph Transductive Learning with Hyperdimensional Computing and Message Passing

Guojing Cong, Tom Potok, Hamed Poursiami, Maryam Parsa

Main category: cs.LG

TL;DR: HDGC combines graph convolution with hyperdimensional computing operations for graph learning, achieving superior accuracy and significant speed improvements over existing methods.

Details

Motivation: To develop a more efficient and accurate graph learning algorithm by integrating graph convolution with hyperdimensional computing's binding and bundling operations.

Method: HDGC algorithm that marries graph convolution with binding and bundling operations from hyperdimensional computing for transductive graph learning.

Result: Outperforms major GNN implementations and state-of-the-art hyperdimensional computing methods on both homophilic and heterophilic graphs. Achieves 9561x speedup over GCNII and 144.5x speedup over HDGL on same GPU platform.

Conclusion: HDGC demonstrates superior performance in both accuracy and efficiency, with promising potential for energy-efficient implementation on neuromorphic and process-in-memory devices due to its binary vector operations.

Abstract: We present a novel algorithm, \hdgc, that marries graph convolution with binding and bundling operations in hyperdimensional computing for transductive graph learning. For prediction accuracy \hdgc outperforms major and popular graph neural network implementations as well as state-of-the-art hyperdimensional computing implementations for a collection of homophilic graphs and heterophilic graphs. Compared with the most accurate learning methodologies we have tested, on the same target GPU platform, \hdgc is on average 9561.0 and 144.5 times faster than \gcnii, a graph neural network implementation and HDGL, a hyperdimensional computing implementation, respectively. As the majority of the learning operates on binary vectors, we expect outstanding energy performance of \hdgc on neuromorphic and emerging process-in-memory devices.

[411] STNet: Spectral Transformation Network for Solving Operator Eigenvalue Problem

Hong Wang, Jiang Yixuan, Jie Wang, Xinyi Li, Jian Luo, Huanshuo Dong

Main category: cs.LG

TL;DR: STNet uses spectral transformations to improve neural network-based eigenvalue computation by deflating solved eigenfunctions and filtering desired eigenvalue regions, achieving state-of-the-art accuracy.

Details

Motivation: Existing deep learning methods for operator eigenvalue problems suffer from performance dependence on spectral distribution and the curse of dimensionality.

Method: Uses spectral transformations including deflation projection to exclude solved eigenfunctions and filter transform to magnify desired eigenvalues while suppressing others.

Result: STNet consistently outperforms existing learning-based methods and achieves state-of-the-art performance in accuracy across extensive experiments.

Conclusion: Spectral transformations effectively enhance neural network-based eigenvalue computation by leveraging spectral distribution information.

Abstract: Operator eigenvalue problems play a critical role in various scientific fields and engineering applications, yet numerical methods are hindered by the curse of dimensionality. Recent deep learning methods provide an efficient approach to address this challenge by iteratively updating neural networks. These methods’ performance relies heavily on the spectral distribution of the given operator: larger gaps between the operator’s eigenvalues will improve precision, thus tailored spectral transformations that leverage the spectral distribution can enhance their performance. Based on this observation, we propose the Spectral Transformation Network (STNet). During each iteration, STNet uses approximate eigenvalues and eigenfunctions to perform spectral transformations on the original operator, turning it into an equivalent but easier problem. Specifically, we employ deflation projection to exclude the subspace corresponding to already solved eigenfunctions, thereby reducing the search space and avoiding converging to existing eigenfunctions. Additionally, our filter transform magnifies eigenvalues in the desired region and suppresses those outside, further improving performance. Extensive experiments demonstrate that STNet consistently outperforms existing learning-based methods, achieving state-of-the-art performance in accuracy.

[412] Optimal Arm Elimination Algorithms for Combinatorial Bandits

Yuxiao Wen, Yanjun Han, Zhengyuan Zhou

Main category: cs.LG

TL;DR: A novel elimination scheme for combinatorial bandits that partitions arms into confirmed, active, and eliminated categories, achieving near-optimal regret in graph feedback and contextual settings where UCB methods fail.

Details

Motivation: Combinatorial bandits require selecting multiple arms per round, but adapting arm elimination methods has been challenging while UCB-based approaches can fail due to insufficient explicit exploration.

Method: Introduces a novel elimination scheme with three arm categories (confirmed, active, eliminated) and incorporates explicit exploration to update these sets.

Result: Achieves near-optimal regret in combinatorial multi-armed bandits with general graph feedback and combinatorial linear contextual bandits, outperforming UCB-based methods.

Conclusion: The proposed elimination method successfully addresses the limitations of UCB approaches in combinatorial bandits by providing sufficient explicit exploration, with matching lower bounds confirming near-optimal performance.

Abstract: Combinatorial bandits extend the classical bandit framework to settings where the learner selects multiple arms in each round, motivated by applications such as online recommendation and assortment optimization. While extensions of upper confidence bound (UCB) algorithms arise naturally in this context, adapting arm elimination methods has proved more challenging. We introduce a novel elimination scheme that partitions arms into three categories (confirmed, active, and eliminated), and incorporates explicit exploration to update these sets. We demonstrate the efficacy of our algorithm in two settings: the combinatorial multi-armed bandit with general graph feedback, and the combinatorial linear contextual bandit. In both cases, our approach achieves near-optimal regret, whereas UCB-based methods can provably fail due to insufficient explicit exploration. Matching lower bounds are also provided.

[413] Predicting Barge Tow Size on Inland Waterways Using Vessel Trajectory Derived Features: Proof of Concept

Geoffery Agorku, Sarah Hernandez, Hayley Hames, Cade Wagner

Main category: cs.LG

TL;DR: Using AIS data and machine learning to predict barge quantities on inland waterways, achieving MAE of 1.92 barges with Poisson Regressor model.

Details

Motivation: Accurate real-time estimation of barge quantity is challenging due to non-self-propelled nature of barges and limitations of existing monitoring systems.

Method: Used AIS vessel tracking data with ML, manually annotated barge instances from satellite scenes, created 30 AIS-derived features, applied Recursive Feature Elimination, and tested six regression models.

Result: Poisson Regressor performed best with MAE of 1.92 barges using 12 features; course entropy, speed variability and trip length were most predictive.

Conclusion: Provides scalable method for Maritime Domain Awareness with applications in lock scheduling and freight planning; future work will test transferability to other rivers.

Abstract: Accurate, real-time estimation of barge quantity on inland waterways remains a critical challenge due to the non-self-propelled nature of barges and the limitations of existing monitoring systems. This study introduces a novel method to use Automatic Identification System (AIS) vessel tracking data to predict the number of barges in tow using Machine Learning (ML). To train and test the model, barge instances were manually annotated from satellite scenes across the Lower Mississippi River. Labeled images were matched to AIS vessel tracks using a spatiotemporal matching procedure. A comprehensive set of 30 AIS-derived features capturing vessel geometry, dynamic movement, and trajectory patterns were created and evaluated using Recursive Feature Elimination (RFE) to identify the most predictive variables. Six regression models, including ensemble, kernel-based, and generalized linear approaches, were trained and evaluated. The Poisson Regressor model yielded the best performance, achieving a Mean Absolute Error (MAE) of 1.92 barges using 12 of the 30 features. The feature importance analysis revealed that metrics capturing vessel maneuverability such as course entropy, speed variability and trip length were most predictive of barge count. The proposed approach provides a scalable, readily implementable method for enhancing Maritime Domain Awareness (MDA), with strong potential applications in lock scheduling, port management, and freight planning. Future work will expand the proof of concept presented here to explore model transferability to other inland rivers with differing operational and environmental conditions.

[414] Training-Free Safe Text Embedding Guidance for Text-to-Image Diffusion Models

Byeonghu Na, Mina Kang, Jiseok Kwak, Minsang Park, Jiwoo Shin, SeJoon Jun, Gayoung Lee, Jin-Hwa Kim, Il-Chul Moon

Main category: cs.LG

TL;DR: STG is a training-free method that guides text embeddings during diffusion model sampling to generate safer images without compromising quality, outperforming existing baselines on safety metrics.

Details

Motivation: Text-to-image models trained on web-crawled datasets often generate harmful content from malicious prompts, raising safety concerns that need addressing.

Method: Safe Text embedding Guidance (STG) adjusts text embeddings during sampling using a safety function evaluated on expected denoised images, aligning model distribution with safety constraints without additional training.

Result: STG consistently outperforms training-based and training-free baselines in removing unsafe content (nudity, violence, artist-style) while preserving semantic intent of input prompts.

Conclusion: STG provides an effective training-free solution for improving diffusion model safety by guiding text embeddings during sampling, achieving safer outputs with minimal quality degradation.

Abstract: Text-to-image models have recently made significant advances in generating realistic and semantically coherent images, driven by advanced diffusion models and large-scale web-crawled datasets. However, these datasets often contain inappropriate or biased content, raising concerns about the generation of harmful outputs when provided with malicious text prompts. We propose Safe Text embedding Guidance (STG), a training-free approach to improve the safety of diffusion models by guiding the text embeddings during sampling. STG adjusts the text embeddings based on a safety function evaluated on the expected final denoised image, allowing the model to generate safer outputs without additional training. Theoretically, we show that STG aligns the underlying model distribution with safety constraints, thereby achieving safer outputs while minimally affecting generation quality. Experiments on various safety scenarios, including nudity, violence, and artist-style removal, show that STG consistently outperforms both training-based and training-free baselines in removing unsafe content while preserving the core semantic intent of input prompts. Our code is available at https://github.com/aailab-kaist/STG.

[415] NeuroPathNet: Dynamic Path Trajectory Learning for Brain Functional Connectivity Analysis

Guo Tianqi Guo, Chen Liping, Peng Ciyuan, Guo Jingjing, Ren Jing

Main category: cs.LG

TL;DR: Proposes NeuroPathNet, a path-level trajectory modeling framework to characterize dynamic connection pathways between brain functional partitions, outperforming existing methods on fMRI datasets.

Details

Motivation: Existing methods struggle to capture temporal evolution characteristics of connections between specific functional communities in brain networks, which is important for understanding cognitive mechanisms and diagnosing neurological diseases.

Method: Extracts time series of connection strengths between functional partitions using medically supported static partitioning schemes, then models them using a temporal neural network framework.

Result: Validated on three public fMRI datasets, showing superior performance over existing mainstream methods across multiple evaluation metrics.

Conclusion: The framework promotes dynamic graph learning for brain network analysis and provides potential clinical applications for neurological disease diagnosis.

Abstract: Understanding the evolution of brain functional networks over time is of great significance for the analysis of cognitive mechanisms and the diagnosis of neurological diseases. Existing methods often have difficulty in capturing the temporal evolution characteristics of connections between specific functional communities. To this end, this paper proposes a new path-level trajectory modeling framework (NeuroPathNet) to characterize the dynamic behavior of connection pathways between brain functional partitions. Based on medically supported static partitioning schemes (such as Yeo and Smith ICA), we extract the time series of connection strengths between each pair of functional partitions and model them using a temporal neural network. We validate the model performance on three public functional Magnetic Resonance Imaging (fMRI) datasets, and the results show that it outperforms existing mainstream methods in multiple indicators. This study can promote the development of dynamic graph learning methods for brain network analysis, and provide possible clinical applications for the diagnosis of neurological diseases.

[416] Efficient Global-Local Fusion Sampling for Physics-Informed Neural Networks

Jiaqi Luo, Shixin Xu, Zhouwang Yang

Main category: cs.LG

TL;DR: Proposes Global-Local Fusion (GLF) Sampling Strategy for PINNs that combines global stability with local efficiency through residual-adaptive sampling and lightweight surrogate approximation.

Details

Motivation: PINNs' accuracy depends on collocation point placement - global sampling is stable but computationally expensive, while local sampling is efficient but may neglect well-learned areas and reduce robustness.

Method: GLF generates collocation points by perturbing training points with Gaussian noise scaled inversely to residual, concentrating samples in difficult regions while preserving exploration. Uses lightweight linear surrogate to approximate global residual-based distribution for reduced computational cost.

Result: Extensive experiments on benchmark PDEs show GLF consistently improves both accuracy and efficiency compared to global and local sampling strategies.

Conclusion: GLF provides a practical and scalable framework for enhancing PINNs’ reliability and efficiency in solving complex and high-dimensional PDEs.

Abstract: The accuracy of Physics-Informed Neural Networks (PINNs) critically depends on the placement of collocation points, as the PDE loss is approximated through sampling over the solution domain. Global sampling ensures stability by covering the entire domain but requires many samples and is computationally expensive, whereas local sampling improves efficiency by focusing on high-residual regions but may neglect well-learned areas, reducing robustness. We propose a Global-Local Fusion (GLF) Sampling Strategy that combines the strengths of both approaches. Specifically, new collocation points are generated by perturbing training points with Gaussian noise scaled inversely to the residual, thereby concentrating samples in difficult regions while preserving exploration. To further reduce computational overhead, a lightweight linear surrogate is introduced to approximate the global residual-based distribution, achieving similar effectiveness at a fraction of the cost. Together, these components, residual-adaptive sampling and residual-based approximation, preserve the stability of global methods while retaining the efficiency of local refinement. Extensive experiments on benchmark PDEs demonstrate that GLF consistently improves both accuracy and efficiency compared with global and local sampling strategies. This study provides a practical and scalable framework for enhancing the reliability and efficiency of PINNs in solving complex and high-dimensional PDEs.

[417] Spatio-temporal Multivariate Time Series Forecast with Chosen Variables

Zibo Liu, Zhe Jiang, Zelin Xu, Tingsong Xiao, Yupu Zhang, Zhengkun Xiao, Haibo Wang, Shigang Chen

Main category: cs.LG

TL;DR: This paper introduces a new problem of STMF with chosen variables, proposing a unified framework that jointly performs variable selection and model optimization to maximize forecast accuracy when only m-out-of-n variables can be monitored due to budget constraints.

Details

Motivation: Existing STMF methods assume pre-determined sensor locations, but the critical problem of optimally selecting which m variables to monitor from n possible locations has never been studied, despite practical budget constraints in sensing applications.

Method: Proposes a unified framework with three novel components: (1) masked variable-parameter pruning using quantile-based masking, (2) prioritized variable-parameter replay of low-loss samples, and (3) dynamic extrapolation mechanism using spatial embeddings and adjacency information.

Result: Experiments on five real-world datasets show the proposed method significantly outperforms state-of-the-art baselines in both accuracy and efficiency.

Conclusion: The work demonstrates the effectiveness of joint variable selection and model optimization for STMF, filling an important gap in practical sensing applications with budget constraints.

Abstract: Spatio-Temporal Multivariate time series Forecast (STMF) uses the time series of $n$ spatially distributed variables in a period of recent past to forecast their values in a period of near future. It has important applications in spatio-temporal sensing forecast such as road traffic prediction and air pollution prediction. Recent papers have addressed a practical problem of missing variables in the model input, which arises in the sensing applications where the number $m$ of sensors is far less than the number $n$ of locations to be monitored, due to budget constraints. We observe that the state of the art assumes that the $m$ variables (i.e., locations with sensors) in the model input are pre-determined and the important problem of how to choose the $m$ variables in the input has never been studied. This paper fills the gap by studying a new problem of STMF with chosen variables, which optimally selects $m$-out-of-$n$ variables for the model input in order to maximize the forecast accuracy. We propose a unified framework that jointly performs variable selection and model optimization for both forecast accuracy and model efficiency. It consists of three novel technical components: (1) masked variable-parameter pruning, which progressively prunes less informative variables and attention parameters through quantile-based masking; (2) prioritized variable-parameter replay, which replays low-loss past samples to preserve learned knowledge for model stability; (3) dynamic extrapolation mechanism, which propagates information from variables selected for the input to all other variables via learnable spatial embeddings and adjacency information. Experiments on five real-world datasets show that our work significantly outperforms the state-of-the-art baselines in both accuracy and efficiency, demonstrating the effectiveness of joint variable selection and model optimization.

[418] GraphNet: A Large-Scale Computational Graph Dataset for Tensor Compiler Research

Xinqi Li, Yiqun Liu, Shan Jiang, Enrong Zheng, Huaijin Zheng, Wenhao Dai, Haodong Deng, Dianhai Yu, Yanjun Ma

Main category: cs.LG

TL;DR: GraphNet is a dataset of 2.7K real-world deep learning computational graphs with metadata across six task categories and multiple frameworks, featuring new benchmark metrics Speedup Score S(t) and Error-aware Speedup Score ES(t) for evaluating tensor compiler performance.

Details

Motivation: To provide a comprehensive real-world dataset for evaluating tensor compiler performance on diverse deep learning computational graphs, addressing the need for reliable metrics that consider both runtime speedup and execution correctness.

Method: Created GraphNet dataset with 2.7K computational graphs spanning six major task categories across multiple deep learning frameworks. Proposed Speedup Score S(t) metric that jointly considers runtime speedup and execution correctness under tunable tolerance levels, and extended it to Error-aware Speedup Score ES(t) that incorporates error information.

Result: The paper benchmarks default tensor compilers (CINN for PaddlePaddle and TorchInductor for PyTorch) on computer vision and natural language processing samples to demonstrate GraphNet’s practicality. The full construction pipeline with graph extraction and compiler evaluation tools is publicly available.

Conclusion: GraphNet provides a valuable dataset and evaluation framework for tensor compiler development, with practical applications demonstrated through benchmarking major compilers, and the tools are open-sourced for community use.

Abstract: We introduce GraphNet, a dataset of 2.7K real-world deep learning computational graphs with rich metadata, spanning six major task categories across multiple deep learning frameworks. To evaluate tensor compiler performance on these samples, we propose the benchmark metric Speedup Score S(t), which jointly considers runtime speedup and execution correctness under tunable tolerance levels, offering a reliable measure of general optimization capability. Furthermore, we extend S(t) to the Error-aware Speedup Score ES(t), which incorporates error information and helps compiler developers identify key performance bottlenecks. In this report, we benchmark the default tensor compilers, CINN for PaddlePaddle and TorchInductor for PyTorch, on computer vision (CV) and natural language processing (NLP) samples to demonstrate the practicality of GraphNet. The full construction pipeline with graph extraction and compiler evaluation tools is available at https://github.com/PaddlePaddle/GraphNet .

[419] Geometric Algorithms for Neural Combinatorial Optimization with Constraints

Nikolaos Karalias, Akbar Rafiey, Yifei Xu, Zhishang Luo, Behrooz Tahmasebi, Connie Jiang, Stefanie Jegelka

Main category: cs.LG

TL;DR: A self-supervised learning framework for combinatorial optimization that handles discrete constraints by decomposing neural network outputs into convex combinations of feasible solutions using convex geometry techniques.

Details

Motivation: To address the challenge of solving combinatorial optimization problems with discrete constraints using self-supervised learning, where traditional neural approaches struggle with constraint satisfaction.

Method: End-to-end differentiable framework leveraging convex geometry and Carathéodory’s theorem to decompose neural network outputs into convex combinations of polytope corners representing feasible sets, enabling quality-preserving rounding.

Result: Extensive experiments show consistent outperformance over neural baselines in cardinality-constrained optimization, with successful application to independent sets and matroid-constrained problems.

Conclusion: The proposed decomposition-based approach effectively enables self-supervised training for combinatorial optimization while ensuring feasible solution generation across diverse problem types.

Abstract: Self-Supervised Learning (SSL) for Combinatorial Optimization (CO) is an emerging paradigm for solving combinatorial problems using neural networks. In this paper, we address a central challenge of SSL for CO: solving problems with discrete constraints. We design an end-to-end differentiable framework that enables us to solve discrete constrained optimization problems with neural networks. Concretely, we leverage algorithmic techniques from the literature on convex geometry and Carath'eodory’s theorem to decompose neural network outputs into convex combinations of polytope corners that correspond to feasible sets. This decomposition-based approach enables self-supervised training but also ensures efficient quality-preserving rounding of the neural net output into feasible solutions. Extensive experiments in cardinality-constrained optimization show that our approach can consistently outperform neural baselines. We further provide worked-out examples of how our method can be applied beyond cardinality-constrained problems to a diverse set of combinatorial optimization tasks, including finding independent sets in graphs, and solving matroid-constrained problems.

[420] Causal-Aware Generative Adversarial Networks with Reinforcement Learning

Tu Anh Hoang Nguyen, Dang Nguyen, Tri-Nhan Vo, Thuc Duy Le, Sunil Gupta

Main category: cs.LG

TL;DR: CA-GAN is a novel generative framework for tabular data that addresses privacy concerns while preserving causal relationships, data utility, and providing provable privacy guarantees through a two-step approach combining causal graph extraction and conditional WGAN-GP with reinforcement learning.

Details

Motivation: Existing tabular data generation methods, particularly GAN-based approaches, struggle with capturing complex causal relationships, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment, limiting their practical utility for privacy-sensitive applications.

Method: CA-GAN uses a two-step approach: (1) causal graph extraction to learn comprehensive causal relationships in the data manifold, and (2) a custom Conditional WGAN-GP that operates according to the causal graph structure, trained with a novel Reinforcement Learning-based objective that aligns causal graphs from real and fake data.

Result: CA-GAN demonstrates superiority over six state-of-the-art methods across 14 tabular datasets, achieving strong performance in causal preservation, utility preservation, and privacy preservation - the core data engineering metrics.

Conclusion: CA-GAN provides a practical, high-performance solution for creating high-quality, privacy-compliant synthetic datasets that can be used for benchmarking database systems, accelerating software development, and facilitating secure data-driven research.

Abstract: The utility of tabular data for tasks ranging from model training to large-scale data analysis is often constrained by privacy concerns or regulatory hurdles. While existing data generation methods, particularly those based on Generative Adversarial Networks (GANs), have shown promise, they frequently struggle with capturing complex causal relationship, maintaining data utility, and providing provable privacy guarantees suitable for enterprise deployment. We introduce CA-GAN, a novel generative framework specifically engineered to address these challenges for real-world tabular datasets. CA-GAN utilizes a two-step approach: causal graph extraction to learn a robust, comprehensive causal relationship in the data’s manifold, followed by a custom Conditional WGAN-GP (Wasserstein GAN with Gradient Penalty) that operates exclusively as per the structure of nodes in the causal graph. More importantly, the generator is trained with a new Reinforcement Learning-based objective that aligns the causal graphs constructed from real and fake data, ensuring the causal awareness in both training and sampling phases. We demonstrate CA-GAN superiority over six SOTA methods across 14 tabular datasets. Our evaluations, focused on core data engineering metrics: causal preservation, utility preservation, and privacy preservation. Our method offers a practical, high-performance solution for data engineers seeking to create high-quality, privacy-compliant synthetic datasets to benchmark database systems, accelerate software development, and facilitate secure data-driven research.

Akira Tamamori

Main category: cs.LG

TL;DR: Two-Stage LKPLO is a multi-stage outlier detection framework that combines kernel PCA for non-linear data linearization and local clustering for multi-modal distributions, achieving state-of-the-art performance on challenging datasets.

Details

Motivation: To overcome limitations of conventional projection-based methods that rely on fixed statistical metrics and assume single data structures, which fail on complex datasets with non-linear and multi-modal characteristics.

Method: Two-stage framework: (1) global kernel PCA stage to linearize non-linear data structures, (2) local clustering stage to handle multi-modal distributions, using generalized loss-based outlyingness measure (PLO) with adaptive loss functions like SVM-like loss.

Result: Achieved state-of-the-art performance in 5-fold cross-validation on 10 benchmark datasets, significantly outperforming baselines on challenging datasets like Optdigits (multi-cluster) and Arrhythmia (high-dimensional). Ablation study confirmed both stages are essential.

Conclusion: The synergistic combination of kernelization and localization stages creates a powerful tool for outlier detection, demonstrating the importance of hybrid multi-stage architectures for handling complex data structures.

Abstract: This paper presents Two-Stage LKPLO, a novel multi-stage outlier detection framework that overcomes the coexisting limitations of conventional projection-based methods: their reliance on a fixed statistical metric and their assumption of a single data structure. Our framework uniquely synthesizes three key concepts: (1) a generalized loss-based outlyingness measure (PLO) that replaces the fixed metric with flexible, adaptive loss functions like our proposed SVM-like loss; (2) a global kernel PCA stage to linearize non-linear data structures; and (3) a subsequent local clustering stage to handle multi-modal distributions. Comprehensive 5-fold cross-validation experiments on 10 benchmark datasets, with automated hyperparameter optimization, demonstrate that Two-Stage LKPLO achieves state-of-the-art performance. It significantly outperforms strong baselines on datasets with challenging structures where existing methods fail, most notably on multi-cluster data (Optdigits) and complex, high-dimensional data (Arrhythmia). Furthermore, an ablation study empirically confirms that the synergistic combination of both the kernelization and localization stages is indispensable for its superior performance. This work contributes a powerful new tool for a significant class of outlier detection problems and underscores the importance of hybrid, multi-stage architectures.

[422] Learning from History: A Retrieval-Augmented Framework for Spatiotemporal Prediction

Hao Jia, Penghao Zhao, Hao Wu, Yuan Gao, Yangyu Tao, Bin Cui

Main category: cs.LG

TL;DR: RAP is a hybrid framework that combines deep learning with historical data retrieval to improve long-term spatiotemporal predictions by using similar historical patterns as dynamic guidance.

Details

Motivation: Deep learning models for physical systems suffer from error accumulation in long-term predictions and struggle to capture full system constraints, leading to physically implausible results.

Method: Retrieval-Augmented Prediction (RAP) framework retrieves similar historical evolutionary patterns from a database and uses their true future evolution as reference targets in a dual-stream architecture to guide predictions.

Result: RAP outperforms state-of-the-art methods and analog-only baselines across meteorology, turbulence, and fire simulation benchmarks, generating more physically realistic predictions with reduced error divergence.

Conclusion: The hybrid approach of combining parametric deep learning with non-parametric historical data retrieval effectively addresses error accumulation in long-term spatiotemporal predictions for complex physical systems.

Abstract: Accurate and long-term spatiotemporal prediction for complex physical systems remains a fundamental challenge in scientific computing. While deep learning models, as powerful parametric approximators, have shown remarkable success, they suffer from a critical limitation: the accumulation of errors during long-term autoregressive rollouts often leads to physically implausible artifacts. This deficiency arises from their purely parametric nature, which struggles to capture the full constraints of a system’s intrinsic dynamics. To address this, we introduce a novel \textbf{Retrieval-Augmented Prediction (RAP)} framework, a hybrid paradigm that synergizes the predictive power of deep networks with the grounded truth of historical data. The core philosophy of RAP is to leverage historical evolutionary exemplars as a non-parametric estimate of the system’s local dynamics. For any given state, RAP efficiently retrieves the most similar historical analog from a large-scale database. The true future evolution of this analog then serves as a \textbf{reference target}. Critically, this target is not a hard constraint in the loss function but rather a powerful conditional input to a specialized dual-stream architecture. It provides strong \textbf{dynamic guidance}, steering the model’s predictions towards physically viable trajectories. In extensive benchmarks across meteorology, turbulence, and fire simulation, RAP not only surpasses state-of-the-art methods but also significantly outperforms a strong \textbf{analog-only forecasting baseline}. More importantly, RAP generates predictions that are more physically realistic by effectively suppressing error divergence in long-term rollouts.

[423] Mitigating Negative Transfer via Reducing Environmental Disagreement

Hui Sun, Zheng Xie, Hao-Yuan He, Ming Li

Main category: cs.LG

TL;DR: This paper proposes RED (Reducing Environmental Disagreement), a method to mitigate negative transfer in unsupervised domain adaptation by disentangling causal and environmental features and reducing cross-domain disagreement on non-causal features.

Details

Motivation: Significant domain shifts in UDA cause negative transfer, deteriorating model performance. The authors identify cross-domain discriminative disagreement on non-causal environmental features as a key factor causing negative transfer.

Method: RED disentangles samples into domain-invariant causal features and domain-specific non-causal environmental features using adversarially trained domain-specific environmental feature extractors. It then estimates and reduces environmental disagreement based on these domain-specific non-causal features.

Result: Experimental results show that RED effectively mitigates negative transfer and achieves state-of-the-art performance in unsupervised domain adaptation.

Conclusion: The study demonstrates that addressing environmental disagreement through causal disentanglement is an effective approach to mitigate negative transfer in domain adaptation scenarios.

Abstract: Unsupervised Domain Adaptation~(UDA) focuses on transferring knowledge from a labeled source domain to an unlabeled target domain, addressing the challenge of \emph{domain shift}. Significant domain shifts hinder effective knowledge transfer, leading to \emph{negative transfer} and deteriorating model performance. Therefore, mitigating negative transfer is essential. This study revisits negative transfer through the lens of causally disentangled learning, emphasizing cross-domain discriminative disagreement on non-causal environmental features as a critical factor. Our theoretical analysis reveals that overreliance on non-causal environmental features as the environment evolves can cause discriminative disagreements~(termed \emph{environmental disagreement}), thereby resulting in negative transfer. To address this, we propose Reducing Environmental Disagreement~(RED), which disentangles each sample into domain-invariant causal features and domain-specific non-causal environmental features via adversarially training domain-specific environmental feature extractors in the opposite domains. Subsequently, RED estimates and reduces environmental disagreement based on domain-specific non-causal environmental features. Experimental results confirm that RED effectively mitigates negative transfer and achieves state-of-the-art performance.

[424] Low-N Protein Activity Optimization with FolDE

Jacob B. Roberts, Catherine R. Ji, Isaac Donnell, Thomas D. Young, Allison N. Pearson, Graham A. Hudson, Leah S. Keiser, Mia Wesselkamper, Peter H. Winegar, Janik Ludwig, Sarah H. Klass, Isha V. Sheth, Ezechinyere C. Ukabiala, Maria C. T. Astolfi, Benjamin Eysenbach, Jay D. Keasling

Main category: cs.LG

TL;DR: FolDE is an Active Learning-assisted Directed Evolution method that outperforms existing ALDE methods by discovering 23% more top 10% mutants and being 55% more likely to find top 1% mutants through naturalness-based warm-starting and improved batch diversity.

Details

Motivation: Traditional protein optimization is costly, and existing ALDE methods suffer from selecting homogeneous training data that limits accurate prediction in subsequent rounds.

Method: FolDE uses naturalness-based warm-starting (augmenting limited activity measurements with protein language model outputs) and a constant-liar batch selector to improve batch diversity.

Result: In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants.

Conclusion: FolDE maximizes end-of-campaign success in protein optimization and is available as open-source software to make efficient protein optimization accessible to any laboratory.

Abstract: Proteins are traditionally optimized through the costly construction and measurement of many mutants. Active Learning-assisted Directed Evolution (ALDE) alleviates that cost by predicting the best improvements and iteratively testing mutants to inform predictions. However, existing ALDE methods face a critical limitation: selecting the highest-predicted mutants in each round yields homogeneous training data insufficient for accurate prediction models in subsequent rounds. Here we present FolDE, an ALDE method designed to maximize end-of-campaign success. In simulations across 20 protein targets, FolDE discovers 23% more top 10% mutants than the best baseline ALDE method (p=0.005) and is 55% more likely to find top 1% mutants. FolDE achieves this primarily through naturalness-based warm-starting, which augments limited activity measurements with protein language model outputs to improve activity prediction. We also introduce a constant-liar batch selector, which improves batch diversity; this is important in multi-mutation campaigns but had limited effect in our benchmarks. The complete workflow is freely available as open-source software, making efficient protein optimization accessible to any laboratory.

[425] FALQON: Accelerating LoRA Fine-tuning with Low-Bit Floating-Point Arithmetic

Kanghyun Choi, Hyeyoon Lee, SunJong Park, Dain Kwon, Jinho Lee

Main category: cs.LG

TL;DR: FALQON eliminates quantization overhead in FP8 LoRA fine-tuning by merging adapters into the quantized backbone, achieving 3× speedup over existing methods while maintaining accuracy.

Details

Motivation: FP8 quantization provides acceleration for large matrix operations but suffers from overheads when applied to LoRA fine-tuning with small-dimensional matrices, limiting its effectiveness for efficient LLM adaptation.

Method: Directly merges LoRA adapters into FP8-quantized backbone during fine-tuning, reformulates forward/backward computations to reduce quantization overhead, and introduces row-wise proxy update mechanism for efficient integration of updates.

Result: Achieves approximately 3× training speedup over existing quantized LoRA methods with similar accuracy levels, and enables end-to-end FP8 workflow eliminating post-training quantization needs.

Conclusion: FALQON provides a practical solution for efficient large-scale model fine-tuning by overcoming FP8 quantization limitations in LoRA, facilitating faster training and easier deployment.

Abstract: Low-bit floating-point (FP) formats, such as FP8, provide significant acceleration and memory savings in model training thanks to native hardware support on modern GPUs and NPUs. However, we analyze that FP8 quantization offers speedup primarily for large-dimensional matrix multiplications, while inherent quantization overheads diminish speedup when applied to low-rank adaptation (LoRA), which uses small-dimensional matrices for efficient fine-tuning of large language models (LLMs). To address this limitation, we propose FALQON, a novel framework that eliminates the quantization overhead from separate LoRA computational paths by directly merging LoRA adapters into an FP8-quantized backbone during fine-tuning. Furthermore, we reformulate the forward and backward computations for merged adapters to significantly reduce quantization overhead, and introduce a row-wise proxy update mechanism that efficiently integrates substantial updates into the quantized backbone. Experimental evaluations demonstrate that FALQON achieves approximately a 3$\times$ training speedup over existing quantized LoRA methods with a similar level of accuracy, providing a practical solution for efficient large-scale model fine-tuning. Moreover, FALQON’s end-to-end FP8 workflow removes the need for post-training quantization, facilitating efficient deployment. Code is available at https://github.com/iamkanghyunchoi/falqon.

[426] Information-Theoretic Discrete Diffusion

Moongyu Jeon, Sangwoo Shin, Dongjae Jeon, Albert No

Main category: cs.LG

TL;DR: The paper presents an information-theoretic framework for discrete diffusion models that provides principled estimators of log-likelihood using score-matching losses, establishing connections between mutual information and denoising losses.

Details

Motivation: To develop a principled theoretical foundation for discrete diffusion models that connects mutual information to commonly used score-matching losses, showing they are not just variational bounds but tight estimators of log-likelihood.

Method: Derived Information-Minimum Denoising Score Entropy (I-MDSE) relation for discrete diffusion and Information-Minimum Denoising Cross-Entropy (I-MDCE) relation for masked diffusion, providing time-integral decomposition of log-likelihood in terms of optimal score-based losses.

Result: The framework enables practical extensions including time-free formula, conditional likelihood estimation, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm accuracy, variance stability, and utility of the estimators.

Conclusion: The proposed information-theoretic framework provides tight and principled estimators of log-likelihood for discrete diffusion models, with practical extensions that work effectively on both synthetic and real-world datasets.

Abstract: We present an information-theoretic framework for discrete diffusion models that yields principled estimators of log-likelihood using score-matching losses. Inspired by the I-MMSE identity for the Gaussian setup, we derive analogous results for the discrete setting. Specifically, we introduce the Information-Minimum Denoising Score Entropy (I-MDSE) relation, which links mutual information between data and its diffused version to the minimum denoising score entropy (DSE) loss. We extend this theory to masked diffusion and establish the Information-Minimum Denoising Cross-Entropy (I-MDCE) relation, connecting cross-entropy losses to mutual information in discrete masked processes. These results provide a time-integral decomposition of the log-likelihood of the data in terms of optimal score-based losses, showing that commonly used losses such as DSE and DCE are not merely variational bounds but tight and principled estimators of log-likelihood. The I-MDCE decomposition further enables practical extensions, including time-free formula, conditional likelihood estimation in prompt-response tasks, and coupled Monte Carlo estimation of likelihood ratios. Experiments on synthetic and real-world data confirm the accuracy, variance stability, and utility of our estimators. The code is publicly available at https://github.com/Dongjae0324/infodis.

[427] Learning Parameterized Skills from Demonstrations

Vedant Gupta, Haotian Fu, Calvin Luo, Yiding Jiang, George Konidaris

Main category: cs.LG

TL;DR: DEPS is an end-to-end algorithm that discovers parameterized skills from expert demonstrations by learning skill policies jointly with a meta-policy, using temporal variational inference and information-theoretic regularization to ensure temporally extended and meaningful skills.

Details

Motivation: To improve generalization to unseen tasks by learning parameterized skills from multitask expert demonstrations, addressing the degeneracy challenge in latent variable models.

Method: Combines temporal variational inference with information-theoretic regularization to learn parameterized skill policies and a meta-policy that selects discrete skills and continuous parameters at each timestep.

Result: Outperforms multitask and skill learning baselines on LIBERO and MetaWorld benchmarks, and discovers interpretable parameterized skills like object grasping with continuous grasp location parameters.

Conclusion: Learning parameterized skills from multitask demonstrations significantly improves generalization and produces interpretable, adaptable skills.

Abstract: We present DEPS, an end-to-end algorithm for discovering parameterized skills from expert demonstrations. Our method learns parameterized skill policies jointly with a meta-policy that selects the appropriate discrete skill and continuous parameters at each timestep. Using a combination of temporal variational inference and information-theoretic regularization methods, we address the challenge of degeneracy common in latent variable models, ensuring that the learned skills are temporally extended, semantically meaningful, and adaptable. We empirically show that learning parameterized skills from multitask expert demonstrations significantly improves generalization to unseen tasks. Our method outperforms multitask as well as skill learning baselines on both LIBERO and MetaWorld benchmarks. We also demonstrate that DEPS discovers interpretable parameterized skills, such as an object grasping skill whose continuous arguments define the grasp location.

[428] Graph-Guided Concept Selection for Efficient Retrieval-Augmented Generation

Ziyu Liu, Yijing Liu, Jianfei Yuan, Minzhi Yan, Le Yue, Honghui Xiong, Yi Yang

Main category: cs.LG

TL;DR: Graph-based RAG uses knowledge graphs for better retrieval in LLM QA, but is expensive. G2ConS reduces costs by selecting important chunks and using a concept graph, outperforming baselines.

Details

Motivation: Graph-based RAG methods are effective for multi-hop reasoning in domains like biomedicine and law, but require many expensive LLM calls for entity/relation extraction, making them costly at scale.

Method: Proposes G2ConS with two components: chunk selection to reduce KG construction costs by picking salient documents, and an LLM-independent concept graph to fill knowledge gaps from chunk selection at zero cost.

Result: Evaluations on multiple real-world datasets show G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.

Conclusion: G2ConS successfully addresses the cost issues of graph-based RAG while maintaining or improving performance through intelligent chunk selection and concept graph integration.

Abstract: Graph-based RAG constructs a knowledge graph (KG) from text chunks to enhance retrieval in Large Language Model (LLM)-based question answering. It is especially beneficial in domains such as biomedicine, law, and political science, where effective retrieval often involves multi-hop reasoning over proprietary documents. However, these methods demand numerous LLM calls to extract entities and relations from text chunks, incurring prohibitive costs at scale. Through a carefully designed ablation study, we observe that certain words (termed concepts) and their associated documents are more important. Based on this insight, we propose Graph-Guided Concept Selection (G2ConS). Its core comprises a chunk selection method and an LLM-independent concept graph. The former selects salient document chunks to reduce KG construction costs; the latter closes knowledge gaps introduced by chunk selection at zero cost. Evaluations on multiple real-world datasets show that G2ConS outperforms all baselines in construction cost, retrieval effectiveness, and answering quality.

[429] Causal Convolutional Neural Networks as Finite Impulse Response Filters

Kiran Bacsa, Wei Liu, Xudong Jian, Huangbin Liang, Eleni Chatzi

Main category: cs.LG

TL;DR: Causal CNNs with quasi-linear activations behave like FIR filters when trained on multimodal frequency time-series data, offering enhanced interpretability and equivalent single-layer filter representations.

Details

Motivation: To understand how causal CNNs process time-series data with multimodal frequency content and establish connections with traditional signal processing methods like FIR filters for better interpretability in dynamic systems.

Method: Using causal CNNs with extended-length convolutional kernels and quasi-linear activation functions, then leveraging the associative property of convolution to reduce the entire network to an equivalent single-layer FIR filter optimized via least-squares criteria.

Result: Causal CNNs capture spectral features both implicitly and explicitly, and the entire network can be represented as an equivalent FIR filter, providing new insights into spectral learning behavior for signals with sparse frequency content.

Conclusion: Causal CNNs with quasi-linear activations exhibit FIR filter-like properties when applied to multimodal frequency time-series, offering improved interpretability and relevance for modeling physical systems with dynamic responses, as validated on beam dynamics and bridge vibration datasets.

Abstract: This study investigates the behavior of Causal Convolutional Neural Networks (CNNs) with quasi-linear activation functions when applied to time-series data characterized by multimodal frequency content. We demonstrate that, once trained, such networks exhibit properties analogous to Finite Impulse Response (FIR) filters, particularly when the convolutional kernels are of extended length exceeding those typically employed in standard CNN architectures. Causal CNNs are shown to capture spectral features both implicitly and explicitly, offering enhanced interpretability for tasks involving dynamic systems. Leveraging the associative property of convolution, we further show that the entire network can be reduced to an equivalent single-layer filter resembling an FIR filter optimized via least-squares criteria. This equivalence yields new insights into the spectral learning behavior of CNNs trained on signals with sparse frequency content. The approach is validated on both simulated beam dynamics and real-world bridge vibration datasets, underlining its relevance for modeling and identifying physical systems governed by dynamic responses.

[430] What do vision-language models see in the context? Investigating multimodal in-context learning

Gabriel O. dos Santos, Esther Colombini, Sandra Avila

Main category: cs.LG

TL;DR: Systematic study of in-context learning (ICL) in Vision-Language Models (VLMs) reveals they primarily focus on textual cues, fail to leverage visual information effectively, and show trade-offs between instruction alignment and in-context adaptation.

Details

Motivation: ICL has been extensively studied in LLMs but remains underexplored in VLMs, prompting investigation into how prompt design, architecture, and training strategies affect multimodal ICL.

Method: Evaluated seven VLMs spanning four architectures on three image captioning benchmarks, analyzed attention patterns with increasing demonstrations, and studied effects of training on image-text interleaved data and instruction tuning.

Result: Training on image-text interleaved data enhances ICL performance but doesn’t ensure effective multimodal integration. Instruction tuning improves instruction-following but reduces reliance on in-context demonstrations. VLMs focus mainly on textual cues and fail to leverage visual information.

Conclusion: Current VLMs have limited capacity for multimodal integration in ICL, highlighting key limitations and providing insights for enhancing their ability to learn from multimodal in-context examples.

Abstract: In-context learning (ICL) enables Large Language Models (LLMs) to learn tasks from demonstration examples without parameter updates. Although it has been extensively studied in LLMs, its effectiveness in Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic study of ICL in VLMs, evaluating seven models spanning four architectures on three image captioning benchmarks. We analyze how prompt design, architectural choices, and training strategies influence multimodal ICL. To our knowledge, we are the first to analyze how attention patterns in VLMs vary with an increasing number of in-context demonstrations. Our results reveal that training on imag-text interleaved data enhances ICL performance but does not imply effective integration of visual and textual information from demonstration examples. In contrast, instruction tuning improves instruction-following but can reduce reliance on in-context demonstrations, suggesting a trade-off between instruction alignment and in-context adaptation. Attention analyses further show that current VLMs primarily focus on textual cues and fail to leverage visual information, suggesting a limited capacity for multimodal integration. These findings highlight key limitations in the ICL abilities of current VLMs and provide insights for enhancing their ability to learn from multimodal in-context examples.

[431] Fixed Point Neural Acceleration and Inverse Surrogate Model for Battery Parameter Identification

Hojin Cheon, Hyeongseok Seo, Jihun Jeon, Wooju Lee, Dohyun Jeong, Hongseok Kim

Main category: cs.LG

TL;DR: A deep learning framework for fast and accurate parameter identification of lithium-ion battery models, achieving 2000x speedup and 10x higher accuracy compared to conventional methods.

Details

Motivation: Address limitations of conventional metaheuristic approaches (high computational cost, slow convergence) and machine learning methods (reliance on constant current data not available in practice) for battery health assessment.

Method: Combines neural surrogate model (NeuralSPMe) trained on realistic EV load profiles with parameter update network (PUNet) using deep learning-based fixed-point iteration for parameter identification.

Result: Achieves 2000x acceleration in parameter identification, superior sample efficiency, and over 10x higher accuracy compared to conventional metaheuristic algorithms, especially under dynamic load conditions.

Conclusion: The proposed framework enables fast and accurate battery parameter identification for practical EV applications, overcoming limitations of existing methods.

Abstract: The rapid expansion of electric vehicles has intensified the need for accurate and efficient diagnosis of lithium-ion batteries. Parameter identification of electrochemical battery models is widely recognized as a powerful method for battery health assessment. However, conventional metaheuristic approaches suffer from high computational cost and slow convergence, and recent machine learning methods are limited by their reliance on constant current data, which may not be available in practice. To overcome these challenges, we propose deep learning-based framework for parameter identification of electrochemical battery models. The proposed framework combines a neural surrogate model of the single particle model with electrolyte (NeuralSPMe) and a deep learning-based fixed-point iteration method. NeuralSPMe is trained on realistic EV load profiles to accurately predict lithium concentration dynamics under dynamic operating conditions while a parameter update network (PUNet) performs fixed-point iterative updates to significantly reduce both the evaluation time per sample and the overall number of iterations required for convergence. Experimental evaluations demonstrate that the proposed framework accelerates the parameter identification by more than 2000 times, achieves superior sample efficiency and more than 10 times higher accuracy compared to conventional metaheuristic algorithms, particularly under dynamic load scenarios encountered in practical applications.

[432] Identifiable learning of dissipative dynamics

Aiqing Zhu, Beatrice W. Soh, Grigorios A. Pavliotis, Qianxiao Li

Main category: cs.LG

TL;DR: I-OnsagerNet is a neural framework that learns dissipative stochastic dynamics from trajectories while ensuring interpretability and uniqueness by extending the Onsager principle and Helmholtz decomposition.

Details

Motivation: Complex dissipative systems operate far from equilibrium where energy dissipation and time irreversibility are key but difficult to quantify from data. Current models struggle to balance expressiveness with physical meaningfulness and mathematical identifiability.

Method: Extends Onsager principle to guarantee learned potential from stationary density and clean drift decomposition into time-reversible and irreversible components via Helmholtz decomposition. Uses neural framework to learn dynamics from trajectories.

Result: Enables calculation of entropy production and quantification of irreversibility. Applications reveal super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with strain rate, and suppression of irreversibility with increasing batch size.

Conclusion: I-OnsagerNet establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics with both interpretability and uniqueness guarantees.

Abstract: Complex dissipative systems appear across science and engineering, from polymers and active matter to learning algorithms. These systems operate far from equilibrium, where energy dissipation and time irreversibility are key to their behavior, but are difficult to quantify from data. Learning accurate and interpretable models of such dynamics remains a major challenge: the models must be expressive enough to describe diverse processes, yet constrained enough to remain physically meaningful and mathematically identifiable. Here, we introduce I-OnsagerNet, a neural framework that learns dissipative stochastic dynamics directly from trajectories while ensuring both interpretability and uniqueness. I-OnsagerNet extends the Onsager principle to guarantee that the learned potential is obtained from the stationary density and that the drift decomposes cleanly into time-reversible and time-irreversible components, as dictated by the Helmholtz decomposition. Our approach enables us to calculate the entropy production and to quantify irreversibility, offering a principled way to detect and quantify deviations from equilibrium. Applications to polymer stretching in elongational flow and to stochastic gradient Langevin dynamics reveal new insights, including super-linear scaling of barrier heights and sub-linear scaling of entropy production rates with the strain rate, and the suppression of irreversibility with increasing batch size. I-OnsagerNet thus establishes a general, data-driven framework for discovering and interpreting non-equilibrium dynamics.

[433] EddyFormer: Accelerated Neural Simulations of Three-Dimensional Turbulence at Scale

Yiheng Du, Aditi S. Krishnapriyan

Main category: cs.LG

TL;DR: EddyFormer is a Transformer-based spectral-element architecture for large-scale turbulence simulation that achieves DNS-level accuracy with 30x speedup and shows strong domain generalization.

Details

Motivation: Fully resolving large-scale turbulence through direct numerical simulation (DNS) is computationally prohibitive, motivating data-driven machine learning alternatives.

Method: Transformer-based spectral-element (SEM) architecture with SEM tokenization that decomposes flow into grid-scale and subgrid-scale components, enabling capture of both local and global features.

Result: Achieves DNS-level accuracy at 256^3 resolution with 30x speedup over DNS, preserves accuracy on physics-invariant metrics when applied to unseen domains up to 4x larger, and resolves cases where prior ML models fail to converge.

Conclusion: EddyFormer provides an accurate and efficient alternative to DNS for large-scale turbulence simulation with strong domain generalization capabilities.

Abstract: Computationally resolving turbulence remains a central challenge in fluid dynamics due to its multi-scale interactions. Fully resolving large-scale turbulence through direct numerical simulation (DNS) is computationally prohibitive, motivating data-driven machine learning alternatives. In this work, we propose EddyFormer, a Transformer-based spectral-element (SEM) architecture for large-scale turbulence simulation that combines the accuracy of spectral methods with the scalability of the attention mechanism. We introduce an SEM tokenization that decomposes the flow into grid-scale and subgrid-scale components, enabling capture of both local and global features. We create a new three-dimensional isotropic turbulence dataset and train EddyFormer to achieves DNS-level accuracy at 256^3 resolution, providing a 30x speedup over DNS. When applied to unseen domains up to 4x larger than in training, EddyFormer preserves accuracy on physics-invariant metrics-energy spectra, correlation functions, and structure functions-showing domain generalization. On The Well benchmark suite of diverse turbulent flows, EddyFormer resolves cases where prior ML models fail to converge, accurately reproducing complex dynamics across a wide range of physical conditions.

[434] Closing Gaps: An Imputation Analysis of ICU Vital Signs

Alisher Turubayev, Anna Shopova, Fabian Lange, Mahmut Kamalak, Paul Mattes, Victoria Ayvasky, Bert Arnrich, Bjarne Pfitzner, Robin P. van de Water

Main category: cs.LG

TL;DR: This paper compares various time-series imputation techniques for ICU vital sign data to improve clinical prediction models by addressing missing data issues.

Details

Motivation: The lack of data quality in ICU datasets, particularly missing vital sign measurements, hinders clinical prediction using machine learning, and current practices often use suboptimal imputation methods that decrease prediction accuracy.

Method: The authors introduce an extensible benchmark with 15 imputation and 4 amputation methods for evaluating performance on major ICU datasets.

Result: The study provides a comparative analysis of established imputation techniques to determine best practices for handling missing ICU data.

Conclusion: This work aims to guide researchers in selecting optimal imputation methods and facilitate further ML development to bring more models into clinical practice.

Abstract: As more Intensive Care Unit (ICU) data becomes available, the interest in developing clinical prediction models to improve healthcare protocols increases. However, the lack of data quality still hinders clinical prediction using Machine Learning (ML). Many vital sign measurements, such as heart rate, contain sizeable missing segments, leaving gaps in the data that could negatively impact prediction performance. Previous works have introduced numerous time-series imputation techniques. Nevertheless, more comprehensive work is needed to compare a representative set of methods for imputing ICU vital signs and determine the best practice. In reality, ad-hoc imputation techniques that could decrease prediction accuracy, like zero imputation, are still used. In this work, we compare established imputation techniques to guide researchers in improving the performance of clinical prediction models by selecting the most accurate imputation technique. We introduce an extensible and reusable benchmark with currently 15 imputation and 4 amputation methods, created for benchmarking on major ICU datasets. We hope to provide a comparative basis and facilitate further ML development to bring more models into clinical practice.

[435] V-SAT: Video Subtitle Annotation Tool

Arpita Kundu, Joyita Chakraborty, Anindita Desarkar, Aritra Sen, Srushti Anil Patil, Vishwanathan Raman

Main category: cs.LG

TL;DR: V-SAT is a unified framework that automatically detects and corrects subtitle quality issues using LLMs, VLMs, image processing, and ASR to leverage audio-visual context.

Details

Motivation: Existing subtitle generation methods suffer from poor synchronization, incorrect text, inconsistent formatting, inappropriate reading speeds, and inability to adapt to dynamic contexts, making post-editing labor-intensive.

Method: Combines Large Language Models (LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR) to leverage contextual cues from both audio and video with human-in-the-loop validation.

Result: Subtitle quality improved significantly with SUBER score reduced from 9.6 to 3.54 after resolving language mode issues and F1-scores of ~0.80 for image mode issues.

Conclusion: V-SAT provides the first comprehensive solution for robust subtitle annotation by automatically detecting and correcting a wide range of subtitle quality issues.

Abstract: The surge of audiovisual content on streaming platforms and social media has heightened the demand for accurate and accessible subtitles. However, existing subtitle generation methods primarily speech-based transcription or OCR-based extraction suffer from several shortcomings, including poor synchronization, incorrect or harmful text, inconsistent formatting, inappropriate reading speeds, and the inability to adapt to dynamic audio-visual contexts. Current approaches often address isolated issues, leaving post-editing as a labor-intensive and time-consuming process. In this paper, we introduce V-SAT (Video Subtitle Annotation Tool), a unified framework that automatically detects and corrects a wide range of subtitle quality issues. By combining Large Language Models(LLMs), Vision-Language Models (VLMs), Image Processing, and Automatic Speech Recognition (ASR), V-SAT leverages contextual cues from both audio and video. Subtitle quality improved, with the SUBER score reduced from 9.6 to 3.54 after resolving all language mode issues and F1-scores of ~0.80 for image mode issues. Human-in-the-loop validation ensures high-quality results, providing the first comprehensive solution for robust subtitle annotation.

[436] PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling

Ai Jian, Jingqing Ruan, Xing Ma, Dailin Li, QianLin Zhou, Ke Zeng, Xunliang Cai

Main category: cs.LG

TL;DR: PaTaRM is a unified reward modeling framework that integrates preference-aware rewards with dynamic rubric adaptation, eliminating the need for explicit point-wise labels while improving RLHF performance.

Details

Motivation: Current reward models have limitations: pair-wise methods cause mismatches for point-wise inference and require complex pairing strategies, while point-wise methods need elaborate absolute labeling with poor adaptability and high costs.

Method: PaTaRM combines preference-aware reward mechanism (using relative preference from pairwise data for point-wise training) with task-adaptive rubric system that generates evaluation criteria for both global task consistency and instance-specific reasoning.

Result: Achieves 4.7% average improvement on RewardBench and RMBench across Qwen3 models, and boosts downstream RLHF performance by 13.6% on IFEval and InFoBench benchmarks.

Conclusion: PaTaRM provides an efficient, generalizable, and interpretable reward modeling framework that effectively addresses limitations of current training paradigms for RLHF.

Abstract: Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at https://github.com/JaneEyre0530/PaTaRM.

[437] SPEAR++: Scaling Gradient Inversion via Sparsely-Used Dictionary Learning

Alexander Bakarsky, Dimitar I. Dimitrov, Maximilian Baader, Martin Vechev

Main category: cs.LG

TL;DR: SPEAR++ improves the SPEAR gradient inversion attack by making it tractable for larger batch sizes using dictionary learning techniques, while maintaining robustness to DP noise and FedAvg aggregation.

Details

Motivation: Federated Learning's privacy claims are challenged by gradient inversion attacks, but existing attacks like SPEAR have exponential runtime limitations that make them impractical for real-world deployments.

Method: Applied State-of-the-Art techniques from Sparsely-Used Dictionary Learning to solve the gradient inversion problem on linear layers with ReLU activations, making it computationally tractable.

Result: SPEAR++ achieves 10x larger batch sizes compared to SPEAR while retaining all desirable properties including robustness to DP noise and FedAvg aggregation.

Conclusion: The new attack SPEAR++ successfully overcomes the computational limitations of SPEAR, making gradient inversion attacks more practical and scalable for real-world Federated Learning systems.

Abstract: Federated Learning has seen an increased deployment in real-world scenarios recently, as it enables the distributed training of machine learning models without explicit data sharing between individual clients. Yet, the introduction of the so-called gradient inversion attacks has fundamentally challenged its privacy-preserving properties. Unfortunately, as these attacks mostly rely on direct data optimization without any formal guarantees, the vulnerability of real-world systems remains in dispute and requires tedious testing for each new federated deployment. To overcome these issues, recently the SPEAR attack was introduced, which is based on a theoretical analysis of the gradients of linear layers with ReLU activations. While SPEAR is an important theoretical breakthrough, the attack’s practicality was severely limited by its exponential runtime in the batch size b. In this work, we fill this gap by applying State-of-the-Art techniques from Sparsely-Used Dictionary Learning to make the problem of gradient inversion on linear layers with ReLU activations tractable. Our experiments demonstrate that our new attack, SPEAR++, retains all desirable properties of SPEAR, such as robustness to DP noise and FedAvg aggregation, while being applicable to 10x bigger batch sizes.

[438] Unlocking Out-of-Distribution Generalization in Dynamics through Physics-Guided Augmentation

Fan Xu, Hao Wu, Kun Wang, Nan Wang, Qingsong Wen, Xian Wu, Wei Gong, Xibin Zhao

Main category: cs.LG

TL;DR: SPARK is a physics-guided augmentation plugin that integrates physical parameters into a structured state dictionary, enabling creation of physically-plausible training samples through latent space interpolation, combined with a Fourier-enhanced Graph ODE for robust downstream prediction.

Details

Motivation: Traditional numerical methods have high computational costs, while modern data-driven approaches struggle with data scarcity and distribution shifts, creating fundamental limitations in dynamical system modeling.

Method: Uses a reconstruction autoencoder to integrate physical parameters into a physics-rich discrete state dictionary, then performs principled interpolation in the latent space to create new training samples. Combines with Fourier-enhanced Graph ODE for downstream prediction.

Result: Extensive experiments show SPARK significantly outperforms state-of-the-art baselines, especially in challenging out-of-distribution scenarios and data-scarce regimes.

Conclusion: The physics-guided augmentation paradigm proves effective for addressing fundamental limitations in dynamical system modeling.

Abstract: In dynamical system modeling, traditional numerical methods are limited by high computational costs, while modern data-driven approaches struggle with data scarcity and distribution shifts. To address these fundamental limitations, we first propose SPARK, a physics-guided quantitative augmentation plugin. Specifically, SPARK utilizes a reconstruction autoencoder to integrate physical parameters into a physics-rich discrete state dictionary. This state dictionary then acts as a structured dictionary of physical states, enabling the creation of new, physically-plausible training samples via principled interpolation in the latent space. Further, for downstream prediction, these augmented representations are seamlessly integrated with a Fourier-enhanced Graph ODE, a combination designed to robustly model the enriched data distribution while capturing long-term temporal dependencies. Extensive experiments on diverse benchmarks demonstrate that SPARK significantly outperforms state-of-the-art baselines, particularly in challenging out-of-distribution scenarios and data-scarce regimes, proving the efficacy of our physics-guided augmentation paradigm.

[439] PRIVET: Privacy Metric Based on Extreme Value Theory

Antoine Szatkownik, Aurélien Decelle, Beatriz Seoane, Nicolas Bereux, Léo Planche, Guillaume Charpiat, Burak Yelmen, Flora Jay, Cyril Furtlehner

Main category: cs.LG

TL;DR: PRIVET is a sample-based privacy assessment method that detects individual privacy leaks in synthetic data using extreme value statistics on nearest-neighbor distances.

Details

Motivation: Existing privacy assessment methods for synthetic data rely on global criteria and lack interpretable sample-level insights, hindering real-world deployment of privacy-preserving synthetic data.

Method: Uses extreme value statistics on nearest-neighbor distances to assign individual privacy leak scores to each synthetic sample, working across diverse data modalities including high-dimensional and limited-sample settings.

Result: PRIVET reliably detects memorization and privacy leakage across various data types, outperforms existing approaches, and provides both dataset-level and sample-level assessments with qualitative and quantitative outputs.

Conclusion: PRIVET offers a practical solution for privacy assessment in synthetic data generation, addressing limitations of existing methods and revealing issues with current computer vision embeddings for near-duplicate detection.

Abstract: Deep generative models are often trained on sensitive data, such as genetic sequences, health data, or more broadly, any copyrighted, licensed or protected content. This raises critical concerns around privacy-preserving synthetic data, and more specifically around privacy leakage, an issue closely tied to overfitting. Existing methods almost exclusively rely on global criteria to estimate the risk of privacy failure associated to a model, offering only quantitative non interpretable insights. The absence of rigorous evaluation methods for data privacy at the sample-level may hinder the practical deployment of synthetic data in real-world applications. Using extreme value statistics on nearest-neighbor distances, we propose PRIVET, a generic sample-based, modality-agnostic algorithm that assigns an individual privacy leak score to each synthetic sample. We empirically demonstrate that PRIVET reliably detects instances of memorization and privacy leakage across diverse data modalities, including settings with very high dimensionality, limited sample sizes such as genetic data and even under underfitting regimes. We compare our method to existing approaches under controlled settings and show its advantage in providing both dataset level and sample level assessments through qualitative and quantitative outputs. Additionally, our analysis reveals limitations in existing computer vision embeddings to yield perceptually meaningful distances when identifying near-duplicate samples.

[440] Sparse Optimistic Information Directed Sampling

Ludovic Schwartz, Hamish Flynn, Gergely Neu

Main category: cs.LG

TL;DR: SOIDS achieves optimal worst-case regret in both data-rich and data-poor regimes for stochastic sparse linear bandits, extending IDS guarantees without Bayesian assumptions.

Details

Motivation: Existing algorithms only achieve optimal regret in either data-rich or data-poor regimes, but not both simultaneously. There's a need for adaptive algorithms that perform optimally across different data availability scenarios.

Method: Sparse Optimistic Information Directed Sampling (SOIDS) with a novel analysis enabling time-dependent learning rate to balance information and regret optimally.

Result: SOIDS provides the first algorithm that simultaneously achieves optimal worst-case regret in both data-rich and data-poor regimes, with empirical demonstration of good performance.

Conclusion: SOIDS successfully extends IDS theoretical guarantees to worst-case settings without Bayesian assumptions, achieving adaptive optimal performance across different data regimes.

Abstract: Many high-dimensional online decision-making problems can be modeled as stochastic sparse linear bandits. Most existing algorithms are designed to achieve optimal worst-case regret in either the data-rich regime, where polynomial dependence on the ambient dimension is unavoidable, or the data-poor regime, where dimension-independence is possible at the cost of worse dependence on the number of rounds. In contrast, the sparse Information Directed Sampling (IDS) algorithm satisfies a Bayesian regret bound that has the optimal rate in both regimes simultaneously. In this work, we explore the use of Sparse Optimistic Information Directed Sampling (SOIDS) to achieve the same adaptivity in the worst-case setting, without Bayesian assumptions. Through a novel analysis that enables the use of a time-dependent learning rate, we show that SOIDS can optimally balance information and regret. Our results extend the theoretical guarantees of IDS, providing the first algorithm that simultaneously achieves optimal worst-case regret in both the data-rich and data-poor regimes. We empirically demonstrate the good performance of SOIDS.

[441] Temporal Knowledge Graph Hyperedge Forecasting: Exploring Entity-to-Category Link Prediction

Edward Markai, Sina Molavipour

Main category: cs.LG

TL;DR: Extends TLogic rule-based framework for temporal knowledge graphs by incorporating entity categories to improve accuracy and explainability, with LLM-based category generation when categories are unknown.

Details

Motivation: Most temporal knowledge graph methods are embedding-based black boxes that lack interpretability. This work aims to provide explainable predictions while maintaining high accuracy.

Method: Extends TLogic rule-based framework by incorporating entity categories to limit rule application, proposes LLM-based approach for generating categories when unknown, and investigates aggregation methods for entity scoring.

Result: The extended framework yields high accuracy with explainable predictions, offering transparency for end-user evaluation of applied rules.

Conclusion: Rule-based approaches with entity categories provide an effective alternative to black-box methods, combining accuracy with explainability in temporal knowledge graph prediction.

Abstract: Temporal Knowledge Graphs have emerged as a powerful way of not only modeling static relationships between entities but also the dynamics of how relations evolve over time. As these informational structures can be used to store information from a real-world setting, such as a news flow, predicting future graph components to a certain extent equates predicting real-world events. Most of the research in this field focuses on embedding-based methods, often leveraging convolutional neural net architectures. These solutions act as black boxes, limiting insight. In this paper, we explore an extension to an established rule-based framework, TLogic, that yields a high accuracy in combination with explainable predictions. This offers transparency and allows the end-user to critically evaluate the rules applied at the end of the prediction stage. The new rule format incorporates entity category as a key component with the purpose of limiting rule application only to relevant entities. When categories are unknown for building the graph, we propose a data-driven method to generate them with an LLM-based approach. Additionally, we investigate the choice of aggregation method for scores of retrieved entities when performing category prediction.

[442] Transformers can do Bayesian Clustering

Prajit Bhaskaran, Tom Viering

Main category: cs.LG

TL;DR: Cluster-PFN is a Transformer-based model that performs fast Bayesian clustering by learning from synthetic GMM data, handling missing values without imputation, and outperforming traditional methods in speed and accuracy.

Details

Motivation: Bayesian clustering is computationally intensive and real-world datasets often have missing values, where simple imputation ignores uncertainty and leads to poor results.

Method: Extends Prior-Data Fitted Networks (PFNs) using Transformers, trained entirely on synthetic datasets from Gaussian Mixture Model priors to estimate posterior distributions over cluster counts and assignments.

Result: Outperforms AIC, BIC and Variational Inference in estimating cluster numbers, achieves competitive clustering quality with VI but much faster, and handles missing data better than imputation baselines on genomic datasets.

Conclusion: Cluster-PFN provides scalable and flexible Bayesian clustering that effectively handles uncertainty and missing data.

Abstract: Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.

[443] SALS: Sparse Attention in Latent Space for KV cache Compression

Junlin Mu, Hantao Huang, Jihang Zhang, Minghui Yu, Tao Wang, Yidong Li

Main category: cs.LG

TL;DR: SALS is a sparse attention framework that compresses KV cache via low-rank projection into latent space and performs token selection, achieving significant speed-up and compression while maintaining accuracy.

Details

Motivation: Large Language Models with extended contexts face challenges due to large KV cache size and high memory bandwidth requirements. While KV cache has low-rank characteristics, naive compression fails due to Rotary Position Embedding requirements.

Method: Projects KV cache into compact latent space via low-rank projection, performs sparse token selection using RoPE-free query-key interactions in latent space, and reconstructs only important tokens to avoid full KV cache reconstruction overhead.

Result: Achieves 6.4x KV cache compression and 5.7x speed-up in attention operator vs FlashAttention2 on 4K sequences. End-to-end throughput improves 1.4x and 4.5x vs GPT-fast on 4k and 32K sequences respectively.

Conclusion: SALS framework effectively addresses KV cache compression challenges by leveraging latent space projections and sparse token selection, achieving state-of-the-art performance with competitive accuracy across various models and benchmarks.

Abstract: Large Language Models capable of handling extended contexts are in high demand, yet their inference remains challenging due to substantial Key-Value cache size and high memory bandwidth requirements. Previous research has demonstrated that KV cache exhibits low-rank characteristics within the hidden dimension, suggesting the potential for effective compression. However, due to the widely adopted Rotary Position Embedding mechanism in modern LLMs, naive low-rank compression suffers severe accuracy degradation or creates a new speed bottleneck, as the low-rank cache must first be reconstructed in order to apply RoPE. In this paper, we introduce two key insights: first, the application of RoPE to the key vectors increases their variance, which in turn results in a higher rank; second, after the key vectors are transformed into the latent space, they largely maintain their representation across most layers. Based on these insights, we propose the Sparse Attention in Latent Space framework. SALS projects the KV cache into a compact latent space via low-rank projection, and performs sparse token selection using RoPE-free query-key interactions in this space. By reconstructing only a small subset of important tokens, it avoids the overhead of full KV cache reconstruction. We comprehensively evaluate SALS on various tasks using two large-scale models: LLaMA2-7b-chat and Mistral-7b, and additionally verify its scalability on the RULER-128k benchmark with LLaMA3.1-8B-Instruct. Experimental results demonstrate that SALS achieves SOTA performance by maintaining competitive accuracy. Under different settings, SALS achieves 6.4-fold KV cache compression and 5.7-fold speed-up in the attention operator compared to FlashAttention2 on the 4K sequence. For the end-to-end throughput performance, we achieves 1.4-fold and 4.5-fold improvement compared to GPT-fast on 4k and 32K sequences, respectively.

[444] Perception Learning: A Formal Separation of Sensory Representation Learning from Decision Learning

Suman Sanyal

Main category: cs.LG

TL;DR: Perception Learning (PeL) is a paradigm that optimizes sensory interfaces using task-agnostic signals, separate from decision learning, focusing on perceptual properties like stability, informativeness, and controlled geometry.

Details

Motivation: To decouple perception learning from downstream decision tasks, allowing optimization of sensory interfaces based on intrinsic perceptual properties rather than task-specific objectives.

Method: Formalizes separation of perception and decision, defines perceptual properties independent of objectives, proves PeL updates preserve invariants orthogonal to Bayes task-risk gradients, and provides task-agnostic evaluation metrics.

Result: Developed a framework for optimizing sensory interfaces using label-free perceptual properties assessed via objective representation-invariant metrics.

Conclusion: PeL enables task-agnostic optimization of perception systems, providing certified perceptual quality through invariant metrics separate from decision learning.

Abstract: We introduce Perception Learning (PeL), a paradigm that optimizes an agent’s sensory interface $f_\phi:\mathcal{X}\to\mathcal{Z}$ using task-agnostic signals, decoupled from downstream decision learning $g_\theta:\mathcal{Z}\to\mathcal{Y}$. PeL directly targets label-free perceptual properties, such as stability to nuisances, informativeness without collapse, and controlled geometry, assessed via objective representation-invariant metrics. We formalize the separation of perception and decision, define perceptual properties independent of objectives or reparameterizations, and prove that PeL updates preserving sufficient invariants are orthogonal to Bayes task-risk gradients. Additionally, we provide a suite of task-agnostic evaluation metrics to certify perceptual quality.

[445] EDC: Equation Discovery for Classification

Guus Toussaint, Arno Knobbe

Main category: cs.LG

TL;DR: EDC is a new equation discovery framework for binary classification that finds analytical functions to specify decision boundaries, outperforming current ED-based methods and achieving comparable performance to state-of-the-art classification approaches.

Details

Motivation: To extend Equation Discovery techniques from regression tasks to binary classification by discovering interpretable analytical functions that define decision boundaries.

Method: Proposes EDC framework using a configurable grammar with summands including linear, quadratic, exponential terms, and feature products to create flexible decision boundaries while avoiding overfitting.

Result: EDC successfully discovers both equation structure and parameter values, outperforming current ED-based classification methods and achieving performance comparable to state-of-the-art binary classification approaches on artificial and real-life datasets.

Conclusion: The proposed grammar provides sufficient flexibility for decision boundaries without causing overfitting, and the framework allows for domain-specific extensions where needed.

Abstract: Equation Discovery techniques have shown considerable success in regression tasks, where they are used to discover concise and interpretable models (\textit{Symbolic Regression}). In this paper, we propose a new ED-based binary classification framework. Our proposed method EDC finds analytical functions of manageable size that specify the location and shape of the decision boundary. In extensive experiments on artificial and real-life data, we demonstrate how EDC is able to discover both the structure of the target equation as well as the value of its parameters, outperforming the current state-of-the-art ED-based classification methods in binary classification and achieving performance comparable to the state of the art in binary classification. We suggest a grammar of modest complexity that appears to work well on the tested datasets but argue that the exact grammar – and thus the complexity of the models – is configurable, and especially domain-specific expressions can be included in the pattern language, where that is required. The presented grammar consists of a series of summands (additive terms) that include linear, quadratic and exponential terms, as well as products of two features (producing hyperbolic curves ideal for capturing XOR-like dependencies). The experiments demonstrate that this grammar allows fairly flexible decision boundaries while not so rich to cause overfitting.

[446] Filtering instances and rejecting predictions to obtain reliable models in healthcare

Maria Gabriela Valeriano, David Kohan Marzagão, Alfredo Montelongo, Carlos Roberto Veiga Kiffer, Natan Katz, Ana Carolina Lorena

Main category: cs.LG

TL;DR: A two-step data-centric approach using Instance Hardness filtering and confidence-based rejection to improve ML model reliability in healthcare applications.

Details

Motivation: ML models in healthcare often fail to account for uncertainty and provide predictions even with low confidence, which is problematic for high-stakes domains where reliability is critical.

Method: Two-step approach: (1) Use Instance Hardness to filter problematic instances during training to refine dataset, (2) Implement confidence-based rejection mechanism during inference to retain only reliable predictions.

Result: Evaluation on three real-world healthcare datasets shows improved model reliability while balancing predictive performance and rejection rate. IH filtering with confidence-based rejection effectively enhances performance while preserving most instances.

Conclusion: The proposed approach provides a practical method for deploying ML systems in safety-critical applications by improving data quality and filtering low-confidence predictions.

Abstract: Machine Learning (ML) models are widely used in high-stakes domains such as healthcare, where the reliability of predictions is critical. However, these models often fail to account for uncertainty, providing predictions even with low confidence. This work proposes a novel two-step data-centric approach to enhance the performance of ML models by improving data quality and filtering low-confidence predictions. The first step involves leveraging Instance Hardness (IH) to filter problematic instances during training, thereby refining the dataset. The second step introduces a confidence-based rejection mechanism during inference, ensuring that only reliable predictions are retained. We evaluate our approach using three real-world healthcare datasets, demonstrating its effectiveness at improving model reliability while balancing predictive performance and rejection rate. Additionally, we use alternative criteria - influence values for filtering and uncertainty for rejection - as baselines to evaluate the efficiency of the proposed method. The results demonstrate that integrating IH filtering with confidence-based rejection effectively enhances model performance while preserving a large proportion of instances. This approach provides a practical method for deploying ML systems in safety-critical applications.

[447] A Comprehensive Evaluation Framework for Synthetic Trip Data Generation in Public Transport

Yuanyuan Wu, Zhenlin Qin, Zhenliang Ma

Main category: cs.LG

TL;DR: Proposes a Representativeness-Privacy-Utility (RPU) framework to systematically evaluate synthetic trip data across three dimensions and hierarchical levels, benchmarking 12 generation methods and finding CTGAN provides the most balanced trade-off.

Details

Motivation: Synthetic data addresses privacy and accessibility challenges in public transport research, but existing evaluations are fragmented and limited, leaving unclear how reliable, safe, and useful synthetic data truly are.

Method: Developed a RPU framework that evaluates synthetic trip data across representativeness, privacy, and utility dimensions at record, group, and population levels, using consistent metrics to quantify similarity, disclosure risk, and practical usefulness.

Result: Synthetic data doesn’t inherently guarantee privacy; no “one-size-fits-all” model exists; trade-off between privacy and representativeness/utility is obvious; CTGAN provides the most balanced trade-off for practical applications.

Conclusion: The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.

Abstract: Synthetic data offers a promising solution to the privacy and accessibility challenges of using smart card data in public transport research. Despite rapid progress in generative modeling, there is limited attention to comprehensive evaluation, leaving unclear how reliable, safe, and useful synthetic data truly are. Existing evaluations remain fragmented, typically limited to population-level representativeness or record-level privacy, without considering group-level variations or task-specific utility. To address this gap, we propose a Representativeness-Privacy-Utility (RPU) framework that systematically evaluates synthetic trip data across three complementary dimensions and three hierarchical levels (record, group, population). The framework integrates a consistent set of metrics to quantify similarity, disclosure risk, and practical usefulness, enabling transparent and balanced assessment of synthetic data quality. We apply the framework to benchmark twelve representative generation methods, spanning conventional statistical models, deep generative networks, and privacy-enhanced variants. Results show that synthetic data do not inherently guarantee privacy and there is no “one-size-fits-all” model, the trade-off between privacy and representativeness/utility is obvious. Conditional Tabular generative adversarial network (CTGAN) provide the most balanced trade-off and is suggested for practical applications. The RPU framework provides a systematic and reproducible basis for researchers and practitioners to compare synthetic data generation techniques and select appropriate methods in public transport applications.

[448] Sample-efficient and Scalable Exploration in Continuous-Time RL

Klemens Iten, Lenart Treven, Bhavya Sukhija, Florian Dörfler, Andreas Krause

Main category: cs.LG

TL;DR: COMBRL is a continuous-time model-based RL algorithm that uses probabilistic models to learn nonlinear ODE dynamics and maximizes a combination of extrinsic reward and model uncertainty for sample-efficient learning.

Details

Motivation: Most RL algorithms are designed for discrete-time dynamics, but real-world control systems are often continuous in time, creating a gap that needs to be addressed.

Method: Leverages probabilistic models (Gaussian processes and Bayesian neural networks) to learn uncertainty-aware ODE models, and greedily maximizes a weighted sum of extrinsic reward and model epistemic uncertainty.

Result: COMBRL achieves sublinear regret in reward-driven settings, provides sample complexity bounds in unsupervised RL, and outperforms baselines in experiments with better scalability and sample efficiency.

Conclusion: COMBRL offers a scalable and sample-efficient approach to continuous-time model-based RL that bridges the gap between discrete-time algorithms and continuous real-world systems.

Abstract: Reinforcement learning algorithms are typically designed for discrete-time dynamics, even though the underlying real-world control systems are often continuous in time. In this paper, we study the problem of continuous-time reinforcement learning, where the unknown system dynamics are represented using nonlinear ordinary differential equations (ODEs). We leverage probabilistic models, such as Gaussian processes and Bayesian neural networks, to learn an uncertainty-aware model of the underlying ODE. Our algorithm, COMBRL, greedily maximizes a weighted sum of the extrinsic reward and model epistemic uncertainty. This yields a scalable and sample-efficient approach to continuous-time model-based RL. We show that COMBRL achieves sublinear regret in the reward-driven setting, and in the unsupervised RL setting (i.e., without extrinsic rewards), we provide a sample complexity bound. In our experiments, we evaluate COMBRL in both standard and unsupervised RL settings and demonstrate that it scales better, is more sample-efficient than prior methods, and outperforms baselines across several deep RL tasks.

[449] APEX: Approximate-but-exhaustive search for ultra-large combinatorial synthesis libraries

Aryan Pedawi, Jordi Silvestre-Ryan, Bradley Worley, Darren J Hsu, Kushal S Shah, Elias Stehle, Jingrong Zhang, Izhar Wallach

Main category: cs.LG

TL;DR: APEX is a neural network-based virtual screening protocol that enables approximate-but-exhaustive search of combinatorial synthesis libraries in under a minute on consumer GPUs, outperforming existing methods in both accuracy and speed.

Details

Motivation: Current virtual screening methods for large combinatorial synthesis libraries (tens of billions of compounds) can only evaluate <0.1% of compounds due to computational constraints, leaving many high-scoring compounds undiscovered. Existing algorithms also lack amortization when objectives and constraints change during screening campaigns.

Method: APEX uses a neural network surrogate that exploits the structure of combinatorial synthesis libraries to predict objectives and constraints, enabling full enumeration on consumer GPUs in under a minute for exact retrieval of approximate top-k sets.

Result: APEX demonstrated consistently strong performance in retrieval accuracy and runtime compared to alternative methods on a benchmark CSL of over 10 million compounds with docking scores on five medically relevant targets and physicochemical properties.

Conclusion: APEX enables efficient approximate-but-exhaustive virtual screening of combinatorial synthesis libraries, overcoming computational limitations of current methods and providing a practical solution for large-scale drug discovery campaigns.

Abstract: Make-on-demand combinatorial synthesis libraries (CSLs) like Enamine REAL have significantly enabled drug discovery efforts. However, their large size presents a challenge for virtual screening, where the goal is to identify the top compounds in a library according to a computational objective (e.g., optimizing docking score) subject to computational constraints under a limited computational budget. For current library sizes – numbering in the tens of billions of compounds – and scoring functions of interest, a routine virtual screening campaign may be limited to scoring fewer than 0.1% of the available compounds, leaving potentially many high scoring compounds undiscovered. Furthermore, as constraints (and sometimes objectives) change during the course of a virtual screening campaign, existing virtual screening algorithms typically offer little room for amortization. We propose the approximate-but-exhaustive search protocol for CSLs, or APEX. APEX utilizes a neural network surrogate that exploits the structure of CSLs in the prediction of objectives and constraints to make full enumeration on a consumer GPU possible in under a minute, allowing for exact retrieval of approximate top-$k$ sets. To demonstrate APEX’s capabilities, we develop a benchmark CSL comprised of more than 10 million compounds, all of which have been annotated with their docking scores on five medically relevant targets along with physicohemical properties measured with RDKit such that, for any objective and set of constraints, the ground truth top-$k$ compounds can be identified and compared against the retrievals from any virtual screening algorithm. We show APEX’s consistently strong performance both in retrieval accuracy and runtime compared to alternative methods.

[450] Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings

Seyed Mahdi Basiri Azad, Joschka Boedecker

Main category: cs.LG

TL;DR: A hybrid offline-to-online RL method that uses a few demonstrations to initialize value functions, reducing exploration burden in sparse-reward environments.

Details

Motivation: RL in sparse-reward environments is challenging due to lack of informative feedback, requiring efficient ways to guide exploration.

Method: Use offline demonstrations to precompute value estimates as targets for early learning, then refine through standard online interaction.

Result: Accelerates convergence and outperforms standard baselines in sparse-reward tasks, even with minimal or suboptimal demonstrations.

Conclusion: The hybrid offline-to-online paradigm significantly improves sample efficiency in sparse-reward RL settings.

Abstract: Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.

[451] Methodology for Comparing Machine Learning Algorithms for Survival Analysis

Lucas Buk Cardoso, Simone Aldrey Angelo, Yasmin Pacheco Gil Bonilha, Fernando Maia, Adeylson Guimarães Ribeiro, Maria Paula Curado, Gisele Aparecida Fernandes, Vanderlei Cunha Parro, Flávio Almeida de Magalhães Cipparrone, Alexandre Dias Porto Chiavegatto Filho, Tatiana Natasha Toporcov

Main category: cs.LG

TL;DR: Comparative analysis of 6 machine learning survival models on 45,000 colorectal cancer patients, with XGB-AFT achieving best performance.

Details

Motivation: To evaluate and compare the performance of various machine learning models for survival analysis in predicting colorectal cancer patient survival.

Method: Used data from 45,000 colorectal cancer patients, evaluated 6 MLSA models (RSF, GBSA, SSVM, XGB-Cox, XGB-AFT, LGBM) with hyperparameter optimization, assessed using C-Index, IPCW, time-dependent AUC, and IBS metrics.

Result: XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. Models were compared with classification algorithms and interpreted using SHAP and permutation importance.

Conclusion: Machine learning survival analysis models show strong potential for improving survival prediction and supporting clinical decision making in colorectal cancer.

Abstract: This study presents a comparative methodological analysis of six machine learning models for survival analysis (MLSA). Using data from nearly 45,000 colorectal cancer patients in the Hospital-Based Cancer Registries of S~ao Paulo, we evaluated Random Survival Forest (RSF), Gradient Boosting for Survival Analysis (GBSA), Survival SVM (SSVM), XGBoost-Cox (XGB-Cox), XGBoost-AFT (XGB-AFT), and LightGBM (LGBM), capable of predicting survival considering censored data. Hyperparameter optimization was performed with different samplers, and model performance was assessed using the Concordance Index (C-Index), C-Index IPCW, time-dependent AUC, and Integrated Brier Score (IBS). Survival curves produced by the models were compared with predictions from classification algorithms, and predictor interpretation was conducted using SHAP and permutation importance. XGB-AFT achieved the best performance (C-Index = 0.7618; IPCW = 0.7532), followed by GBSA and RSF. The results highlight the potential and applicability of MLSA to improve survival prediction and support decision making.

[452] MIMIC-Sepsis: A Curated Benchmark for Modeling and Learning from Sepsis Trajectories in the ICU

Yong Huang, Zhongqi Yang, Amir Rahmani

Main category: cs.LG

TL;DR: MIMIC-Sepsis is a curated cohort and benchmark framework derived from MIMIC-IV database for reproducible sepsis modeling, featuring 35,239 ICU patients with standardized clinical variables and treatment data.

Details

Motivation: Existing sepsis research relies on outdated datasets, non-reproducible preprocessing pipelines, and limited clinical intervention coverage, creating need for standardized framework.

Method: Created transparent preprocessing pipeline based on Sepsis-3 criteria with structured imputation and treatment inclusion, released alongside benchmark tasks for mortality prediction, length-of-stay estimation, and shock onset classification.

Result: Empirical results show incorporating treatment variables substantially improves model performance, especially for Transformer-based architectures.

Conclusion: MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research with improved reproducibility and clinical relevance.

Abstract: Sepsis is a leading cause of mortality in intensive care units (ICUs), yet existing research often relies on outdated datasets, non-reproducible preprocessing pipelines, and limited coverage of clinical interventions. We introduce MIMIC-Sepsis, a curated cohort and benchmark framework derived from the MIMIC-IV database, designed to support reproducible modeling of sepsis trajectories. Our cohort includes 35,239 ICU patients with time-aligned clinical variables and standardized treatment data, including vasopressors, fluids, mechanical ventilation and antibiotics. We describe a transparent preprocessing pipeline-based on Sepsis-3 criteria, structured imputation strategies, and treatment inclusion-and release it alongside benchmark tasks focused on early mortality prediction, length-of-stay estimation, and shock onset classification. Empirical results demonstrate that incorporating treatment variables substantially improves model performance, particularly for Transformer-based architectures. MIMIC-Sepsis serves as a robust platform for evaluating predictive and sequential models in critical care research.

[453] LoRA-DA: Data-Aware Initialization for Low-Rank Adaptation via Asymptotic Analysis

Qingyue Zhang, Chang Chu, Tianren Peng, Qi Li, Xiangyang Luo, Zhihao Jiang, Shao-Lun Huang

Main category: cs.LG

TL;DR: The paper proposes LoRA-DA, a data-aware LoRA initialization method that uses asymptotic analysis to derive an optimal initialization strategy combining bias and variance terms, improving fine-tuning performance across multiple benchmarks.

Details

Motivation: Existing LoRA initialization methods have limitations: many don't use target-domain data, while gradient-based methods rely on one-step gradient decomposition which has weak empirical performance and lacks rigorous theoretical foundation or depends on restrictive assumptions.

Method: Established theoretical framework based on asymptotic analysis, deriving optimization problem with bias term (Fisher-gradient formulation) and variance term (Fisher information). Developed LoRA-DA algorithm that estimates these terms from target domain samples to obtain optimal LoRA initialization.

Result: Empirical results across multiple benchmarks show LoRA-DA consistently improves final accuracy over existing methods, with faster and more stable convergence, robustness across ranks, and only small initialization overhead.

Conclusion: LoRA-DA provides a theoretically grounded, data-aware initialization method for LoRA that outperforms existing approaches, offering practical benefits for parameter-efficient fine-tuning of LLMs.

Abstract: With the widespread adoption of LLMs, LoRA has become a dominant method for PEFT, and its initialization methods have attracted increasing attention. However, existing methods have notable limitations: many methods do not incorporate target-domain data, while gradient-based methods exploit data only at a shallow level by relying on one-step gradient decomposition, which remains unsatisfactory due to the weak empirical performance of the one-step fine-tuning model that serves as their basis, as well as the fact that these methods either lack a rigorous theoretical foundation or depend heavily on restrictive isotropic assumptions. In this paper, we establish a theoretical framework for data-aware LoRA initialization based on asymptotic analysis. Starting from a general optimization objective that minimizes the expectation of the parameter discrepancy between the fine-tuned and target models, we derive an optimization problem with two components: a bias term, which is related to the parameter distance between the fine-tuned and target models, and is approximated using a Fisher-gradient formulation to preserve anisotropy; and a variance term, which accounts for the uncertainty introduced by sampling stochasticity through the Fisher information. By solving this problem, we obtain an optimal initialization strategy for LoRA. Building on this theoretical framework, we develop an efficient algorithm, LoRA-DA, which estimates the terms in the optimization problem from a small set of target domain samples and obtains the optimal LoRA initialization. Empirical results across multiple benchmarks demonstrate that LoRA-DA consistently improves final accuracy over existing initialization methods. Additional studies show faster, more stable convergence, robustness across ranks, and only a small initialization overhead for LoRA-DA. The source code will be released upon publication.

[454] DistDF: Time-Series Forecasting Needs Joint-Distribution Wasserstein Alignment

Hao Wang, Licheng Pan, Yuan Lu, Zhixuan Chu, Xiaoxi Li, Shuting He, Zhichao Chen, Haoxuan Li, Qingsong Wen, Zhouchen Lin

Main category: cs.LG

TL;DR: DistDF is a new time-series forecasting method that minimizes distributional discrepancies between forecasts and labels, addressing bias in standard direct forecasting approaches caused by label autocorrelation.

Details

Motivation: Standard direct forecast approaches suffer from biased estimation due to label autocorrelation when minimizing conditional negative log-likelihood using mean squared error.

Method: Proposes DistDF which minimizes a joint-distribution Wasserstein discrepancy that upper bounds the conditional discrepancy, enabling tractable and differentiable estimation from empirical samples for gradient-based training.

Result: Extensive experiments show DistDF improves performance of diverse forecast models and achieves state-of-the-art forecasting performance.

Conclusion: DistDF provides an effective alternative to standard direct forecasting by addressing distributional misalignment through a theoretically grounded discrepancy minimization approach.

Abstract: Training time-series forecast models requires aligning the conditional distribution of model forecasts with that of the label sequence. The standard direct forecast (DF) approach resorts to minimize the conditional negative log-likelihood of the label sequence, typically estimated using the mean squared error. However, this estimation proves to be biased in the presence of label autocorrelation. In this paper, we propose DistDF, which achieves alignment by alternatively minimizing a discrepancy between the conditional forecast and label distributions. Because conditional discrepancies are difficult to estimate from finite time-series observations, we introduce a newly proposed joint-distribution Wasserstein discrepancy for time-series forecasting, which provably upper bounds the conditional discrepancy of interest. This discrepancy admits tractable, differentiable estimation from empirical samples and integrates seamlessly with gradient-based training. Extensive experiments show that DistDF improves the performance diverse forecast models and achieves the state-of-the-art forecasting performance. Code is available at https://anonymous.4open.science/r/DistDF-F66B.

[455] Physics-Informed Extreme Learning Machine (PIELM): Opportunities and Challenges

He Yang, Fei Ren, Hai-Sui Yu, Xiaohui Chen, Pei-Zhi Zhuang

Main category: cs.LG

TL;DR: This paper provides a perspective and review on Physics-Informed Extreme Learning Machine (PIELM), highlighting its development, applications in solving PDEs with various challenges, and identifying remaining challenges for future research.

Details

Motivation: To summarize and review the fast development of PIELM for higher computation efficiency and accuracy in physics-informed machine learning, as no comprehensive review currently exists.

Method: The authors present their perspective and experience on PIELM development, analyzing efforts made to solve PDEs with sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, and multiphysics coupling.

Result: The review identifies that despite successes in various applications, many urgent challenges remain in PIELM development.

Conclusion: The remaining challenges provide opportunities to develop more robust, interpretable, and generalizable PIELM frameworks for science and engineering applications.

Abstract: We are very delighted to see the fast development of physics-informed extreme learning machine (PIELM) in recent years for higher computation efficiency and accuracy in physics-informed machine learning. As a summary or review on PIELM is currently not available, we would like to take this opportunity to show our perspective and experience for this promising research direction. We can see many efforts are made to solve PDEs with sharp gradients, nonlinearities, high-frequency behavior, hard constraints, uncertainty, multiphysics coupling. Despite the success, many urgent challenges remain to be tackled, which also provides us opportunities to develop more robust, interpretable, and generalizable PIELM frameworks with applications in science and engineering.

[456] Causal Ordering for Structure Learning From Time Series

Pedro P. Sanchez, Damian Machlanski, Steven McDonagh, Sotirios A. Tsaftaris

Main category: cs.LG

TL;DR: DOTS (Diffusion Ordered Temporal Structure) is a novel causal discovery method that uses multiple causal orderings instead of a single one to improve temporal causal structure recovery, outperforming state-of-the-art baselines on both synthetic and real-world benchmarks.

Details

Motivation: Traditional ordering-based causal discovery methods in time series are limited by their single-ordering approach, which restricts representational capacity and introduces spurious artifacts. The combinatorial complexity of identifying true causal relationships grows with variables and time points.

Method: DOTS leverages multiple valid causal orderings using diffusion-based causal discovery. It integrates score matching with diffusion processes for efficient Hessian estimation, recovering the transitive closure of the underlying directed acyclic graph under standard assumptions like stationarity and additive noise model.

Result: On synthetic benchmarks (3-6 variables, 200-5,000 samples), DOTS improves mean window-graph F1 from 0.63 (best baseline) to 0.81. On CausalTime real-world benchmark (20-36 variables), DOTS achieves highest average summary-graph F1 while halving runtime compared to graph-optimization methods.

Conclusion: DOTS establishes itself as a scalable and accurate solution for temporal causal discovery, effectively addressing limitations of single-ordering approaches and demonstrating superior performance across synthetic and real-world datasets.

Abstract: Predicting causal structure from time series data is crucial for understanding complex phenomena in physiology, brain connectivity, climate dynamics, and socio-economic behaviour. Causal discovery in time series is hindered by the combinatorial complexity of identifying true causal relationships, especially as the number of variables and time points grow. A common approach to simplify the task is the so-called ordering-based methods. Traditional ordering methods inherently limit the representational capacity of the resulting model. In this work, we fix this issue by leveraging multiple valid causal orderings, instead of a single one as standard practice. We propose DOTS (Diffusion Ordered Temporal Structure), using diffusion-based causal discovery for temporal data. By integrating multiple orderings, DOTS effectively recovers the transitive closure of the underlying directed acyclic graph, mitigating spurious artifacts inherent in single-ordering approaches. We formalise the problem under standard assumptions such as stationarity and the additive noise model, and leverage score matching with diffusion processes to enable efficient Hessian estimation. Extensive experiments validate the approach. Empirical evaluations on synthetic and real-world datasets demonstrate that DOTS outperforms state-of-the-art baselines, offering a scalable and robust approach to temporal causal discovery. On synthetic benchmarks ($d{=}!3-!6$ variables, $T{=}200!-!5{,}000$ samples), DOTS improves mean window-graph $F1$ from $0.63$ (best baseline) to $0.81$. On the CausalTime real-world benchmark ($d{=}20!-!36$), while baselines remain the best on individual datasets, DOTS attains the highest average summary-graph $F1$ while halving runtime relative to graph-optimisation methods. These results establish DOTS as a scalable and accurate solution for temporal causal discovery.

[457] A Novel XAI-Enhanced Quantum Adversarial Networks for Velocity Dispersion Modeling in MaNGA Galaxies

Sathwik Narkedimilli, N V Saran Kumar, Aswath Babu H, Manjunath K Vanahalli, Manish M, Vinija Jain, Aman Chadha

Main category: cs.LG

TL;DR: A quantum adversarial framework combining hybrid quantum neural networks with classical layers, using evaluator-guided optimization for accuracy and interpretability, achieving strong performance metrics.

Details

Motivation: Address challenges in quantum machine learning by balancing predictive accuracy, robustness, and interpretability in a unified framework.

Method: Hybrid quantum neural network with classical deep learning layers, guided by adversarial evaluator with LIME-based interpretability, extended through quantum GAN and self-supervised variants.

Result: Vanilla model achieved RMSE = 0.27, MSE = 0.071, MAE = 0.21, and R^2 = 0.59, showing most consistent performance across regression metrics.

Conclusion: Combining quantum-inspired methods with classical architectures enables lightweight, high-performance, and interpretable predictive models, advancing QML applicability.

Abstract: Current quantum machine learning approaches often face challenges balancing predictive accuracy, robustness, and interpretability. To address this, we propose a novel quantum adversarial framework that integrates a hybrid quantum neural network (QNN) with classical deep learning layers, guided by an evaluator model with LIME-based interpretability, and extended through quantum GAN and self-supervised variants. In the proposed model, an adversarial evaluator concurrently guides the QNN by computing feedback loss, thereby optimizing both prediction accuracy and model explainability. Empirical evaluations show that the Vanilla model achieves RMSE = 0.27, MSE = 0.071, MAE = 0.21, and R^2 = 0.59, delivering the most consistent performance across regression metrics compared to adversarial counterparts. These results demonstrate the potential of combining quantum-inspired methods with classical architectures to develop lightweight, high-performance, and interpretable predictive models, advancing the applicability of QML beyond current limitations.

[458] The Cost of Robustness: Tighter Bounds on Parameter Complexity for Robust Memorization in ReLU Nets

Yujun Kim, Chaewon Moon, Chulhee Yun

Main category: cs.LG

TL;DR: The paper analyzes parameter complexity for robust memorization in ReLU networks, establishing upper and lower bounds that depend on the robustness ratio ρ = μ/ε, showing complexity matches non-robust memorization for small ρ but grows with increasing ρ.

Details

Motivation: To understand how many parameters are needed for ReLU networks to achieve robust memorization - interpolating datasets while maintaining consistent predictions within μ-balls around training samples, with ε-separation between differently labeled points.

Method: Established upper and lower bounds on parameter count as a function of robustness ratio ρ = μ/ε, providing fine-grained analysis across the entire range ρ ∈ (0,1) with tighter bounds than prior work.

Result: Parameter complexity of robust memorization matches non-robust memorization when ρ is small, but grows with increasing ρ. Obtained tighter upper and lower bounds that improve upon existing results.

Conclusion: The study provides comprehensive bounds on parameter complexity for robust memorization in ReLU networks, revealing the relationship between robustness requirements and network size across the full range of robustness ratios.

Abstract: We study the parameter complexity of robust memorization for $\mathrm{ReLU}$ networks: the number of parameters required to interpolate any given dataset with $\epsilon$-separation between differently labeled points, while ensuring predictions remain consistent within a $\mu$-ball around each training sample. We establish upper and lower bounds on the parameter count as a function of the robustness ratio $\rho = \mu / \epsilon$. Unlike prior work, we provide a fine-grained analysis across the entire range $\rho \in (0,1)$ and obtain tighter upper and lower bounds that improve upon existing results. Our findings reveal that the parameter complexity of robust memorization matches that of non-robust memorization when $\rho$ is small, but grows with increasing $\rho$.

[459] Semi-supervised and unsupervised learning for health indicator extraction from guided waves in aerospace composite structures

James Josep Perry, Pablo Garcia-Conde Ortiz, George Konstantinou, Cornelie Vergouwen, Edlyn Santha Kumaran, Morteza Moradi

Main category: cs.LG

TL;DR: This paper presents a data-driven framework for learning health indicators in aerospace composite structures using two approaches: Diversity-DeepSAD and DTC-VAE, integrated with multi-domain signal processing to handle material variability and complex damage modes.

Details

Motivation: Extracting reliable health indicators for aerospace composite structures is challenging due to material property variability, stochastic damage evolution, diverse damage modes, and complications from manufacturing defects and in-service incidents.

Method: Two learning approaches: (1) Diversity-DeepSAD with continuous auxiliary labels as damage proxies to overcome binary label limitations, and (2) DTC-VAE with explicit monotonicity constraint. Uses guided waves with multiple frequencies and explores time, frequency, and time-frequency representations with unsupervised ensemble learning for HI fusion.

Result: Diversity-DeepSAD achieved 81.6% performance using FFT features, while DTC-VAE delivered the most consistent health indicators with 92.3% performance, outperforming existing baselines.

Conclusion: The proposed data-driven framework successfully learns reliable health indicators for aerospace composite structures, with DTC-VAE showing superior performance and consistency in monitoring structural degradation under fatigue loading.

Abstract: Health indicators (HIs) are central to diagnosing and prognosing the condition of aerospace composite structures, enabling efficient maintenance and operational safety. However, extracting reliable HIs remains challenging due to variability in material properties, stochastic damage evolution, and diverse damage modes. Manufacturing defects (e.g., disbonds) and in-service incidents (e.g., bird strikes) further complicate this process. This study presents a comprehensive data-driven framework that learns HIs via two learning approaches integrated with multi-domain signal processing. Because ground-truth HIs are unavailable, a semi-supervised and an unsupervised approach are proposed: (i) a diversity deep semi-supervised anomaly detection (Diversity-DeepSAD) approach augmented with continuous auxiliary labels used as hypothetical damage proxies, which overcomes the limitation of prior binary labels that only distinguish healthy and failed states while neglecting intermediate degradation, and (ii) a degradation-trend-constrained variational autoencoder (DTC-VAE), in which the monotonicity criterion is embedded via an explicit trend constraint. Guided waves with multiple excitation frequencies are used to monitor single-stiffener composite structures under fatigue loading. Time, frequency, and time-frequency representations are explored, and per-frequency HIs are fused via unsupervised ensemble learning to mitigate frequency dependence and reduce variance. Using fast Fourier transform features, the augmented Diversity-DeepSAD model achieved 81.6% performance, while DTC-VAE delivered the most consistent HIs with 92.3% performance, outperforming existing baselines.

[460] Symbolic Snapshot Ensembles

Mingyue Liu, Andrew Cropper

Main category: cs.LG

TL;DR: A novel ILP ensemble method that saves intermediate hypotheses from a single training run and combines them using MDL weighting, achieving 4% accuracy improvement with minimal computational overhead.

Details

Motivation: Traditional ILP ensemble methods require multiple training runs to learn multiple hypotheses, which is computationally expensive. This paper aims to develop a more efficient ensemble approach.

Method: Train an ILP algorithm only once while saving intermediate hypotheses, then combine these hypotheses using a minimum description length (MDL) weighting scheme.

Result: Experiments on multiple benchmarks (including game playing and visual reasoning) show 4% improvement in predictive accuracy with less than 1% computational overhead compared to traditional methods.

Conclusion: The proposed single-run ensemble method with MDL weighting provides significant accuracy improvements for ILP while maintaining computational efficiency.

Abstract: Inductive logic programming (ILP) is a form of logical machine learning. Most ILP algorithms learn a single hypothesis from a single training run. Ensemble methods train an ILP algorithm multiple times to learn multiple hypotheses. In this paper, we train an ILP algorithm only once and save intermediate hypotheses. We then combine the hypotheses using a minimum description length weighting scheme. Our experiments on multiple benchmarks, including game playing and visual reasoning, show that our approach improves predictive accuracy by 4% with less than 1% computational overhead.

[461] Learning to Drive Safely with Hybrid Options

Bram De Cooman, Johan Suykens

Main category: cs.LG

TL;DR: This paper applies the options framework to autonomous driving on highways, defining specialized options for longitudinal and lateral maneuvers with safety/comfort constraints, and shows that policies over hybrid options outperform baseline approaches.

Details

Motivation: The options framework is naturally suited for hierarchical control in autonomous driving but is underutilized, so the authors aim to incorporate domain knowledge and constrain driving behavior more easily through this approach.

Method: Defined dedicated options for longitudinal and lateral maneuvers with embedded safety/comfort constraints, proposed hierarchical control setups with options, and derived practical algorithms using state-of-the-art reinforcement learning techniques with separate action selection for longitudinal and lateral control.

Result: Policies over combined and hybrid options achieved the same expressiveness and flexibility as human drivers while being easier to interpret than classical continuous action policies. Among all approaches, flexible policies over hybrid options performed best under varying traffic conditions.

Conclusion: The options framework successfully enables hierarchical control for autonomous driving, with hybrid options providing the best performance while maintaining interpretability and human-like flexibility.

Abstract: Out of the many deep reinforcement learning approaches for autonomous driving, only few make use of the options (or skills) framework. That is surprising, as this framework is naturally suited for hierarchical control applications in general, and autonomous driving tasks in specific. Therefore, in this work the options framework is applied and tailored to autonomous driving tasks on highways. More specifically, we define dedicated options for longitudinal and lateral manoeuvres with embedded safety and comfort constraints. This way, prior domain knowledge can be incorporated into the learning process and the learned driving behaviour can be constrained more easily. We propose several setups for hierarchical control with options and derive practical algorithms following state-of-the-art reinforcement learning techniques. By separately selecting actions for longitudinal and lateral control, the introduced policies over combined and hybrid options obtain the same expressiveness and flexibility that human drivers have, while being easier to interpret than classical policies over continuous actions. Of all the investigated approaches, these flexible policies over hybrid options perform the best under varying traffic conditions, outperforming the baseline policies over actions.

[462] Pearl: A Foundation Model for Placing Every Atom in the Right Location

Genesis Research Team, Alejandro Dobles, Nina Jovic, Kenneth Leidal, Pranav Murugan, David C. Williams, Drausin Wulsin, Nate Gruver, Christina X. Ji, Korrawat Pruegsanusak, Gianluca Scarpellini, Ansh Sharma, Wojciech Swiderski, Andrea Bootsma, Richard Strong Bowen, Charlotte Chen, Jamin Chen, Marc André Dämgen, Roy Tal Dew, Benjamin DiFrancesco, J. D. Fishman, Alla Ivanova, Zach Kagin, David Li-Bland, Zuli Liu, Igor Morozov, Jeffrey Ouyang-Zhang, Frank C. Pickard IV, Kushal S. Shah, Ben Shor, Gabriel Monteiro da Silva, Maxx Tessmer, Carl Tilbury, Cyr Vetcher, Daniel Zeng, Maruan Al-Shedivat, Aleksandra Faust, Evan N. Feinberg, Michael V. LeVine, Matteus Pan

Main category: cs.LG

TL;DR: Pearl is a foundation model for protein-ligand cofolding that achieves state-of-the-art performance through large-scale synthetic data training, SO(3)-equivariant diffusion architecture, and controllable inference capabilities.

Details

Motivation: Current deep learning methods for protein-ligand structure prediction are limited by scarce experimental data, inefficient architectures, physically invalid poses, and inability to exploit auxiliary information during inference.

Method: Pearl uses three key innovations: (1) large-scale synthetic data training to overcome data scarcity, (2) SO(3)-equivariant diffusion module for 3D rotational symmetry, and (3) controllable inference with multi-chain templating and dual unconditional/conditional modes.

Result: Pearl surpasses AlphaFold 3 and other baselines with 14.5% and 14.2% improvements on public benchmarks for accurate (RMSD < 2 Å) and physically valid poses. In pocket-conditional cofolding, it achieves 3.6× improvement on challenging drug targets at RMSD < 1 Å threshold.

Conclusion: Pearl establishes new state-of-the-art in protein-ligand cofolding, with performance directly correlating with synthetic dataset size used in training.

Abstract: Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N’ Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.

[463] Greedy Sampling Is Provably Efficient for RLHF

Di Wu, Chengshuai Shi, Jing Yang, Cong Shen

Main category: cs.LG

TL;DR: The paper provides theoretical analysis of RLHF with general preference models, showing that greedy sampling achieves order-wise better performance guarantees than existing optimistic/pessimistic approaches.

Details

Motivation: Theoretical understanding of RLHF is limited, especially for general preference models beyond Bradley-Terry. Existing approaches use optimism/pessimism but may be suboptimal.

Method: Analyzes RLHF with general preference models using greedy sampling (empirical estimates) rather than optimistic/pessimistic estimates. Leverages structural properties of optimal policy under KL-regularized targets.

Result: Obtains major order-wise improvements in performance guarantees over existing methods. Shows greedy sampling is sufficient for RLHF, particularly for Bradley-Terry model.

Conclusion: Greedy sampling provides strong theoretical guarantees for RLHF, challenging the need for complex optimistic/pessimistic approaches due to unique structural properties of KL-regularized optimal policies.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has emerged as a key technique for post-training large language models. Despite its empirical success, the theoretical understanding of RLHF is still limited, as learning the KL-regularized target with only preference feedback poses additional challenges compared with canonical RL. Existing works mostly study the reward-based Bradley-Terry (BT) preference model, and extend classical designs utilizing optimism or pessimism. This work, instead, considers the general preference model (whose practical relevance has been observed recently) and obtains performance guarantees with major, order-wise improvements over existing ones. Surprisingly, these results are derived from algorithms that directly use the empirical estimates (i.e., greedy sampling), as opposed to constructing optimistic or pessimistic estimates in previous works. This insight has a deep root in the unique structural property of the optimal policy class under the KL-regularized target, and we further specialize it to the BT model, highlighting the surprising sufficiency of greedy sampling in RLHF.

[464] Eigenfunction Extraction for Ordered Representation Learning

Burak Varıcı, Che-Ping Tsai, Ritabrata Ray, Nicholas M. Boffi, Pradeep Ravikumar

Main category: cs.LG

TL;DR: A framework for extracting ordered and identifiable eigenfunctions from contextual kernels, addressing limitations of existing methods that only recover linear spans of top eigenfunctions.

Details

Motivation: Current representation learning methods (contrastive and non-contrastive) only recover linear spans of top eigenfunctions, while exact spectral decomposition is essential for understanding feature ordering and importance.

Method: Proposed a general framework with modular building blocks for eigenfunction extraction, compatible with contextual kernels and scalable to modern settings. Aligned two paradigms: low-rank approximation and Rayleigh quotient optimization.

Result: Validated on synthetic kernels and real-world image datasets. Recovered eigenvalues act as effective importance scores for feature selection, enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional representations.

Conclusion: The framework successfully extracts ordered eigenfunctions that provide meaningful feature importance scores, enabling better understanding and control of representation learning.

Abstract: Recent advances in representation learning reveal that widely used objectives, such as contrastive and non-contrastive, implicitly perform spectral decomposition of a contextual kernel, induced by the relationship between inputs and their contexts. Yet, these methods recover only the linear span of top eigenfunctions of the kernel, whereas exact spectral decomposition is essential for understanding feature ordering and importance. In this work, we propose a general framework to extract ordered and identifiable eigenfunctions, based on modular building blocks designed to satisfy key desiderata, including compatibility with the contextual kernel and scalability to modern settings. We then show how two main methodological paradigms, low-rank approximation and Rayleigh quotient optimization, align with this framework for eigenfunction extraction. Finally, we validate our approach on synthetic kernels and demonstrate on real-world image datasets that the recovered eigenvalues act as effective importance scores for feature selection, enabling principled efficiency-accuracy tradeoffs via adaptive-dimensional representations.

[465] FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation

Dongwon Jo, Jiwon Song, Yulhwa Kim, Jae-Joon Kim

Main category: cs.LG

TL;DR: FastKV is a KV cache compression framework that reduces latency in both prefill and decoding stages by leveraging token importance stabilization in later layers, achieving significant speedups while maintaining accuracy.

Details

Motivation: Current KV cache compression methods tie prefill compute reduction to decoding KV budget, causing accuracy degradation due to overlooking layer-dependent variation of critical context.

Method: FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer that forwards only the most informative tokens. It then independently selects salient KV entries for caching, decoupling KV budget from prefill compute reduction.

Result: FastKV achieves speedups of up to 1.82× in prefill and 2.87× in decoding compared to full-context baseline, while matching the accuracy of baselines that only accelerate decoding stage.

Conclusion: FastKV successfully decouples KV budget from prefill compute reduction through independent control of TSP rate and KV retention rate, enabling flexible optimization of efficiency and accuracy in LLM inference.

Abstract: While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the baselines that only accelerate the decoding stage. Our code is available at https://github.com/dongwonjo/FastKV.

[466] Offline Learning and Forgetting for Reasoning with Large Language Models

Tianwei Ni, Allen Nie, Sapana Chaudhary, Yao Liu, Huzefa Rangwala, Rasool Fakoor

Main category: cs.LG

TL;DR: Fine-tuning LLMs on search-generated successful and failed reasoning paths improves mathematical reasoning performance while dramatically reducing inference time compared to inference-time search methods.

Details

Motivation: Inference-time search methods enhance LLM capabilities for complex reasoning but significantly increase computational costs and inference time. The goal is to integrate search capabilities directly into the model to maintain performance while reducing inference overhead.

Method: Fine-tune LLMs on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods, using a smaller learning rate to prevent degradation of search capability. Replace CoT-generated data with search-generated data for offline fine-tuning.

Result: On Game-of-24 and Countdown puzzles, the approach improves success rates by ~23% over inference-time search baselines while reducing inference time by 180×. The learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

Conclusion: Integrating search capabilities directly into LLMs through offline fine-tuning on search-generated reasoning paths is an effective approach that achieves better performance with dramatically reduced inference costs compared to inference-time search methods.

Abstract: Leveraging inference-time search in large language models has proven effective in further enhancing a trained model’s capability to solve complex mathematical and reasoning problems. However, this approach significantly increases computational costs and inference time, as the model must generate and evaluate multiple candidate solutions to identify a viable reasoning path. To address this, we propose an effective approach that integrates search capabilities directly into the model by fine-tuning it on unpaired successful (learning) and failed reasoning paths (forgetting) derived from diverse search methods. A key challenge we identify is that naive fine-tuning can degrade the model’s search capability; we show this can be mitigated with a smaller learning rate. Extensive experiments on the challenging Game-of-24 and Countdown arithmetic puzzles show that, replacing CoT-generated data with search-generated data for offline fine-tuning improves success rates by around 23% over inference-time search baselines, while reducing inference time by 180$\times$. On top of this, our learning and forgetting objective consistently outperforms both supervised fine-tuning and preference-based methods.

[467] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay

Yifan Sun, Jingyan Shen, Yibin Wang, Tianyu Chen, Zhendong Wang, Mingyuan Zhou, Huan Zhang

Main category: cs.LG

TL;DR: Proposes two data-efficient techniques for RL fine-tuning of LLMs: adaptive difficulty-based online data selection and rollout replay, reducing training time by 23-62% while maintaining performance.

Details

Motivation: RL fine-tuning for LLMs is resource-intensive and existing work has overlooked data efficiency problems, leading to high computational costs.

Method: Uses adaptive difficulty targeting for online data selection (prioritizing moderately difficult questions) with attention-based difficulty estimation, and rollout replay mechanism to reuse recent rollouts.

Result: Achieves 23% to 62% reduction in RL fine-tuning time across 6 LLM-dataset combinations while matching the performance of original GRPO algorithm.

Conclusion: The proposed data-efficient RL fine-tuning approach significantly reduces computational costs without sacrificing performance, making RL fine-tuning more practical for LLMs.

Abstract: Reinforcement learning (RL) has become an effective approach for fine-tuning large language models (LLMs), particularly to enhance their reasoning capabilities. However, RL fine-tuning remains highly resource-intensive, and existing work has largely overlooked the problem of data efficiency. In this paper, we propose two techniques to improve data efficiency in LLM RL fine-tuning: difficulty-targeted online data selection and rollout replay. We introduce the notion of adaptive difficulty to guide online data selection, prioritizing questions of moderate difficulty that are more likely to yield informative learning signals. To estimate adaptive difficulty efficiently, we develop an attention-based framework that requires rollouts for only a small reference set of questions. The adaptive difficulty of the remaining questions is then estimated based on their similarity to this set. To further reduce rollout cost, we introduce a rollout replay mechanism inspired by experience replay in traditional RL. This technique reuses recent rollouts, lowering per-step computation while maintaining stable updates. Experiments across 6 LLM-dataset combinations show that our method reduces RL fine-tuning time by 23% to 62% while reaching the same level of performance as the original GRPO algorithm. Our code is available at https://github.com/ASTRAL-Group/data-efficient-llm-rl.

[468] LittleBit: Ultra Low-Bit Quantization via Latent Factorization

Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim

Main category: cs.LG

TL;DR: LittleBit introduces extreme LLM compression to 0.1 bits per weight, achieving 31× memory reduction through low-rank matrix factorization and binarization with multi-scale compensation mechanisms.

Details

Motivation: Deploying large language models faces challenges from substantial memory and computational costs, with existing quantization methods suffering performance degradation in sub-1-bit regimes.

Method: Uses latent matrix factorization to represent weights in low-rank form, then binarizes factors. Integrates multi-scale compensation (row, column, latent dimension) and employs Dual Sign-Value-Independent Decomposition for QAT initialization with residual compensation.

Result: Achieves 0.1 bits per weight compression, reducing Llama2-13B to under 0.9 GB (31× memory reduction). Outperforms leading methods - 0.1 BPW performance surpasses competitor’s 0.7 BPW on Llama2-7B.

Conclusion: LittleBit establishes a new viable size-performance trade-off, enabling 11.6× kernel-level speedup over FP16 and making powerful LLMs practical for resource-constrained environments.

Abstract: Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit’s superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method’s 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off–unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level–and makes powerful LLMs practical for resource-constrained environments.

[469] Diffusion Models Meet Contextual Bandits

Imad Aouali

Main category: cs.LG

TL;DR: Using pre-trained diffusion models as expressive priors for efficient online decision-making in contextual bandits, enabling fast posterior approximation and sampling.

Details

Motivation: Address computational and statistical inefficiencies in contextual bandits by leveraging diffusion models as informative priors to capture complex action dependencies.

Method: Develop a practical algorithm that efficiently approximates posteriors under diffusion model priors, supporting both fast updates and sampling.

Result: Empirical results show effectiveness and versatility across diverse contextual bandit settings.

Conclusion: Diffusion models serve as effective priors for improving efficiency in contextual bandit decision-making through fast posterior approximation.

Abstract: Efficient online decision-making in contextual bandits is challenging, as methods without informative priors often suffer from computational or statistical inefficiencies. In this work, we leverage pre-trained diffusion models as expressive priors to capture complex action dependencies and develop a practical algorithm that efficiently approximates posteriors under such priors, enabling both fast updates and sampling. Empirical results demonstrate the effectiveness and versatility of our approach across diverse contextual bandit settings.

[470] One-Step is Enough: Sparse Autoencoders for Text-to-Image Diffusion Models

Viacheslav Surkov, Chris Wendler, Antonio Mari, Mikhail Terekhov, Justin Deschenaux, Robert West, Caglar Gulcehre, David Bau

Main category: cs.LG

TL;DR: SAEs can decompose intermediate representations in SDXL Turbo into interpretable features that generalize across different text-to-image models and enable controllable image editing.

Details

Motivation: While SAEs have shown success in making LLM representations interpretable, similar approaches were lacking for text-to-image models like SDXL Turbo.

Method: Train SAEs on transformer block updates in SDXL Turbo’s denoising U-net, create RIEBench for image editing by activating/deactivating SAE features during generation.

Result: SAEs learned interpretable features that generalize to 4-step SDXL Turbo and SDXL base model without retraining, with features showing causal influence and block specialization.

Conclusion: SAEs are a promising approach for understanding and manipulating text-to-image diffusion models, establishing the first investigation of SAEs for interpretability in this domain.

Abstract: For large language models (LLMs), sparse autoencoders (SAEs) have been shown to decompose intermediate representations that often are not interpretable directly into sparse sums of interpretable features, facilitating better control and subsequent analysis. However, similar analyses and approaches have been lacking for text-to-image models. We investigate the possibility of using SAEs to learn interpretable features for SDXL Turbo, a few-step text-to-image diffusion model. To this end, we train SAEs on the updates performed by transformer blocks within SDXL Turbo’s denoising U-net in its 1-step setting. Interestingly, we find that they generalize to 4-step SDXL Turbo and even to the multi-step SDXL base model (i.e., a different model) without additional training. In addition, we show that their learned features are interpretable, causally influence the generation process, and reveal specialization among the blocks. We do so by creating RIEBench, a representation-based image editing benchmark, for editing images while they are generated by turning on and off individual SAE features. This allows us to track which transformer blocks’ features are the most impactful depending on the edit category. Our work is the first investigation of SAEs for interpretability in text-to-image diffusion models and our results establish SAEs as a promising approach for understanding and manipulating the internal mechanisms of text-to-image models.

[471] $β$-DQN: Improving Deep Q-Learning By Evolving the Behavior

Hongming Zhang, Fengshuo Bai, Chenjun Xiao, Chao Gao, Bo Xu, Martin Müller

Main category: cs.LG

TL;DR: β-DQN is a simple and efficient exploration method that augments DQN with a behavior function β to generate diverse policies and uses an adaptive meta-controller for flexible exploration.

Details

Motivation: Existing sophisticated exploration methods lack generality and have high computational cost, leading researchers to prefer simpler methods like ε-greedy.

Method: Augments standard DQN with behavior function β that estimates action probabilities, generates diverse policies balancing exploration coverage and bias correction, and uses adaptive meta-controller for policy selection.

Result: Outperforms existing baseline methods across a wide range of tasks in both simple and challenging exploration domains.

Conclusion: β-DQN provides an effective, straightforward-to-implement solution with minimal computational overhead for improving exploration in deep reinforcement learning.

Abstract: While many sophisticated exploration methods have been proposed, their lack of generality and high computational cost often lead researchers to favor simpler methods like $\epsilon$-greedy. Motivated by this, we introduce $\beta$-DQN, a simple and efficient exploration method that augments the standard DQN with a behavior function $\beta$. This function estimates the probability that each action has been taken at each state. By leveraging $\beta$, we generate a population of diverse policies that balance exploration between state-action coverage and overestimation bias correction. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. $\beta$-DQN is straightforward to implement and adds minimal computational overhead to the standard DQN. Experiments on both simple and challenging exploration domains show that $\beta$-DQN outperforms existing baseline methods across a wide range of tasks, providing an effective solution for improving exploration in deep reinforcement learning.

[472] A High-Dimensional Statistical Method for Optimizing Transfer Quantities in Multi-Source Transfer Learning

Qingyue Zhang, Haohao Fu, Guanbo Huang, Yaoyuan Liang, Chang Chu, Tianren Peng, Yanru Wu, Qi Li, Yang Li, Shao-Lun Huang

Main category: cs.LG

TL;DR: Proposes OTQMS framework to determine optimal sample quantities from multiple source tasks for efficient multi-source transfer learning.

Details

Motivation: Existing multi-source transfer learning methods use all available source samples, which reduces training efficiency and may lead to suboptimal results.

Method: Developed theoretical framework using K-L divergence and high-dimensional statistical analysis to determine optimal transfer quantities, implemented in architecture-agnostic OTQMS algorithm.

Result: OTQMS significantly outperforms state-of-the-art approaches in accuracy and data efficiency on diverse architectures and real-world benchmark datasets.

Conclusion: The proposed framework effectively addresses data efficiency in multi-source transfer learning by optimizing source sample usage.

Abstract: Multi-source transfer learning provides an effective solution to data scarcity in real-world supervised learning scenarios by leveraging multiple source tasks. In this field, existing works typically use all available samples from sources in training, which constrains their training efficiency and may lead to suboptimal results. To address this, we propose a theoretical framework that answers the question: what is the optimal quantity of source samples needed from each source task to jointly train the target model? Specifically, we introduce a generalization error measure based on K-L divergence, and minimize it based on high-dimensional statistical analysis to determine the optimal transfer quantity for each source task. Additionally, we develop an architecture-agnostic and data-efficient algorithm OTQMS to implement our theoretical results for target model training in multi-source transfer learning. Experimental studies on diverse architectures and two real-world benchmark datasets show that our proposed algorithm significantly outperforms state-of-the-art approaches in both accuracy and data efficiency. The code and supplementary materials are available in https://github.com/zqy0126/OTQMS.

[473] ADMN: A Layer-Wise Adaptive Multimodal Network for Dynamic Input Noise and Compute Resources

Jason Wu, Yuyang Yuan, Kang Yang, Lance Kaplan, Mani Srivastava

Main category: cs.LG

TL;DR: ADMN is a layer-wise Adaptive Depth Multimodal Network that dynamically adjusts computation across modalities based on resource constraints and input quality, achieving comparable accuracy with up to 75% fewer FLOPs.

Details

Motivation: Multimodal systems struggle with dynamic compute resource availability and fluctuating input quality, while existing approaches cannot adapt to these changing conditions effectively.

Method: Proposes ADMN which adjusts total active layers to meet compute constraints and reallocates layers across modalities based on their quality.

Result: ADMN matches state-of-the-art accuracy while reducing floating-point operations by up to 75%.

Conclusion: ADMN effectively addresses both compute resource constraints and modality quality variations in multimodal systems.

Abstract: Multimodal deep learning systems are deployed in dynamic scenarios due to the robustness afforded by multiple sensing modalities. Nevertheless, they struggle with varying compute resource availability (due to multi-tenancy, device heterogeneity, etc.) and fluctuating quality of inputs (from sensor feed corruption, environmental noise, etc.). Statically provisioned multimodal systems cannot adapt when compute resources change over time, while existing dynamic networks struggle with strict compute budgets. Additionally, both systems often neglect the impact of variations in modality quality. Consequently, modalities suffering substantial corruption may needlessly consume resources better allocated towards other modalities. We propose ADMN, a layer-wise Adaptive Depth Multimodal Network capable of tackling both challenges: it adjusts the total number of active layers across all modalities to meet strict compute resource constraints and continually reallocates layers across input modalities according to their modality quality. Our evaluations showcase ADMN can match the accuracy of state-of-the-art networks while reducing up to 75% of their floating-point operations.

[474] FragFM: Hierarchical Framework for Efficient Molecule Generation via Fragment-Level Discrete Flow Matching

Joongwon Lee, Seonghwan Kim, Seokhyun Moon, Hyunwoo Kim, Woo Youn Kim

Main category: cs.LG

TL;DR: FragFM is a hierarchical framework for molecular graph generation that uses fragment-level discrete flow matching and a coarse-to-fine autoencoder to efficiently generate molecules while achieving better property control than atom-based methods.

Details

Motivation: To enable more efficient and scalable molecular generation by working at the fragment level rather than atom level, and to provide better property control and flexibility in molecular design.

Method: Uses fragment-level discrete flow matching with a hierarchical framework, a coarse-to-fine autoencoder for atom-level detail reconstruction, and a stochastic fragment bag strategy to handle extensive fragment spaces.

Result: FragFM achieves superior performance on various molecular generation benchmarks including the proposed NPGen benchmark, demonstrating better property control than atom-based methods and additional flexibility through fragment bag conditioning.

Conclusion: Fragment-based generative modeling shows significant potential for large-scale, property-aware molecular design, enabling more efficient exploration of chemical space for drug discovery applications.

Abstract: We introduce FragFM, a novel hierarchical framework via fragment-level discrete flow matching for efficient molecular graph generation. FragFM generates molecules at the fragment level, leveraging a coarse-to-fine autoencoder to reconstruct details at the atom level. Together with a stochastic fragment bag strategy to effectively handle an extensive fragment space, our framework enables more efficient and scalable molecular generation. We demonstrate that our fragment-based approach achieves better property control than the atom-based method and additional flexibility through conditioning the fragment bag. We also propose a Natural Product Generation benchmark (NPGen) to evaluate modern molecular graph generative models’ ability to generate natural product-like molecules. Since natural products are biologically prevalidated and differ from typical drug-like molecules, our benchmark provides a more challenging yet meaningful evaluation relevant to drug discovery. We conduct a FragFM comparative study against various models on diverse molecular generation benchmarks, including NPGen, demonstrating superior performance. The results highlight the potential of fragment-based generative modeling for large-scale, property-aware molecular design, paving the way for more efficient exploration of chemical space.

[475] Generalized Exponentiated Gradient Algorithms Using the Euler Two-Parameter Logarithm

Andrzej Cichocki

Main category: cs.LG

TL;DR: Proposes a new class of Generalized Exponentiated Gradient algorithms using Mirror Descent with Bregman divergence based on Euler logarithm, a two-parameter deformed logarithm associated with trace-form entropies.

Details

Motivation: To develop more flexible gradient descent algorithms that can adapt to training data distributions by learning hyperparameters in deformed logarithm functions, addressing the challenge of investigating numerous existing entropic functionals.

Method: Uses Mirror Descent updates with Bregman divergence employing Euler logarithm as link function, estimates deformed exponential function as inverse of Euler logarithm, and tunes two hyperparameters to control algorithm properties.

Result: Developed novel GEG/MD updates that can adapt to data distribution through learned hyperparameters, providing adjustable gradient descent algorithm properties.

Conclusion: The proposed approach enables adaptive gradient algorithms through parameterized deformed logarithms, focusing on trace-form entropies among many existing entropic functionals.

Abstract: IIn this paper we propose and investigate a new class of Generalized Exponentiated Gradient (GEG) algorithms using Mirror Descent (MD) updates, and applying the Bregman divergence with a two–parameter deformation of the logarithm as a link function. This link function (referred here to as the Euler logarithm) is associated with a relatively wide class of trace–form entropies. In order to derive novel GEG/MD updates, we estimate a deformed exponential function, which closely approximates the inverse of the Euler two–parameter deformed logarithm. The characteristic shape and properties of the Euler logarithm and its inverse–deformed exponential functions, are tuned by two hyperparameters. By learning these hyperparameters, we can adapt to the distribution of training data and adjust them to achieve desired properties of gradient descent algorithms. In the literature, there exist nowadays more than fifty mathematically well-established entropic functionals and associated deformed logarithms, so it is impossible to investigate all of them in one research paper. Therefore, we focus here on a class of trace-form entropies and the associated deformed two–parameters logarithms.

[476] Einsum Networks: Fast and Scalable Learning of Tractable Probabilistic Circuits

Robert Peharz, Steven Lang, Antonio Vergari, Karl Stelzner, Alejandro Molina, Martin Trapp, Guy Van den Broeck, Kristian Kersting, Zoubin Ghahramani

Main category: cs.LG

TL;DR: Einsum Networks (EiNets) are a new implementation of probabilistic circuits that use monolithic einsum operations for significant speed and memory improvements, enabling scaling to large datasets like SVHN and CelebA.

Details

Motivation: Current probabilistic circuit implementations have sparsely connected computational graphs that make training difficult on real-world data, limiting their scalability and practical application.

Method: EiNets combine many arithmetic operations into a single einsum operation for efficiency. They also simplify Expectation-Maximization implementation using automatic differentiation.

Result: EiNets achieve speedups and memory savings of up to two orders of magnitude compared to previous implementations, and successfully scale to datasets like SVHN and CelebA as faithful generative image models.

Conclusion: EiNets represent a significant advancement in probabilistic circuit implementation, making them more practical for real-world applications through improved efficiency and scalability.

Abstract: Probabilistic circuits (PCs) are a promising avenue for probabilistic modeling, as they permit a wide range of exact and efficient inference routines. Recent ``deep-learning-style’’ implementations of PCs strive for a better scalability, but are still difficult to train on real-world data, due to their sparsely connected computational graphs. In this paper, we propose Einsum Networks (EiNets), a novel implementation design for PCs, improving prior art in several regards. At their core, EiNets combine a large number of arithmetic operations in a single monolithic einsum-operation, leading to speedups and memory savings of up to two orders of magnitude, in comparison to previous implementations. As an algorithmic contribution, we show that the implementation of Expectation-Maximization (EM) can be simplified for PCs, by leveraging automatic differentiation. Furthermore, we demonstrate that EiNets scale well to datasets which were previously out of reach, such as SVHN and CelebA, and that they can be used as faithful generative image models.

[477] Mirror Descent and Novel Exponentiated Gradient Algorithms Using Trace-Form Entropies and Deformed Logarithms

Andrzej Cichocki, Toshihisa Tanaka, Frank Nielsen, Sergio Cruces

Main category: cs.LG

TL;DR: This paper presents a unified framework for Mirror Descent and Generalized Exponentiated Gradient algorithms using deformed logarithms and trace-form entropies, connecting them to natural gradient methods and providing improved convergence and robustness.

Details

Motivation: To develop optimization algorithms with better convergence behavior, robustness to gradient issues, and adaptability to non-Euclidean geometries through a unified geometric foundation.

Method: Derives MD and GEG algorithms from trace-form entropies using deformed logarithms, establishes connections to natural gradient, and analyzes specific entropy families (Tsallis, Kaniadakis, etc.) that induce distinct Riemannian metrics.

Result: Shows that each entropy family induces a unique Riemannian metric, leading to GEG algorithms that preserve natural statistical geometry, with tunable parameters enabling adaptive geometric selection for enhanced robustness and convergence.

Conclusion: The framework unifies key first-order MD optimization methods under an information-geometric perspective using generalized Bregman divergences, where entropy choice determines the underlying metric and dual geometric structure.

Abstract: This paper introduces a broad class of Mirror Descent (MD) and Generalized Exponentiated Gradient (GEG) algorithms derived from trace-form entropies defined via deformed logarithms. Leveraging these generalized entropies yields MD & GEG algorithms with improved convergence behavior, robustness to vanishing and exploding gradients, and inherent adaptability to non-Euclidean geometries through mirror maps. We establish deep connections between these methods and Amari’s natural gradient, revealing a unified geometric foundation for additive, multiplicative, and natural gradient updates. Focusing on the Tsallis, Kaniadakis, Sharma–Taneja–Mittal, and Kaniadakis–Lissia–Scarfone entropy families, we show that each entropy induces a distinct Riemannian metric on the parameter space, leading to GEG algorithms that preserve the natural statistical geometry. The tunable parameters of deformed logarithms enable adaptive geometric selection, providing enhanced robustness and convergence over classical Euclidean optimization. Overall, our framework unifies key first-order MD optimization methods under a single information-geometric perspective based on generalized Bregman divergences, where the choice of entropy determines the underlying metric and dual geometric structure.

[478] Online (Non-)Convex Learning via Tempered Optimism

Maxime Haddouche, Olivier Wintenberger, Benjamin Guedj

Main category: cs.LG

TL;DR: The paper introduces Optimistically Tempered (OT) online learning to handle imperfect experts that may lead to overfitting in dynamic environments, proposing modified gradient descent algorithms and demonstrating practical efficiency.

Details

Motivation: To address the challenge of implicit optimism in online learning when experts convey partially relevant information that may cause overfitting in dynamic environments.

Method: Proposed Optimistically Tempered (OT) online learning framework with modified Online Gradient and Mirror Descent algorithms for non-convex learning, and a second OT algorithm for convex losses.

Result: The tempered optimism approach shows practical efficiency on real-life datasets and toy experiments, demonstrating its effectiveness in handling imperfect experts.

Conclusion: Tempered optimism is a fruitful paradigm for online learning that successfully addresses the limitations of traditional optimistic approaches when dealing with imperfect experts in dynamic environments.

Abstract: Optimistic Online Learning aims to exploit experts conveying reliable information to predict the future. However, such implicit optimism may be challenged when it comes to practical crafting of such experts. A fundamental example consists in approximating a minimiser of the current problem and use it as expert. In the context of dynamic environments, such an expert only conveys partially relevant information as it may lead to overfitting. To tackle this issue, we introduce in this work the \emph{optimistically tempered} (OT) online learning framework designed to handle such imperfect experts. As a first contribution, we show that tempered optimism is a fruitful paradigm for Online Non-Convex Learning by proposing simple, yet powerful modification of Online Gradient and Mirror Descent. Second, we derive a second OT algorithm for convex losses and third, evaluate the practical efficiency of tempered optimism on real-life datasets and a toy experiment.

[479] Multimodal 3D Genome Pre-training

Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei Zhang, Zhi-An Huang, Hui Xiong

Main category: cs.LG

TL;DR: MIX-HIC is the first multimodal foundation model for 3D genomics that integrates 3D genome structure and epigenomic tracks, achieving superior performance across diverse downstream tasks.

Details

Motivation: Deep learning has advanced 3D genomics analysis, but a holistic understanding integrating both 3D genome structure and epigenomic information remains underexplored.

Method: Developed cross-modal interaction and mapping blocks for robust unified representation, and created a large-scale dataset with over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for pre-training.

Result: MIX-HIC significantly surpasses existing state-of-the-art methods in diverse downstream tasks, demonstrating accurate aggregation of 3D genome knowledge.

Conclusion: This work provides a valuable multimodal foundation model and dataset resource for advancing 3D genomics research by enabling comprehensive semantic understanding of 3D genome structure and function.

Abstract: Deep learning techniques have driven significant progress in various analytical tasks within 3D genomics in computational biology. However, a holistic understanding of 3D genomics knowledge remains underexplored. Here, we propose MIX-HIC, the first multimodal foundation model of 3D genome that integrates both 3D genome structure and epigenomic tracks, which obtains unified and comprehensive semantics. For accurate heterogeneous semantic fusion, we design the cross-modal interaction and mapping blocks for robust unified representation, yielding the accurate aggregation of 3D genome knowledge. Besides, we introduce the first large-scale dataset comprising over 1 million pairwise samples of Hi-C contact maps and epigenomic tracks for high-quality pre-training, enabling the exploration of functional implications in 3D genomics. Extensive experiments show that MIX-HIC can significantly surpass existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.

[480] Datasheets for Machine Learning Sensors

Matthew Stewart, Yuke Zhang, Pete Warden, Yasmine Omri, Shvetank Prakash, Jacob Huckelberry, Joao Henrique Santos, Shawn Hymel, Benjamin Yeager Brown, Jim MacArthur, Nat Jeffries, Emanuel Moss, Mona Sloane, Brian Plancher, Vijay Janapa Reddi

Main category: cs.LG

TL;DR: The paper introduces a datasheet framework for ML sensors to provide comprehensive documentation, addressing transparency, reproducibility, compliance, and responsible operation in embedded AI sensing systems.

Details

Motivation: There is a need for transparency in ML-enabled sensing systems to enable reproducibility, address compliance and auditing requirements, and verify responsible operation across diverse applications like industrial anomaly detection and wildlife tracking.

Method: The authors developed a comprehensive datasheet template through academia-industry partnerships that captures ML sensor attributes including hardware specs, ML model characteristics, dataset features, performance metrics, and environmental impacts. The framework addresses streaming data, real-time processing, and includes real-world benchmarking.

Result: The framework was demonstrated through two datasheets: one for an open-source ML sensor and another for a commercial ML sensor, both performing computer vision-based person detection. The approach aligns with FAIR principles for enhanced transparency and reusability.

Conclusion: The datasheet for ML sensors framework provides a practical solution for documenting ML-enabled sensing systems, enhancing transparency across academic, industrial, and regulatory domains while supporting responsible AI deployment.

Abstract: Machine learning (ML) is becoming prevalent in embedded AI sensing systems. These “ML sensors” enable context-sensitive, real-time data collection and decision-making across diverse applications ranging from anomaly detection in industrial settings to wildlife tracking for conservation efforts. As such, there is a need to provide transparency in the operation of such ML-enabled sensing systems through comprehensive documentation. This is needed to enable their reproducibility, to address new compliance and auditing regimes mandated in regulation and industry-specific policy, and to verify and validate the responsible nature of their operation. To address this gap, we introduce the datasheet for ML sensors framework. We provide a comprehensive template, collaboratively developed in academia-industry partnerships, that captures the distinct attributes of ML sensors, including hardware specifications, ML model and dataset characteristics, end-to-end performance metrics, and environmental impacts. Our framework addresses the continuous streaming nature of sensor data, real-time processing requirements, and embeds benchmarking methodologies that reflect real-world deployment conditions, ensuring practical viability. Aligned with the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), our approach enhances the transparency and reusability of ML sensor documentation across academic, industrial, and regulatory domains. To show the application of our approach, we present two datasheets: the first for an open-source ML sensor designed in-house and the second for a commercial ML sensor developed by industry collaborators, both performing computer vision-based person detection.

[481] Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An

Main category: cs.LG

TL;DR: GiGPO is a novel RL algorithm for LLM agents that enables fine-grained credit assignment in multi-turn tasks through a two-level hierarchical structure, achieving significant performance improvements while maintaining low memory overhead.

Details

Motivation: Current group-based RL methods struggle with multi-turn LLM agent training due to sparse/delayed rewards and difficulty in credit assignment across individual steps in agent-environment interactions.

Method: GiGPO uses a two-level structure: episode-level macro relative advantages based on trajectory groups, and step-level micro relative advantages using anchor state grouping that identifies repeated environment states across trajectories.

Result: GiGPO achieved >12% improvement on ALFWorld and >9% on WebShop over GRPO, with superior QA task performance (42.1% on 3B and 47.2% on 7B models), while maintaining same GPU memory and minimal time cost.

Conclusion: GiGPO successfully enables fine-grained per-step credit assignment for LLM agents in multi-turn tasks, overcoming limitations of existing group-based RL methods while preserving their computational efficiency benefits.

Abstract: Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2.5-1.5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12% on ALFWorld and > 9% on WebShop over GRPO, and obtains superior performance on QA tasks (42.1% on 3B and 47.2% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

[482] UniCrossFi: A Unified Framework For Cross-Domain Wi-Fi-based Gesture Recognition

Ke Xu, Zhiyong Zheng, Hongyuan Zhu, Lei Wang, Jiangtao Wang

Main category: cs.LG

TL;DR: UniCrossFi is a unified framework that addresses cross-domain problems in Wi-Fi sensing by extending domain generalization to semi-supervised settings and introducing physics-informed data augmentation through Antenna Response Consistency.

Details

Motivation: Wi-Fi sensing systems face severe performance degradation in unseen real-world environments due to domain shift, and existing methods require extensive labeled data which is impractical in real scenarios with limited labeled source data.

Method: Proposes UniCrossFi framework with: 1) Semi-Supervised Domain Generalization (SSDG) setting, 2) Antenna Response Consistency (ARC) data augmentation using spatial diversity of multi-antenna systems, 3) Unified Contrastive Objective to prevent pushing apart same-class samples from different domains.

Result: Extensive experiments on Widar and CSIDA datasets show UniCrossFi consistently establishes new state-of-the-art performance, significantly outperforming existing methods across unsupervised domain adaptation, DG, and SSDG benchmarks.

Conclusion: UniCrossFi provides a principled and practical solution to domain shift challenges, advancing the feasibility of robust Wi-Fi sensing systems that operate effectively with limited labeled data in real-world deployments.

Abstract: Wi-Fi sensing systems are severely hindered by cross domain problem when deployed in unseen real-world environments. Existing methods typically design separate frameworks for either domain adaptation or domain generalization, often relying on extensive labeled data. Existing methods that designed for domain generalization is often relying on extensive labeled data. However, real-world scenarios are far more complex, where the deployed model must be capable of handling generalization under limited labeled source data. To this end, we propose UniCrossFi, a unified framework designed to mitigate performance drop in CSI-based sensing across diverse deployment settings. Our framework not only extends conventional Domain Generalization (DG) to a more practical Semi-Supervised Domain Generalization (SSDG) setting, where only partially labeled source data are available, but also introduces a physics-informed data augmentation strategy, Antenna Response Consistency (ARC). ARC mitigates the risk of learning superficial shortcuts by exploiting the intrinsic spatial diversity of multi-antenna systems, treating signals from different antennas as naturally augmented views of the same event. In addition, we design a Unified Contrastive Objective to prevent conventional contrastive learning from pushing apart samples from different domains that share the same class. We conduct extensive experiments on the public Widar and CSIDA datasets. The results demonstrate that UniCrossFi consistently establishes a new state-of-the-art, significantly outperforming existing methods across all unsupervised domain adaptation, DG, and SSDG benchmarks. UniCrossFi provides a principled and practical solution to the domain shift challenge, advancing the feasibility of robust, real-world Wi-Fi sensing systems that can operate effectively with limited labeled data.

[483] The Logical Expressiveness of Temporal GNNs via Two-Dimensional Product Logics

Marco Sälzer, Przemysław Andrzej Wałęga, Martin Lange

Main category: cs.LG

TL;DR: This paper provides the first logical characterization of temporal graph neural networks (TGNNs) by connecting them to two-dimensional product logics, showing how different architectural combinations of graph and temporal components affect expressive power.

Details

Motivation: As basic neural architectures become well understood, attention is turning to combined models like temporal GNNs that integrate both spatial (graph-structure) and temporal (evolution over time) dimensions, which are challenging to analyze.

Method: The study connects temporal GNNs to two-dimensional product logics and analyzes how different architectural paradigms (static GNNs applied recursively, graph-and-time TGNNs, global TGNNs) combine graph and temporal components.

Result: Temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of PTL and modal logic K, while other architectures like graph-and-time TGNNs and global TGNNs can only express restricted fragments with syntactically constrained interaction between temporal and spatial operators.

Conclusion: These findings provide the first results on the logical expressiveness of temporal GNNs, establishing that expressive power depends crucially on how graph and temporal components are combined in the architecture.

Abstract: In recent years, the expressive power of various neural architectures – including graph neural networks (GNNs), transformers, and recurrent neural networks – has been characterised using tools from logic and formal language theory. As the capabilities of basic architectures are becoming well understood, increasing attention is turning to models that combine multiple architectural paradigms. Among them particularly important, and challenging to analyse, are temporal extensions of GNNs, which integrate both spatial (graph-structure) and temporal (evolution over time) dimensions. In this paper, we initiate the study of logical characterisation of temporal GNNs by connecting them to two-dimensional product logics. We show that the expressive power of temporal GNNs depends on how graph and temporal components are combined. In particular, temporal GNNs that apply static GNNs recursively over time can capture all properties definable in the product logic of (past) propositional temporal logic PTL and the modal logic K. In contrast, architectures such as graph-and-time TGNNs and global TGNNs can only express restricted fragments of this logic, where the interaction between temporal and spatial operators is syntactically constrained. These provide us with the first results on the logical expressiveness of temporal GNNs.

[484] FedMAP: Personalised Federated Learning for Real Large-Scale Healthcare Systems

Fan Zhang, Daniel Kreuter, Carlos Esteve-Yagüe, Sören Dittmer, Javier Fernandez-Marques, Samantha Ip, BloodCounts! Consortium, Norbert C. J. de Wit, Angela Wood, James HF Rudd, Nicholas Lane, Nicholas S Gleadall, Carola-Bibiane Schönlieb, Michael Roberts

Main category: cs.LG

TL;DR: FedMAP is a personalized federated learning framework that addresses statistical heterogeneity in healthcare data through local MAP estimation with adaptive ICNN priors, enabling large-scale deployment across diverse healthcare sites while improving performance and equity.

Details

Motivation: Federated learning in healthcare faces limitations from statistical heterogeneity (differences in patient demographics, treatments, outcomes) and infrastructure constraints, preventing practical large-scale deployment.

Method: FedMAP uses personalized FL with local Maximum a Posteriori estimation and Input Convex Neural Network priors that adaptively learn global patterns and inter-site relationships, featuring a three-tier design for varying infrastructure capabilities.

Result: FedMAP outperforms local training, FedAvg, and several PFL methods across three large clinical datasets, with underperforming regions achieving up to 14.3% performance gains and substantial equity improvements.

Conclusion: FedMAP provides the first practical pathway for large-scale healthcare FL deployment that benefits sites at all scales, enhances equity, and retains privacy.

Abstract: Federated learning (FL) promises to enable collaborative machine learning across healthcare sites whilst preserving data privacy. Practical deployment remains limited by statistical heterogeneity arising from differences in patient demographics, treatments, and outcomes, and infrastructure constraints. We introduce FedMAP, a personalised FL (PFL) framework that addresses heterogeneity through local Maximum a Posteriori (MAP) estimation with Input Convex Neural Network priors. These priors represent global knowledge gathered from other sites that guides the model while adapting to local data, and we provide a formal proof of convergence. Unlike many PFL methods that rely on fixed regularisation, FedMAP’s prior adaptively learns patterns that capture complex inter-site relationships. We demonstrate improved performance compared to local training, FedAvg, and several PFL methods across three large-scale clinical datasets: 10-year cardiovascular risk prediction (CPRD, 387 general practitioner practices, 258,688 patients), iron deficiency detection (INTERVAL, 4 donor centres, 31,949 blood donors), and mortality prediction (eICU, 150 hospitals, 44,842 patients). FedMAP incorporates a three-tier design that enables participation across healthcare sites with varying infrastructure and technical capabilities, from full federated training to inference-only deployment. Geographical analysis reveals substantial equity improvements, with underperforming regions achieving up to 14.3% performance gains. This framework provides the first practical pathway for large-scale healthcare FL deployment, which ensures clinical sites at all scales can benefit, equity is enhanced, and privacy is retained.

[485] Do Language Models Use Their Depth Efficiently?

Róbert Csordás, Christopher D. Manning, Christopher Potts

Main category: cs.LG

TL;DR: Deeper LLMs don’t use their additional layers for new types of computation, but rather spread existing computations more thinly across layers, explaining diminishing returns from increased depth.

Details

Motivation: To investigate whether modern deep LLMs use their increased depth efficiently to create higher-order computations or merely spread the same computations over more layers.

Method: Analyzed residual streams of Llama 3.1, Qwen 3, and OLMo 2 models; compared layer contributions; tested layer skipping effects; examined multihop tasks; trained linear maps between shallow and deep model residual streams.

Result: Layers in second half contribute much less than first half; skipping later layers has minimal impact; no evidence of depth enabling multihop composition; linear mapping shows same relative depth layers correspond best, indicating same computations spread over more layers.

Conclusion: Deeper models are not learning new kinds of computation with additional depth, but only making more fine-grained adjustments to the residual stream, explaining diminishing returns from scaling Transformer architectures.

Abstract: Modern LLMs are increasingly deep, and depth correlates with performance, albeit with diminishing returns. However, do these models use their depth efficiently? Do they compose more features to create higher-order computations that are impossible in shallow models, or do they merely spread the same kinds of computation out over more layers? To address these questions, we analyze the residual stream of the Llama 3.1, Qwen 3, and OLMo 2 family of models. We find: First, comparing the output of the sublayers to the residual stream reveals that layers in the second half contribute much less than those in the first half, with a clear phase transition between the two halves. Second, skipping layers in the second half has a much smaller effect on future computations and output predictions. Third, for multihop tasks, we are unable to find evidence that models are using increased depth to compose subresults in examples involving many hops. Fourth, we seek to directly address whether deeper models are using their additional layers to perform new kinds of computation. To do this, we train linear maps from the residual stream of a shallow model to a deeper one. We find that layers with the same relative depth map best to each other, suggesting that the larger model simply spreads the same computations out over its many layers. All this evidence suggests that deeper models are not using their depth to learn new kinds of computation, but only using the greater depth to perform more fine-grained adjustments to the residual. This may help explain why increasing scale leads to diminishing returns for stacked Transformer architectures.

[486] TIDMAD: Time Series Dataset for Discovering Dark Matter with AI Denoising

J. T. Fry, Xinyi Hope Fu, Zhenghao Fu, Kaliroe M. W. Pappas, Lindley Winslow, Aobo Li

Main category: cs.LG

TL;DR: The TIDMAD data release from the ABRACADABRA experiment provides ultra-long time-series data, denoising scores, and analysis framework to enable AI algorithms to search for dark matter signals in the form of sinusoidal oscillations.

Details

Motivation: Dark matter constitutes 85% of total matter but has never been directly observed. A detection would be a Nobel-Prize-level breakthrough. The ABRACADABRA experiment was designed specifically to search for dark matter.

Method: The experiment generates ultra-long time-series data at 10 million samples per second. The data release includes training, validation, and science subsets, denoising scores for benchmarking, and a complete analysis framework.

Result: While ABRACADABRA has not yet discovered dark matter, it has produced several widely-endorsed dark matter search results. The TIDMAD release enables AI algorithms to extract potential dark matter signals.

Conclusion: This comprehensive data release advances fundamental science by enabling core AI algorithms to extract dark matter signals and produce real physics results suitable for publication.

Abstract: Dark matter makes up approximately 85% of total matter in our universe, yet it has never been directly observed in any laboratory on Earth. The origin of dark matter is one of the most important questions in contemporary physics, and a convincing detection of dark matter would be a Nobel-Prize-level breakthrough in fundamental science. The ABRACADABRA experiment was specifically designed to search for dark matter. Although it has not yet made a discovery, ABRACADABRA has produced several dark matter search results widely endorsed by the physics community. The experiment generates ultra-long time-series data at a rate of 10 million samples per second, where the dark matter signal would manifest itself as a sinusoidal oscillation mode within the ultra-long time series. In this paper, we present the TIDMAD – a comprehensive data release from the ABRACADABRA experiment including three key components: an ultra-long time series dataset divided into training, validation, and science subsets; a carefully-designed denoising score for direct model benchmarking; and a complete analysis framework which produces a community-standard dark matter search result suitable for publication as a physics paper. This data release enables core AI algorithms to extract the dark matter signal and produce real physics results thereby advancing fundamental science. The data downloading and associated analysis scripts are available at https://github.com/jessicafry/TIDMAD

[487] STree: Speculative Tree Decoding for Hybrid State-Space Models

Yangchao Wu, Zongyue Qin, Alex Wong, Stefano Soatto

Main category: cs.LG

TL;DR: The paper proposes the first scalable algorithm for tree-based speculative decoding in state-space models (SSMs) and hybrid SSM-Transformer architectures, improving inference efficiency by leveraging accumulated state transition matrices.

Details

Motivation: While SSMs are already more efficient than AR Transformers, existing speculative decoding approaches for SSMs don't leverage tree-based verification methods due to computational challenges in efficiently computing token trees.

Method: The authors exploit the structure of accumulated state transition matrices to enable tree-based speculative decoding with minimal overhead, and provide a hardware-aware implementation that improves upon naive AR Transformer methods applied to SSMs.

Result: The proposed method outperforms vanilla speculative decoding with SSMs on three different benchmarks, even with a baseline drafting model and tree structure.

Conclusion: This work opens up opportunities for further speed improvements in SSM and hybrid model inference through efficient tree-based speculative decoding.

Abstract: Speculative decoding is a technique to leverage hardware concurrency in order to enable multiple steps of token generation in a single forward pass, thus improving the efficiency of large-scale autoregressive (AR) Transformer models. State-space models (SSMs) are already more efficient than AR Transformers, since their state summarizes all past data with no need to cache or re-process tokens in the sliding window context. However, their state can also comprise thousands of tokens; so, speculative decoding has recently been extended to SSMs. Existing approaches, however, do not leverage the tree-based verification methods, since current SSMs lack the means to compute a token tree efficiently. We propose the first scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers. We exploit the structure of accumulated state transition matrices to facilitate tree-based speculative decoding with minimal overhead relative to current SSM implementations. Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs. Furthermore, we outperform vanilla speculative decoding with SSMs even with a baseline drafting model and tree structure on three different benchmarks, opening up opportunities for further speed up with SSM and hybrid model inference. Code can be found at: https://github.com/wyc1997/stree.

[488] DeltaPhi: Physical States Residual Learning for Neural Operators in Data-Limited PDE Solving

Xihang Yue, Yi Yang, Linchao Zhu

Main category: cs.LG

TL;DR: DeltaPhi is a novel framework that transforms PDE solving from learning direct input-output mappings to learning residuals between similar physical states, providing implicit data augmentation and improving neural operator performance in data-limited scenarios.

Details

Motivation: Limited availability of high-quality training data poses a major obstacle in data-driven PDE solving, where expensive data collection and resolution constraints severely impact neural operator networks' ability to learn and generalize physical systems.

Method: DeltaPhi reformulates PDE solving to learn residuals between similar physical states rather than direct input-output mappings, exploiting the inherent stability of physical systems where closer initial states lead to closer evolution trajectories. It is architecture-agnostic and can be integrated with existing neural operators.

Result: Extensive experiments demonstrate consistent and significant improvements across diverse physical systems including regular and irregular domains, different neural architectures, multiple training data amounts, and cross-resolution scenarios.

Conclusion: DeltaPhi is an effective general enhancement for neural operators in data-limited PDE solving, providing robust performance improvements across various conditions and architectures.

Abstract: The limited availability of high-quality training data poses a major obstacle in data-driven PDE solving, where expensive data collection and resolution constraints severely impact the ability of neural operator networks to learn and generalize the underlying physical system. To address this challenge, we propose DeltaPhi, a novel learning framework that transforms the PDE solving task from learning direct input-output mappings to learning the residuals between similar physical states, a fundamentally different approach to neural operator learning. This reformulation provides implicit data augmentation by exploiting the inherent stability of physical systems where closer initial states lead to closer evolution trajectories. DeltaPhi is architecture-agnostic and can be seamlessly integrated with existing neural operators to enhance their performance. Extensive experiments demonstrate consistent and significant improvements across diverse physical systems including regular and irregular domains, different neural architectures, multiple training data amount, and cross-resolution scenarios, confirming its effectiveness as a general enhancement for neural operators in data-limited PDE solving.

[489] MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Csaba Dékány, Stefan Balauca, Robin Staab, Dimitar I. Dimitrov, Martin Vechev

Main category: cs.LG

TL;DR: MixAT is a novel adversarial training method that combines discrete and continuous attacks to improve LLM robustness against harmful content generation, achieving better defense (ALO-ASR < 20%) with minimal computational overhead.

Details

Motivation: Current adversarial attacks can still force harmful generations from LLMs, and existing adversarial training methods either rely on computationally expensive discrete attacks or continuous relaxations that don't capture full vulnerability spectrum.

Method: MixAT combines stronger discrete attacks and faster continuous attacks during training, using the ALO-ASR metric to measure worst-case vulnerability.

Result: MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%) while maintaining runtime comparable to continuous methods.

Conclusion: MixAT’s discrete-continuous defense offers superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs.

Abstract: Despite recent efforts in Large Language Model (LLM) safety and alignment, current adversarial attacks on frontier LLMs can still consistently force harmful generations. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. At the same time, despite their effectiveness and generalization capabilities, training with continuous perturbations does not always capture the full spectrum of vulnerabilities exploited by discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT’s discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

[490] Adaptive Anomaly Detection in Network Flows with Low-Rank Tensor Decompositions and Deep Unrolling

Lukas Schynol, Marius Pesavento

Main category: cs.LG

TL;DR: This paper proposes a deep unrolling approach for anomaly detection in network flows using tensor decomposition, addressing challenges of data efficiency, domain adaptation, and interpretability.

Details

Motivation: Deep learning shows state-of-the-art anomaly detection performance but faces concerns about training data efficiency, domain adaptation, and interpretability in critical communication systems.

Method: Proposes a block-successive convex approximation algorithm using tensor decomposition (normal flows as low-rank, anomalies as sparse), applies deep unrolling to create a network architecture with learnable regularization parameters, and extends with Bayesian-inspired online adaptation.

Result: Extensive experiments show the proposed architecture exhibits high training data efficiency, outperforms reference methods, and adapts seamlessly to varying network topologies.

Conclusion: The deep unrolling approach successfully addresses key challenges in anomaly detection for communication systems while maintaining performance and adaptability.

Abstract: Anomaly detection (AD) is increasingly recognized as a key component for ensuring the resilience of future communication systems. While deep learning has shown state-of-the-art AD performance, its application in critical systems is hindered by concerns regarding training data efficiency, domain adaptation and interpretability. This work considers AD in network flows using incomplete measurements, leveraging a robust tensor decomposition approach and deep unrolling techniques to address these challenges. We first propose a novel block-successive convex approximation algorithm based on a regularized model-fitting objective where the normal flows are modeled as low-rank tensors and anomalies as sparse. An augmentation of the objective is introduced to decrease the computational cost. We apply deep unrolling to derive a novel deep network architecture based on our proposed algorithm, treating the regularization parameters as learnable weights. Inspired by Bayesian approaches, we extend the model architecture to perform online adaptation to per-flow and per-time-step statistics, improving AD performance while maintaining a low parameter count and preserving the problem’s permutation equivariances. To optimize the deep network weights for detection performance, we employ a homotopy optimization approach based on an efficient approximation of the area under the receiver operating characteristic curve. Extensive experiments on synthetic and real-world data demonstrate that our proposed deep network architecture exhibits a high training data efficiency, outperforms reference methods, and adapts seamlessly to varying network topologies.

[491] GraSS: Scalable Data Attribution with Gradient Sparsification and Sparse Projection

Pingbang Hu, Joseph Melkonian, Weijing Tang, Han Zhao, Jiaqi W. Ma

Main category: cs.LG

TL;DR: GraSS is a gradient compression algorithm that leverages the sparsity of per-sample gradients to achieve sub-linear space and time complexity for data attribution methods like influence functions, with FactGraSS variant providing up to 165% faster throughput on billion-scale models.

Details

Motivation: Gradient-based data attribution methods are computationally expensive due to high memory and computational costs of per-sample gradient computation, limiting their scalability.

Method: Proposed GraSS gradient compression algorithm and its variant FactGraSS for linear layers, which exploit the inherent sparsity of per-sample gradients to achieve sub-linear complexity.

Result: Extensive experiments show substantial speedups while preserving data influence fidelity, with FactGraSS achieving up to 165% faster throughput on billion-scale models compared to state-of-the-art baselines.

Conclusion: GraSS and FactGraSS provide efficient gradient compression for scalable data attribution without requiring model retraining, with publicly available code.

Abstract: Gradient-based data attribution methods, such as influence functions, are critical for understanding the impact of individual training samples without requiring repeated model retraining. However, their scalability is often limited by the high computational and memory costs associated with per-sample gradient computation. In this work, we propose GraSS, a novel gradient compression algorithm and its variants FactGraSS for linear layers specifically, that explicitly leverage the inherent sparsity of per-sample gradients to achieve sub-linear space and time complexity. Extensive experiments demonstrate the effectiveness of our approach, achieving substantial speedups while preserving data influence fidelity. In particular, FactGraSS achieves up to 165% faster throughput on billion-scale models compared to the previous state-of-the-art baselines. Our code is publicly available at https://github.com/TRAIS-Lab/GraSS.

[492] RWKV-edge: Deeply Compressed RWKV for Resource-Constrained Devices

Wonkyo Choe, Yangfeng Ji, Felix Xiaozhu Lin

Main category: cs.LG

TL;DR: Proposes compression techniques for RWKV LLMs that reduce memory footprint by 3.4x-5x with minimal accuracy loss, making them 4x more memory-efficient than comparable transformer LLMs.

Details

Motivation: To deploy LLMs on resource-constrained platforms like mobile robots and smartphones, current RWKV models still have high parameter counts despite their computational efficiency, limiting deployment.

Method: A suite of compression techniques including model architecture optimizations and post-training compression specifically tailored to the RWKV architecture.

Result: Memory footprint reduced by 3.4x-5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, models require 4x less memory footprint.

Conclusion: The proposed compression techniques successfully enable efficient deployment of RWKV LLMs on resource-constrained platforms while maintaining competitive performance.

Abstract: To deploy LLMs on resource-contained platforms such as mobile robots and smartphones, non-transformers LLMs have achieved major breakthroughs. Recently, a novel RNN-based LLM family, Repentance Weighted Key Value (RWKV) has shown strong computational efficiency; nevertheless, RWKV models still have high parameter counts which limited their deployment. In this paper, we propose a suite of compression techniques, ranging from model architecture optimizations to post-training compression, tailored to the RWKV architecture. Combined, our techniques reduce the memory footprint of RWKV models by 3.4x – 5x with only negligible degradation in accuracy; compared to transformer LLMs with similar accuracy, our models require 4x less memory footprint.

[493] FALCON: An ML Framework for Fully Automated Layout-Constrained Analog Circuit Design

Asal Mehradfar, Xuzhe Zhao, Yilun Huang, Emir Ceyani, Yankai Yang, Shihao Han, Hamidreza Aghasi, Salman Avestimehr

Main category: cs.LG

TL;DR: FALCON is a unified ML framework for automated analog circuit synthesis that performs topology selection and layout-constrained optimization to design circuits from performance specifications.

Details

Motivation: Analog circuit design is complex and multi-stage, requiring topology selection, parameter inference, and layout feasibility assessment. Current methods lack automation and integration across these stages.

Method: Uses performance-driven classifier for topology selection, edge-centric GNN for performance prediction, gradient-based parameter inference through learned forward model, and differentiable layout cost with design rule constraints.

Result: Achieved >99% topology inference accuracy, <10% relative performance prediction error, and layout-aware design completion in under 1 second per instance on 1M circuit dataset across 20 topologies.

Conclusion: FALCON serves as a practical and extensible foundation model for end-to-end analog circuit design automation, demonstrating high accuracy and efficiency in automated synthesis.

Abstract: Designing analog circuits from performance specifications is a complex, multi-stage process encompassing topology selection, parameter inference, and layout feasibility. We introduce FALCON, a unified machine learning framework that enables fully automated, specification-driven analog circuit synthesis through topology selection and layout-constrained optimization. Given a target performance, FALCON first selects an appropriate circuit topology using a performance-driven classifier guided by human design heuristics. Next, it employs a custom, edge-centric graph neural network trained to map circuit topology and parameters to performance, enabling gradient-based parameter inference through the learned forward model. This inference is guided by a differentiable layout cost, derived from analytical equations capturing parasitic and frequency-dependent effects, and constrained by design rules. We train and evaluate FALCON on a large-scale custom dataset of 1M analog mm-wave circuits, generated and simulated using Cadence Spectre across 20 expert-designed topologies. Through this evaluation, FALCON demonstrates >99% accuracy in topology inference, <10% relative error in performance prediction, and efficient layout-aware design that completes in under 1 second per instance. Together, these results position FALCON as a practical and extensible foundation model for end-to-end analog circuit design automation.

[494] Advancing Compositional Awareness in CLIP with Efficient Fine-Tuning

Amit Peleg, Naman Deep Singh, Matthias Hein

Main category: cs.LG

TL;DR: CLIC is a fine-tuning method that improves CLIP models’ compositional reasoning by combining multiple images and captions during training, enhancing both lexical and semantic understanding while boosting retrieval performance.

Details

Motivation: Vision-language models like CLIP struggle with compositional reasoning, and previous improvement methods mainly enhanced lexical sensitivity while neglecting semantic understanding and often degraded retrieval performance.

Method: CLIC uses a novel training technique that combines multiple images and their associated captions during fine-tuning to improve compositionality.

Result: CLIC improves compositionality across different architectures and pre-trained CLIP models, achieving gains in both lexical/semantic understanding and retrieval performance, including on state-of-the-art CLIPS models.

Conclusion: CLIC enables short fine-tuning that leads to improved retrieval and creates the best compositional CLIP model on the SugarCrepe++ benchmark.

Abstract: Vision-language models like CLIP have demonstrated remarkable zero-shot capabilities in classification and retrieval. However, these models often struggle with compositional reasoning - the ability to understand the relationships between concepts. A recent benchmark, SugarCrepe++, reveals that previous works on improving compositionality have mainly improved lexical sensitivity but neglected semantic understanding. In addition, downstream retrieval performance often deteriorates, although one would expect that improving compositionality should enhance retrieval. In this work, we introduce CLIC (Compositionally-aware Learning in CLIP), a fine-tuning method based on a novel training technique combining multiple images and their associated captions. CLIC improves compositionality across architectures as well as differently pre-trained CLIP models, both in terms of lexical and semantic understanding, and achieves consistent gains in retrieval performance. This even applies to the recent CLIPS, which achieves SOTA retrieval performance. Nevertheless, the short fine-tuning with CLIC leads to an improvement in retrieval and to the best compositional CLIP model on SugarCrepe++. All our models and code are available at https://clic-compositional-clip.github.io

[495] Geometry matters: insights from Ollivier Ricci Curvature and Ricci Flow into representational alignment through Ollivier-Ricci Curvature and Ricci Flow

Nahid Torbati, Michael Gaebler, Simon M. Hofmann, Nico Scherf

Main category: cs.LG

TL;DR: A geometric framework using Ollivier Ricci Curvature and Ricci Flow reveals fine-grained local structure in representations, showing discrepancies between human similarity judgments and neural network embeddings that traditional RSA misses.

Details

Motivation: Traditional representational similarity analysis (RSA) can be misleading without considering underlying representational geometry, as it may miss important geometric inconsistencies in alignment between humans and models.

Method: Introduced a framework using Ollivier Ricci Curvature and Ricci Flow to analyze local geometric structure of representations, applied to compare human similarity judgments for 2D/3D face stimuli with VGG-Face and its human-aligned variant.

Result: Revealed geometric inconsistencies in alignment when moving from 2D to 3D viewing conditions, showing geometry-aware analysis provides more sensitive characterization of discrepancies than traditional RSA.

Conclusion: Incorporating geometric information exposes alignment differences missed by traditional metrics, offering deeper insight into representational organization and highlighting limitations of standard RSA approaches.

Abstract: Representational similarity analysis (RSA) is widely used to analyze the alignment between humans and neural networks; however, conclusions based on this approach can be misleading without considering the underlying representational geometry. Our work introduces a framework using Ollivier Ricci Curvature and Ricci Flow to analyze the fine-grained local structure of representations. This approach is agnostic to the source of the representational space, enabling a direct geometric comparison between human behavioral judgments and a model’s vector embeddings. We apply it to compare human similarity judgments for 2D and 3D face stimuli with a baseline 2D native network (VGG-Face) and a variant of it aligned to human behavior. Our results suggest that geometry-aware analysis provides a more sensitive characterization of discrepancies and geometric dissimilarities in the underlying representations that remain only partially captured by RSA. Notably, we reveal geometric inconsistencies in the alignment when moving from 2D to 3D viewing conditions.This highlights how incorporating geometric information can expose alignment differences missed by traditional metrics, offering deeper insight into representational organization.

[496] REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving

Sujun Tang, Christopher Priebe, Rohan Mahapatra, Lianhui Qin, Hadi Esmaeilzadeh

Main category: cs.LG

TL;DR: A novel compiler framework called Reasoning Compiler that uses large language models (LLMs) and Monte Carlo tree search (MCTS) to optimize neural workloads more efficiently than existing compilers.

Details

Motivation: High cost of serving large-scale models is a barrier to accessibility and innovation. Existing compilers struggle with neural workloads due to large interdependent transformation spaces, and stochastic search techniques are sample-inefficient and lack context awareness.

Method: Formulates optimization as sequential, context-aware decision process using LLMs as proposal mechanisms for hardware-informed transformations, combined with structured MCTS to balance exploration and exploitation in the compiler optimization space.

Result: Achieves substantial speedups with significantly fewer samples than leading neural compilers.

Conclusion: Demonstrates the potential of LLM-guided reasoning to transform compiler optimization by leveraging context-aware decision making and improving sample efficiency.

Abstract: While model serving has unlocked unprecedented capabilities, the high cost of serving large-scale models continues to be a significant barrier to widespread accessibility and rapid innovation. Compiler optimizations have long driven substantial performance improvements, but existing compilers struggle with neural workloads due to the exponentially large and highly interdependent space of possible transformations. Although existing stochastic search techniques can be effective, they are often sample-inefficient and fail to leverage the structural context underlying compilation decisions. We set out to investigate the research question of whether reasoning with large language models (LLMs), without any retraining, can leverage the context-aware decision space of compiler optimizations to significantly improve sample efficiency. To that end, we introduce a novel compilation framework (dubbed Reasoning Compiler) that formulates optimization as a sequential, context-aware decision process guided by a large language model and structured Monte Carlo tree search (MCTS). The LLM acts as a proposal mechanism, suggesting hardware-informed transformations that reflect the current program state and accumulated performance feedback. MCTS incorporates the LLM-generated proposals to balance exploration and exploitation, facilitating structured, context-sensitive traversal of the expansive compiler optimization space. By achieving substantial speedups with markedly fewer samples than leading neural compilers, our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.

[497] Physics-Informed Latent Neural Operator for Real-time Predictions of time-dependent parametric PDEs

Sharmila Karumuri, Lori Graham-Brady, Somdatta Goswami

Main category: cs.LG

TL;DR: PI-Latent-NO is a physics-informed latent neural operator that combines two coupled DeepONets to efficiently learn parametric PDE solutions without labeled data, achieving better scalability and physics consistency than data-driven approaches.

Details

Motivation: Traditional DeepONets become overparameterized for high-dimensional PDE problems, while Latent DeepONet lacks physics incorporation. There's a need for efficient, physics-consistent operator learning that works in data-scarce settings.

Method: Two coupled DeepONets trained end-to-end: Latent-DeepONet learns low-dimensional solution representations, and Reconstruction-DeepONet maps back to physical space. PDE constraints are embedded via automatic differentiation.

Result: The framework is memory and compute-efficient with near-constant scaling, shows significant speedups over traditional physics-informed models, and works without labeled training data.

Conclusion: PI-Latent-NO provides an accurate, scalable framework for real-time prediction in complex physical systems by integrating physics directly into latent operator learning.

Abstract: Deep operator network (DeepONet) has shown significant promise as surrogate models for systems governed by partial differential equations (PDEs), enabling accurate mappings between infinite-dimensional function spaces. However, when applied to systems with high-dimensional input-output mappings arising from large numbers of spatial and temporal collocation points, these models often require heavily overparameterized networks, leading to long training times. Latent DeepONet addresses some of these challenges by introducing a two-step approach: first learning a reduced latent space using a separate model, followed by operator learning within this latent space. While efficient, this method is inherently data-driven and lacks mechanisms for incorporating physical laws, limiting its robustness and generalizability in data-scarce settings. In this work, we propose PI-Latent-NO, a physics-informed latent neural operator framework that integrates governing physics directly into the learning process. Our architecture features two coupled DeepONets trained end-to-end: a Latent-DeepONet that learns a low-dimensional representation of the solution, and a Reconstruction-DeepONet that maps this latent representation back to the physical space. By embedding PDE constraints into the training via automatic differentiation, our method eliminates the need for labeled training data and ensures physics-consistent predictions. The proposed framework is both memory and compute-efficient, exhibiting near-constant scaling with problem size and demonstrating significant speedups over traditional physics-informed operator models. We validate our approach on a range of parametric PDEs, showcasing its accuracy, scalability, and suitability for real-time prediction in complex physical systems.

[498] Data Leakage and Deceptive Performance: A Critical Examination of Credit Card Fraud Detection Methodologies

Mohammed Hilal Al-Kharusi, Khizar Hayat, Khalil Bader Al Ruqeishi, Haroon Rashid Lone

Main category: cs.LG

TL;DR: Current automated Quranic recitation systems using ASR are flawed due to focus on word identification over qualitative acoustic evaluation. The paper proposes a shift to knowledge-based computational frameworks using rule-based acoustic modeling of Tajweed rules.

Details

Motivation: Existing automated systems for Quranic recitation evaluation struggle with educational effectiveness and broad acceptance, facing limitations like biased datasets and inability to provide meaningful feedback.

Method: Literature review analyzing scholarly research, digital platforms, and commercial tools from past 20 years, advocating for rule-based acoustic modeling using canonical pronunciation principles and articulation points (Makhraj).

Result: Analysis reveals fundamental flaws in current ASR-based approaches and proposes a paradigm shift toward knowledge-based computational frameworks that leverage the unchanging nature of Quranic text and well-defined Tajweed rules.

Conclusion: Future automated Quranic recitation assessment requires hybrid systems combining linguistic expertise with advanced audio processing to create reliable, fair, and pedagogically effective tools for global learners.

Abstract: The art and science of Quranic recitation (Tajweed), a discipline governed by meticulous phonetic, rhythmic, and theological principles, confronts substantial educational challenges in today’s digital age. Although modern technology offers unparalleled opportunities for learning, existing automated systems for evaluating recitation have struggled to gain broad acceptance or demonstrate educational effectiveness. This literature review examines this crucial disparity, offering a thorough analysis of scholarly research, digital platforms, and commercial tools developed over the past twenty years. Our analysis uncovers a fundamental flaw in current approaches that adapt Automatic Speech Recognition (ASR) systems, which emphasize word identification over qualitative acoustic evaluation. These systems suffer from limitations such as reliance on biased datasets, demographic disparities, and an inability to deliver meaningful feedback for improvement. Challenging these data-centric methodologies, we advocate for a paradigm shift toward a knowledge-based computational framework. By leveraging the unchanging nature of the Quranic text and the well-defined rules of Tajweed, we propose that an effective evaluation system should be built upon rule-based acoustic modeling centered on canonical pronunciation principles and articulation points (Makhraj), rather than depending on statistical patterns derived from flawed or biased data. The review concludes that the future of automated Quranic recitation assessment lies in hybrid systems that combine linguistic expertise with advanced audio processing. Such an approach paves the way for developing reliable, fair, and pedagogically effective tools that can authentically assist learners across the globe.

[499] Selecting Critical Scenarios of DER Adoption in Distribution Grids Using Bayesian Optimization

Olivier Mulkin, Miguel Heleno, Mike Ludkovski

Main category: cs.LG

TL;DR: A new Bayesian Optimization framework for efficiently identifying critical DER adoption scenarios that could cause voltage and line flow violations in distribution grids.

Details

Motivation: Current utility planning relies on deterministic or ad hoc scenario selection for anticipating grid risks from PV adoption, lacking systematic methods to identify the most critical scenarios.

Method: Multi-objective Bayesian Optimization using Gaussian Process surrogates to approximate grid stress metrics as black-box functions, with an acquisition function based on probability of Pareto-critical scenarios across violation objectives.

Result: The approach provides statistical guarantees and achieves an order of magnitude speed-up compared to exhaustive search, demonstrating effectiveness on realistic 200-400 bus feeders.

Conclusion: The proposed framework offers an efficient and accurate methodology for identifying critical DER adoption scenarios, significantly improving utility investment planning for distribution grid reliability.

Abstract: We develop a new methodology to select scenarios of DER adoption most critical for distribution grids. Anticipating risks of future voltage and line flow violations due to additional PV adopters is central for utility investment planning but continues to rely on deterministic or ad hoc scenario selection. We propose a highly efficient search framework based on multi-objective Bayesian Optimization. We treat underlying grid stress metrics as computationally expensive black-box functions, approximated via Gaussian Process surrogates and design an acquisition function based on probability of scenarios being Pareto-critical across a collection of line- and bus-based violation objectives. Our approach provides a statistical guarantee and offers an order of magnitude speed-up relative to a conservative exhaustive search. Case studies on realistic feeders with 200-400 buses demonstrate the effectiveness and accuracy of our approach.

[500] NOBLE – Neural Operator with Biologically-informed Latent Embeddings to Capture Experimental Variability in Biological Neuron Models

Luca Ghafourpour, Valentin Duruisseaux, Bahareh Tolooshams, Philip H. Wong, Costas A. Anastassiou, Anima Anandkumar

Main category: cs.LG

TL;DR: NOBLE is a neural operator framework that learns to map interpretable neuron features to somatic voltage responses, enabling efficient generation of synthetic neurons with experimental variability and offering 4200x speedup over traditional solvers.

Details

Motivation: Current bio-realistic neuron models are constrained by limited experimental data availability and cannot account for natural variability. Deep learning approaches fail to capture full biophysical complexity and nonlinear voltage dynamics.

Method: NOBLE uses a neural operator framework trained on synthetic data from bio-realistic models. It learns a mapping from continuous frequency-modulated embeddings of interpretable neuron features to somatic voltage responses induced by current injection.

Result: NOBLE predicts distributions of neural dynamics accounting for experimental variability, generates synthetic neurons resembling experimental data with trial-to-trial variability, and achieves 4200x speedup over numerical solvers. It successfully generalizes to real experimental data.

Conclusion: NOBLE captures fundamental neural properties in an emergent manner, enabling better understanding of cellular composition, neuromorphic architectures, large-scale brain circuits, and neuroAI applications.

Abstract: Characterizing the cellular properties of neurons is fundamental to understanding their function in the brain. In this quest, the generation of bio-realistic models is central towards integrating multimodal cellular data sets and establishing causal relationships. However, current modeling approaches remain constrained by the limited availability and intrinsic variability of experimental neuronal data. The deterministic formalism of bio-realistic models currently precludes accounting for the natural variability observed experimentally. While deep learning is becoming increasingly relevant in this space, it fails to capture the full biophysical complexity of neurons, their nonlinear voltage dynamics, and variability. To address these shortcomings, we introduce NOBLE, a neural operator framework that learns a mapping from a continuous frequency-modulated embedding of interpretable neuron features to the somatic voltage response induced by current injection. Trained on synthetic data generated from bio-realistic neuron models, NOBLE predicts distributions of neural dynamics accounting for the intrinsic experimental variability. Unlike conventional bio-realistic neuron models, interpolating within the embedding space offers models whose dynamics are consistent with experimentally observed responses. NOBLE enables the efficient generation of synthetic neurons that closely resemble experimental data and exhibit trial-to-trial variability, offering a $4200\times$ speedup over the numerical solver. NOBLE is the first scaled-up deep learning framework that validates its generalization with real experimental data. To this end, NOBLE captures fundamental neural properties in a unique and emergent manner that opens the door to a better understanding of cellular composition and computations, neuromorphic architectures, large-scale brain circuits, and general neuroAI applications.

[501] Riemannian-Geometric Fingerprints of Generative Models

Hae Jin Song, Laurent Itti

Main category: cs.LG

TL;DR: The paper proposes a Riemannian geometry-based framework for defining and analyzing generative model fingerprints, improving model attribution across diverse datasets and architectures.

Details

Motivation: Address the gap in understanding generative model fingerprints for IP protection, content source verification, and preventing model collapse from regurgitative training.

Method: Geometric approach using Riemannian geometry to define artifacts and fingerprints, learning Riemannian metrics from data, and using geodesic distances with kNN-based Riemannian center of mass.

Result: Significantly improves model attribution performance across 4 datasets, 27 model architectures, 2 resolutions, and 2 modalities, with better generalization to unseen data.

Conclusion: The Riemannian geometry framework provides a principled way to define and compute generative model fingerprints, demonstrating practical efficacy for model authentication and synthetic data detection.

Abstract: Recent breakthroughs and rapid integration of generative models (GMs) have sparked interest in the problem of model attribution and their fingerprints. For instance, service providers need reliable methods of authenticating their models to protect their IP, while users and law enforcement seek to verify the source of generated content for accountability and trust. In addition, a growing threat of model collapse is arising, as more model-generated data are being fed back into sources (e.g., YouTube) that are often harvested for training (“regurgitative training”), heightening the need to differentiate synthetic from human data. Yet, a gap still exists in understanding generative models’ fingerprints, we believe, stemming from the lack of a formal framework that can define, represent, and analyze the fingerprints in a principled way. To address this gap, we take a geometric approach and propose a new definition of artifact and fingerprint of GMs using Riemannian geometry, which allows us to leverage the rich theory of differential geometry. Our new definition generalizes previous work (Song et al., 2024) to non-Euclidean manifolds by learning Riemannian metrics from data and replacing the Euclidean distances and nearest-neighbor search with geodesic distances and kNN-based Riemannian center of mass. We apply our theory to a new gradient-based algorithm for computing the fingerprints in practice. Results show that it is more effective in distinguishing a large array of GMs, spanning across 4 different datasets in 2 different resolutions (64 by 64, 256 by 256), 27 model architectures, and 2 modalities (Vision, Vision-Language). Using our proposed definition significantly improves the performance on model attribution, as well as a generalization to unseen datasets, model types, and modalities, suggesting its practical efficacy.

[502] Learning Provably Improves the Convergence of Gradient Descent

Qingyu Song, Wei Lin, Hong Xu

Main category: cs.LG

TL;DR: This paper provides theoretical convergence guarantees for Learn to Optimize (L2O) methods that learn Gradient Descent hyperparameters for quadratic programming, addressing the lack of rigorous theoretical backing in existing L2O approaches.

Details

Motivation: L2O methods have shown empirical success but lack rigorous theoretical convergence guarantees, with existing analyses often relying on unrealistic assumptions. This work aims to bridge this theoretical gap.

Method: The authors use Neural Tangent Kernel (NTK) theory to prove training convergence of L2O models, propose a deterministic initialization strategy to support theoretical results, and mitigate gradient explosion for stable training over extended optimization horizons.

Result: The proposed L2O framework demonstrates over 50% better optimality than standard Gradient Descent and superior robustness compared to state-of-the-art L2O methods on synthetic datasets.

Conclusion: This work successfully provides theoretical convergence guarantees for L2O training while achieving practical performance improvements, bridging the gap between empirical success and theoretical understanding in L2O methods.

Abstract: Learn to Optimize (L2O) trains deep neural network-based solvers for optimization, achieving success in accelerating convex problems and improving non-convex solutions. However, L2O lacks rigorous theoretical backing for its own training convergence, as existing analyses often use unrealistic assumptions – a gap this work highlights empirically. We bridge this gap by proving the training convergence of L2O models that learn Gradient Descent (GD) hyperparameters for quadratic programming, leveraging the Neural Tangent Kernel (NTK) theory. We propose a deterministic initialization strategy to support our theoretical results and promote stable training over extended optimization horizons by mitigating gradient explosion. Our L2O framework demonstrates over 50% better optimality than GD and superior robustness over state-of-the-art L2O methods on synthetic datasets. The code of our method can be found from https://github.com/NetX-lab/MathL2OProof-Official.

[503] Mixture-of-Experts Meets In-Context Reinforcement Learning

Wenhao Wu, Fuhong Liu, Haoru Li, Zican Hu, Daoyi Dong, Chunlin Chen, Zhi Wang

Main category: cs.LG

TL;DR: T2MIR introduces a mixture-of-experts (MoE) framework for in-context reinforcement learning, using token-wise and task-wise MoE layers to handle multi-modal state-action-reward data and diverse decision tasks, with contrastive learning for improved task routing.

Details

Motivation: To address challenges in in-context RL including multi-modality of state-action-reward data and heterogeneous decision tasks, and to better harness in-context learning capabilities.

Method: Replaces feedforward layers with two parallel MoE layers: token-wise MoE for multi-modal token semantics and task-wise MoE for specialized task handling, enhanced by contrastive learning for task-router mutual information maximization.

Result: T2MIR significantly improves in-context learning capacity and outperforms various baseline methods in comprehensive experiments.

Conclusion: The framework brings MoE potential to ICRL, offering a scalable architectural enhancement that advances ICRL toward achievements seen in language and vision domains.

Abstract: In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose T2MIR (Token- and Task-wise MoE for In-context RL), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.

[504] GST-UNet: A Neural Framework for Spatiotemporal Causal Inference with Time-Varying Confounding

Miruna Oprescu, David K. Park, Xihaier Luo, Shinjae Yoo, Nathan Kallus

Main category: cs.LG

TL;DR: GST-UNet is a neural framework combining U-Net architecture with G-computation for spatiotemporal causal inference, addressing challenges like interference, spatial confounding, and time-varying confounding in observational data.

Details

Motivation: Existing methods for spatiotemporal causal inference either rely on strong structural assumptions or fail to handle key challenges like interference, spatial confounding, temporal carryover, and time-varying confounding where covariates are influenced by past treatments and affect future ones.

Method: GST-UNet combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences, explicitly adjusting for time-varying confounders and capturing non-linear spatial and temporal dependencies.

Result: The framework was validated in synthetic experiments and real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire, demonstrating effectiveness in data-scarce settings.

Conclusion: GST-UNet provides a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.

Abstract: Estimating causal effects from spatiotemporal observational data is essential in public health, environmental science, and policy evaluation, where randomized experiments are often infeasible. Existing approaches, however, either rely on strong structural assumptions or fail to handle key challenges such as interference, spatial confounding, temporal carryover, and time-varying confounding – where covariates are influenced by past treatments and, in turn, affect future ones. We introduce GST-UNet (G-computation Spatio-Temporal UNet), a theoretically grounded neural framework that combines a U-Net-based spatiotemporal encoder with regression-based iterative G-computation to estimate location-specific potential outcomes under complex intervention sequences. GST-UNet explicitly adjusts for time-varying confounders and captures non-linear spatial and temporal dependencies, enabling valid causal inference from a single observed trajectory in data-scarce settings. We validate its effectiveness in synthetic experiments and in a real-world analysis of wildfire smoke exposure and respiratory hospitalizations during the 2018 California Camp Fire. Together, these results position GST-UNet as a principled and ready-to-use framework for spatiotemporal causal inference, advancing reliable estimation in policy-relevant and scientific domains.

[505] Learning to Coordinate with Experts

Mohamad H. Danesh, Nguyen X. Khanh, Tu Trinh, Benjamin Plaut

Main category: cs.LG

TL;DR: The paper introduces YRC-0, a problem where agents must learn to collaborate with experts in new environments without expert interaction during training, and presents YRC-Bench as a benchmark with implementations and evaluation methods.

Details

Motivation: AI agents in real-world scenarios often face challenges beyond their capabilities, requiring expert assistance for improved safety and performance, but expert consultation is costly, creating a need for learning when to consult experts.

Method: The authors propose YRC-0 as a novel problem variant and develop YRC-Bench, an open-source benchmark with Gym-like API, simulated experts, evaluation pipeline, and baseline implementations, along with a validation strategy for evaluating learning methods.

Result: YRC-Bench provides a standardized framework for research on expert-leveraging agents, and the evaluation of various learning methods offers insights to guide future work in this area.

Conclusion: The paper establishes YRC-0 as a key research problem and provides YRC-Bench as a comprehensive benchmark to facilitate development of low-cost, robust approaches for training agents that can effectively leverage expert assistance in new environments.

Abstract: When deployed in the real world, AI agents will inevitably face challenges that exceed their individual capabilities. Leveraging assistance from experts, whether humans or highly capable AI systems, can significantly improve both safety and performance in such situations. Since expert assistance is costly, a central challenge is determining when to consult an expert. In this paper, we explore a novel variant of this problem, termed YRC-0, in which an agent must learn to collaborate with an expert in new environments in an unsupervised manner–that is, without interacting with the expert during training. This setting motivates the development of low-cost, robust approaches for training expert-leveraging agents. To support research in this area, we introduce YRC-Bench, an open-source benchmark that instantiates YRC-0 across diverse environments. YRC-Bench provides a standardized Gym-like API, simulated experts, an evaluation pipeline, and implementations of popular baselines. Toward tackling YRC-0, we propose a validation strategy and evaluate a range of learning methods, offering insights that can inform future research. Codebase: github.com/modanesh/YRC-Bench

[506] Inter-turbine Modelling of Wind-Farm Power using Multi-task Learning

Simon M. Brealy, Lawrence A. Bull, Pauline Beltrando, Anders Sommer, Nikolaos Dervilis, Keith Worden

Main category: cs.LG

TL;DR: A hierarchical Bayesian metamodel for wind turbine power prediction that leverages spatial correlations and multi-task learning to predict power for both observed and unobserved turbines, outperforming benchmark models.

Details

Motivation: Need to reduce operation and maintenance costs for renewable energy infrastructure through online monitoring, addressing challenges like limited labeled damage data, operational variability, and uncertainty quantification.

Method: Probabilistic regression model for wind turbine power prediction with wake effects, extended to hierarchical Bayesian model for multi-task learning that captures spatial correlations between turbines.

Result: The metamodel outperforms benchmark models and enables power predictions for turbines not included in training data by leveraging spatial correlations from wake effects.

Conclusion: Demonstrates an efficient data-use strategy for inference in populations of structures with correlated variables, particularly useful for wind turbine wake-effect applications.

Abstract: Because of the global need to increase power production from renewable energy resources, developments in the online monitoring of the associated infrastructure is of interest to reduce operation and maintenance costs. However, challenges exist for data-driven approaches to this problem, such as incomplete or limited histories of labelled damage-state data, operational and environmental variability, or the desire for the quantification of uncertainty to support risk management. This work first introduces a probabilistic regression model for predicting wind-turbine power, which adjusts for wake effects learnt from data. Spatial correlations in the learned model parameters for different tasks (turbines) are then leveraged in a hierarchical Bayesian model (an approach to multi-task learning) to develop a “metamodel”, which can be used to make power-predictions which adjust for turbine location - including on previously unobserved turbines not included in the training data. The results show that the metamodel is able to outperform a series of benchmark models, and demonstrates a novel strategy for making efficient use of data for inference in populations of structures, in particular where correlations exist in the variable(s) of interest (such as those from wind-turbine wake-effects).

[507] PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa

Main category: cs.LG

TL;DR: PULSE protocol introduces realistic unlearning evaluation for LMMs, focusing on pre-trained knowledge unlearning and long-term sustainability, revealing current methods struggle with pre-training knowledge and sequential unlearning.

Details

Motivation: Address the lack of practical evaluation framework for unlearning in large multimodal models, as existing benchmarks only consider single-operation fine-tuned knowledge unlearning.

Method: Introduces PULSE protocol with two perspectives: Pre-trained knowledge Unlearning to analyze effects across knowledge acquisition phases, and Long-term Sustainability Evaluation for sequential unlearning requests.

Result: Current unlearning techniques can successfully unlearn fine-tuned knowledge but struggle with pre-training knowledge. Methods effective for batch unlearning show significant performance degradation when data is split and unlearned sequentially.

Conclusion: Existing unlearning methods have limitations in handling pre-trained knowledge and sequential unlearning scenarios, highlighting the need for more robust unlearning approaches for LMMs.

Abstract: In recent years, unlearning techniques, which are methods for inducing a model to “forget” previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.

[508] Federated Structured Sparse PCA for Anomaly Detection in IoT Networks

Chenyi Huang, Xianchao Xiu

Main category: cs.LG

TL;DR: Proposes FedSSP, a federated structured sparse PCA method for IoT anomaly detection that integrates double sparsity regularization to enhance model interpretability and detection accuracy.

Details

Motivation: Current federated PCA methods lack sparsity integration, which is critical for robust anomaly detection in IoT environments where privacy preservation is important.

Method: Uses federated structured sparse PCA with double sparsity regularization: row-wise sparsity via ℓ₂,p-norm (p∈[0,1)) to remove redundant features, and element-wise sparsity via ℓq-norm (q∈[0,1)) to suppress noise. Solved using proximal alternating minimization (PAM) algorithm.

Result: Numerical experiments show that incorporating structured sparsity enhances both model interpretability and detection accuracy in IoT networks.

Conclusion: FedSSP effectively addresses the sparsity limitation in federated PCA methods and provides improved anomaly detection performance while maintaining privacy in distributed IoT environments.

Abstract: Although federated learning has gained prominence as a privacy-preserving framework tailored for distributed Internet of Things (IoT) environments, current federated principal component analysis (PCA) methods lack integration of sparsity, a critical feature for robust anomaly detection. To address this limitation, we propose a novel federated structured sparse PCA (FedSSP) approach for anomaly detection in IoT networks. The proposed model uniquely integrates double sparsity regularization: (1) row-wise sparsity governed by $\ell_{2,p}$-norm with $p\in [0,1)$ to eliminate redundant feature dimensions, and (2) element-wise sparsity via $\ell_{q}$-norm with $q\in [0,1)$ to suppress noise-sensitive components. To solve this nonconvex problem in a distributed setting, we devise an efficient optimization algorithm based on the proximal alternating minimization (PAM). Numerical experiments validate that incorporating structured sparsity enhances both model interpretability and detection accuracy. Our code is available at https://github.com/xianchaoxiu/FedSSP.

[509] FoGE: Fock Space inspired encoding for graph prompting

Sotirios Panagiotis Chytas, Rudrasis Chakraborty, Vikas Singh

Main category: cs.LG

TL;DR: Parameter-free Fock space graph encoder enables LLMs to answer graph-related questions effectively across diverse graph types with minimal architecture adjustments.

Details

Motivation: To create a versatile graph understanding solution for LLMs that requires less supervision and generalizes well across different graph structures without extensive modifications.

Method: Use parameter-free graph encoder based on Fock space representations from mathematical physics, combined with prefix-tuned prompts and frozen pre-trained LLM.

Result: The approach effectively handles graph-related questions across various graph types (simple graphs, proteins, hypergraphs) with minimal adjustments to architecture.

Conclusion: Fock space representations provide rich graph encodings that significantly simplify existing solutions and generalize effortlessly across multiple graph-based structures.

Abstract: Recent results show that modern Large Language Models (LLM) are indeed capable of understanding and answering questions about structured data such as graphs. This new paradigm can lead to solutions that require less supervision while, at the same time, providing a model that can generalize and answer questions beyond the training labels. Existing proposals often use some description of the graph to create an ``augmented’’ prompt fed to the LLM. For a chosen class of graphs, if a well-tailored graph encoder is deployed to play together with a pre-trained LLM, the model can answer graph-related questions well. Existing solutions to graph-based prompts range from graph serialization to graph transformers. In this work, we show that the use of a parameter-free graph encoder based on Fock space representations, a concept borrowed from mathematical physics, is remarkably versatile in this problem setting. The simple construction, inherited directly from the theory with a few small adjustments, can provide rich and informative graph encodings, for a wide range of different graphs. We investigate the use of this idea for prefix-tuned prompts leveraging the capabilities of a pre-trained, frozen LLM. The modifications lead to a model that can answer graph-related questions – from simple graphs to proteins to hypergraphs – effectively and with minimal, if any, adjustments to the architecture. Our work significantly simplifies existing solutions and generalizes well to multiple different graph-based structures effortlessly.

[510] Pairwise Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model

Kotaro Ikeda, Masanori Koyama, Jinzhe Zhang, Kohei Hayashi, Kenji Fukumizu

Main category: cs.LG

TL;DR: A flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport, handling continuous conditions with sparse observations.

Details

Motivation: To address the challenge of learning optimal transport maps among continuous conditional distributions with sparse empirical observations per condition.

Method: Proposes a novel cost function for simultaneous learning of optimal transports for all pairs of conditional distributions, using learned transport maps to couple data points in conditional flow matching.

Result: The method demonstrates effectiveness on synthetic, benchmark, and chemical datasets with continuous physical properties as conditions, with theoretical convergence guarantees.

Conclusion: The proposed approach successfully learns all-to-all transfer maps for conditional distributions and provides a practical solution for continuous conditions with sparse data.

Abstract: In this paper, we propose a flow-based method for learning all-to-all transfer maps among conditional distributions that approximates pairwise optimal transport. The proposed method addresses the challenge of handling the case of continuous conditions, which often involve a large set of conditions with sparse empirical observations per condition. We introduce a novel cost function that enables simultaneous learning of optimal transports for all pairs of conditional distributions. Our method is supported by a theoretical guarantee that, in the limit, it converges to the pairwise optimal transports among infinite pairs of conditional distributions. The learned transport maps are subsequently used to couple data points in conditional flow matching. We demonstrate the effectiveness of this method on synthetic and benchmark datasets, as well as on chemical datasets in which continuous physical properties are defined as conditions. The code for this project can be found at https://github.com/kotatumuri-room/A2A-FM

[511] DP-LLM: Runtime Model Adaptation with Dynamic Layer-wise Precision Assignment

Sangwoo Kwon, Seong Hoon Seo, Jae W. Lee, Yeonhong Park

Main category: cs.LG

TL;DR: DP-LLM is a novel mechanism that dynamically assigns precision to each layer of on-device LLMs based on input values, achieving superior performance-latency trade-off by leveraging layer sensitivity changes across decoding steps.

Details

Motivation: To effectively handle queries for on-device LLMs with varying runtime constraints (latency and accuracy) by enabling memory-efficient runtime model adaptation through dynamic precision assignment.

Method: Leverages the observation that layer sensitivity dynamically changes across decoding steps, and introduces DP-LLM which dynamically assigns precision to each layer based on input values rather than using static mixed-precision.

Result: Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves superior performance-latency trade-off compared to prior approaches.

Conclusion: Dynamic precision assignment based on input values and changing layer sensitivity across decoding steps provides an effective solution for on-device LLM deployment with varying runtime constraints.

Abstract: How can we effectively handle queries for on-device large language models (LLMs) with varying runtime constraints, such as latency and accuracy? Multi-scale quantization addresses this challenge by enabling memory-efficient runtime model adaptation of LLMs through the overlaying of multiple model variants quantized to different bitwidths. Meanwhile, an important question still remains open-ended: how can models be properly configured to match a target precision or latency? While mixed-precision offers a promising solution, we take this further by leveraging the key observation that the sensitivity of each layer dynamically changes across decoding steps. Building on this insight, we introduce DP-LLM, a novel mechanism that dynamically assigns precision to each layer based on input values. Experimental results across multiple models and benchmarks demonstrate that DP-LLM achieves a superior performance-latency trade-off, outperforming prior approaches.

[512] Data Fusion of Deep Learned Molecular Embeddings for Property Prediction

Robert J Appleton, Brian C Barnes, Alejandro Strachan

Main category: cs.LG

TL;DR: A new multitask learning approach that fuses pretrained single-task embeddings outperforms standard multitask models on sparse data with weakly correlated properties.

Details

Motivation: Standard multitask learning underperforms on sparse data sets with weakly correlated properties, limiting accuracy in material science applications where data is often scarce.

Method: Fuse deep-learned embeddings from independent pretrained single-task models to create a multitask model that inherits property-specific representations, reducing trainable parameters.

Result: The fused model outperforms standard multitask models on both quantum chemistry benchmark data and newly compiled sparse experimental data.

Conclusion: Reusing pretrained embeddings rather than retraining enables more effective multitask learning on sparse data with weakly correlated properties.

Abstract: Data-driven approaches such as deep learning can result in predictive models for material properties with exceptional accuracy and efficiency. However, in many applications, data is sparse, severely limiting their accuracy and applicability. To improve predictions, techniques such as transfer learning and multitask learning have been used. The performance of multitask learning models depends on the strength of the underlying correlations between tasks and the completeness of the data set. Standard multitask models tend to underperform when trained on sparse data sets with weakly correlated properties. To address this gap, we fuse deep-learned embeddings generated by independent pretrained single-task models, resulting in a multitask model that inherits rich, property-specific representations. By reusing (rather than retraining) these embeddings, the resulting fused model outperforms standard multitask models and can be extended with fewer trainable parameters. We demonstrate this technique on a widely used benchmark data set of quantum chemistry data for small molecules as well as a newly compiled sparse data set of experimental data collected from literature and our own quantum chemistry and thermochemical calculations.

[513] Robustness is Important: Limitations of LLMs for Data Fitting

Hejia Liu, Mochen Yang, Gediminas Adomavicius

Main category: cs.LG

TL;DR: LLMs show prediction sensitivity to task-irrelevant data variations like variable name changes, with error differences up to 82%, revealing fundamental robustness issues in using LLMs for data fitting.

Details

Motivation: To investigate the vulnerability of LLMs when used as plug-and-play data fitting tools, particularly their sensitivity to task-irrelevant data representation changes.

Method: Examined LLM prediction sensitivity under in-context learning and supervised fine-tuning, analyzed attention patterns in open-weight LLMs, and compared with TabPFN (tabular foundation model).

Result: LLMs show significant prediction sensitivity (up to 82% error variation) to task-irrelevant changes like variable name modifications, with non-uniform attention patterns explaining the vulnerability. TabPFN also shows some sensitivity despite robustness design.

Conclusion: Despite impressive predictive capabilities, current LLMs lack basic robustness needed for principled data-fitting applications due to sensitivity to task-irrelevant variations.

Abstract: Large Language Models (LLMs) are being applied in a wide array of settings, well beyond the typical language-oriented use cases. In particular, LLMs are increasingly used as a plug-and-play method for fitting data and generating predictions. Prior work has shown that LLMs, via in-context learning or supervised fine-tuning, can perform competitively with many tabular supervised learning techniques in terms of predictive performance. However, we identify a critical vulnerability of using LLMs for data fitting – making changes to data representation that are completely irrelevant to the underlying learning task can drastically alter LLMs’ predictions on the same data. For example, simply changing variable names can sway the size of prediction error by as much as 82% in certain settings. Such prediction sensitivity with respect to task-irrelevant variations manifests under both in-context learning and supervised fine-tuning, for both close-weight and open-weight general-purpose LLMs. Moreover, by examining the attention scores of an open-weight LLM, we discover a non-uniform attention pattern: training examples and variable names/values which happen to occupy certain positions in the prompt receive more attention when output tokens are generated, even though different positions are expected to receive roughly the same attention. This partially explains the sensitivity in the presence of task-irrelevant variations. We also consider a state-of-the-art tabular foundation model (TabPFN) trained specifically for data fitting. Despite being explicitly designed to achieve prediction robustness, TabPFN is still not immune to task-irrelevant variations. Overall, despite LLMs’ impressive predictive capabilities, currently they lack even the basic level of robustness to be used as a principled data-fitting tool.

[514] Clustering-Based Low-Rank Matrix Approximation for Medical Image Compression

Sisipho Hamlomo, Marcellin Atemkeng

Main category: cs.LG

TL;DR: Adaptive low-rank matrix approximation (LoRMA) partitions medical images into overlapping patches, clusters similar patches, and performs SVD per cluster to preserve local structural variations and diagnostic fidelity while achieving efficient compression.

Details

Motivation: Medical images have high local structural variations crucial for diagnosis, but global LoRMA techniques fail to adapt to these local variations, leading to loss of diagnostic information in compression.

Method: Partition medical images into overlapping patches, group structurally similar patches using k-means clustering, perform SVD within each cluster, and derive compression factors accounting for patch overlap.

Result: Adaptive LoRMA outperforms global SVD across MRI, ultrasound, CT scan, and chest X-ray in PSNR, SSIM, IoU, EPI metrics with lower MSE, preserves structural integrity and edge details, minimizes block artifacts, and maintains diagnostic relevance.

Conclusion: Adaptive LoRMA effectively preserves clinically salient regions while allowing aggressive compression in non-critical areas, justifying higher processing time for superior diagnostic fidelity in high-compression medical imaging applications.

Abstract: Medical images are inherently high-resolution and contain locally varying structures crucial for diagnosis. Efficient compression must preserve diagnostic fidelity while minimizing redundancy. Low-rank matrix approximation (LoRMA) techniques have shown strong potential for image compression by capturing global correlations; however, they often fail to adapt to local structural variations across regions of interest. To address this, we introduce an adaptive LoRMA, which partitions a medical image into overlapping patches, groups structurally similar patches into clusters using k-means, and performs SVD within each cluster. We derive the overall compression factor accounting for patch overlap and analyze how patch size influences compression efficiency and computational cost. While applicable to any data with high local variation, we focus on medical imaging due to its pronounced local variability. We evaluate and compare our adaptive LoRMA against global SVD across four imaging modalities: MRI, ultrasound, CT scan, and chest X-ray. Results demonstrate that adaptive LoRMA effectively preserves structural integrity, edge details, and diagnostic relevance, measured by PSNR, SSIM, MSE, IoU, and EPI. Adaptive LoRMA minimizes block artifacts and residual errors, particularly in pathological regions, consistently outperforming global SVD in PSNR, SSIM, IoU, EPI, and achieving lower MSE. It prioritizes clinically salient regions while allowing aggressive compression in non-critical regions, optimizing storage efficiency. Although adaptive LoRMA requires higher processing time, its diagnostic fidelity justifies the overhead for high-compression applications.

[515] Turbocharging Gaussian Process Inference with Approximate Sketch-and-Project

Pratik Rathore, Zachary Frangella, Sachin Garg, Shaghayegh Fazliani, Michał Dereziński, Madeleine Udell

Main category: cs.LG

TL;DR: ADASAP is a distributed, accelerated sketch-and-project algorithm that enables scalable Gaussian process inference for large datasets by efficiently solving linear systems that normally scale quadratically with dataset size.

Details

Motivation: Gaussian processes struggle with scalability for large datasets due to quadratic computational complexity in solving linear systems, which limits their application in modern large-scale problems.

Method: Proposed ADASAP algorithm uses distributed, accelerated sketch-and-project approach with determinantal point processes theory to approximate posterior mean efficiently.

Result: ADASAP outperforms state-of-the-art solvers (conjugate gradient, coordinate descent) on benchmark datasets and scales to datasets with over 300 million samples, achieving condition number-free convergence for posterior mean estimation.

Conclusion: ADASAP provides a principled, scalable solution for Gaussian process inference that enables practical application to massive datasets previously infeasible in the literature.

Abstract: Gaussian processes (GPs) play an essential role in biostatistics, scientific machine learning, and Bayesian optimization for their ability to provide probabilistic predictions and model uncertainty. However, GP inference struggles to scale to large datasets (which are common in modern applications), since it requires the solution of a linear system whose size scales quadratically with the number of samples in the dataset. We propose an approximate, distributed, accelerated sketch-and-project algorithm ($\texttt{ADASAP}$) for solving these linear systems, which improves scalability. We use the theory of determinantal point processes to show that the posterior mean induced by sketch-and-project rapidly converges to the true posterior mean. In particular, this yields the first efficient, condition number-free algorithm for estimating the posterior mean along the top spectral basis functions, showing that our approach is principled for GP inference. $\texttt{ADASAP}$ outperforms state-of-the-art solvers based on conjugate gradient and coordinate descent across several benchmark datasets and a large-scale Bayesian optimization task. Moreover, $\texttt{ADASAP}$ scales to a dataset with $> 3 \cdot 10^8$ samples, a feat which has not been accomplished in the literature.

[516] Pre-trained knowledge elevates large language models beyond traditional chemical reaction optimizers

Robert MacKnight, Jose Emilio Regio, Jeffrey G. Ethier, Luke A. Baldwin, Gabe Gomes

Main category: cs.LG

TL;DR: LLM-guided optimization (LLM-GO) matches or exceeds Bayesian optimization performance in chemical reaction optimization, particularly in complex categorical spaces where high-performing conditions are scarce.

Details

Motivation: To demonstrate that pre-trained knowledge in large language models fundamentally changes the paradigm of black-box optimization in experimental chemistry.

Method: Benchmarked LLM-guided optimization against Bayesian optimization and random sampling using six fully enumerated categorical reaction datasets (768-5,684 experiments), with a topology-agnostic information theory framework to quantify sampling diversity.

Result: LLMs consistently match or exceed BO performance across five single-objective datasets, with advantages growing as parameter complexity increases and high-performing conditions become scarce (<5% of space). BO retains superiority only for explicit multi-objective trade-offs.

Conclusion: LLM-GO excels precisely where traditional methods struggle: complex categorical spaces requiring domain understanding rather than mathematical optimization, with pre-trained domain knowledge enabling more effective navigation of chemical parameter space.

Abstract: Modern optimization in experimental chemistry employs algorithmic search through black-box parameter spaces. Here we demonstrate that pre-trained knowledge in large language models (LLMs) fundamentally changes this paradigm. Using six fully enumerated categorical reaction datasets (768-5,684 experiments), we benchmark LLM-guided optimization (LLM-GO) against Bayesian optimization (BO) and random sampling. Frontier LLMs consistently match or exceed BO performance across five single-objective datasets, with advantages growing as parameter complexity increases and high-performing conditions become scarce (<5% of space). BO retains superiority only for explicit multi-objective trade-offs. To understand these contrasting behaviors, we introduce a topology-agnostic information theory framework quantifying sampling diversity throughout optimization campaigns. This analysis reveals that LLMs maintain systematically higher exploration Shannon entropy than BO across all datasets while achieving superior performance, with advantages most pronounced in solution-scarce parameter spaces where high-entropy exploration typically fails-suggesting that pre-trained domain knowledge enables more effective navigation of chemical parameter space rather than replacing structured exploration strategies. To enable transparent benchmarking and community validation, we release Iron Mind (https://gomes.andrew.cmu.edu/iron-mind), a no-code platform for side-by-side evaluation of human, algorithmic, and LLM optimization campaigns with public leaderboards and complete trajectories. Our findings establish that LLM-GO excels precisely where traditional methods struggle: complex categorical spaces requiring domain understanding rather than mathematical optimization.

[517] JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Qihao Duan, Bingding Huang, Zhenqiao Song, Irina Lehmann, Lei Gu, Roland Eils, Benjamin Wild

Main category: cs.LG

TL;DR: JanusDNA is a bidirectional DNA foundation model that combines autoregressive efficiency with masked modeling’s bidirectional comprehension, using a hybrid Mamba-Attention-MoE architecture to handle up to 1 million base pairs and achieve SOTA results on genomic benchmarks.

Details

Motivation: LLMs struggle with genomics due to long-range dependencies (over 10,000 base pairs) and DNA's bidirectional nature, which standard autoregressive training can't capture efficiently, while masked models are computationally inefficient.

Method: Uses a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture combining long-range modeling of Attention with efficient sequential learning of Mamba, plus a novel pretraining paradigm that merges autoregressive optimization efficiency with masked modeling’s bidirectional comprehension.

Result: Processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU and achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters.

Conclusion: JanusDNA successfully addresses key limitations in adapting LLMs to genomics by providing bidirectional understanding with computational efficiency, enabling large-scale DNA sequence analysis.

Abstract: Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types, including genetic sequences. However, adapting LLMs to genomics presents significant challenges. Capturing complex genomic interactions requires modeling long-range dependencies within DNA sequences, where interactions often span over 10,000 base pairs, even within a single gene, posing substantial computational burdens under conventional model architectures and training paradigms. Moreover, standard LLM training approaches are suboptimal for DNA: autoregressive training, while efficient, supports only unidirectional understanding. However, DNA is inherently bidirectional, e.g., bidirectional promoters regulate transcription in both directions and account for nearly 11% of human gene expression. Masked language models (MLMs) allow bidirectional understanding but are inefficient, as only masked tokens contribute to the loss per step. To address these limitations, we introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm that combines the optimization efficiency of autoregressive modeling with the bidirectional comprehension of masked modeling. JanusDNA adopts a hybrid Mamba, Attention and Mixture of Experts (MoE) architecture, combining long-range modeling of Attention with efficient sequential learning of Mamba. MoE layers further scale model capacity via sparse activation while keeping computational cost low. Notably, JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU. Extensive experiments and ablations show JanusDNA achieves new SOTA results on three genomic representation benchmarks, outperforming models with 250x more activated parameters. Code: https://github.com/Qihao-Duan/JanusDNA

[518] CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots

Keisuke Kawano, Takuro Kutsuna, Naoki Hayashi, Yasushi Esaki, Hidenori Tanaka

Main category: cs.LG

TL;DR: CT-OT Flow is a two-stage framework that estimates continuous-time dynamics from temporally aggregated snapshot data by inferring high-resolution time labels via partial optimal transport and reconstructing continuous-time distributions through kernel smoothing.

Details

Motivation: Many real-world applications like single-cell RNA sequencing, mobility sensing, and environmental monitoring only provide temporally aggregated snapshots with noisy timestamps, lacking continuous trajectory data, creating a need to estimate continuous-time dynamics from such limited observations.

Method: A two-stage approach: (1) infer high-resolution time labels by aligning neighboring intervals using partial optimal transport, (2) reconstruct continuous-time data distribution through temporal kernel smoothing, then sample nearby time pairs to train standard ODE/SDE models. Includes practical accelerations like screening and mini-batch POT.

Result: CT-OT Flow outperforms existing methods (OT-CFM, [SF]²M, TrajectoryNet, MFM, ENOT) by reducing distributional and trajectory errors across synthetic benchmarks and real datasets including scRNA-seq and typhoon tracks.

Conclusion: The framework successfully addresses snapshot aggregation and time-label uncertainty, making it applicable to large datasets and providing improved continuous-time dynamics estimation from temporally aggregated observations.

Abstract: In many real-world settings–e.g., single-cell RNA sequencing, mobility sensing, and environmental monitoring–data are observed only as temporally aggregated snapshots collected over finite time windows, often with noisy or uncertain timestamps, and without access to continuous trajectories. We study the problem of estimating continuous-time dynamics from such snapshots. We present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage framework that (i) infers high-resolution time labels by aligning neighboring intervals via partial optimal transport (POT) and (ii) reconstructs a continuous-time data distribution through temporal kernel smoothing, from which we sample pairs of nearby times to train standard ODE/SDE models. Our formulation explicitly accounts for snapshot aggregation and time-label uncertainty and uses practical accelerations (screening and mini-batch POT), making it applicable to large datasets. Across synthetic benchmarks and two real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional and trajectory errors compared with OT-CFM, [SF](^{2})M, TrajectoryNet, MFM, and ENOT.

Yuting Huang, Ziquan Fang, Zhihao Zeng, Lu Chen, Yunjun Gao

Main category: cs.LG

TL;DR: E^2-CSTP is a causal multi-modal spatio-temporal prediction framework that uses cross-modal attention and gating for effective data fusion, dual-branch causal inference for bias mitigation, and GCN-Mamba integration for computational efficiency.

Details

Motivation: To address challenges in spatio-temporal prediction including inadequate multi-modal fusion, confounding factors obscuring causal relations, and high computational complexity of existing models.

Method: Uses cross-modal attention and gating mechanisms for multi-modal integration, dual-branch causal inference with primary branch for prediction and auxiliary branch for bias mitigation via causal interventions, and integrates GCN with Mamba architecture for efficient spatio-temporal encoding.

Result: Outperforms 9 state-of-the-art methods on 4 real-world datasets with up to 9.66% accuracy improvement and 17.37%-56.11% reduction in computational overhead.

Conclusion: E^2-CSTP effectively addresses key challenges in multi-modal spatio-temporal prediction through causal inference and efficient architecture design, achieving significant performance gains and computational efficiency.

Abstract: Spatio-temporal prediction plays a crucial role in intelligent transportation, weather forecasting, and urban planning. While integrating multi-modal data has shown potential for enhancing prediction accuracy, key challenges persist: (i) inadequate fusion of multi-modal information, (ii) confounding factors that obscure causal relations, and (iii) high computational complexity of prediction models. To address these challenges, we propose E^2-CSTP, an Effective and Efficient Causal multi-modal Spatio-Temporal Prediction framework. E^2-CSTP leverages cross-modal attention and gating mechanisms to effectively integrate multi-modal data. Building on this, we design a dual-branch causal inference approach: the primary branch focuses on spatio-temporal prediction, while the auxiliary branch mitigates bias by modeling additional modalities and applying causal interventions to uncover true causal dependencies. To improve model efficiency, we integrate GCN with the Mamba architecture for accelerated spatio-temporal encoding. Extensive experiments on 4 real-world datasets show that E^2-CSTP significantly outperforms 9 state-of-the-art methods, achieving up to 9.66% improvements in accuracy as well as 17.37%-56.11% reductions in computational overhead.

[520] PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models

He Xiao, Runming Yang, Qingyao Yang, Wendong Xu, Zhen Li, Yupeng Su, Zhengwu Liu, Hongxia Yang, Ngai Wong

Main category: cs.LG

TL;DR: PTQTP is a novel ternary-weight post-training quantization framework that decomposes LLM weights into structured ternary trit-planes, achieving multiplication-free inference with superior expressiveness compared to existing ultra-low-bit methods.

Details

Motivation: Existing ultra-low-bit PTQ methods suffer from limited representational capacity or computational overhead that undermines efficiency gains, creating a fundamental trade-off between computational efficiency and model expressiveness.

Method: Decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation, with progressive approximation algorithm for global weight consistency, model-agnostic deployment, and uniform ternary operations.

Result: Significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention vs 0% for competitors, and approaches/surpasses 1.58-bit quantization-aware training performance with single-hour quantization vs 10-14 GPU days for training-based methods.

Conclusion: PTQTP establishes a practical solution for efficient LLM deployment in resource-constrained environments, providing multiplication-free inference while maintaining superior expressiveness through structured ternary decomposition.

Abstract: Post-training quantization (PTQ) of large language models (LLMs) to extremely low bit-widths remains challenging due to the fundamental trade-off between computational efficiency and model expressiveness. While existing ultra-low-bit PTQ methods rely on binary approximations or complex compensation mechanisms, they suffer from either limited representational capacity or computational overhead that undermines their efficiency gains. We introduce PTQ to Trit-Planes (PTQTP), the first ternary-weight PTQ framework that decomposes weight matrices into structured ternary {-1, 0, 1} trit-planes using 2x1.58-bit representation. PTQTP achieves multiplication-free inference, identical to 1-bit quantization, while maintaining superior expressiveness through its novel structured decomposition. Our approach provides: (1) a theoretically grounded progressive approximation algorithm ensuring global weight consistency; (2) model-agnostic deployment across diverse modern LLMs without architectural modifications; and (3) uniform ternary operations that eliminate the need for mixed-precision or compensation schemes. Comprehensive experiments across LLaMA3.x and Qwen3 model families (0.6B-70B parameters) demonstrate that PTQTP significantly outperforms existing low-bit PTQ methods, achieving 82.4% mathematical reasoning retention versus 0% for competing approaches. PTQTP approaches and sometimes surpasses 1.58-bit quantization-aware training performance while requiring only single-hour quantization compared to 10-14 GPU days for training-based methods. These results establish PTQTP as a practical solution for efficient LLM deployment in resource-constrained environments. The code will be available at https://github.com/HeXiao-55/PTQTP.

[521] Why Diffusion Models Don’t Memorize: The Role of Implicit Dynamical Regularization in Training

Tony Bonnaire, Raphaël Urfin, Giulio Biroli, Marc Mézard

Main category: cs.LG

TL;DR: Diffusion models exhibit two distinct training timescales: an early generalization time τ_gen for quality samples and a later memorization time τ_mem. τ_mem grows linearly with dataset size while τ_gen remains constant, creating a generalization window that prevents memorization in overparameterized settings.

Details

Motivation: To understand the mechanisms that prevent diffusion models from memorizing training data and enable generalization, particularly investigating the role of training dynamics in the transition from generalization to memorization.

Method: Extensive experiments with standard U-Net architectures on realistic and synthetic datasets, combined with theoretical analysis using a tractable random features model studied in the high-dimensional limit.

Result: Identified two distinct timescales: τ_gen (early generalization) remains constant, while τ_mem (later memorization) increases linearly with training set size. This creates a growing window where models generalize effectively before memorization emerges.

Conclusion: Training dynamics provide implicit dynamical regularization that allows diffusion models to avoid memorization even in highly overparameterized settings, with overfitting only disappearing when dataset size exceeds a model-dependent threshold at infinite training times.

Abstract: Diffusion models have achieved remarkable success across a wide range of generative tasks. A key challenge is understanding the mechanisms that prevent their memorization of training data and allow generalization. In this work, we investigate the role of the training dynamics in the transition from generalization to memorization. Through extensive experiments and theoretical analysis, we identify two distinct timescales: an early time $\tau_\mathrm{gen}$ at which models begin to generate high-quality samples, and a later time $\tau_\mathrm{mem}$ beyond which memorization emerges. Crucially, we find that $\tau_\mathrm{mem}$ increases linearly with the training set size $n$, while $\tau_\mathrm{gen}$ remains constant. This creates a growing window of training times with $n$ where models generalize effectively, despite showing strong memorization if training continues beyond it. It is only when $n$ becomes larger than a model-dependent threshold that overfitting disappears at infinite training times. These findings reveal a form of implicit dynamical regularization in the training dynamics, which allow to avoid memorization even in highly overparameterized settings. Our results are supported by numerical experiments with standard U-Net architectures on realistic and synthetic datasets, and by a theoretical analysis using a tractable random features model studied in the high-dimensional limit.

[522] PEARL: Peer-Enhanced Adaptive Radio via On-Device LLM

Ju-Hyung Lee, Yanqing Lu, Klaus Doppler

Main category: cs.LG

TL;DR: PEARL is a framework using on-device LLMs for cooperative D2D communication optimization, improving performance and reducing energy consumption by up to 16% through peer-aware context and reward-aligned training.

Details

Motivation: To extend single-device on-device LLM optimization to cooperative scenarios by leveraging both publisher and subscriber states for better Wi-Fi Aware parameter selection in D2D communication.

Method: Uses context-aware reward that normalizes latency by application tolerances and modulates energy by device battery states for KL-based finetuning. Two variants: PEARL (Head + LoRA) and PEARL-Lite (Head-only).

Result: PEARL improves objective scores over heuristic and compact model baselines, reduces energy by up to 16% in cooperative low-battery cases, and PEARL-Lite achieves sub-20 ms inference with near-identical performance.

Conclusion: Peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control in D2D communication.

Abstract: We present PEARL (Peer-Enhanced Adaptive Radio via On-Device LLM), a framework for cooperative cross-layer optimization in device-to-device (D2D) communication. Building on our previous work on single-device on-device LLMs, PEARL extends the paradigm by leveraging both publisher and subscriber states to guide Wi-Fi Aware (WA) parameter selection. A context-aware reward, which normalizes latency by application tolerances and modulates energy by device battery states, provides richer supervision for KL-based finetuning. We study two lightweight variants: PEARL (Head + Low-Rank Adaptation (LoRA)) achieves the best overall performance, while PEARL-Lite (Head-only) delivers sub-20 ms inference at near-identical objective scores. Across synthetic scenarios grounded in real measurements, PEARL improves objective scores over heuristic and compact model baselines and reduces energy by up to 16% in cooperative low-battery cases. These results demonstrate that peer-aware context, reward-aligned training, and head-based efficiency make LLMs practical for always-on, on-device cross-layer control. Code, real-world demo, and dataset are available at https://github.com/abman23/pearl

[523] URB – Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles

Ahmet Onur Akman, Anastasia Psarou, Michał Hoffmann, Łukasz Gorczyca, Łukasz Kowalski, Paweł Gora, Grzegorz Jamróz, Rafał Kucharski

Main category: cs.LG

TL;DR: URB is a new benchmark for testing multi-agent reinforcement learning algorithms on connected autonomous vehicle routing in 29 real-world urban networks, showing current methods rarely outperform human drivers despite extensive training.

Details

Motivation: There's a lack of standardized, realistic benchmarks for developing collective routing strategies for connected autonomous vehicles using reinforcement learning, which is needed to reduce urban congestion.

Method: Created URB benchmark with 29 real traffic networks, realistic demand patterns, predefined tasks, MARL implementations, baseline methods, domain metrics, and modular configuration.

Result: State-of-the-art MARL algorithms rarely outperformed human drivers despite lengthy training, and current approaches struggle to scale effectively.

Conclusion: URB establishes the first leaderboard for MARL in urban routing optimization, revealing urgent need for algorithmic advancements as current methods underperform human capabilities.

Abstract: Connected Autonomous Vehicles (CAVs) promise to reduce congestion in future urban networks, potentially by optimizing their routing decisions. Unlike for human drivers, these decisions can be made with collective, data-driven policies, developed using machine learning algorithms. Reinforcement learning (RL) can facilitate the development of such collective routing strategies, yet standardized and realistic benchmarks are missing. To that end, we present URB: Urban Routing Benchmark for RL-equipped Connected Autonomous Vehicles. URB is a comprehensive benchmarking environment that unifies evaluation across 29 real-world traffic networks paired with realistic demand patterns. URB comes with a catalog of predefined tasks, multi-agent RL (MARL) algorithm implementations, three baseline methods, domain-specific performance metrics, and a modular configuration scheme. Our results show that, despite the lengthy and costly training, state-of-the-art MARL algorithms rarely outperformed humans. The experimental results reported in this paper initiate the first leaderboard for MARL in large-scale urban routing optimization. They reveal that current approaches struggle to scale, emphasizing the urgent need for advancements in this domain.

[524] Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

Haizhong Zheng, Jiawei Zhao, Beidi Chen

Main category: cs.LG

TL;DR: M2PO (Second-Moment Trust Policy Optimization) enables stable off-policy reinforcement learning for large language models by constraining the second moment of importance weights, allowing effective use of stale rollout data while matching on-policy performance.

Details

Motivation: Current RL methods for language models rely on on-policy training requiring fresh rollouts at every update, which limits efficiency and scalability. Asynchronous RL systems exist but degrade with stale data.

Method: M2PO constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates, reducing clipped tokens from 1.22% to 0.06% under high staleness.

Result: M2PO enables stable off-policy training with data stale by at least 256 model updates across six models (1.7B to 32B) and eight benchmarks, matching on-policy performance while sharply reducing variance.

Conclusion: M2PO successfully addresses the staleness challenge in asynchronous RL for language models, demonstrating that stale data can be as informative as on-policy data when properly exploited through second-moment constraints.

Abstract: Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

[525] Structured Reinforcement Learning for Combinatorial Decision-Making

Heiko Hoppe, Léo Baty, Louis Bouvier, Axel Parmentier, Maximilian Schiffer

Main category: cs.LG

TL;DR: SRL embeds combinatorial optimization layers into actor networks using Fenchel-Young losses, improving RL performance on structured decision problems with combinatorial action spaces.

Details

Motivation: Standard RL struggles with combinatorial action spaces in real-world problems like routing and scheduling, lacking scalability, generalization, and structure exploitation.

Method: Actor-critic paradigm with combinatorial optimization layers in actor networks, trained end-to-end using Fenchel-Young losses, interpreted as primal-dual algorithm in dual moment polytope.

Result: SRL matches or surpasses unstructured RL and imitation learning on static tasks, improves by up to 92% on dynamic problems, with better stability and convergence speed across six environments.

Conclusion: SRL effectively handles combinatorial action spaces in RL, demonstrating superior performance, stability, and convergence in structured decision-making problems.

Abstract: Reinforcement learning (RL) is increasingly applied to real-world problems involving complex and structured decisions, such as routing, scheduling, and assortment planning. These settings challenge standard RL algorithms, which struggle to scale, generalize, and exploit structure in the presence of combinatorial action spaces. We propose Structured Reinforcement Learning (SRL), a novel actor-critic paradigm that embeds combinatorial optimization-layers into the actor neural network. We enable end-to-end learning of the actor via Fenchel-Young losses and provide a geometric interpretation of SRL as a primal-dual algorithm in the dual of the moment polytope. Across six environments with exogenous and endogenous uncertainty, SRL matches or surpasses the performance of unstructured RL and imitation learning on static tasks and improves over these baselines by up to 92% on dynamic problems, with improved stability and convergence speed.

[526] DeepRTE: Pre-trained Attention-based Neural Network for Radiative Transfer

Yekun Zhu, Min Tang, Zheng Ma

Main category: cs.LG

TL;DR: DeepRTE is a novel neural network approach that efficiently solves the steady-state Radiative Transfer Equation using physics-informed architecture and achieves high accuracy with fewer parameters through multi-head attention and zero-shot capability via Green’s function theory.

Details

Motivation: To develop a computationally efficient method for solving the steady-state Radiative Transfer Equation that governs radiation propagation in participating media, with applications in neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging.

Method: Proposes DeepRTE framework with physics-informed network architecture, mathematical derivation embedding, multi-head attention mechanisms, Green’s function theory integration, and pre-training with delta-function inflow boundary conditions to create a mesh-free neural operator.

Result: Demonstrates superior computational efficiency compared to traditional methods and existing neural network approaches, achieves high accuracy with significantly fewer parameters, and exhibits inherent zero-shot capability through comprehensive numerical experiments.

Conclusion: DeepRTE provides an effective and efficient neural network solution for the steady-state Radiative Transfer Equation, combining physical principles with advanced neural architecture to outperform existing methods while maintaining high accuracy.

Abstract: In this paper, we propose a novel neural network approach, termed DeepRTE, to address the steady-state Radiative Transfer Equation (RTE). The RTE is a differential-integral equation that governs the propagation of radiation through a participating medium, with applications spanning diverse domains such as neutron transport, atmospheric radiative transfer, heat transfer, and optical imaging. Our DeepRTE framework demonstrates superior computational efficiency for solving the steady-state RTE, surpassing traditional methods and existing neural network approaches. This efficiency is achieved by embedding physical information through derivation of the RTE and mathematically-informed network architecture. Concurrently, DeepRTE achieves high accuracy with significantly fewer parameters, largely due to its incorporation of mechanisms such as multi-head attention. Furthermore, DeepRTE is a mesh-free neural operator framework with inherent zero-shot capability. This is achieved by incorporating Green’s function theory and pre-training with delta-function inflow boundary conditions into both its architecture design and training data construction. The efficacy of the proposed approach is substantiated through comprehensive numerical experiments.

[527] Distilled Protein Backbone Generation

Liyang Xie, Haoran Zhang, Zhendong Wang, Wesley Tansey, Mingyuan Zhou

Main category: cs.LG

TL;DR: The paper introduces a score distillation method to accelerate protein backbone generation from diffusion models, achieving 20x speedup while maintaining designability, diversity, and novelty comparable to the original model.

Details

Motivation: Diffusion-based protein generation models suffer from slow sampling speeds (hundreds of iterative steps), limiting their practical utility in large-scale protein discovery where thousands to millions of candidate structures are needed.

Method: Adapted Score identity Distillation (SiD) with multistep generation and inference time noise modulation to train few-step protein backbone generators that significantly reduce sampling time while maintaining performance.

Result: The distilled few-step generators achieve more than 20-fold improvement in sampling speed while maintaining similar levels of designability, diversity, and novelty as the original Proteina teacher model.

Conclusion: This approach enables large-scale in silico protein design by reducing inference costs, bringing diffusion-based models closer to real-world protein engineering applications.

Abstract: Diffusion- and flow-based generative models have recently demonstrated strong performance in protein backbone generation tasks, offering unprecedented capabilities for de novo protein design. However, while achieving notable performance in generation quality, these models are limited by their generating speed, often requiring hundreds of iterative steps in the reverse-diffusion process. This computational bottleneck limits their practical utility in large-scale protein discovery, where thousands to millions of candidate structures are needed. To address this challenge, we explore the techniques of score distillation, which has shown great success in reducing the number of sampling steps in the vision domain while maintaining high generation quality. However, a straightforward adaptation of these methods results in unacceptably low designability. Through extensive study, we have identified how to appropriately adapt Score identity Distillation (SiD), a state-of-the-art score distillation strategy, to train few-step protein backbone generators which significantly reduce sampling time, while maintaining comparable performance to their pretrained teacher model. In particular, multistep generation combined with inference time noise modulation is key to the success. We demonstrate that our distilled few-step generators achieve more than a 20-fold improvement in sampling speed, while achieving similar levels of designability, diversity, and novelty as the Proteina teacher model. This reduction in inference cost enables large-scale in silico protein design, thereby bringing diffusion-based models closer to real-world protein engineering applications. The PyTorch implementation is available at https://github.com/LY-Xie/SiD_Protein

[528] Practical Bayes-Optimal Membership Inference Attacks

Marcus Lassila, Johan Östman, Khac-Hoang Ngo, Alexandre Graell i Amat

Main category: cs.LG

TL;DR: The paper develops practical membership inference attacks (MIAs) for both i.i.d. and graph-structured data, introducing BASE and G-BASE as tractable approximations of Bayes-optimal attacks that outperform prior methods with lower computational cost.

Details

Motivation: To address key open questions about optimal query strategies in graph settings and develop theoretically grounded membership inference attacks that are both practical and effective across different data types.

Method: Building on Bayesian decision-theoretic framework, the authors derive Bayes-optimal membership inference rules for node-level MIAs against graph neural networks, then introduce BASE and G-BASE as tractable approximations.

Result: G-BASE achieves superior performance compared to prior classifier-based node-level MIA attacks. BASE matches or exceeds state-of-the-art MIAs (LiRA, RMIA) at significantly lower computational cost. The paper also shows equivalence between BASE and RMIA under specific hyperparameter settings.

Conclusion: The proposed BASE and G-BASE attacks provide principled, Bayes-optimal justifications for membership inference, offering superior performance and computational efficiency while establishing theoretical connections to existing methods.

Abstract: We develop practical and theoretically grounded membership inference attacks (MIAs) against both independent and identically distributed (i.i.d.) data and graph-structured data. Building on the Bayesian decision-theoretic framework of Sablayrolles et al., we derive the Bayes-optimal membership inference rule for node-level MIAs against graph neural networks, addressing key open questions about optimal query strategies in the graph setting. We introduce BASE and G-BASE, tractable approximations of the Bayes-optimal membership inference. G-BASE achieves superior performance compared to previously proposed classifier-based node-level MIA attacks. BASE, which is also applicable to non-graph data, matches or exceeds the performance of prior state-of-the-art MIAs, such as LiRA and RMIA, at a significantly lower computational cost. Finally, we show that BASE and RMIA are equivalent under a specific hyperparameter setting, providing a principled, Bayes-optimal justification for the RMIA attack.

[529] Think Just Enough: Sequence-Level Entropy as a Confidence Signal for LLM Reasoning

Aman Sharma, Paras Chopra

Main category: cs.LG

TL;DR: Novel entropy-based framework uses token-level Shannon entropy as confidence signal for early stopping in LLM reasoning tasks, achieving 25-50% computational savings while maintaining accuracy.

Details

Motivation: To improve token efficiency in large language models during reasoning tasks by exploiting emergent confidence awareness in advanced reasoning models.

Method: Uses Shannon entropy from token-level logprobs as confidence signal to enable early stopping, with entropy threshold calculated using few examples from reasoning datasets.

Result: Achieves 25-50% computational savings while maintaining task accuracy across reasoning-optimized model families.

Conclusion: Entropy-based confidence calibration is an emergent property of advanced post-training optimization, revealing confidence mechanisms as distinguishing characteristic of modern reasoning systems.

Abstract: We introduce a simple, yet novel entropy-based framework to drive token efficiency in large language models during reasoning tasks. Our approach uses Shannon entropy from token-level logprobs as a confidence signal to enable early stopping, achieving 25-50% computational savings while maintaining task accuracy. Crucially, we demonstrate that entropy-based confidence calibration represents an emergent property of advanced post-training optimization present in modern reasoning models but notably absent in standard instruction-tuned and pre-trained models (Llama 3.3 70B). We show that the entropy threshold to stop reasoning varies from model to model but can be calculated easily in one shot using only a few examples from existing reasoning datasets. Our results indicate that advanced reasoning models often know that they’ve gotten a correct answer early on, and that this emergent confidence awareness can be exploited to save tokens and reduce latency. The framework demonstrates consistent performance across reasoning-optimized model families with 25-50% computational cost reduction while preserving accuracy, revealing that confidence mechanisms represent a distinguishing characteristic of modern post-trained reasoning systems versus their predecessors.

[530] Uni-LoRA: One Vector is All You Need

Kaiyang Li, Shaobo Han, Qing Su, Wei Li, Zhipeng Cai, Shihao Ji

Main category: cs.LG

TL;DR: Uni-LoRA introduces a unified framework for parameter-efficient fine-tuning of LLMs that reconstructs LoRA parameters from a single trainable vector using an isometric projection matrix, achieving state-of-the-art parameter efficiency.

Details

Motivation: Existing LoRA variants use layer-wise projections that limit cross-layer parameter sharing and compromise parameter efficiency. There's a need for a more unified and efficient approach.

Method: Proposes Uni-LoRA framework that reconstructs LoRA parameters from a low-dimensional subspace using an isometric projection matrix, enabling global parameter sharing with just one trainable vector for the entire LLM.

Result: Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks show Uni-LoRA achieves state-of-the-art parameter efficiency while matching or outperforming prior approaches in predictive performance.

Conclusion: Uni-LoRA provides both a unified theoretical framework and a practical “one-vector-only” solution that significantly improves parameter efficiency in LLM fine-tuning while maintaining strong performance.

Abstract: Low-Rank Adaptation (LoRA) has become the de facto parameter-efficient fine-tuning (PEFT) method for large language models (LLMs) by constraining weight updates to low-rank matrices. Recent works such as Tied-LoRA, VeRA, and VB-LoRA push efficiency further by introducing additional constraints to reduce the trainable parameter space. In this paper, we show that the parameter space reduction strategies employed by these LoRA variants can be formulated within a unified framework, Uni-LoRA, where the LoRA parameter space, flattened as a high-dimensional vector space $R^D$, can be reconstructed through a projection from a subspace R^d, with $d \ll D$. We demonstrate that the fundamental difference among various LoRA methods lies in the choice of the projection matrix, $P \in R^{D \times d}$.Most existing LoRA variants rely on layer-wise or structure-specific projections that limit cross-layer parameter sharing, thereby compromising parameter efficiency. In light of this, we introduce an efficient and theoretically grounded projection matrix that is isometric, enabling global parameter sharing and reducing computation overhead. Furthermore, under the unified view of Uni-LoRA, this design requires only a single trainable vector to reconstruct LoRA parameters for the entire LLM - making Uni-LoRA both a unified framework and a “one-vector-only” solution. Extensive experiments on GLUE, mathematical reasoning, and instruction tuning benchmarks demonstrate that Uni-LoRA achieves state-of-the-art parameter efficiency while outperforming or matching prior approaches in predictive performance. Our code is available at https://github.com/KaiyangLi1992/Uni-LoRA.

[531] Two-Stage Learning of Stabilizing Neural Controllers via Zubov Sampling and Iterative Domain Expansion

Haoyu Li, Xiangru Zhong, Bin Hu, Huan Zhang

Main category: cs.LG

TL;DR: A two-stage training framework for neural network controllers that jointly synthesizes controllers and Lyapunov functions, using Zubov-inspired ROA estimation and neural network verification to reduce conservatism and improve verification speed.

Details

Motivation: Existing neural network controllers lack stability guarantees and have conservative region of attraction estimates due to limitations in training and verification algorithms for continuous-time systems.

Method: Two-stage training framework with Zubov-inspired ROA characterization, novel training-data sampling, domain-updating mechanism, and extension of α,β-CROWN neural network verifier for continuous systems with automatic Jacobian bound propagation.

Result: Achieved ROA volumes 5-150,000 times larger than baselines, verification speed 40-10,000 times faster than SMT solver dReal on challenging nonlinear systems.

Conclusion: The proposed framework significantly reduces conservatism in neural controller training and enables efficient formal verification for continuous-time systems, providing practical stability guarantees.

Abstract: Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov-inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha,!\beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1.5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{,}000$ times faster compared to the traditional SMT solver dReal. Our code is available at https://github.com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.

[532] The Formalism-Implementation Gap in Reinforcement Learning Research

Pablo Samuel Castro

Main category: cs.LG

TL;DR: RL research should shift focus from demonstrating agent capabilities to understanding learning dynamics and improving benchmark precision to enable better transfer to real-world problems.

Details

Motivation: Current RL research prioritizes performance demonstrations over understanding learning dynamics, risking overfitting on benchmarks and making techniques difficult to transfer to novel problems.

Method: The paper argues for a paradigm shift in RL research methodology, using the Arcade Learning Environment as an example of how benchmarks can be better utilized for understanding rather than just performance measurement.

Result: The analysis shows that performance-focused research diminishes work aimed at understanding RL techniques and makes benchmarks less useful for real-world deployment.

Conclusion: RL research needs to focus more on advancing scientific understanding of learning dynamics and be more precise about how benchmarks map to mathematical formalisms to facilitate real-world impact.

Abstract: The last decade has seen an upswing in interest and adoption of reinforcement learning (RL) techniques, in large part due to its demonstrated capabilities at performing certain tasks at “super-human levels”. This has incentivized the community to prioritize research that demonstrates RL agent performance, often at the expense of research aimed at understanding their learning dynamics. Performance-focused research runs the risk of overfitting on academic benchmarks – thereby rendering them less useful – which can make it difficult to transfer proposed techniques to novel problems. Further, it implicitly diminishes work that does not push the performance-frontier, but aims at improving our understanding of these techniques. This paper argues two points: (i) RL research should stop focusing solely on demonstrating agent capabilities, and focus more on advancing the science and understanding of reinforcement learning; and (ii) we need to be more precise on how our benchmarks map to the underlying mathematical formalisms. We use the popular Arcade Learning Environment (ALE; Bellemare et al., 2013) as an example of a benchmark that, despite being increasingly considered “saturated”, can be effectively used for developing this understanding, and facilitating the deployment of RL techniques in impactful real-world problems.

[533] RDB2G-Bench: A Comprehensive Benchmark for Automatic Graph Modeling of Relational Databases

Dongwon Choi, Sunwoo Kim, Juyeon Kim, Kyungho Kim, Geon Lee, Shinhwan Kang, Myunghwan Kim, Kijung Shin

Main category: cs.LG

TL;DR: RDB2G-Bench is a benchmark framework for evaluating RDB-to-graph modeling methods, featuring 5 real-world databases and 12 predictive tasks with 50k graph model-performance pairs for efficient evaluation.

Details

Motivation: Effective modeling of relational databases into graphs is challenging due to numerous possible modeling approaches, with performance varying significantly (up to 10% difference) depending on the chosen graph model.

Method: Created RDB2G-Bench with extensive datasets covering 5 real-world RDBs and 12 predictive tasks, enabling benchmarking of 10 automatic RDB-to-graph modeling methods 380x faster than on-the-fly evaluation.

Result: The benchmark framework enables efficient evaluation and reveals key structural patterns affecting graph model effectiveness, with practical implications for effective graph modeling.

Conclusion: RDB2G-Bench provides the first comprehensive benchmark for RDB-to-graph modeling research, facilitating reproducible evaluations and revealing important insights for effective graph construction from relational databases.

Abstract: Recent advances have demonstrated the effectiveness of graph-based learning on relational databases (RDBs) for predictive tasks. Such approaches require transforming RDBs into graphs, a process we refer to as RDB-to-graph modeling, where rows of tables are represented as nodes and foreign-key relationships as edges. Yet, effective modeling of RDBs into graphs remains challenging. Specifically, there exist numerous ways to model RDBs into graphs, and performance on predictive tasks varies significantly depending on the chosen graph model of RDBs. In our analysis, we find that the best-performing graph model can yield up to a 10% higher performance compared to the common heuristic rule for graph modeling, which remains non-trivial to identify. To foster research on intelligent RDB-to-graph modeling, we introduce RDB2G-Bench, the first benchmark framework for evaluating such methods. We construct extensive datasets covering 5 real-world RDBs and 12 predictive tasks, resulting in around 50k graph model-performance pairs for efficient and reproducible evaluations. Thanks to our precomputed datasets, we were able to benchmark 10 automatic RDB-to-graph modeling methods on the 12 tasks about 380x faster than on-the-fly evaluation, which requires repeated GNN training. Our analysis of the datasets and benchmark results reveals key structural patterns affecting graph model effectiveness, along with practical implications for effective graph modeling. Our datasets and code are available at https://github.com/chlehdwon/RDB2G-Bench.

[534] Trade-offs in Data Memorization via Strong Data Processing Inequalities

Vitaly Feldman, Guy Kornowski, Xin Lyu

Main category: cs.LG

TL;DR: This paper develops a general approach to prove lower bounds on data memorization in machine learning, showing that simple binary classification problems require memorizing Ω(d) bits of training data when only O(1) d-dimensional examples are available.

Details

Motivation: The motivation stems from concerns about privacy violations in large language models that memorize training data, particularly when dealing with sensitive user information.

Method: The authors develop a general approach connecting strong data processing inequalities with data memorization, and analyze simple binary classification problems and mixture-of-clusters models.

Result: The paper demonstrates that learning algorithms must memorize Ω(d) bits of training data when only O(1) d-dimensional examples are available, with this requirement decaying as more examples become available.

Conclusion: The work establishes fundamental trade-offs between sample size and data memorization, with lower bounds that are generally matched by simple learning algorithms, addressing limitations in prior work by Brown et al. (2021).

Abstract: Recent research demonstrated that training large language models involves memorization of a significant fraction of training data. Such memorization can lead to privacy violations when training on sensitive user data and thus motivates the study of data memorization’s role in learning. In this work, we develop a general approach for proving lower bounds on excess data memorization, that relies on a new connection between strong data processing inequalities and data memorization. We then demonstrate that several simple and natural binary classification problems exhibit a trade-off between the number of samples available to a learning algorithm, and the amount of information about the training data that a learning algorithm needs to memorize to be accurate. In particular, $\Omega(d)$ bits of information about the training data need to be memorized when $O(1)$ $d$-dimensional examples are available, which then decays as the number of examples grows at a problem-specific rate. Further, our lower bounds are generally matched (up to logarithmic factors) by simple learning algorithms. We also extend our lower bounds to more general mixture-of-clusters models. Our definitions and results build on the work of Brown et al. (2021) and address several limitations of the lower bounds in their work.

[535] MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, Yiqun Liu

Main category: cs.LG

TL;DR: The paper proposes a new benchmark for evaluating LLM systems’ continual learning abilities from user feedback, addressing limitations in existing memory benchmarks that focus on homogeneous reading comprehension tasks.

Details

Motivation: Current scaling methods for LLM systems are reaching their limits due to data depletion and diminishing returns. Inspired by human learning abilities, there's a need to develop memory and continual learning frameworks for LLMs, but existing benchmarks don't adequately test learning from accumulated user feedback.

Method: The authors developed a user feedback simulation framework and comprehensive benchmark covering multiple domains, languages, and task types to evaluate LLM systems’ continual learning capabilities.

Result: Experiments revealed that state-of-the-art baselines perform poorly in terms of both effectiveness and efficiency when tested on the proposed benchmark.

Conclusion: The benchmark provides a foundation for future research on LLM memory and optimization algorithms, highlighting the need for improved continual learning approaches in LLM systems.

Abstract: Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained from larger computational resource consumption. Inspired by the abilities of human and traditional AI systems in learning from practice, constructing memory and continual learning frameworks for LLMsys has become an important and popular research direction in recent literature. Yet, existing benchmarks for LLM memory often focus on evaluating the system on homogeneous reading comprehension tasks with long-form inputs rather than testing their abilities to learn from accumulated user feedback in service time. Therefore, we propose a user feedback simulation framework and a comprehensive benchmark covering multiple domains, languages, and types of tasks to evaluate the continual learning abilities of LLMsys. Experiments show that the effectiveness and efficiency of state-of-the-art baselines are far from satisfying, and we hope this benchmark could pave the way for future studies on LLM memory and optimization algorithms.

[536] GeoClip: Geometry-Aware Clipping for Differentially Private SGD

Atefeh Gilani, Naima Tasnim, Lalitha Sankar, Oliver Kosut

Main category: cs.LG

TL;DR: GeoClip is a geometry-aware DP-SGD framework that clips gradients in a transformed basis aligned with gradient distribution geometry, outperforming existing adaptive clipping methods under the same privacy budget.

Details

Motivation: Standard DP-SGD methods fail to account for correlations across gradient coordinates when setting the clipping threshold, which significantly affects the privacy-utility trade-off.

Method: GeoClip adaptively estimates a transformation using previously released noisy gradients (no additional privacy cost), clips and perturbs gradients in a transformed basis aligned with gradient distribution geometry, and provides a closed-form solution for optimal transformation.

Result: Experiments on tabular and image datasets show GeoClip consistently outperforms existing adaptive clipping methods under the same privacy budget.

Conclusion: GeoClip provides an effective geometry-aware approach for DP-SGD that better accounts for gradient correlations and improves the privacy-utility trade-off without additional privacy costs.

Abstract: Differentially private stochastic gradient descent (DP-SGD) is the most widely used method for training machine learning models with provable privacy guarantees. A key challenge in DP-SGD is setting the per-sample gradient clipping threshold, which significantly affects the trade-off between privacy and utility. While recent adaptive methods improve performance by adjusting this threshold during training, they operate in the standard coordinate system and fail to account for correlations across the coordinates of the gradient. We propose GeoClip, a geometry-aware framework that clips and perturbs gradients in a transformed basis aligned with the geometry of the gradient distribution. GeoClip adaptively estimates this transformation using only previously released noisy gradients, incurring no additional privacy cost. We provide convergence guarantees for GeoClip and derive a closed-form solution for the optimal transformation that minimizes the amount of noise added while keeping the probability of gradient clipping under control. Experiments on both tabular and image datasets demonstrate that GeoClip consistently outperforms existing adaptive clipping methods under the same privacy budget.

[537] CausalPFN: Amortized Causal Effect Estimation via In-Context Learning

Vahid Balazadeh, Hamidreza Kamkari, Valentin Thomas, Benson Li, Junwei Ma, Jesse C. Cresswell, Rahul G. Krishnan

Main category: cs.LG

TL;DR: CausalPFN is a transformer-based model that automates causal effect estimation from observational data without requiring manual estimator selection or domain expertise.

Details

Motivation: Current causal effect estimation requires substantial manual effort and domain expertise to select appropriate estimators from dozens of specialized methods.

Method: Train a single transformer on a large library of simulated data-generating processes that satisfy ignorability, combining Bayesian causal inference with prior-fitted networks to map raw observations directly to causal effects.

Result: Achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC) and competitive performance for real-world policy making on uplift modeling tasks.

Conclusion: CausalPFN provides a ready-to-use model with calibrated uncertainty estimates that requires no further training or tuning, advancing toward automated causal inference.

Abstract: Causal effect estimation from observational data is fundamental across various applications. However, selecting an appropriate estimator from dozens of specialized methods demands substantial manual effort and domain expertise. We present CausalPFN, a single transformer that amortizes this workflow: trained once on a large library of simulated data-generating processes that satisfy ignorability, it infers causal effects for new observational datasets out of the box. CausalPFN combines ideas from Bayesian causal inference with the large-scale training protocol of prior-fitted networks (PFNs), learning to map raw observations directly to causal effects without any task-specific adjustment. Our approach achieves superior average performance on heterogeneous and average treatment effect estimation benchmarks (IHDP, Lalonde, ACIC). Moreover, it shows competitive performance for real-world policy making on uplift modeling tasks. CausalPFN provides calibrated uncertainty estimates to support reliable decision-making based on Bayesian principles. This ready-to-use model requires no further training or tuning and takes a step toward automated causal inference (https://github.com/vdblm/CausalPFN/).

[538] Apollo: A Posteriori Label-Only Membership Inference Attack Towards Machine Unlearning

Liou Tang, James Joshi, Ashish Kundu

Main category: cs.LG

TL;DR: The paper proposes Apollo, a label-only membership inference attack against machine unlearning that can identify unlearned samples using only the model’s label outputs, without needing access to the original model.

Details

Motivation: Machine unlearning increases privacy risks by creating new attack surfaces. Existing attacks require access to both original and unlearned models, which is unrealistic in real scenarios.

Method: Developed A Posteriori Label-Only Membership Inference Attack (Apollo) that uses only the label outputs from the unlearned model to infer whether samples were removed.

Result: The attack achieves high precision in identifying unlearned samples despite having less model access than previous methods.

Conclusion: Machine unlearning introduces new privacy vulnerabilities, and Apollo demonstrates effective attacks are possible even under strict threat models with limited model access.

Abstract: Machine Unlearning (MU) aims to update Machine Learning (ML) models following requests to remove training samples and their influences on a trained model efficiently without retraining the original ML model from scratch. While MU itself has been employed to provide privacy protection and regulatory compliance, it can also increase the attack surface of the model. Existing privacy inference attacks towards MU that aim to infer properties of the unlearned set rely on the weaker threat model that assumes the attacker has access to both the unlearned model and the original model, limiting their feasibility toward real-life scenarios. We propose a novel privacy attack, A Posteriori Label-Only Membership Inference Attack towards MU, Apollo, that infers whether a data sample has been unlearned, following a strict threat model where an adversary has access to the label-output of the unlearned model only. We demonstrate that our proposed attack, while requiring less access to the target model compared to previous attacks, can achieve relatively high precision on the membership status of the unlearned samples.

[539] Equivariance Everywhere All At Once: A Recipe for Graph Foundation Models

Ben Finkelshtein, İsmail İlkan Ceylan, Michael Bronstein, Ron Levie

Main category: cs.LG

TL;DR: The paper presents a recipe for building graph foundation models that generalize across arbitrary graphs and features by investigating necessary symmetries: node permutation-equivariance, label permutation-equivariance, and feature permutation-invariance.

Details

Motivation: Current graph machine learning architectures are tailored to specific tasks and datasets, limiting their broader applicability. The goal is to develop graph foundation models capable of generalizing across arbitrary graphs and features.

Method: Systematically investigates required symmetries, characterizes linear transformations equivariant to node/label permutations and invariant to feature permutations, proves universality on multisets, and applies these layers to local graph neighborhoods.

Result: Validated on 29 real-world node classification datasets, showing strong zero-shot performance and consistent improvement as training graphs increase.

Conclusion: The proposed symmetry-based approach provides a principled foundation for building graph foundation models that achieve strong generalization across diverse graph datasets.

Abstract: Graph machine learning architectures are typically tailored to specific tasks on specific datasets, which hinders their broader applicability. This has led to a new quest in graph machine learning: how to build graph foundation models capable of generalizing across arbitrary graphs and features? In this work, we present a recipe for designing graph foundation models for node-level tasks from first principles. The key ingredient underpinning our study is a systematic investigation of the symmetries that a graph foundation model must respect. In a nutshell, we argue that label permutation-equivariance alongside feature permutation-invariance are necessary in addition to the common node permutation-equivariance on each local neighborhood of the graph. To this end, we first characterize the space of linear transformations that are equivariant to permutations of nodes and labels, and invariant to permutations of features. We then prove that the resulting network is a universal approximator on multisets that respect the aforementioned symmetries. Our recipe uses such layers on the multiset of features induced by the local neighborhood of the graph to obtain a class of graph foundation models for node property prediction. We validate our approach through extensive experiments on 29 real-world node classification datasets, demonstrating both strong zero-shot empirical performance and consistent improvement as the number of training graphs increases.

[540] PPFL-RDSN: Privacy-Preserving Federated Learning-based Residual Dense Spatial Networks for Encrypted Lossy Image Reconstruction

Peilin He, James Joshi

Main category: cs.LG

TL;DR: A Privacy-Preserving Federated Learning framework for Residual Dense Spatial Networks (PPFL-RDSN) that enables secure collaborative image reconstruction while protecting data privacy and reducing computational costs.

Details

Motivation: Centralized training for image reconstruction poses privacy risks (data leakage, inference attacks) and high computational costs when multiple parties collaborate. There's a need for secure collaborative learning without sharing raw data.

Method: Integrates Federated Learning, local differential privacy, and robust model watermarking techniques to keep data on local devices, protect privacy-sensitive information, and maintain model authenticity without revealing underlying data.

Result: Achieves comparable performance to state-of-the-art centralized methods while reducing computational burdens and effectively mitigating security and privacy vulnerabilities.

Conclusion: PPFL-RDSN provides a practical solution for secure and privacy-preserving collaborative computer vision applications, balancing performance with privacy protection.

Abstract: Reconstructing high-quality images from low-resolution inputs using Residual Dense Spatial Networks (RDSNs) is crucial yet challenging. It is even more challenging in centralized training where multiple collaborating parties are involved, as it poses significant privacy risks, including data leakage and inference attacks, as well as high computational and communication costs. We propose a novel Privacy-Preserving Federated Learning-based RDSN (PPFL-RDSN) framework specifically tailored for encrypted lossy image reconstruction. PPFL-RDSN integrates Federated Learning (FL), local differential privacy, and robust model watermarking techniques to ensure that data remains secure on local clients/devices, safeguards privacy-sensitive information, and maintains model authenticity without revealing underlying data. Empirical evaluations show that PPFL-RDSN achieves comparable performance to the state-of-the-art centralized methods while reducing computational burdens, and effectively mitigates security and privacy vulnerabilities, making it a practical solution for secure and privacy-preserving collaborative computer vision applications.

[541] Robust Uncertainty Quantification for Self-Evolving Large Language Models via Continual Domain Pretraining

Xiaofan Zhou, Lu Cheng

Main category: cs.LG

TL;DR: Proposes an adaptive rejection and non-exchangeable conformal prediction framework for continual domain pretraining of LLMs, addressing distribution shifts and improving reliability guarantees.

Details

Motivation: Continual learning is crucial for LLMs to adapt to evolving knowledge, but lacks statistical reliability guarantees, especially in continual domain pretraining where test data distributions shift.

Method: Uses transformer-based clustering to estimate test domain distributions, then reweights/resamples calibration data, combined with adaptive rejection CP that allows selective abstention when confidence shifts.

Result: Extensive experiments show the framework enhances both effectiveness and reliability of conformal prediction under continual domain pretraining scenarios.

Conclusion: The proposed framework successfully addresses challenges in continual domain pretraining by providing adaptive rejection mechanisms and handling non-exchangeable data distributions.

Abstract: Continual Learning (CL) is essential for enabling self-evolving large language models (LLMs) to adapt and remain effective amid rapid knowledge growth. Yet, despite its importance, little attention has been given to establishing statistical reliability guarantees for LLMs under CL, particularly in the setting of continual domain pretraining (CDP). Conformal Prediction (CP) has shown promise in offering correctness guarantees for LLMs, but it faces major challenges in CDP: testing data often stems from unknown or shifting domain distributions, under which CP may no longer provide valid guarantees. Moreover, when high coverage is required, CP can yield excessively large prediction sets for unanswerable queries, reducing informativeness. To address these challenges, we introduce an adaptive rejection and non-exchangeable CP framework. Our method first estimates the distribution of questions across domains in the test set using transformer-based clustering, then reweights or resamples the calibration data accordingly. Building on this, adaptive rejection CP allows the LLM to selectively abstain from answering when its confidence or competence shifts significantly. Extensive experiments demonstrate that our framework enhances both the effectiveness and reliability of CP under CDP scenarios. Our code is available at: https://anonymous.4open.science/r/CPCL-8C12/

[542] Sample Complexity Bounds for Linear Constrained MDPs with a Generative Model

Xingtu Liu, Lin F. Yang, Sharan Vaswani

Main category: cs.LG

TL;DR: Proposes a primal-dual framework for solving constrained Markov decision processes (CMDPs) that leverages black-box unconstrained MDP solvers, providing sample complexity bounds for both relaxed and strict feasibility settings.

Details

Motivation: To develop efficient algorithms for constrained MDPs that can handle both approximate and exact constraint satisfaction while achieving near-optimal sample complexity.

Method: Uses a primal-dual framework with mirror descent value iteration (MDVI) as the MDP solver, analyzing sample complexity for linear CMDPs with feature dimension d.

Result: For relaxed feasibility: Õ(d²/(1-γ)⁴ε²) samples; For strict feasibility: Õ(d²/(1-γ)⁶ε²ζ²) samples, with lower bound Ω(d²/(1-γ)⁵ε²ζ²). Near-optimal dependence on d, ε, and ζ.

Conclusion: The proposed framework achieves near-optimal sample complexity for CMDPs and can be instantiated for tabular settings to recover known near-optimal results.

Abstract: We consider infinite-horizon $\gamma$-discounted (linear) constrained Markov decision processes (CMDPs) where the objective is to find a policy that maximizes the expected cumulative reward subject to expected cumulative constraints. Given access to a generative model, we propose to solve CMDPs with a primal-dual framework that can leverage any black-box unconstrained MDP solver. For linear CMDPs with feature dimension $d$, we instantiate the framework by using mirror descent value iteration (\texttt{MDVI})~\citep{kitamura2023regularization} an example MDP solver. We provide sample complexity bounds for the resulting CMDP algorithm in two cases: (i) relaxed feasibility, where small constraint violations are allowed, and (ii) strict feasibility, where the output policy is required to exactly satisfy the constraint. For (i), we prove that the algorithm can return an $\epsilon$-optimal policy with high probability by using $\tilde{O}\left(\frac{d^2}{(1-\gamma)^4\epsilon^2}\right)$ samples. For (ii), we show that the algorithm requires $\tilde{O}\left(\frac{d^2}{(1-\gamma)^6\epsilon^2\zeta^2}\right)$ samples, where $\zeta$ is the problem-dependent Slater constant that characterizes the size of the feasible region. Furthermore, we prove a lower-bound of $\Omega\left(\frac{d^2}{(1-\gamma)^5\epsilon^2\zeta^2}\right)$ for the strict feasibility setting. We note that our upper bounds under both settings exhibit a near-optimal dependence on $d$, $\epsilon$, and $\zeta$. Finally, we instantiate our framework for tabular CMDPs and show that it can be used to recover near-optimal sample complexities in this setting.

[543] Eigen-Value: Efficient Domain-Robust Data Valuation via Eigenvalue-Based Approach

Youngjun Choi, Joonseong Kang, Sungjun Lim, Kyungwoo Song

Main category: cs.LG

TL;DR: Eigen-Value (EV) is a plug-and-play data valuation framework that improves out-of-distribution (OOD) robustness using only in-distribution (ID) data, addressing computational inefficiency in existing OOD-aware methods.

Details

Motivation: Existing data valuation methods based on ID loss fail to generalize to OOD settings, and current OOD-aware methods are computationally expensive, limiting practical deployment.

Method: EV uses spectral approximation of domain discrepancy via eigenvalue ratios of ID data’s covariance matrix, estimates marginal contributions via perturbation theory, and plugs into ID loss-based methods without additional training.

Result: EV achieves improved OOD robustness and stable value rankings across real-world datasets while remaining computationally lightweight.

Conclusion: EV provides an efficient, practical solution for OOD-robust data valuation in large-scale settings with domain shift.

Abstract: Data valuation has become central in the era of data-centric AI. It drives efficient training pipelines and enables objective pricing in data markets by assigning a numeric value to each data point. Most existing data valuation methods estimate the effect of removing individual data points by evaluating changes in model validation performance under in-distribution (ID) settings, as opposed to out-of-distribution (OOD) scenarios where data follow different patterns. Since ID and OOD data behave differently, data valuation methods based on ID loss often fail to generalize to OOD settings, particularly when the validation set contains no OOD data. Furthermore, although OOD-aware methods exist, they involve heavy computational costs, which hinder practical deployment. To address these challenges, we introduce \emph{Eigen-Value} (EV), a plug-and-play data valuation framework for OOD robustness that uses only an ID data subset, including during validation. EV provides a new spectral approximation of domain discrepancy, which is the gap of loss between ID and OOD using ratios of eigenvalues of ID data’s covariance matrix. EV then estimates the marginal contribution of each data point to this discrepancy via perturbation theory, alleviating the computational burden. Subsequently, EV plugs into ID loss-based methods by adding an EV term without any additional training loop. We demonstrate that EV achieves improved OOD robustness and stable value rankings across real-world datasets, while remaining computationally lightweight. These results indicate that EV is practical for large-scale settings with domain shift, offering an efficient path to OOD-robust data valuation.

[544] MH-GIN: Multi-scale Heterogeneous Graph-based Imputation Network for AIS Data (Extended Version)

Hengyu Liu, Tianyi Li, Yuqiang He, Kristian Torp, Yushuai Li, Christian S. Jensen

Main category: cs.LG

TL;DR: MH-GIN is a multi-scale heterogeneous graph-based imputation network that addresses missing values in maritime location-tracking data by capturing multi-scale dependencies between attributes with different update rates.

Details

Motivation: Automatic Identification System data suffers from missing values that hamper maritime safety applications, and existing imputation methods fail to capture multi-scale dependencies between attributes with diverse update rates.

Method: Extracts multi-scale temporal features for each attribute while preserving heterogeneous characteristics, then constructs a multi-scale heterogeneous graph to model dependencies between attributes for imputation through graph propagation.

Result: Achieves 57% average reduction in imputation errors compared to state-of-the-art methods while maintaining computational efficiency on two real-world datasets.

Conclusion: MH-GIN effectively captures multi-scale dependencies in heterogeneous maritime data, significantly improving imputation accuracy for location-tracking applications.

Abstract: Location-tracking data from the Automatic Identification System, much of which is publicly available, plays a key role in a range of maritime safety and monitoring applications. However, the data suffers from missing values that hamper downstream applications. Imputing the missing values is challenging because the values of different heterogeneous attributes are updated at diverse rates, resulting in the occurrence of multi-scale dependencies among attributes. Existing imputation methods that assume similar update rates across attributes are unable to capture and exploit such dependencies, limiting their imputation accuracy. We propose MH-GIN, a Multi-scale Heterogeneous Graph-based Imputation Network that aims improve imputation accuracy by capturing multi-scale dependencies. Specifically, MH-GIN first extracts multi-scale temporal features for each attribute while preserving their intrinsic heterogeneous characteristics. Then, it constructs a multi-scale heterogeneous graph to explicitly model dependencies between heterogeneous attributes to enable more accurate imputation of missing values through graph propagation. Experimental results on two real-world datasets find that MH-GIN is capable of an average 57% reduction in imputation errors compared to state-of-the-art methods, while maintaining computational efficiency. The source code and implementation details of MH-GIN are publicly available https://github.com/hyLiu1994/MH-GIN.

[545] Interpretable Clustering with Adaptive Heterogeneous Causal Structure Learning in Mixed Observational Data

Wenrui Li, Qinghao Zhang, Xiaowo Wang

Main category: cs.LG

TL;DR: HCL is an unsupervised framework that jointly infers latent clusters and their causal structures from observational data, addressing causal heterogeneity without requiring prior knowledge.

Details

Motivation: Existing methods lack causal awareness and struggle with modeling heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations.

Method: HCL introduces an equivalent representation encoding structural heterogeneity and confounding, uses bi-directional iterative strategy to refine causal clustering and structure learning, and employs self-supervised regularization to balance cross-cluster universality and specificity.

Result: Theoretically shows identifiability of heterogeneous causal structures under mild conditions. Empirically achieves superior performance in clustering and structure learning, and recovers biologically meaningful mechanisms in single-cell perturbation data.

Conclusion: HCL enables convergence toward interpretable, heterogeneous causal patterns and demonstrates utility for discovering interpretable, mechanism-level causal heterogeneity in scientific domains.

Abstract: Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.

[546] High-Energy Concentration for Federated Learning in Frequency Domain

Haozhi Shi, Weiying Xie, Hangyu Ye, Daixun Li, Jitao Ma, Yunsong Li, Leyuan Fang

Main category: cs.LG

TL;DR: FedFD is a frequency-domain federated learning method that uses discrete cosine transform to filter high-frequency noise and redundant information, reducing communication costs while improving performance.

Details

Motivation: Existing federated learning methods using dataset distillation suffer from redundant information and noise in spatial-domain designs, which increases communication burden.

Method: Proposes FedFD that applies discrete cosine transform to concentrate energy in specific regions, filters low-energy high-frequency components using a binary mask, and uses real data-driven synthetic classification loss to enhance low-frequency component quality.

Result: Achieves superior performance on five image and speech datasets while reducing communication costs by at least 37.78% on CIFAR-10 with 10.88% performance gain.

Conclusion: Frequency-domain filtering of high-frequency components effectively reduces communication costs and improves performance in federated learning by eliminating redundant information and noise.

Abstract: Federated Learning (FL) presents significant potential for collaborative optimization without data sharing. Since synthetic data is sent to the server, leveraging the popular concept of dataset distillation, this FL framework protects real data privacy while alleviating data heterogeneity. However, such methods are still challenged by the redundant information and noise in entire spatial-domain designs, which inevitably increases the communication burden. In this paper, we propose a novel Frequency-Domain aware FL method with high-energy concentration (FedFD) to address this problem. Our FedFD is inspired by the discovery that the discrete cosine transform predominantly distributes energy to specific regions, referred to as high-energy concentration. The principle behind FedFD is that low-energy like high-frequency components usually contain redundant information and noise, thus filtering them helps reduce communication costs and optimize performance. Our FedFD is mathematically formulated to preserve the low-frequency components using a binary mask, facilitating an optimal solution through frequency-domain distribution alignment. In particular, real data-driven synthetic classification is imposed into the loss to enhance the quality of the low-frequency components. On five image and speech datasets, FedFD achieves superior performance than state-of-the-art methods while reducing communication costs. For example, on the CIFAR-10 dataset with Dirichlet coefficient $\alpha = 0.01$, FedFD achieves a minimum reduction of 37.78% in the communication cost, while attaining a 10.88% performance gain.

[547] FraudTransformer: Time-Aware GPT for Transaction Fraud Detection

Gholamali Aminian, Andrew Elliott, Tiger Li, Timothy Cheuk Hin Wong, Victor Claude Dehon, Lukasz Szpruch, Carsten Maple, Christopher Read, Martin Brown, Gesine Reinert, Mo Mamouei

Main category: cs.LG

TL;DR: FraudTransformer is a sequence model for payment fraud detection that enhances GPT-style architecture with time encoding and learned positional encoding, outperforming classical baselines and transformer ablations.

Details

Motivation: Real-world banking fraud detection requires models that can utilize both event order and irregular time gaps between transactions.

Method: Augments GPT-style architecture with dedicated time encoder (for absolute timestamps or inter-event values) and learned positional encoder to preserve relative order.

Result: Outperforms four classical baselines (Logistic Regression, XGBoost, LightGBM) and transformer ablations without time or positional components, achieving highest AUROC and PRAUC on test set.

Conclusion: FraudTransformer effectively captures temporal patterns in payment sequences for improved fraud detection in banking streams.

Abstract: Detecting payment fraud in real-world banking streams requires models that can exploit both the order of events and the irregular time gaps between them. We introduce FraudTransformer, a sequence model that augments a vanilla GPT-style architecture with (i) a dedicated time encoder that embeds either absolute timestamps or inter-event values, and (ii) a learned positional encoder that preserves relative order. Experiments on a large industrial dataset – tens of millions of transactions and auxiliary events – show that FraudTransformer surpasses four strong classical baselines (Logistic Regression, XGBoost and LightGBM) as well as transformer ablations that omit either the time or positional component. On the held-out test set it delivers the highest AUROC and PRAUC.

[548] Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks

Main category: cs.LG

TL;DR: Inoculation Prompting (IP) is a technique that prevents learning of undesired behaviors in LLMs by explicitly requesting those behaviors in training prompts, reducing reward hacking and sycophancy without compromising desired capabilities.

Details

Motivation: Large language models are often trained with imperfect oversight signals, leading to problematic behaviors like reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, so methods are needed to improve behavior despite imperfect training signals.

Method: Inoculation Prompting modifies training prompts to explicitly request the undesired behavior. For example, to prevent reward hacking, prompts request code that only works on provided test cases but fails on other inputs. Prompts that more strongly elicit undesired behavior before fine-tuning are more effective.

Result: Across four settings, IP reduces learning of undesired behavior without substantially reducing learning of desired capabilities. The technique effectively controls how models generalize from fine-tuning.

Conclusion: IP is a simple yet effective way to prevent learning of undesired behaviors in LLMs without substantially disrupting desired capabilities, providing a practical approach to improve model behavior despite imperfect training signals.

Abstract: Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.

[549] Rademacher Meets Colors: More Expressivity, but at What Cost ?

Martin Carrasco, Caio F. Deberaldini Netto, Vahan A. Martirosyan, Aneeqa Mehrab, Ehimare Okoyomon, Caterina Graziani

Main category: cs.LG

TL;DR: This paper provides a theoretical explanation for the trade-off between expressivity and generalization in GNNs by linking WL coloring algorithms to Rademacher complexity.

Details

Motivation: To understand why more expressive GNNs suffer from higher generalization error despite being able to distinguish richer sets of graphs.

Method: The authors analyze the relationship between WL colorings and GNNs’ Rademacher complexity, showing that the number of equivalence classes induced by WL colorings directly bounds the complexity measure.

Result: Greater expressivity leads to higher Rademacher complexity and weaker generalization guarantees. The complexity is also stable under perturbations in color counts across samples.

Conclusion: The framework unifies expressivity and generalization in GNNs, explaining why increasing expressive power often comes at the cost of generalization performance.

Abstract: The expressive power of graph neural networks (GNNs) is typically understood through their correspondence with graph isomorphism tests such as the Weisfeiler-Leman (WL) hierarchy. While more expressive GNNs can distinguish a richer set of graphs, they are also observed to suffer from higher generalization error. This work provides a theoretical explanation for this trade-off by linking expressivity and generalization through the lens of coloring algorithms. Specifically, we show that the number of equivalence classes induced by WL colorings directly bounds the GNNs Rademacher complexity – a key data-dependent measure of generalization. Our analysis reveals that greater expressivity leads to higher complexity and thus weaker generalization guarantees. Furthermore, we prove that the Rademacher complexity is stable under perturbations in the color counts across different samples, ensuring robustness to sampling variability across datasets. Importantly, our framework is not restricted to message-passing GNNs or 1-WL, but extends to arbitrary GNN architectures and expressivity measures that partition graphs into equivalence classes. These results unify the study of expressivity and generalization in GNNs, providing a principled understanding of why increasing expressive power often comes at the cost of generalization.

[550] Schrödinger bridge for generative AI: Soft-constrained formulation and convergence analysis

Jin Ma, Ying Tan, Renyuan Xu

Main category: cs.LG

TL;DR: The paper introduces a soft-constrained Schrödinger bridge problem (SCSBP) that replaces hard terminal constraints with penalty functions, providing more flexible and stable generative AI modeling compared to classical SBPs.

Details

Motivation: Classical Schrödinger bridge problems enforce hard terminal constraints that lead to instability in high-dimensional or data-scarce settings, motivating the need for a more robust formulation.

Method: Proposes soft-constrained Schrödinger bridge problem using penalty functions instead of hard constraints, analyzed through Doob’s h-transform, Gamma-convergence, and fixed-point arguments coupling optimization over measures with entropic optimal transport.

Result: Establishes existence of optimal solutions for all penalty levels and proves linear convergence rate of controls and value functions to classical SBP as penalty increases.

Conclusion: Soft-constrained bridges provide quantitative convergence guarantees and enable robust generative modeling, fine-tuning, and transfer learning through penalty regularization.

Abstract: Generative AI can be framed as the problem of learning a model that maps simple reference measures into complex data distributions, and it has recently found a strong connection to the classical theory of the Schr"odinger bridge problems (SBPs) due partly to their common nature of interpolating between prescribed marginals via entropy-regularized stochastic dynamics. However, the classical SBP enforces hard terminal constraints, which often leads to instability in practical implementations, especially in high-dimensional or data-scarce regimes. To address this challenge, we follow the idea of the so-called soft-constrained Schr"odinger bridge problem (SCSBP), in which the terminal constraint is replaced by a general penalty function. This relaxation leads to a more flexible stochastic control formulation of McKean-Vlasov type. We establish the existence of optimal solutions for all penalty levels and prove that, as the penalty grows, both the controls and value functions converge to those of the classical SBP at a linear rate. Our analysis builds on Doob’s h-transform representations, the stability results of Schr"odinger potentials, Gamma-convergence, and a novel fixed-point argument that couples an optimization problem over the space of measures with an auxiliary entropic optimal transport problem. These results not only provide the first quantitative convergence guarantees for soft-constrained bridges but also shed light on how penalty regularization enables robust generative modeling, fine-tuning, and transfer learning.

[551] Assessing the robustness of heterogeneous treatment effects in survival analysis under informative censoring

Yuxin Wang, Dennis Frauen, Jonas Schweisthal, Maresa Schröder, Stefan Feuerriegel

Main category: cs.LG

TL;DR: A framework for assessing robustness of conditional average treatment effect estimates in survival analysis when facing informative censoring bias, using partial identification to derive bounds on CATE rather than relying on strong assumptions.

Details

Motivation: Dropout in clinical studies introduces censoring bias when informative, leading to biased treatment effect estimates. Existing methods rely on strong assumptions like non-informative censoring.

Method: Proposes an assumption-lean framework using partial identification to derive bounds on CATE. Develops a novel meta-learner that estimates bounds using arbitrary machine learning models with double robustness and quasi-oracle efficiency properties.

Result: The framework helps identify patient subgroups where treatment remains effective despite informative censoring. Demonstrated practical value through numerical experiments and cancer drug trial application.

Conclusion: Provides a practical tool for assessing robustness of treatment effects in presence of censoring, promoting reliable use of survival data for evidence generation in medicine and epidemiology.

Abstract: Dropout is common in clinical studies, with up to half of patients leaving early due to side effects or other reasons. When dropout is informative (i.e., dependent on survival time), it introduces censoring bias, because of which treatment effect estimates are also biased. In this paper, we propose an assumption-lean framework to assess the robustness of conditional average treatment effect (CATE) estimates in survival analysis when facing censoring bias. Unlike existing works that rely on strong assumptions, such as non-informative censoring, to obtain point estimation, we use partial identification to derive informative bounds on the CATE. Thereby, our framework helps to identify patient subgroups where treatment is effective despite informative censoring. We further develop a novel meta-learner that estimates the bounds using arbitrary machine learning models and with favorable theoretical properties, including double robustness and quasi-oracle efficiency. We demonstrate the practical value of our meta-learner through numerical experiments and in an application to a cancer drug trial. Together, our framework offers a practical tool for assessing the robustness of estimated treatment effects in the presence of censoring and thus promotes the reliable use of survival data for evidence generation in medicine and epidemiology.

[552] Learning Wireless Interference Patterns: Decoupled GNN for Throughput Prediction in Heterogeneous Multi-Hop p-CSMA Networks

Faezeh Dehghan Tarzjani, Bhaskar Krishnamachari

Main category: cs.LG

TL;DR: Proposes D-GCN, a novel GNN architecture that decouples node transmission probability from neighbor interference effects for accurate throughput prediction in multi-hop wireless networks, achieving 3.3% NMAE and enabling gradient-based network optimization.

Details

Motivation: Existing methods for predicting saturation throughput in heterogeneous multi-hop wireless networks are either inaccurate (simplified models underestimate by 48-62%) or computationally infeasible (exact Markov-chain analyses scale exponentially). Standard GNNs also fail with 63.94% NMAE due to interference propagation issues.

Method: D-GCN explicitly separates processing of a node’s own transmission probability from neighbor interference effects, replacing mean aggregation with learnable attention to capture complex multihop interference patterns and yield interpretable per-neighbor contribution weights.

Result: D-GCN achieves 3.3% normalized mean absolute error (NMAE), significantly outperforming standard GCN (63.94% NMAE) and other baselines. It remains computationally tractable for large networks and enables gradient-based optimization achieving within 1% of theoretical optima.

Conclusion: The proposed D-GCN architecture successfully addresses the limitations of existing methods by explicitly modeling interference propagation, providing accurate and scalable throughput prediction for multi-hop wireless networks while enabling practical network optimization.

Abstract: The p-persistent CSMA protocol is central to random-access MAC analysis, but predicting saturation throughput in heterogeneous multi-hop wireless networks remains a hard problem. Simplified models that assume a single, shared interference domain can underestimate throughput by 48-62% in sparse topologies. Exact Markov-chain analyses are accurate but scale exponentially in computation time, making them impractical for large networks. These computational barriers motivate structural machine learning approaches like GNNs for scalable throughput prediction in general network topologies. Yet off-the-shelf GNNs struggle here: a standard GCN yields 63.94% normalized mean absolute error (NMAE) on heterogeneous networks because symmetric normalization conflates a node’s direct interference with higher-order, cascading effects that pertain to how interference propagates over the network graph. Building on these insights, we propose the Decoupled Graph Convolutional Network (D-GCN), a novel architecture that explicitly separates processing of a node’s own transmission probability from neighbor interference effects. D-GCN replaces mean aggregation with learnable attention, yielding interpretable, per-neighbor contribution weights while capturing complex multihop interference patterns. D-GCN attains 3.3% NMAE, outperforms strong baselines, remains tractable even when exact analytical methods become computationally infeasible, and enables gradient-based network optimization that achieves within 1% of theoretical optima.

[553] Geometric Mixture Models for Electrolyte Conductivity Prediction

Anyi Li, Jiacheng Cen, Songyou Li, Mingze Li, Yang Yu, Wenbing Huang

Main category: cs.LG

TL;DR: GeoMix is a geometry-aware framework for predicting ionic conductivity in electrolyte systems that addresses challenges in standardized benchmarks and geometric structure modeling through equivariant message passing.

Details

Motivation: Current research faces two fundamental challenges: lack of high-quality standardized benchmarks and inadequate modeling of geometric structure and intermolecular interactions in mixture systems.

Method: Reorganized CALiSol and DiffMix datasets with geometric graph representations, then proposed GeoMix framework with Geometric Interaction Network (GIN) for equivariant intermolecular geometric message passing that preserves Set-SE(3) equivariance.

Result: GeoMix consistently outperforms diverse baselines (MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing.

Conclusion: This work establishes new benchmarks for electrolyte research and provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.

Abstract: Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance-an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.

[554] An unsupervised tour through the hidden pathways of deep neural networks

Diego Doimo

Main category: cs.LG

TL;DR: This thesis develops unsupervised methods to understand how deep neural networks create meaningful representations and generalize, focusing on intrinsic dimension estimation, probability density evolution across layers, and generalization mechanisms.

Details

Motivation: To improve understanding of the internal mechanisms by which deep neural networks create meaningful representations and generalize, particularly characterizing semantic content in hidden representations.

Method: Developed Gride method for intrinsic dimension estimation; studied probability density evolution across hidden layers in state-of-the-art networks; analyzed generalization in wide neural networks with redundant representations.

Result: Found that initial layers create unimodal density removing irrelevant structure; subsequent layers develop hierarchical density peaks mirroring semantic hierarchy; wide networks learn redundant representations rather than overfitting when regularized.

Conclusion: Deep networks systematically organize representations through hierarchical density structures, and generalization improves through redundant representations in wide networks under regularization, challenging classical bias-variance trade-off.

Abstract: The goal of this thesis is to improve our understanding of the internal mechanisms by which deep artificial neural networks create meaningful representations and are able to generalize. We focus on the challenge of characterizing the semantic content of the hidden representations with unsupervised learning tools, partially developed by us and described in this thesis, which allow harnessing the low-dimensional structure of the data. Chapter 2. introduces Gride, a method that allows estimating the intrinsic dimension of the data as an explicit function of the scale without performing any decimation of the data set. Our approach is based on rigorous distributional results that enable the quantification of uncertainty of the estimates. Moreover, our method is simple and computationally efficient since it relies only on the distances among nearest data points. In Chapter 3, we study the evolution of the probability density across the hidden layers in some state-of-the-art deep neural networks. We find that the initial layers generate a unimodal probability density getting rid of any structure irrelevant to classification. In subsequent layers, density peaks arise in a hierarchical fashion that mirrors the semantic hierarchy of the concepts. This process leaves a footprint in the probability density of the output layer, where the topography of the peaks allows reconstructing the semantic relationships of the categories. In Chapter 4, we study the problem of generalization in deep neural networks: adding parameters to a network that interpolates its training data will typically improve its generalization performance, at odds with the classical bias-variance trade-off. We show that wide neural networks learn redundant representations instead of overfitting to spurious correlation and that redundant neurons appear only if the network is regularized and the training error is zero.

[555] MARS-M: When Variance Reduction Meets Matrices

Yifeng Liu, Angela Yuan, Quanquan Gu

Main category: cs.LG

TL;DR: MARS-M is a new optimizer that combines matrix-based preconditioning from Muon with variance reduction from MARS, achieving faster convergence and better performance on language modeling and computer vision tasks.

Details

Motivation: To combine the efficiency of matrix-based preconditioned optimizers like Muon with the speedups from variance-reduction techniques like MARS for better training of large-scale neural networks.

Method: Integrates variance reduction technique from MARS with the matrix-based preconditioned optimizer Muon, creating MARS-M optimizer.

Result: Proves convergence rate of O(T^{-1/3}), improves upon Muon’s O(T^{-1/4}) rate. Empirical results show lower losses and improved performance on language modeling and computer vision tasks.

Conclusion: MARS-M successfully combines matrix preconditioning with variance reduction, achieving superior convergence rates and performance across various benchmarks.

Abstract: Matrix-based preconditioned optimizers, such as Muon, have recently been shown to be more efficient than scalar-based optimizers for training large-scale neural networks, including large language models (LLMs). On the other hand, recent benchmarks on optimizers for LLM pre-training have demonstrated that variance-reduction techniques such as MARS can achieve substantial speedups over standard optimizers that do not employ variance reduction. In this paper, to achieve the best of both worlds, we introduce MARS-M, a new optimizer that integrates the variance reduction technique in MARS with Muon. Under standard regularity conditions, we prove that Muon-M converges to a first-order stationary point at a rate of $\tilde{\mathcal{O}}(T^{-1/3})$, which improves upon $\tilde{\mathcal{O}}(T^{-1/4})$ rate attained by Muon. Our empirical results on language modeling and computer vision tasks demonstrate that MARS-M consistently yields lower losses and improved performance across various downstream benchmarks. The implementation of MARS-M is available at https://github.com/AGI-Arena/MARS/tree/main/MARS_M.

[556] Tractable Shapley Values and Interactions via Tensor Networks

Farzaneh Heidari, Chao Li, Guillaume Rabusseau

Main category: cs.LG

TL;DR: TN-SHAP replaces exponential coalition enumeration for Shapley values with tensor-network surrogates, achieving polynomial-time computation for order-1 and order-2 Shapley interactions with theoretical guarantees.

Details

Motivation: Shapley values and interaction indices require O(2^n) coalition enumeration over n features, which is computationally prohibitive for large feature sets.

Method: Represent predictor’s local behavior as factorized multilinear map using tensor-network surrogate, enabling linear probes of coefficient tensor with targeted evaluations instead of exhaustive sweeps.

Result: Achieves O(n*poly(chi) + n^2) cost for order-1 and order-2 computations, with 25-1000x wall-clock speedups over KernelSHAP-IQ while maintaining comparable accuracy on UCI datasets.

Conclusion: TN-SHAP provides efficient polynomial-time approximation for Shapley interactions with theoretical error guarantees, enabling practical computation of feature importance and interactions.

Abstract: We show how to replace the O(2^n) coalition enumeration over n features behind Shapley values and Shapley-style interaction indices with a few-evaluation scheme on a tensor-network (TN) surrogate: TN-SHAP. The key idea is to represent a predictor’s local behavior as a factorized multilinear map, so that coalitional quantities become linear probes of a coefficient tensor. TN-SHAP replaces exhaustive coalition sweeps with just a small number of targeted evaluations to extract order-k Shapley interactions. In particular, both order-1 (single-feature) and order-2 (pairwise) computations have cost O(n*poly(chi) + n^2), where chi is the TN’s maximal cut rank. We provide theoretical guarantees on the approximation error and tractability of TN-SHAP. On UCI datasets, our method matches enumeration on the fitted surrogate while reducing evaluation by orders of magnitude and achieves 25-1000x wall-clock speedups over KernelSHAP-IQ at comparable accuracy, while amortizing training across local cohorts.

[557] SeeDNorm: Self-Rescaled Dynamic Normalization

Wenrui Cai, Defa Zhu, Qingjie Liu, Qiyang Min

Main category: cs.LG

TL;DR: SeeDNorm is a dynamic normalization method that enhances RMSNorm by preserving input norm information and using data-dependent scaling coefficients, achieving superior performance in LLMs and vision tasks with minimal parameter overhead.

Details

Motivation: RMSNorm discards input norm information and uses static scaling factors, which limits performance in zero-shot scenarios and with distributional shifts. There's a need for dynamic, input-dependent normalization that preserves norm information.

Method: SeeDNorm dynamically adjusts scaling coefficients based on current input, preserving input norm information while maintaining RMSNorm’s gradient adjustment capabilities. It addresses potential instability issues with specific optimization solutions.

Result: SeeDNorm consistently outperforms RMSNorm, LayerNorm, and DyT across various model sizes in LLM pre-training and computer vision tasks, with minimal parameter increase and negligible efficiency impact.

Conclusion: Dynamic, input-dependent normalization like SeeDNorm provides significant performance improvements over static normalization methods, making it a promising replacement for existing normalization layers in transformers.

Abstract: Normalization layer constitutes an essential component in neural networks. In transformers, the predominantly used RMSNorm constrains vectors to a unit hypersphere, followed by dimension-wise rescaling through a learnable scaling coefficient $\gamma$ to maintain the representational capacity of the model. However, RMSNorm discards the input norm information in forward pass and a static scaling factor $\gamma$ may be insufficient to accommodate the wide variability of input data and distributional shifts, thereby limiting further performance improvements, particularly in zero-shot scenarios that large language models routinely encounter. To address this limitation, we propose SeeDNorm, which enhances the representational capability of the model by dynamically adjusting the scaling coefficient based on the current input, thereby preserving the input norm information and enabling data-dependent, self-rescaled dynamic normalization. During backpropagation, SeeDNorm retains the ability of RMSNorm to dynamically adjust gradient according to the input norm. We provide a detailed analysis of the training optimization for SeedNorm and proposed corresponding solutions to address potential instability issues that may arise when applying SeeDNorm. We validate the effectiveness of SeeDNorm across models of varying sizes in large language model pre-training as well as supervised and unsupervised computer vision tasks. By introducing a minimal number of parameters and with neglligible impact on model efficiency, SeeDNorm achieves consistently superior performance compared to previously commonly used normalization layers such as RMSNorm and LayerNorm, as well as element-wise activation alternatives to normalization layers like DyT.

[558] Towards Personalized Treatment Plan: Geometrical Model-Agnostic Approach to Counterfactual Explanations

Daniel Sin, Milad Toutounchian

Main category: cs.LG

TL;DR: SSBA method generates counterfactual explanations in high-dimensional spaces using binary search to find decision boundary points and identifies the closest feasible counterfactual while handling real-world constraints on immutable features.

Details

Motivation: To develop an effective model-agnostic method for generating counterfactual explanations that can handle high-dimensional spaces and real-world constraints on immutable features like age, gender, and other characteristics.

Method: Four-step approach: fitting dataset to model, finding decision boundary, determining constraints, and computing closest counterfactual point. Uses discretized approach with binary search (Segmented Sampling for Boundary Approximation) to find boundary points and identify closest feasible counterfactual.

Result: Outperforms current methods with 5% to 50% reduction in L2 norm distance across four datasets. Handles real-world constraints on immutable/categorical features. SSBA generates boundary points orders of magnitude faster than grid-based approaches.

Conclusion: SSBA provides a simple, effective model-agnostic method for computing nearest feasible counterfactual explanations with realistic constraints, demonstrating significant improvements in distance metrics and runtime efficiency.

Abstract: In our article, we describe a method for generating counterfactual explanations in high-dimensional spaces using four steps that involve fitting our dataset to a model, finding the decision boundary, determining constraints on the problem, and computing the closest point (counterfactual explanation) from that boundary. We propose a discretized approach where we find many discrete points on the boundary and then identify the closest feasible counterfactual explanation. This method, which we later call $\textit{Segmented Sampling for Boundary Approximation}$ (SSBA), applies binary search to find decision boundary points and then searches for the closest boundary point. Across four datasets of varying dimensionality, we show that our method can outperform current methods for counterfactual generation with reductions in distance between $5%$ to $50%$ in terms of the $L_2$ norm. Our method can also handle real-world constraints by restricting changes to immutable and categorical features, such as age, gender, sex, height, and other related characteristics such as the case for a health-based dataset. In terms of runtime, the SSBA algorithm generates decision boundary points on multiple orders of magnitude in the same given time when we compare to a grid-based approach. In general, our method provides a simple and effective model-agnostic method that can compute nearest feasible (i.e. realistic with constraints) counterfactual explanations. All of our results and code are available at: https://github.com/dsin85691/SSBA_For_Counterfactuals

[559] RL-AUX: Reinforcement Learning for Auxiliary Task Generation

Judah Goldfeder, Matthew So, Hod Lipson

Main category: cs.LG

TL;DR: RL-based approach to dynamically create auxiliary tasks without Bi-Level Optimization, achieving better performance than human-labeled tasks and matching Bi-Level Optimization methods on CIFAR100.

Details

Motivation: To overcome the need for labeled auxiliary tasks in Auxiliary Learning, which requires human effort and domain expertise, and to avoid computational costs of Bi-Level Optimization methods.

Method: RL agent selects auxiliary labels for each data point, rewarded when selections improve primary task performance. Also experiments with learning optimal strategies for weighing auxiliary loss per data point.

Result: Outperforms human-labeled auxiliary tasks and performs as well as Bi-Level Optimization on CIFAR100. Weight-aware RL approach helps VGG16 achieve 80.9% test accuracy vs 75.53% with human-labeled tasks.

Conclusion: RL is a viable approach for dynamic auxiliary task generation, and per-sample auxiliary task weights can be learned alongside labels to achieve strong results.

Abstract: Auxiliary Learning (AL) is a special case of Multi-task Learning (MTL) in which a network trains on auxiliary tasks to improve performance on its main task. This technique is used to improve generalization and, ultimately, performance on the network’s main task. AL has been demonstrated to improve performance across multiple domains, including navigation, image classification, and natural language processing. One weakness of AL is the need for labeled auxiliary tasks, which can require human effort and domain expertise to generate. Meta Learning techniques have been used to solve this issue by learning an additional auxiliary task generation network that can create helpful tasks for the primary network. The most prominent techniques rely on Bi-Level Optimization, which incurs computational cost and increased code complexity. To avoid the need for Bi-Level Optimization, we present an RL-based approach to dynamically create auxiliary tasks. In this framework, an RL agent is tasked with selecting auxiliary labels for every data point in a training set. The agent is rewarded when their selection improves the performance on the primary task. We also experiment with learning optimal strategies for weighing the auxiliary loss per data point. On the 20-Superclass CIFAR100 problem, our RL approach outperforms human-labeled auxiliary tasks and performs as well as a prominent Bi-Level Optimization technique. Our weight learning approaches significantly outperform all of these benchmarks. For example, a Weight-Aware RL-based approach helps the VGG16 architecture achieve 80.9% test accuracy while the human-labeled auxiliary task setup achieved 75.53%. The goal of this work is to (1) prove that RL is a viable approach to dynamically generate auxiliary tasks and (2) demonstrate that per-sample auxiliary task weights can be learned alongside the auxiliary task labels and can achieve strong results.

[560] SGFusion: Stochastic Geographic Gradient Fusion in Federated Learning

Khoa Nguyen, Khang Tran, NhatHai Phan, Cristian Borcea, Rouming Jin, Issa Khalil

Main category: cs.LG

TL;DR: SGFusion is a novel FL training algorithm that leverages geographic information by training separate models per zone and enabling probabilistic gradient fusion between similar zones using hierarchical random graphs.

Details

Motivation: To better leverage geographic information of mobile users in Federated Learning, addressing the need for models that adapt to local data patterns while enabling knowledge sharing between similar geographic zones.

Method: Maps mobile device data to geographical zones, trains one FL model per zone, models zone correlations as hierarchical random graphs optimized by MCMC sampling, and performs stochastic gradient fusion with self-attention weights between sampled zones.

Result: Significantly improves model utility across all 6 countries tested, converges with upper-bounded expected errors, and maintains system scalability without notable computational cost increases.

Conclusion: SGFusion effectively enables geographic-aware federated learning through probabilistic gradient fusion, achieving superior performance while preserving computational efficiency and scalability.

Abstract: This paper proposes Stochastic Geographic Gradient Fusion (SGFusion), a novel training algorithm to leverage the geographic information of mobile users in Federated Learning (FL). SGFusion maps the data collected by mobile devices onto geographical zones and trains one FL model per zone, which adapts well to the data and behaviors of users in that zone. SGFusion models the local data-based correlation among geographical zones as a hierarchical random graph (HRG) optimized by Markov Chain Monte Carlo sampling. At each training step, every zone fuses its local gradient with gradients derived from a small set of other zones sampled from the HRG. This approach enables knowledge fusion and sharing among geographical zones in a probabilistic and stochastic gradient fusion process with self-attention weights, such that “more similar” zones have “higher probabilities” of sharing gradients with “larger attention weights.” SGFusion remarkably improves model utility without introducing undue computational cost. Extensive theoretical and empirical results using a heart-rate prediction dataset collected across 6 countries show that models trained with SGFusion converge with upper-bounded expected errors and significantly improve utility in all countries compared to existing approaches without notable cost in system scalability.

cs.MA

[561] Logic-based Task Representation and Reward Shaping in Multiagent Reinforcement Learning

Nishant Doshi

Main category: cs.MA

TL;DR: Accelerated learning of optimal plans for multi-agent systems using Linear Temporal Logic specifications, options, and reward shaping to reduce convergence times.

Details

Motivation: To address the exponential sample complexity in multi-agent reinforcement learning when learning optimal plans for complex temporal logic tasks.

Method: Convert LTL specifications to Buchi Automaton, use model-free approach with options (temporally abstract actions), construct product Semi-Markov Decision Process on-the-fly, and apply reward shaping to accelerate learning.

Result: Significant reduction in convergence times through reward shaping, and options become increasingly relevant as state and action spaces grow in multi-agent systems.

Conclusion: The proposed approach effectively synthesizes correct-by-design controllers for multi-agent systems with LTL specifications while handling exponential complexity through reward shaping and options.

Abstract: This paper presents an approach for accelerated learning of optimal plans for a given task represented using Linear Temporal Logic (LTL) in multi-agent systems. Given a set of options (temporally abstract actions) available to each agent, we convert the task specification into the corresponding Buchi Automaton and proceed with a model-free approach which collects transition samples and constructs a product Semi Markov Decision Process (SMDP) on-the-fly. Value-based Reinforcement Learning algorithms can then be used to synthesize a correct-by-design controller without learning the underlying transition model of the multi-agent system. The exponential sample complexity due to multiple agents is dealt with using a novel reward shaping approach. We test the proposed algorithm in a deterministic gridworld simulation for different tasks and find that the reward shaping results in significant reduction in convergence times. We also infer that using options becomes increasing more relevant as the state and action space increases in multi-agent systems.

[562] Coordinated Autonomous Drones for Human-Centered Fire Evacuation in Partially Observable Urban Environments

Maria G. Mendoza, Addison Kalanther, Daniel Bostwick, Emma Stephan, Chinmay Maheshwari, Shankar Sastry

Main category: cs.MA

TL;DR: A multi-agent UAV framework for real-time evacuation assistance using POMDP modeling and PPO reinforcement learning to guide panicked humans to safety in fire scenarios.

Details

Motivation: Existing evacuation models overlook human psychological complexity under stress, where evacuees often deviate from safe routes due to panic and uncertainty in real fire scenarios.

Method: Multi-agent coordination with two heterogeneous UAVs (HLR and LLR) using POMDP modeling, agent-based human behavior simulation with empirical psychology, and PPO algorithm with recurrent policies for long-horizon planning in partially observable environments.

Result: Simulation shows UAV team can rapidly locate and intercept evacuees, significantly reducing time to reach safety compared to scenarios without UAV assistance.

Conclusion: The framework effectively addresses real-time evacuation challenges by incorporating psychological human behavior and enabling UAV coordination under uncertainty and limited visibility conditions.

Abstract: Autonomous drone technology holds significant promise for enhancing search and rescue operations during evacuations by guiding humans toward safety and supporting broader emergency response efforts. However, their application in dynamic, real-time evacuation support remains limited. Existing models often overlook the psychological and emotional complexity of human behavior under extreme stress. In real-world fire scenarios, evacuees frequently deviate from designated safe routes due to panic and uncertainty. To address these challenges, this paper presents a multi-agent coordination framework in which autonomous Unmanned Aerial Vehicles (UAVs) assist human evacuees in real-time by locating, intercepting, and guiding them to safety under uncertain conditions. We model the problem as a Partially Observable Markov Decision Process (POMDP), where two heterogeneous UAV agents, a high-level rescuer (HLR) and a low-level rescuer (LLR), coordinate through shared observations and complementary capabilities. Human behavior is captured using an agent-based model grounded in empirical psychology, where panic dynamically affects decision-making and movement in response to environmental stimuli. The environment features stochastic fire spread, unknown evacuee locations, and limited visibility, requiring UAVs to plan over long horizons to search for humans and adapt in real-time. Our framework employs the Proximal Policy Optimization (PPO) algorithm with recurrent policies to enable robust decision-making in partially observable settings. Simulation results demonstrate that the UAV team can rapidly locate and intercept evacuees, significantly reducing the time required for them to reach safety compared to scenarios without UAV assistance.

Ahmet Akkaya Melih, Yamuna Singh, Kunal L. Agarwal, Priya Mukherjee, Kiran Pattnaik, Hanuman Bhatia

Main category: cs.MA

TL;DR: The paper proposes HMS-HI, a Human-Machine Social Hybrid Intelligence framework that enables deep collaboration between human experts and AI agents through shared cognitive space, dynamic role allocation, and trust calibration, achieving 72% reduction in casualties and 70% reduction in cognitive load.

Details

Motivation: Current Human-in-the-Loop paradigms inadequately integrate human expertise, causing cognitive overload and decision-making bottlenecks in complex, high-stakes environments. There's a need for better human-AI collaboration frameworks.

Method: HMS-HI framework with three core pillars: Shared Cognitive Space for unified situational awareness, Dynamic Role and Task Allocation for adaptive task assignment, and Cross-Species Trust Calibration protocol for transparency and mutual adaptation.

Result: In urban emergency response simulation, HMS-HI reduced civilian casualties by 72% and cognitive load by 70% compared to traditional HiTL approaches, with superior decision quality, efficiency, and human-AI trust.

Conclusion: Engineered trust and shared context are foundational for scalable, synergistic human-AI collaboration, as confirmed by ablation studies showing critical contributions of each module.

Abstract: The rapid advancements in large foundation models and multi-agent systems offer unprecedented capabilities, yet current Human-in-the-Loop (HiTL) paradigms inadequately integrate human expertise, often leading to cognitive overload and decision-making bottlenecks in complex, high-stakes environments. We propose the “Human-Machine Social Hybrid Intelligence” (HMS-HI) framework, a novel architecture designed for deep, collaborative decision-making between groups of human experts and LLM-powered AI agents. HMS-HI is built upon three core pillars: (1) a \textbf{Shared Cognitive Space (SCS)} for unified, multi-modal situational awareness and structured world modeling; (2) a \textbf{Dynamic Role and Task Allocation (DRTA)} module that adaptively assigns tasks to the most suitable agent (human or AI) based on capabilities and workload; and (3) a \textbf{Cross-Species Trust Calibration (CSTC)} protocol that fosters transparency, accountability, and mutual adaptation through explainable declarations and structured feedback. Validated in a high-fidelity urban emergency response simulation, HMS-HI significantly reduced civilian casualties by 72% and cognitive load by 70% compared to traditional HiTL approaches, demonstrating superior decision quality, efficiency, and human-AI trust. An ablation study confirms the critical contribution of each module, highlighting that engineered trust and shared context are foundational for scalable, synergistic human-AI collaboration.

[564] Long-Term Mapping of the Douro River Plume with Multi-Agent Reinforcement Learning

Nicolò Dal Fabbro, Milad Mesbahi, Renato Mendes, João Borges de Sousa, George J. Pappas

Main category: cs.MA

TL;DR: Multi-agent reinforcement learning approach for long-term river plume mapping using AUVs, integrating spatiotemporal GPR with Q-network controllers to optimize energy efficiency and communication.

Details

Motivation: To enable efficient long-term monitoring of dynamic river plumes using multiple AUVs while addressing energy and communication constraints.

Method: Combines spatiotemporal Gaussian process regression with multi-head Q-network controllers that regulate AUV direction and speed, using intermittent central coordination.

Result: Outperforms single- and multi-agent benchmarks in simulations, with scaling AUVs improving both MSE and operational endurance - doubling AUVs can more than double endurance while maintaining accuracy.

Conclusion: The learned policies generalize across unseen seasonal regimes, showing promise for data-driven long-term monitoring of dynamic plume environments.

Abstract: We study the problem of long-term (multiple days) mapping of a river plume using multiple autonomous underwater vehicles (AUVs), focusing on the Douro river representative use-case. We propose an energy - and communication - efficient multi-agent reinforcement learning approach in which a central coordinator intermittently communicates with the AUVs, collecting measurements and issuing commands. Our approach integrates spatiotemporal Gaussian process regression (GPR) with a multi-head Q-network controller that regulates direction and speed for each AUV. Simulations using the Delft3D ocean model demonstrate that our method consistently outperforms both single- and multi-agent benchmarks, with scaling the number of agents both improving mean squared error (MSE) and operational endurance. In some instances, our algorithm demonstrates that doubling the number of AUVs can more than double endurance while maintaining or improving accuracy, underscoring the benefits of multi-agent coordination. Our learned policies generalize across unseen seasonal regimes over different months and years, demonstrating promise for future developments of data-driven long-term monitoring of dynamic plume environments.

cs.MM

[565] Adaptive 3D Mesh Steganography Based on Feature-Preserving Distortion

Yushu Zhang, Jiahao Zhu, Mignfu Xue, Xinpeng Zhang, Xiaochun Cao

Main category: cs.MM

TL;DR: The paper proposes a highly adaptive 3D mesh steganography algorithm that minimizes a feature-preserving distortion to enhance security against steganalyzers while maintaining high embedding capacity.

Details

Motivation: Current 3D mesh steganography methods using geometric modification are easily detectable by steganalyzers, so the authors aim to develop a more secure adaptive approach inspired by traditional steganography.

Method: The authors design a feature-preserving distortion (FPD) that measures embedding impact as weighted differences of steganalytic subfeatures, then minimize this distortion using Q-layered syndrome trellis code (STC) with an automatic bit modification probability calculation method.

Result: Experimental results show the algorithm achieves state-of-the-art performance in countering 3D steganalysis while preserving mesh features and maintaining high embedding capacity.

Conclusion: The proposed adaptive steganography approach effectively enhances 3D mesh steganography security by minimizing feature-preserving distortion and provides a practical solution with automatic parameter calculation.

Abstract: Current 3D mesh steganography algorithms relying on geometric modification are prone to detection by steganalyzers. In traditional steganography, adaptive steganography has proven to be an efficient means of enhancing steganography security. Taking inspiration from this, we propose a highly adaptive embedding algorithm, guided by the principle of minimizing a carefully crafted distortion through efficient steganography codes. Specifically, we tailor a payload-limited embedding optimization problem for 3D settings and devise a feature-preserving distortion (FPD) to measure the impact of message embedding. The distortion takes on an additive form and is defined as a weighted difference of the effective steganalytic subfeatures utilized by the current 3D steganalyzers. With practicality in mind, we refine the distortion to enhance robustness and computational efficiency. By minimizing the FPD, our algorithm can preserve mesh features to a considerable extent, including steganalytic and geometric features, while achieving a high embedding capacity. During the practical embedding phase, we employ the Q-layered syndrome trellis code (STC). However, calculating the bit modification probability (BMP) for each layer of the Q-layered STC, given the variation of Q, can be cumbersome. To address this issue, we design a universal and automatic approach for the BMP calculation. The experimental results demonstrate that our algorithm achieves state-of-the-art performance in countering 3D steganalysis. Code is available at https://github.com/zjhJOJO/3D-steganography-based-on-FPD.git.

[566] Mano Technical Report

Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang

Main category: cs.MM

TL;DR: Mano is a robust GUI agent that uses a multi-modal foundation model with a three-stage training pipeline to automate GUI interactions, achieving state-of-the-art performance on benchmarks.

Details

Motivation: Existing vision-language models for GUI automation suffer from limited resolution, domain mismatch, and insufficient sequential decision-making capabilities.

Method: Built on multi-modal foundation model pre-trained on web/system data, with simulated environment for data generation, three-stage training (supervised fine-tuning, offline RL, online RL), and verification module for error recovery.

Result: Achieves state-of-the-art performance on Mind2Web and OSWorld benchmarks with significant improvements in success rate and operational accuracy.

Conclusion: Provides insights into effective RL-VLM integration for GUI agents, emphasizing domain-specific data, iterative training, and holistic reward design.

Abstract: Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.

eess.AS

[567] A Neural Model for Contextual Biasing Score Learning and Filtering

Wanting Huang, Weiran Wang

Main category: eess.AS

TL;DR: An attention-based contextual biasing method for ASR that uses a discriminative objective to filter candidate phrases and improve recognition accuracy through shallow fusion biasing.

Details

Motivation: To improve automatic speech recognition by effectively integrating external knowledge like user-specific phrases during decoding, addressing the challenge of handling large candidate sets.

Method: Uses an attention-based biasing decoder to score candidate phrases based on acoustic information, with a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors.

Result: Effectively filters out majority of candidate phrases and significantly improves recognition accuracy under different biasing conditions when used in shallow fusion biasing on Librispeech benchmark.

Conclusion: The approach is modular, compatible with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

Abstract: Contextual biasing improves automatic speech recognition (ASR) by integrating external knowledge, such as user-specific phrases or entities, during decoding. In this work, we use an attention-based biasing decoder to produce scores for candidate phrases based on acoustic information extracted by an ASR encoder, which can be used to filter out unlikely phrases and to calculate bonus for shallow-fusion biasing. We introduce a per-token discriminative objective that encourages higher scores for ground-truth phrases while suppressing distractors. Experiments on the Librispeech biasing benchmark show that our method effectively filters out majority of the candidate phrases, and significantly improves recognition accuracy under different biasing conditions when the scores are used in shallow fusion biasing. Our approach is modular and can be used with any ASR system, and the filtering mechanism can potentially boost performance of other biasing methods.

[568] Listening without Looking: Modality Bias in Audio-Visual Captioning

Yuchi Ishikawa, Toranosuke Manabe, Tatsuya Komatsu, Yoshimitsu Aoki

Main category: eess.AS

TL;DR: This paper analyzes modality robustness in audio-visual captioning models, revealing audio bias in LAVCap model and proposing AudioVisualCaps dataset to address imbalance.

Details

Motivation: To understand the extent of modality complementarity in audio-visual captioning models and evaluate their robustness when one modality is degraded, particularly addressing the observed bias toward audio streams.

Method: Conducted systematic modality robustness tests on LAVCap model by selectively suppressing/corrupting audio or visual streams, and created AudioVisualCaps dataset with joint audio-visual textual annotations to evaluate balanced modality usage.

Result: Analysis revealed pronounced audio bias in LAVCap. When trained on AudioVisualCaps, the model exhibited less modality bias compared to training on AudioCaps alone.

Conclusion: Current audio-visual captioning models show significant modality bias, and training with balanced multimodal datasets like AudioVisualCaps can reduce this bias and improve robustness.

Abstract: Audio-visual captioning aims to generate holistic scene descriptions by jointly modeling sound and vision. While recent methods have improved performance through sophisticated modality fusion, it remains unclear to what extent the two modalities are complementary in current audio-visual captioning models and how robust these models are when one modality is degraded. We address these questions by conducting systematic modality robustness tests on LAVCap, a state-of-the-art audio-visual captioning model, in which we selectively suppress or corrupt the audio or visual streams to quantify sensitivity and complementarity. The analysis reveals a pronounced bias toward the audio stream in LAVCap. To evaluate how balanced audio-visual captioning models are in their use of both modalities, we augment AudioCaps with textual annotations that jointly describe the audio and visual streams, yielding the AudioVisualCaps dataset. In our experiments, we report LAVCap baseline results on AudioVisualCaps. We also evaluate the model under modality robustness tests on AudioVisualCaps and the results indicate that LAVCap trained on AudioVisualCaps exhibits less modality bias than when trained on AudioCaps.

[569] Forward Convolutive Prediction for Frame Online Monaural Speech Dereverberation Based on Kronecker Product Decomposition

Yujie Zhu, Jilu Jin, Xueqin Luo, Wenxing Yang, Zhong-Qiu Wang, Gongping Huang, Jingdong Chen, Jacob Benesty

Main category: eess.AS

TL;DR: Proposes a novel forward convolutional prediction method using Kronecker product decomposition to reduce computational complexity in speech dereverberation.

Details

Motivation: Existing forward convolutional prediction methods require excessively long linear prediction filters, leading to high computational complexity that limits practical applications.

Method: Models the long prediction filter as Kronecker product of two much shorter filters, with an adaptive algorithm for online iterative updates of these shorter filters.

Result: Achieves competitive dereverberation performance compared to conventional methods while substantially reducing computational cost.

Conclusion: The Kronecker product decomposition approach effectively addresses computational complexity issues in forward convolutional prediction for speech dereverberation.

Abstract: Dereverberation has long been a crucial research topic in speech processing, aiming to alleviate the adverse effects of reverberation in voice communication and speech interaction systems. Among existing approaches, forward convolutional prediction (FCP) has recently attracted attention. It typically employs a deep neural network to predict the direct-path signal and subsequently estimates a linear prediction filter to suppress residual reverberation. However, a major drawback of this approach is that the required linear prediction filter is often excessively long, leading to considerable computational complexity. To address this, our work proposes a novel FCP method based on Kronecker product (KP) decomposition, in which the long prediction filter is modeled as the KP of two much shorter filters. This decomposition significantly reduces the computational cost. An adaptive algorithm is then provided to iteratively update these shorter filters online. Experimental results show that, compared to conventional methods, our approach achieves competitive dereverberation performance while substantially reducing computational cost.

Jinchao Li, Yuejiao Wang, Junan Li, Jiawen Kang, Bo Zheng, Ka Ho Wong, Brian Mak, Helene H. Fung, Jean Woo, Man-Wai Mak, Timothy Kwok, Vincent Mok, Xianmin Gong, Xixin Wu, Xunying Liu, Patrick C. M. Wong, Helen Meng

Main category: eess.AS

TL;DR: Proposes two novel macrostructural approaches for neurocognitive disorder detection using visual-stimulated narratives: Dynamic Topic Model for topic evolution tracking and Text-Image Temporal Alignment Network for cross-modal consistency measurement.

Details

Motivation: Current VSN-based NCD detection methods focus on linguistic microstructures but neglect higher-order linguistic macrostructures that reflect top-down cognitive abilities, which are crucial for NCD detection but challenging to quantify.

Method: Two approaches: (1) Dynamic Topic Model to track topic evolution over time, (2) Text-Image Temporal Alignment Network to measure cross-modal consistency between narrative and visual stimuli.

Result: TITAN achieved superior performance across three corpora: ADReSS (F1=0.8889), ADReSSo (F1=0.8504), and CU-MARVEL-RABBIT (F1=0.7238). Macrostructural features were the most significant contributors to model decisions, outperforming microstructural features.

Conclusion: Macrostructural analysis provides valuable insights into linguistic-cognitive interactions associated with NCDs, demonstrating the importance of higher-order linguistic patterns for early detection of neurocognitive disorders.

Abstract: Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., lexical diversity) that are closely tied to bottom-up, stimulus-driven cognitive processes. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., topic development) that may reflect top-down, concept-driven cognitive abilities remain underexplored. These macrostructural patterns are crucial for NCD detection, yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli. Experimental results show the effectiveness of the proposed approaches in NCD detection, with TITAN achieving superior performance across three corpora: ADReSS (F1=0.8889), ADReSSo (F1=0.8504), and CU-MARVEL-RABBIT (F1=0.7238). Feature contribution analysis reveals that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constitute the most significant contributors to the model’s decision pathways, outperforming the investigated microstructural features. These findings underscore the value of macrostructural analysis for understanding linguistic-cognitive interactions associated with NCDs.

[571] Acoustic and Machine Learning Methods for Speech-Based Suicide Risk Assessment: A Systematic Review

Ambre Marie, Marine Garnier, Thomas Bertin, Laura Machart, Guillaume Dardenne, Gwenolé Quellec, Sofian Berrouiguet

Main category: eess.AS

TL;DR: AI/ML can detect suicide risk through acoustic speech analysis, showing significant differences in features like jitter, F0, MFCC, and PSD between at-risk and non-risk individuals.

Details

Motivation: Suicide remains a public health challenge requiring improved detection methods for timely intervention and treatment.

Method: Systematic review of 33 articles following PRISMA guidelines, analyzing acoustic features between suicide risk and non-risk groups using AI/ML classifiers.

Result: Significant acoustic feature variations found between groups; classifier performance varied (AUC: 0.62-0.985, accuracy: 60%-99.85%); multimodal approaches performed best; most datasets were imbalanced.

Conclusion: AI/ML shows promise for suicide risk detection via acoustic analysis, but methodological limitations like dataset imbalance and incomplete reporting need addressing.

Abstract: Suicide remains a public health challenge, necessitating improved detection methods to facilitate timely intervention and treatment. This systematic review evaluates the role of Artificial Intelligence (AI) and Machine Learning (ML) in assessing suicide risk through acoustic analysis of speech. Following PRISMA guidelines, we analyzed 33 articles selected from PubMed, Cochrane, Scopus, and Web of Science databases. The last search was conducted in February 2025. Risk of bias was assessed using the PROBAST tool. Studies analyzing acoustic features between individuals at risk of suicide (RS) and those not at risk (NRS) were included, while studies lacking acoustic data, a suicide-related focus, or sufficient methodological details were excluded. Sample sizes varied widely and were reported in terms of participants or speech segments, depending on the study. Results were synthesized narratively based on acoustic features and classifier performance. Findings consistently showed significant acoustic feature variations between RS and NRS populations, particularly involving jitter, fundamental frequency (F0), Mel-frequency cepstral coefficients (MFCC), and power spectral density (PSD). Classifier performance varied based on algorithms, modalities, and speech elicitation methods, with multimodal approaches integrating acoustic, linguistic, and metadata features demonstrating superior performance. Among the 29 classifier-based studies, reported AUC values ranged from 0.62 to 0.985 and accuracies from 60% to 99.85%. Most datasets were imbalanced in favor of NRS, and performance metrics were rarely reported separately by group, limiting clear identification of direction of effect.

[572] Local Density-Based Anomaly Score Normalization for Domain Generalization

Kevin Wilkinghoff, Haici Yang, Janek Ebbers, François G. Germain, Gordon Wichern, Jonathan Le Roux

Main category: eess.AS

TL;DR: Proposes a local-density-based anomaly score normalization scheme to address domain mismatch in anomalous sound detection systems, improving performance across different domains.

Details

Motivation: Domain mismatch between source and target domains degrades ASD performance when using a single decision threshold, as optimal thresholds differ across acoustically different domains.

Method: A local-density-based anomaly score normalization scheme that reduces domain mismatch by normalizing anomaly scores based on local density distributions.

Result: Experiments on several ASD datasets show consistent performance improvements for various embedding-based ASD systems, outperforming existing normalization approaches.

Conclusion: The proposed normalization scheme effectively addresses domain mismatch issues and enhances ASD system generalization across different domains.

Abstract: State-of-the-art anomalous sound detection (ASD) systems in domain-shifted conditions rely on projecting audio signals into an embedding space and using distance-based outlier detection to compute anomaly scores. One of the major difficulties to overcome is the so-called domain mismatch between the anomaly score distributions of a source domain and a target domain that differ acoustically and in terms of the amount of training data provided. A decision threshold that is optimal for one domain may be highly sub-optimal for the other domain and vice versa. This significantly degrades the performance when only using a single decision threshold, as is required when generalizing to multiple data domains that are possibly unseen during training while still using the same trained ASD system as in the source domain. To reduce this mismatch between the domains, we propose a simple local-density-based anomaly score normalization scheme. In experiments conducted on several ASD datasets, we show that the proposed normalization scheme consistently improves performance for various types of embedding-based ASD systems and yields better results than existing anomaly score normalization approaches.

[573] RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling

Mandip Goswami

Main category: eess.AS

TL;DR: RIR-Mega is a large dataset of 50,000 simulated room impulse responses with comprehensive metadata and tools for acoustic research applications.

Details

Motivation: Room impulse responses are essential for dereverberation, speech recognition, source localization, and room acoustics estimation, but existing datasets lack standardized metadata and validation tools.

Method: Created a large collection of simulated RIRs with compact metadata schema, distributed with validation tools, Hugging Face Datasets loader, and reference regression baseline for RT60 prediction using Random Forest on time and spectral features.

Result: On 36,000 training and 4,000 validation examples, the baseline model achieved MAE of 0.013s and RMSE of 0.022s for RT60 prediction. The dataset includes 1,000 linear array and 3,000 circular array RIRs on Hugging Face, with full 50,000 RIR archive on Zenodo.

Conclusion: RIR-Mega provides a comprehensive, publicly available dataset with standardized tools to support reproducible research in acoustic signal processing and room acoustics analysis.

Abstract: Room impulse responses are a core resource for dereverberation, robust speech recognition, source localization, and room acoustics estimation. We present RIR-Mega, a large collection of simulated RIRs described by a compact, machine friendly metadata schema and distributed with simple tools for validation and reuse. The dataset ships with a Hugging Face Datasets loader, scripts for metadata checks and checksums, and a reference regression baseline that predicts RT60 like targets from waveforms. On a train and validation split of 36,000 and 4,000 examples, a small Random Forest on lightweight time and spectral features reaches a mean absolute error near 0.013 s and a root mean square error near 0.022 s. We host a subset with 1,000 linear array RIRs and 3,000 circular array RIRs on Hugging Face for streaming and quick tests, and preserve the complete 50,000 RIR archive on Zenodo. The dataset and code are public to support reproducible studies.

[574] A Unified Framework for Direction and Diffuseness Estimation Using Tight-Frame Microphone Arrays

Akira Omoto

Main category: eess.AS

TL;DR: A unified framework for estimating sound-field direction and diffuseness using practical microphone arrays with different geometries, enabling consistent diffuseness evaluation without requiring complex processing.

Details

Motivation: To develop robust methods for spatial-sound-field characterization that work across different array configurations without needing mode whitening or spherical-harmonic decomposition.

Method: Velocity-only covariance approach for diffuseness evaluation, modeling and comparing three array types (A-format, rigid-sphere, and tight-frame arrays) through simulations and measurements.

Result: The tight-frame configuration achieves near-isotropic directional sampling and reproduces diffuseness characteristics comparable to higher-order spherical arrays while maintaining compact structure.

Conclusion: The framework connects theoretical diffuseness analysis with implementable array designs, supporting development of robust broadband methods for spatial-sound-field characterization.

Abstract: This work presents a unified framework for estimating both sound-field direction and diffuseness using practical microphone arrays with different spatial configurations. Building on covariance-based diffuseness models, we formulate a velocity-only covariance approach that enables consistent diffuseness evaluation across heterogeneous array geometries without requiring mode whitening or spherical-harmonic decomposition. Three array types – an A-format array, a rigid-sphere array, and a newly proposed tight-frame array – are modeled and compared through both simulations and measurement-based experiments. The results show that the tight-frame configuration achieves near-isotropic directional sampling and reproduces diffuseness characteristics comparable to those of higher-order spherical arrays, while maintaining a compact physical structure. We further examine the accuracy of direction-of-arrival estimation based on acoustic intensity within the same framework. These findings connect theoretical diffuseness analysis with implementable array designs and support the development of robust, broadband methods for spatial-sound-field characterization.

[575] SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Hanke Xie, Haopeng Lin, Wenxiao Cao, Dake Guo, Wenjie Tian, Jun Wu, Hanlin Wen, Ruixuan Shang, Hongmei Liu, Zhiqi Jiang, Yuepeng Jiang, Wenxi Chen, Ruiqi Yan, Jiale Qian, Yichao Yan, Shunshun Yin, Ming Tao, Xie Chen, Lei Xie, Xinsheng Wang

Main category: eess.AS

TL;DR: SoulX-Podcast is a multi-speaker conversational speech synthesis system that achieves state-of-the-art performance in both monologue TTS and multi-turn dialogue generation, supporting multiple languages and dialects.

Details

Motivation: Most existing TTS systems are designed for single-speaker synthesis and lack coherence in multi-speaker conversational speech, creating a need for systems that can generate natural multi-turn dialogues.

Method: Integrates paralinguistic controls and supports Mandarin, English, and several Chinese dialects (Sichuanese, Henanese, Cantonese) for personalized podcast-style speech generation.

Result: Can continuously produce over 90 minutes of conversation with stable speaker timbre, smooth transitions, and contextually adaptive prosody. Achieves state-of-the-art performance across multiple evaluation metrics.

Conclusion: SoulX-Podcast successfully addresses the limitations of single-speaker TTS systems and demonstrates superior performance in generating coherent multi-speaker conversational speech for podcast applications.

Abstract: Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.

eess.IV

[576] High-Quality and Large-Scale Image Downscaling for Modern Display Devices

Suvrojit Mitra, G B Kevin Arjun, Sanjay Ghosh

Main category: eess.IV

TL;DR: Proposed LSID method for large-scale image downscaling using co-occurrence learning to maintain structural integrity and visual quality while reducing aliasing and blurring artifacts.

Details

Motivation: Need for high-quality large-scale image downscaling that preserves visual authenticity and structural integrity without texture loss or edge blurring, addressing limitations of existing methods.

Method: Uses co-occurrence learning to create data-driven co-occurrence profiles capturing intensity correlations in nearby neighborhoods, guiding a refined filtering process as content-adaptive range kernel based on pixel similarity with neighbors.

Result: Achieved up to 39.22 dB PSNR on DIV2K dataset and PIQE up to 26.35 when downscaling by 8x and 16x respectively, outperforming contemporary approaches on DIV2K, BSD100, Urban100, and RealSR datasets.

Conclusion: LSID method successfully preserves high-frequency structures like edges, textures, and patterns while reducing aliasing and blurring artifacts, demonstrating superior performance in large-scale image downscaling scenarios.

Abstract: In modern display technology and visualization tools, downscaling images is one of the most important activities. This procedure aims to maintain both visual authenticity and structural integrity while reducing the dimensions of an image at a large scale to fit the dimension of the display devices. In this study, we proposed a new technique for downscaling images that uses co-occurrence learning to maintain structural and perceptual information while reducing resolution. The technique uses the input image to create a data-driven co-occurrence profile that captures the frequency of intensity correlations in nearby neighborhoods. A refined filtering process is guided by this profile, which acts as a content-adaptive range kernel. The contribution of each input pixel is based on how closely it resembles pair-wise intensity values with it’s neighbors. We validate our proposed technique on four datasets: DIV2K, BSD100, Urban100, and RealSR to show its effective downscaling capacity. Our technique could obtain up to 39.22 dB PSNR on the DIV2K dataset and PIQE up to 26.35 on the same dataset when downscaling by 8x and 16x, respectively. Numerous experimental findings attest to the ability of the suggested picture downscaling method to outperform more contemporary approaches in terms of both visual quality and performance measures. Unlike most existing methods, which did not focus on the large-scale image resizing scenario, we achieve high-quality downscaled images without texture loss or edge blurring. Our method, LSID (large scale image downscaling), successfully preserves high-frequency structures like edges, textures, and repeating patterns by focusing on statistically consistent pixels while reducing aliasing and blurring artifacts that are typical of traditional downscaling techniques.

[577] Fast algorithms enabling optimization and deep learning for photoacoustic tomography in a circular detection geometry

Andreas Hauptmann, Leonid Kunyansky, Jenni Poimala

Main category: eess.IV

TL;DR: The paper presents fast algorithms for forward and adjoint operators in photoacoustic tomography, achieving O(n² log n) complexity for n×n images, and demonstrates their use in iterative reconstruction methods including variational techniques and deep learning approaches.

Details

Motivation: Iterative algorithms for inverse source problems in photoacoustic tomography require multiple evaluations of forward and adjoint operators, which can be computationally expensive. Current deep learning and optimization approaches need efficient implementations of these operators.

Method: Developed new asymptotically fast algorithms for numerical evaluation of forward and adjoint operators in circular acquisition geometry. The algorithms use mathematical optimization to achieve O(n² log n) computational complexity.

Result: Successfully implemented algorithms that compute forward and adjoint operators in O(n² log n) operations for (n×n) images. Demonstrated performance in numerical simulations with various reconstruction methods including non-negative least squares, total variation regularization, and learned primal dual.

Conclusion: The proposed fast algorithms enable efficient computation of forward and adjoint operators, making iterative reconstruction methods more practical for photoacoustic tomography. A publicly available Python implementation is provided for general use.

Abstract: The inverse source problem arising in photoacoustic tomography and in several other coupled-physics modalities is frequently solved by iterative algorithms. Such algorithms are based on the minimization of a certain cost functional. In addition, novel deep learning techniques are currently being investigated to further improve such optimization approaches. All such methods require multiple applications of the operator defining the forward problem, and of its adjoint. In this paper, we present new asymptotically fast algorithms for numerical evaluation of the forward and adjoint operators, applicable in the circular acquisition geometry. For an $(n \times n)$ image, our algorithms compute these operators in $\mathcal{O}(n^2 \log n)$ floating point operations. We demonstrate the performance of our algorithms in numerical simulations, where they are used as an integral part of several iterative image reconstruction techniques: classic variational methods, such as non-negative least squares and total variation regularized least squares, as well as deep learning methods, such as learned primal dual. A Python implementation of our algorithms and computational examples is available to the general public.

[578] Dipole-lets: a new multiscale decomposition for MR phase and quantitative susceptibility mapping

Ignacio Contreras-Zúñiga, Mathias Lambert, Benjamín Palacios, Cristian Tejos, Carlos Milovic

Main category: eess.IV

TL;DR: The paper introduces Dipole-lets, a multiscale transform to identify and suppress streaking artifacts in quantitative susceptibility mapping by detecting dipole incompatibilities in measured field data.

Details

Motivation: Streaking artifacts in quantitative susceptibility mapping are challenging due to extreme noise and non-dipolar phase contributions that get amplified by the dipole kernel, creating characteristic streaking patterns.

Method: Developed Dipole-lets as an optimal multiscale decomposition method that extracts features of different sizes and orientations relative to the dipole kernel’s zero-valued double-cone surface (magic cone) to identify dipole incompatibilities.

Result: Experiments show that non-dipolar content can be extracted from phase data through artifact localization using Dipole-lets, and implementations as optimization functional regularizers using Tikhonov and infinity norms are presented.

Conclusion: Dipole-lets provide an effective approach for identifying and suppressing streaking artifacts in quantitative susceptibility mapping by detecting dipole incompatibilities in measured field data.

Abstract: Identifying and suppressing streaking artifacts is one of the most challenging problems in quantitative susceptibility mapping. The measured phase from tissue magnetization is assumed to be the convolution by the magnetic dipole kernel; direct inversion or standard regularization methods tend to create streaking artifacts in the estimated susceptibility. This is caused by extreme noise and by the presence of non-dipolar phase contributions, which are amplified by the dipole kernel following the streaking pattern. In this work, we introduce a multiscale transform, called Dipole-lets, as an optimal decomposition method for identifying dipole incompatibilities in measured field data by extracting features of different characteristic size and orientation with respect to the dipole kernel’s zero-valued double-cone surface (the magic cone). We provide experiments that showcase that non-dipolar content can be extracted by Dipole-lets from phase data through artifact localization. We also present implementations of Dipole-lets as a optimization functional regularizator, through simple Tikhonov and infinity norm.

Malte Riedel, Thomas Ulrich, Samuel Bianchi, Klaas P. Pruessmann

Main category: eess.IV

TL;DR: PEERS combines run-time servo navigation with retrospective phase correction to enhance time-resolved segmented 3D EPI fMRI imaging, achieving significant tSNR improvements.

Details

Motivation: To enhance time-resolved segmented imaging by addressing motion artifacts and phase/frequency fluctuations in fMRI time series through a synergistic approach.

Method: Uses a segmented 3D EPI sequence with servo navigation (short orbital navigators and linear perturbation model) for run-time correction of rigid-body motion, bulk phase and frequency fluctuation, combined with retrospective phase correction based on time series repetitive structure.

Result: Servo navigation reduces motion confound and maintains k-space consistency; retrospective phase equalization eliminates shot-wise phase/frequency offsets from eddy-currents and vibrations. PEERS achieved tSNR improvements up to 30% for small motion and ~10% when holding still, outperforming navigator-only phase correction.

Conclusion: PEERS provides effective plug-and-play motion and phase correction for 3D fMRI, combining high-precision run-time motion correction with precise retrospective frequency/phase correction in a fully automatic, self-calibrated system.

Abstract: Purpose: To enhance time-resolved segmented imaging by synergy of run-time stabilization and retrospective, data-driven phase correction. Methods: A segmented 3D EPI sequence for fMRI time series is equipped with servo navigation based on short orbital navigators and a linear perturbation model, enabling run-time correction for rigid-body motion as well as bulk phase and frequency fluctuation. Complementary retrospective phase correction is based on the repetitive structure of the time series and serves to address residual phase and frequency offsets. The combined approach is termed phase equalization enhanced by run-time stabilization (PEERS). Results: The proposed strategy is evaluated in a phantom and in-vivo. Servo navigation is found to diminish motion confound in raw data and maintain k-space consistency over time series. In turn, retrospective phase equalization is found to eliminate shot-wise phase and frequency offsets relative to the navigator, which are attributed to eddy-currents and vibrations from phase encoding. Retrospective phase equalization reduces the precision requirements for run-time frequency control, supporting the use of short navigators. Relative to conventional volume realignment, PEERS achieved tSNR improvements up to $30%$ for small motion and in the order of $10%$ when volunteers tried to hold still. Retrospective phase equalization is found to clearly outperform phase correction based solely on navigator-based frequency estimates. Conclusion: Servo navigation achieves high-precision run-time motion correction for 3D EPI fMRI. Coarse frequency tracking based on short navigators is supplemented by precise retrospective frequency and phase correction. Fully automatic and self-calibrated, PEERS offers effective plug-and-play motion and phase correction for 3D fMRI.

[580] TraceTrans: Translation and Spatial Tracing for Surgical Prediction

Xiyu Luo, Haodong Li, Xinxing Cheng, He Zhao, Yang Hu, Xuan Song, Tianyang Zhang

Main category: eess.IV

TL;DR: TraceTrans is a deformable image translation model for post-operative prediction that maintains spatial correspondences between pre-operative and translated images, ensuring anatomical consistency.

Details

Motivation: Existing image translation methods focus on matching target distributions but neglect spatial correspondences, leading to structural inconsistencies and hallucinations that undermine reliability in clinical applications requiring anatomical accuracy.

Method: The framework uses an encoder for feature extraction and dual decoders - one for predicting spatial deformations and another for synthesizing the translated image. The predicted deformation field imposes spatial constraints to ensure anatomical consistency.

Result: Extensive experiments on medical cosmetology and brain MRI datasets show that TraceTrans delivers accurate and interpretable post-operative predictions.

Conclusion: TraceTrans demonstrates potential for reliable clinical deployment by generating images aligned with target distribution while explicitly revealing spatial correspondences, ensuring anatomical accuracy.

Abstract: Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.

[581] MSRANetV2: An Explainable Deep Learning Architecture for Multi-class Classification of Colorectal Histopathological Images

Ovi Sarkar, Md Shafiuzzaman, Md. Faysal Ahamed, Golam Mahmud, Muhammad E. H. Chowdhury

Main category: eess.IV

TL;DR: Proposed MSRANetV2, a CNN architecture with ResNet50V2 backbone enhanced with residual attention and SE blocks, achieving state-of-the-art performance in colorectal cancer tissue classification on two public datasets.

Details

Motivation: Colorectal cancer is a leading cause of cancer mortality, and conventional diagnostic methods like colonoscopy are subjective, time-consuming, and variable. Deep learning can enhance diagnostic precision and efficiency in digital pathology.

Method: MSRANetV2 uses ResNet50V2 backbone extended with residual attention mechanisms and squeeze-and-excitation blocks to extract deep semantic and fine-grained spatial features. It employs channel alignment and upsampling to fuse multi-scale representations for robust classification.

Result: Achieved remarkable performance on CRC-VAL-HE-7K (avg precision: 0.9884±0.0151, recall: 0.9900±0.0151, F1: 0.9900±0.0145, AUC: 0.9999±0.00006, accuracy: 0.9905±0.0025) and NCT-CRC-HE-100K (avg precision: 0.9904±0.0091, recall: 0.9900±0.0071, F1: 0.9900±0.0071, AUC: 0.9997±0.00016, accuracy: 0.9902±0.0006).

Conclusion: MSRANetV2 is a reliable, interpretable, and high-performing model for colorectal cancer tissue classification, validated through comprehensive evaluation and Grad-CAM visualizations for medical interpretability.

Abstract: Colorectal cancer (CRC) is a leading worldwide cause of cancer-related mortality, and the role of prompt precise detection is of paramount interest in improving patient outcomes. Conventional diagnostic methods such as colonoscopy and histological examination routinely exhibit subjectivity, are extremely time-consuming, and are susceptible to variation. Through the development of digital pathology, deep learning algorithms have become a powerful approach in enhancing diagnostic precision and efficiency. In our work, we proposed a convolutional neural network architecture named MSRANetV2, specially optimized for the classification of colorectal tissue images. The model employs a ResNet50V2 backbone, extended with residual attention mechanisms and squeeze-and-excitation (SE) blocks, to extract deep semantic and fine-grained spatial features. With channel alignment and upsampling operations, MSRANetV2 effectively fuses multi-scale representations, thereby enhancing the robustness of the classification. We evaluated our model on a five-fold stratified cross-validation strategy on two publicly available datasets: CRC-VAL-HE-7K and NCT-CRC-HE-100K. The proposed model achieved remarkable average Precision, recall, F1-score, AUC, and test accuracy were 0.9884 plus-minus 0.0151, 0.9900 plus-minus 0.0151, 0.9900 plus-minus 0.0145, 0.9999 plus-minus 0.00006, and 0.9905 plus-minus 0.0025 on the 7K dataset. On the 100K dataset, they were 0.9904 plus-minus 0.0091, 0.9900 plus-minus 0.0071, 0.9900 plus-minus 0.0071, 0.9997 plus-minus 0.00016, and 0.9902 plus-minus 0.0006. Additionally, Grad-CAM visualizations were incorporated to enhance model interpretability by highlighting tissue areas that are medically relevant. These findings validate that MSRANetV2 is a reliable, interpretable, and high-performing architectural model for classifying CRC tissues.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Evaluating Long-Term Memory for Long-Context Question Answering

[2] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

[3] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs’ Cultural Processing of Figurative Language

[4] How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

[5] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

[6] Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

[7] Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

[8] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning

[9] Language Models for Longitudinal Clinical Prediction

[10] Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

[11] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

[12] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

[13] Agent-based Automated Claim Matching with Instruction-following LLMs

[14] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs

[15] Tongyi DeepResearch Technical Report

[16] Leveraging LLMs for Early Alzheimer’s Prediction

[17] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

[18] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

[19] PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine

[20] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

[21] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

[22] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

[23] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

[24] Success and Cost Elicit Convention Formation for Efficient Communication

[25] Pie: A Programmable Serving System for Emerging LLM Applications

[26] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

[27] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

[28] RegSpeech12: A Regional Corpus of Bengali Spontaneous Speech Across Dialects

[29] Squrve: A Unified and Modular Framework for Complex Real-World Text-to-SQL Tasks

[30] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

[31] Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

[32] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

[33] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

[34] Exploring the Influence of Relevant Knowledge for Natural Language Generation Interpretability

[35] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

[36] HACK: Hallucinations Along Certainty and Knowledge Axes

[37] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

[38] Abjad AI at NADI 2025: CATT-Whisper: Multimodal Diacritic Restoration Using Text and Speech Representations

[39] Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

[40] From Memorization to Reasoning in the Spectrum of Loss Curvature

[41] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent’s Internal Emergent Symbolic Representation?

[42] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

[43] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

[44] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

[45] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

[46] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

[47] Text Simplification with Sentence Embeddings

[48] Comprehensive and Efficient Distillation for Lightweight Sentiment Analysis Models

[49] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

[50] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

[51] Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

[52] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

[53] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

[54] Iterative Critique-Refine Framework for Enhancing LLM Personalization

[55] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems

[56] Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

[57] A word association network methodology for evaluating implicit biases in LLMs compared to humans

[58] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

[59] Levée d’ambiguïtés par grammaires locales

[60] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

[61] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

[62] BEST-RQ-Based Self-Supervised Learning for Whisper Domain Adaptation

[63] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

[64] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

[65] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

[66] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

[67] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

[68] Relative Scaling Laws for LLMs

[69] “Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue

[70] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

[71] Quantifying the Effects of Word Length, Frequency, and Predictability on Dyslexia

[72] Optimizing Retrieval for RAG via Reinforced Contrastive Learning

[73] Evolving Diagnostic Agents in a Virtual Clinical Environment

[74] MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation

[75] InteractComp: Evaluating Search Agents With Ambiguous Queries

[76] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation

[77] SPICE: Self-Play In Corpus Environments Improves Reasoning